Cambridge Syllabus Notes

Topic 18.1 Artificial Intelligence (AI)

Learning objective: Demonstrate a clear understanding of back‑propagation of errors and of the main regression methods used in machine learning, and explain how the two are linked.

1. How this fits into the Cambridge A‑Level (9618) syllabus

Supervised learning – the algorithm is trained on examples that include the correct output (the label).
Within supervised learning the syllabus expects you to know:
- Regression – predicting a continuous value (e.g. house price).
- Neural networks trained by back‑propagation – a general method for fitting both continuous and categorical data.

2. Regression methods required for the exam

2.1 Simple linear regression

Model

$$y = \beta_0 + \beta_1x + \varepsilon$$

Coefficients are obtained by minimising the sum of squared residuals (ordinary least squares):

$$\min_{\beta_0,\beta_1}\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2$$

Closed‑form solution

$$\beta_1 = \frac{\sum_{i}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i}(x_i-\bar{x})^2}, \qquad \beta_0 = \bar{y}-\beta_1\bar{x}$$

2.2 Multiple linear regression

Matrix form

$$\mathbf{y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}$$

where $\mathbf{X}$ is the $n\times(p+1)$ design matrix (first column = 1 for the intercept). The OLS estimate is

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!T}\mathbf{X})^{-1}\mathbf{X}^{\!T}\mathbf{y}$$

2.3 Logistic regression (binary classification)

Probability of class 1 given a feature vector $\mathbf{x}$

$$P(y=1\mid\mathbf{x}) = \sigma(\mathbf{w}^{\!T}\mathbf{x}+b) = \frac{1}{1+e^{-(\mathbf{w}^{\!T}\mathbf{x}+b)}}$$

Parameters $\mathbf{w},b$ are estimated by maximising the log‑likelihood (equivalently, minimising cross‑entropy loss):

$$\ell(\mathbf{w},b)=\sum_{i=1}^{n}\Big[y_i\log\sigma(z_i)+(1-y_i)\log\bigl(1-\sigma(z_i)\bigr)\Big], \qquad z_i=\mathbf{w}^{\!T}\mathbf{x}_i+b$$

2.4 Quick comparison of the four techniques

Method	Typical use‑case	Output type	Loss function	Training approach
Simple linear regression	One predictor → continuous target	Continuous	Sum of squared errors (SSE)	Closed‑form OLS
Multiple linear regression	Several predictors → continuous target	Continuous	SSE	Closed‑form OLS or GD for large $p$
Logistic regression	Binary classification (e.g. spam / not‑spam)	Probability (0–1)	Cross‑entropy (log‑loss)	Iterative optimisation (GD, Newton‑Raphson)
Neural network (back‑propagation)	Complex non‑linear relationships, image/audio, game playing	Continuous or categorical (depends on output layer)	MSE, cross‑entropy, etc.	Iterative GD using back‑propagation

3. Back‑propagation of errors

3.1 What the algorithm does

Back‑propagation computes the gradient of a chosen loss function with respect to every weight in a multilayer feed‑forward network, then updates the weights by gradient descent so that the overall error is reduced.

3.2 Learning cycle (step‑by‑step)

Initialise all weights (small random values) and biases.
Forward pass – feed an input vector $\mathbf{x}$ through the network and record each neuron's net input $net_j$ and activation $a_j$.
Compute loss – most A‑Level questions use the sum‑of‑squares error $$E=\frac12\sum_{k}(t_k-y_k)^2,$$ where $t_k$ is the target for output node $k$ and $y_k$ its actual output.
Backward pass – apply the chain rule layer‑by‑layer to obtain the error term $\delta_j$ for every neuron: $$\delta_j = \frac{\partial E}{\partial net_j}.$$
Weight update – adjust each weight $w_{ij}$ using gradient descent: $$w_{ij}\leftarrow w_{ij}-\eta\,\delta_j\,a_i,$$ where $\eta$ is the learning rate and $a_i$ the activation of the source neuron.
Repeat steps 2‑5 for all training examples (one epoch) until the loss is acceptably low or a stopping criterion is met.

3.3 Deriving the error term $\delta$

For a neuron in layer $\ell$ that feeds forward to layer $\ell+1$:

$$\delta_j^{(\ell)}= \begin{cases} (y_j-t_j)\,f'(net_j) & \text{output neuron}\\[4pt] \left(\displaystyle\sum_{k}\delta_k^{(\ell+1)}w_{jk}^{(\ell+1)}\right)f'(net_j) & \text{hidden neuron} \end{cases}$$

$f'$ is the derivative of the activation function (e.g. sigmoid: $\sigma'(z)=\sigma(z)[1-\sigma(z)]$).

3.4 Worked example – one hidden layer, three hidden neurons

Assume:

Input vector $\mathbf{x}=(x_1,x_2)$.
Hidden layer uses the sigmoid activation $\sigma(z)=\frac{1}{1+e^{-z}}$.
Output layer is linear (no activation function).
Loss = SSE.

Forward pass \[ h_j = \sigma\!\Big(\sum_i w_{ij}^{(1)}x_i + b_j^{(1)}\Big),\qquad j=1,2,3 \] \[ y = \sum_{j=1}^{3} w_{j}^{(2)}h_j + b^{(2)}. \]
Output error term (linear unit) \[ \delta^{(2)} = \frac{\partial E}{\partial y}=y-t. \]
Hidden error terms \[ \delta_j^{(1)} = \delta^{(2)}\,w_{j}^{(2)}\,\sigma'(net_j),\qquad \sigma'(net_j)=h_j(1-h_j). \]
Weight updates \[ w_{ij}^{(1)} \leftarrow w_{ij}^{(1)} - \eta\,\delta_j^{(1)}x_i, \qquad b_j^{(1)} \leftarrow b_j^{(1)} - \eta\,\delta_j^{(1)}. \] \[ w_{j}^{(2)} \leftarrow w_{j}^{(2)} - \eta\,\delta^{(2)}h_j, \qquad b^{(2)} \leftarrow b^{(2)} - \eta\,\delta^{(2)}. \]

Three‑layer network showing forward activations (blue) and backward error terms (red) — Figure 1 – Forward flow of activations (solid arrows) and backward flow of error terms (dashed arrows) for a network with one hidden layer.

4. Linking back‑propagation to regression

Linear output + SSE → the network is performing a linear regression on the transformed features $\{h_j\}$ produced by the hidden layer. The hidden layer therefore acts as a set of non‑linear basis functions.
Sigmoid output + cross‑entropy → the network behaves like logistic regression, but the hidden layer supplies additional non‑linear basis functions, allowing far more flexible decision boundaries.
In both cases the weight‑update rule is exactly the gradient‑descent step used in ordinary regression; back‑propagation simply extends the same principle to arbitrarily deep architectures.

5. Key points to remember for the exam

Back‑propagation applies the chain rule layer‑by‑layer to obtain $\partial E/\partial w$ for every weight.
The learning rate $\eta$ controls step size – too large → divergence; too small → very slow convergence.
Simple & multiple linear regression have analytical OLS solutions; logistic regression and neural networks require iterative optimisation.
Regularisation (e.g., L2 penalty) can be added to any loss function to reduce over‑fitting.
When the output layer is linear and the loss is SSE, a neural network is mathematically equivalent to linear regression on the hidden‑layer activations.

6. Sample A‑Level exam question

Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.

Answer outline (full marks):

Forward pass – compute hidden activations \[ h_j=\sigma\!\Big(\sum_i w_{ij}^{(1)}x_i+b_j^{(1)}\Big),\qquad y=\sum_j w_{j}^{(2)}h_j+b^{(2)}. \]
Output error term – because the output unit is linear, \[ \delta^{(2)} = y-t. \]
Hidden error terms – propagate the error back through each sigmoid unit: \[ \delta_j^{(1)} = \delta^{(2)}\,w_{j}^{(2)}\,h_j(1-h_j). \]
Weight updates (gradient descent): \[ w_{ij}^{(1)} \leftarrow w_{ij}^{(1)} - \eta\,\delta_j^{(1)}x_i,\qquad w_{j}^{(2)} \leftarrow w_{j}^{(2)} - \eta\,\delta^{(2)}h_j, \] and similarly for the biases.
Relation to linear regression – the output layer is a linear combination of the hidden activations $h_j$. Hence the network is fitting a linear regression model to the transformed feature space created by the hidden layer. The hidden layer supplies non‑linear basis functions, extending ordinary linear regression to a richer hypothesis space.

7. Teacher’s quick‑audit checklist

Syllabus element	Covered?	Action if “no”
Definition of supervised vs. unsupervised learning	✔	‑
Purpose of regression in ML (continuous prediction)	✔	‑
Simple & multiple linear regression – formulae & OLS solution	✔	‑
Logistic regression – sigmoid, cross‑entropy, optimisation	✔	‑
Back‑propagation algorithm – steps, weight‑update rule, δ‑terms	✔	‑
Link between back‑propagation and regression (linear & logistic cases)	✔	‑
Key exam points (learning rate, regularisation, convergence)	✔	‑
Worked example & diagram of a network with one hidden layer	✔	‑
Sample A‑Level style question & full answer outline	✔	‑

When all checks are green, the notes are ready for classroom delivery or revision.

Show understanding of back propagation of errors and regression methods in machine learning

Topic 18.1 Artificial Intelligence (AI)

1. How this fits into the Cambridge A‑Level (9618) syllabus

2. Regression methods required for the exam

2.1 Simple linear regression

2.2 Multiple linear regression

2.3 Logistic regression (binary classification)

2.4 Quick comparison of the four techniques

3. Back‑propagation of errors

3.1 What the algorithm does

3.2 Learning cycle (step‑by‑step)

3.3 Deriving the error term \(\delta\)

3.4 Worked example – one hidden layer, three hidden neurons

4. Linking back‑propagation to regression

5. Key points to remember for the exam

6. Sample A‑Level exam question

7. Teacher’s quick‑audit checklist

Show understanding of back propagation of errors and regression methods in machine learning

Topic 18.1 Artificial Intelligence (AI)

1. How this fits into the Cambridge A‑Level (9618) syllabus

2. Regression methods required for the exam

2.1 Simple linear regression

2.2 Multiple linear regression

2.3 Logistic regression (binary classification)

2.4 Quick comparison of the four techniques

3. Back‑propagation of errors

3.1 What the algorithm does

3.2 Learning cycle (step‑by‑step)

3.3 Deriving the error term \(\delta\)

3.4 Worked example – one hidden layer, three hidden neurons

4. Linking back‑propagation to regression

5. Key points to remember for the exam

6. Sample A‑Level exam question

7. Teacher’s quick‑audit checklist

Topic 18.1 Artificial Intelligence (AI)