Learning objective: Demonstrate a clear understanding of back‑propagation of errors and of the main regression methods used in machine learning, and explain how the two are linked.
1. How this fits into the Cambridge A‑Level (9618) syllabus
Supervised learning – the algorithm is trained on examples that include the correct output (the label).
Within supervised learning the syllabus expects you to know:
Regression – predicting a continuous value (e.g. house price).
Neural networks trained by back‑propagation – a general method for fitting both continuous and categorical data.
2. Regression methods required for the exam
2.1 Simple linear regression
Model
$$y = \beta_0 + \beta_1x + \varepsilon$$
Coefficients are obtained by minimising the sum of squared residuals (ordinary least squares):
Complex non‑linear relationships, image/audio, game playing
Continuous or categorical (depends on output layer)
MSE, cross‑entropy, etc.
Iterative GD using back‑propagation
3. Back‑propagation of errors
3.1 What the algorithm does
Back‑propagation computes the gradient of a chosen loss function with respect to every weight in a multilayer feed‑forward network, then updates the weights by gradient descent so that the overall error is reduced.
3.2 Learning cycle (step‑by‑step)
Initialise all weights (small random values) and biases.
Forward pass – feed an input vector \(\mathbf{x}\) through the network and record each neuron's net input \(net_j\) and activation \(a_j\).
Compute loss – most A‑Level questions use the sum‑of‑squares error
$$E=\frac12\sum_{k}(t_k-y_k)^2,$$
where \(t_k\) is the target for output node \(k\) and \(y_k\) its actual output.
Backward pass – apply the chain rule layer‑by‑layer to obtain the error term \(\delta_j\) for every neuron:
$$\delta_j = \frac{\partial E}{\partial net_j}.$$
Weight update – adjust each weight \(w_{ij}\) using gradient descent:
$$w_{ij}\leftarrow w_{ij}-\eta\,\delta_j\,a_i,$$
where \(\eta\) is the learning rate and \(a_i\) the activation of the source neuron.
Repeat steps 2‑5 for all training examples (one epoch) until the loss is acceptably low or a stopping criterion is met.
3.3 Deriving the error term \(\delta\)
For a neuron in layer \(\ell\) that feeds forward to layer \(\ell+1\):
Figure 1 – Forward flow of activations (solid arrows) and backward flow of error terms (dashed arrows) for a network with one hidden layer.
4. Linking back‑propagation to regression
Linear output + SSE → the network is performing a linear regression on the transformed features \(\{h_j\}\) produced by the hidden layer. The hidden layer therefore acts as a set of non‑linear basis functions.
Sigmoid output + cross‑entropy → the network behaves like logistic regression, but the hidden layer supplies additional non‑linear basis functions, allowing far more flexible decision boundaries.
In both cases the weight‑update rule is exactly the gradient‑descent step used in ordinary regression; back‑propagation simply extends the same principle to arbitrarily deep architectures.
5. Key points to remember for the exam
Back‑propagation applies the chain rule layer‑by‑layer to obtain \(\partial E/\partial w\) for every weight.
The learning rate \(\eta\) controls step size – too large → divergence; too small → very slow convergence.
Simple & multiple linear regression have analytical OLS solutions; logistic regression and neural networks require iterative optimisation.
Regularisation (e.g., L2 penalty) can be added to any loss function to reduce over‑fitting.
When the output layer is linear and the loss is SSE, a neural network is mathematically equivalent to linear regression on the hidden‑layer activations.
6. Sample A‑Level exam question
Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.
Output error term – because the output unit is linear,
\[
\delta^{(2)} = y-t.
\]
Hidden error terms – propagate the error back through each sigmoid unit:
\[
\delta_j^{(1)} = \delta^{(2)}\,w_{j}^{(2)}\,h_j(1-h_j).
\]
Weight updates (gradient descent):
\[
w_{ij}^{(1)} \leftarrow w_{ij}^{(1)} - \eta\,\delta_j^{(1)}x_i,\qquad
w_{j}^{(2)} \leftarrow w_{j}^{(2)} - \eta\,\delta^{(2)}h_j,
\]
and similarly for the biases.
Relation to linear regression – the output layer is a linear combination of the hidden activations \(h_j\). Hence the network is fitting a linear regression model to the transformed feature space created by the hidden layer. The hidden layer supplies non‑linear basis functions, extending ordinary linear regression to a richer hypothesis space.
7. Teacher’s quick‑audit checklist
Syllabus element
Covered?
Action if “no”
Definition of supervised vs. unsupervised learning
✔
‑
Purpose of regression in ML (continuous prediction)
✔
‑
Simple & multiple linear regression – formulae & OLS solution
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources,
past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.