Show understanding of back propagation of errors and regression methods in machine learning

Topic 18.1 Artificial Intelligence (AI)

Learning objective: Demonstrate a clear understanding of back‑propagation of errors and of the main regression methods used in machine learning, and explain how the two are linked.


1. How this fits into the Cambridge A‑Level (9618) syllabus

  • Supervised learning – the algorithm is trained on examples that include the correct output (the label).
  • Within supervised learning the syllabus expects you to know:

    • Regression – predicting a continuous value (e.g. house price).
    • Neural networks trained by back‑propagation – a general method for fitting both continuous and categorical data.


2. Regression methods required for the exam

2.1 Simple linear regression

Model

\$y = \beta0 + \beta1x + \varepsilon\$

Coefficients are obtained by minimising the sum of squared residuals (ordinary least squares):

\$\min{\beta0,\beta1}\sum{i=1}^{n}(yi-\beta0-\beta1xi)^2\$

Closed‑form solution

\$\$\beta1 = \frac{\sum{i}(xi-\bar{x})(yi-\bar{y})}{\sum{i}(xi-\bar{x})^2},

\qquad

\beta0 = \bar{y}-\beta1\bar{x}\$\$

2.2 Multiple linear regression

Matrix form

\$\mathbf{y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\$

where \(\mathbf{X}\) is the \(n\times(p+1)\) design matrix (first column = 1 for the intercept). The OLS estimate is

\$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!T}\mathbf{X})^{-1}\mathbf{X}^{\!T}\mathbf{y}\$

2.3 Logistic regression (binary classification)

Probability of class 1 given a feature vector \(\mathbf{x}\)

\$\$P(y=1\mid\mathbf{x}) = \sigma(\mathbf{w}^{\!T}\mathbf{x}+b)

= \frac{1}{1+e^{-(\mathbf{w}^{\!T}\mathbf{x}+b)}}\$\$

Parameters \(\mathbf{w},b\) are estimated by maximising the log‑likelihood (equivalently, minimising cross‑entropy loss):

\$\$\ell(\mathbf{w},b)=\sum{i=1}^{n}\Big[yi\log\sigma(zi)+(1-yi)\log\bigl(1-\sigma(z_i)\bigr)\Big],

\qquad zi=\mathbf{w}^{\!T}\mathbf{x}i+b\$\$

2.4 Quick comparison of the four techniques

MethodTypical use‑caseOutput typeLoss functionTraining approach
Simple linear regressionOne predictor → continuous targetContinuousSum of squared errors (SSE)Closed‑form OLS
Multiple linear regressionSeveral predictors → continuous targetContinuousSSEClosed‑form OLS or GD for large \(p\)
Logistic regressionBinary classification (e.g. spam / not‑spam)Probability (0–1)Cross‑entropy (log‑loss)Iterative optimisation (GD, Newton‑Raphson)
Neural network (back‑propagation)Complex non‑linear relationships, image/audio, game playingContinuous or categorical (depends on output layer)MSE, cross‑entropy, etc.Iterative GD using back‑propagation


3. Back‑propagation of errors

3.1 What the algorithm does

Back‑propagation computes the gradient of a chosen loss function with respect to every weight in a multilayer feed‑forward network, then updates the weights by gradient descent so that the overall error is reduced.

3.2 Learning cycle (step‑by‑step)

  1. Initialise all weights (small random values) and biases.
  2. Forward pass – feed an input vector \(\mathbf{x}\) through the network and record each neuron's net input \(netj\) and activation \(aj\).
  3. Compute loss – most A‑Level questions use the sum‑of‑squares error

    \$E=\frac12\sum{k}(tk-y_k)^2,\$

    where \(tk\) is the target for output node \(k\) and \(yk\) its actual output.

  4. Backward pass – apply the chain rule layer‑by‑layer to obtain the error term \(\delta_j\) for every neuron:

    \$\deltaj = \frac{\partial E}{\partial netj}.\$

  5. Weight update – adjust each weight \(w_{ij}\) using gradient descent:

    \$w{ij}\leftarrow w{ij}-\eta\,\deltaj\,ai,\$

    where \(\eta\) is the learning rate and \(a_i\) the activation of the source neuron.

  6. Repeat steps 2‑5 for all training examples (one epoch) until the loss is acceptably low or a stopping criterion is met.

3.3 Deriving the error term \(\delta\)

For a neuron in layer \(\ell\) that feeds forward to layer \(\ell+1\):

\$\$\delta_j^{(\ell)}=

\begin{cases}

(yj-tj)\,f'(net_j) & \text{output neuron}\\[4pt]

\left(\displaystyle\sum{k}\deltak^{(\ell+1)}w{jk}^{(\ell+1)}\right)f'(netj) & \text{hidden neuron}

\end{cases}\$\$

\(f'\) is the derivative of the activation function (e.g. sigmoid: \(\sigma'(z)=\sigma(z)[1-\sigma(z)]\)).

3.4 Worked example – one hidden layer, three hidden neurons

Assume:

  • Input vector \(\mathbf{x}=(x1,x2)\).
  • Hidden layer uses the sigmoid activation \(\sigma(z)=\frac{1}{1+e^{-z}}\).
  • Output layer is linear (no activation function).
  • Loss = SSE.

  1. Forward pass

    \[

    hj = \sigma\!\Big(\sumi w{ij}^{(1)}xi + b_j^{(1)}\Big),\qquad j=1,2,3

    \]

    \[

    y = \sum{j=1}^{3} w{j}^{(2)}h_j + b^{(2)}.

    \]

  2. Output error term (linear unit)

    \[

    \delta^{(2)} = \frac{\partial E}{\partial y}=y-t.

    \]

  3. Hidden error terms

    \[

    \deltaj^{(1)} = \delta^{(2)}\,w{j}^{(2)}\,\sigma'(net_j),\qquad

    \sigma'(netj)=hj(1-h_j).

    \]

  4. Weight updates

    \[

    w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta\,\deltaj^{(1)}xi,

    \qquad

    bj^{(1)} \leftarrow bj^{(1)} - \eta\,\delta_j^{(1)}.

    \]

    \[

    w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta\,\delta^{(2)}h_j,

    \qquad

    b^{(2)} \leftarrow b^{(2)} - \eta\,\delta^{(2)}.

    \]

Three‑layer network showing forward activations (blue) and backward error terms (red)

Figure 1 – Forward flow of activations (solid arrows) and backward flow of error terms (dashed arrows) for a network with one hidden layer.


4. Linking back‑propagation to regression

  • Linear output + SSE → the network is performing a linear regression on the transformed features \(\{h_j\}\) produced by the hidden layer. The hidden layer therefore acts as a set of non‑linear basis functions.
  • Sigmoid output + cross‑entropy → the network behaves like logistic regression, but the hidden layer supplies additional non‑linear basis functions, allowing far more flexible decision boundaries.
  • In both cases the weight‑update rule is exactly the gradient‑descent step used in ordinary regression; back‑propagation simply extends the same principle to arbitrarily deep architectures.


5. Key points to remember for the exam

  • Back‑propagation applies the chain rule layer‑by‑layer to obtain \(\partial E/\partial w\) for every weight.
  • The learning rate \(\eta\) controls step size – too large → divergence; too small → very slow convergence.
  • Simple & multiple linear regression have analytical OLS solutions; logistic regression and neural networks require iterative optimisation.
  • Regularisation (e.g., L2 penalty) can be added to any loss function to reduce over‑fitting.
  • When the output layer is linear and the loss is SSE, a neural network is mathematically equivalent to linear regression on the hidden‑layer activations.


6. Sample A‑Level exam question

Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.

Answer outline (full marks):

  1. Forward pass – compute hidden activations

    \[

    hj=\sigma\!\Big(\sumi w{ij}^{(1)}xi+b_j^{(1)}\Big),\qquad

    y=\sumj w{j}^{(2)}h_j+b^{(2)}.

    \]

  2. Output error term – because the output unit is linear,

    \[

    \delta^{(2)} = y-t.

    \]

  3. Hidden error terms – propagate the error back through each sigmoid unit:

    \[

    \deltaj^{(1)} = \delta^{(2)}\,w{j}^{(2)}\,hj(1-hj).

    \]

  4. Weight updates (gradient descent):

    \[

    w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta\,\deltaj^{(1)}xi,\qquad

    w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta\,\delta^{(2)}h_j,

    \]

    and similarly for the biases.

  5. Relation to linear regression – the output layer is a linear combination of the hidden activations \(h_j\). Hence the network is fitting a linear regression model to the transformed feature space created by the hidden layer. The hidden layer supplies non‑linear basis functions, extending ordinary linear regression to a richer hypothesis space.


7. Teacher’s quick‑audit checklist

Syllabus elementCovered?Action if “no”
Definition of supervised vs. unsupervised learning
Purpose of regression in ML (continuous prediction)
Simple & multiple linear regression – formulae & OLS solution
Logistic regression – sigmoid, cross‑entropy, optimisation
Back‑propagation algorithm – steps, weight‑update rule, δ‑terms
Link between back‑propagation and regression (linear & logistic cases)
Key exam points (learning rate, regularisation, convergence)
Worked example & diagram of a network with one hidden layer
Sample A‑Level style question & full answer outline

When all checks are green, the notes are ready for classroom delivery or revision.