Show understanding of back propagation of errors and regression methods in machine learning

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – Topic 18.1 Artificial Intelligence (AI)

Topic 18.1 Artificial Intelligence (AI)

Objective: Show understanding of back‑propagation of errors and regression methods in machine learning.

1. Introduction to Machine Learning in AI

Machine learning (ML) enables computers to improve performance on a task through experience. Two fundamental approaches are:

  • Supervised learning – the model is trained on labelled examples.
  • Unsupervised learning – the model discovers patterns without explicit labels.

Within supervised learning, two key techniques are regression (predicting continuous values) and neural networks trained by back‑propagation.

2. Back‑Propagation of Errors

Back‑propagation is the algorithm used to train multilayer feed‑forward neural networks. It adjusts the weights to minimise the error between the network’s output and the target values.

2.1. The Learning Process

  1. Initialise all weights (often with small random numbers).
  2. Present an input vector \$\mathbf{x}\$ and compute the forward pass to obtain the output \$\mathbf{y}\$.
  3. Calculate the error using a loss function, commonly the sum‑of‑squares:

    \$E = \frac{1}{2}\sum{k}(tk - y_k)^2\$

    where \$t_k\$ is the target for output node \$k\$.

  4. Propagate the error backwards through the network, computing the gradient of \$E\$ with respect to each weight.
  5. Update each weight \$w_{ij}\$ using gradient descent:

    \$w{ij} \leftarrow w{ij} - \eta \frac{\partial E}{\partial w_{ij}}\$

    where \$\eta\$ is the learning rate.

  6. Repeat steps 2–5 for all training examples (epoch) until the error is acceptably low.

2.2. Deriving the Weight Update

For a weight connecting neuron \$i\$ in layer \$l\$ to neuron \$j\$ in layer \$l+1\$, the derivative is:

\$\frac{\partial E}{\partial w{ij}} = \deltaj^{(l+1)} a_i^{(l)}\$

where \$ai^{(l)}\$ is the activation of neuron \$i\$ and \$\deltaj^{(l+1)}\$ is the error term for neuron \$j\$ defined as:

\$\$\delta_j^{(l)} =

\begin{cases}

(yj - tj) \, f'(net_j) & \text{for output neurons}\\[4pt]

\left(\sumk \deltak^{(l+1)} w{jk}^{(l+1)}\right) f'(netj) & \text{for hidden neurons}

\end{cases}\$\$

\$f'\$ denotes the derivative of the activation function (e.g., sigmoid, ReLU) evaluated at the neuron's net input \$net_j\$.

Suggested diagram: A three‑layer neural network showing forward flow of activations and backward flow of error terms.

3. Regression Methods

Regression predicts a continuous dependent variable \$y\$ from one or more independent variables \$\mathbf{x} = (x1, x2, \dots, x_p)\$. The most common methods in A‑Level study are linear regression and logistic regression (used for binary classification but often introduced alongside regression concepts).

3.1. Simple Linear Regression

Model: \$y = \beta0 + \beta1 x + \varepsilon\$

The coefficients \$\beta0\$ (intercept) and \$\beta1\$ (slope) are estimated by minimising the sum of squared residuals:

\$\min{\beta0,\beta1}\sum{i=1}^{n}(yi - \beta0 - \beta1 xi)^2\$

Closed‑form solution (ordinary least squares):

\$\$\beta1 = \frac{\sum{i=1}^{n}(xi-\bar{x})(yi-\bar{y})}{\sum{i=1}^{n}(xi-\bar{x})^2}, \qquad

\beta0 = \bar{y} - \beta1\bar{x}\$\$

3.2. Multiple Linear Regression

Model: \$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\$

where \$\mathbf{X}\$ is an \$n \times (p+1)\$ matrix (including a column of 1’s for the intercept) and \$\boldsymbol{\beta}\$ is the vector of coefficients. The OLS estimate is:

\$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}\$

3.3. Logistic Regression (Binary Classification)

Although the output is categorical, the model is built on a regression framework. The probability of class “1” given \$\mathbf{x}\$ is modelled as:

\$P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}\$

Parameters \$\mathbf{w}, b\$ are found by maximising the log‑likelihood (or equivalently minimising the cross‑entropy loss):

\$\ell(\mathbf{w},b) = \sum{i=1}^{n}\Big[ yi\log\sigma(zi) + (1-yi)\log(1-\sigma(z_i))\Big]\$

where \$zi = \mathbf{w}^\top \mathbf{x}i + b\$.

3.4. Comparison of Regression Techniques

MethodTypical Use‑caseOutput TypeLoss FunctionTraining Approach
Simple Linear RegressionPredict a single continuous variable from one predictorContinuousSum of Squared Errors (SSE)Closed‑form OLS solution
Multiple Linear RegressionPredict a continuous variable from several predictorsContinuousSum of Squared Errors (SSE)Closed‑form OLS or gradient descent for large \$p\$
Logistic RegressionBinary classification (e.g., spam vs. not‑spam)Probability of class 1 (0–1)Cross‑entropy (log‑loss)Iterative optimisation (gradient descent, Newton‑Raphson)
Neural Network (Back‑propagation)Complex non‑linear relationships, image/audio, game playingContinuous or categorical (depends on output layer)Mean‑square error, cross‑entropy, etc.Iterative gradient descent using back‑propagation

4. Linking Back‑Propagation and Regression

Back‑propagation can be viewed as a generalised regression optimiser. When the network’s output layer uses a linear activation and the loss is SSE, the network is performing a form of linear regression extended to multiple hidden layers. If a sigmoid activation and cross‑entropy loss are used, the network behaves like logistic regression but with hidden layers that capture non‑linear feature transformations.

5. Key Points to Remember

  • Back‑propagation computes the gradient of the loss with respect to each weight by applying the chain rule layer‑by‑layer.
  • The learning rate \$\eta\$ controls the step size; too large can cause divergence, too small leads to slow convergence.
  • Linear regression has an analytical solution; logistic regression and neural networks require iterative optimisation.
  • Regularisation (e.g., L2 penalty) can be added to the loss to prevent over‑fitting in both regression and neural‑network contexts.

6. Sample Exam Question

Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.

Answer Outline:

  1. Forward pass computes hidden activations \$hj = \sigma(\sumi w{ij}^{(1)}xi + bj^{(1)})\$ and output \$y = \sumj w{j}^{(2)}hj + b^{(2)}\$.
  2. Error term for the output neuron: \$\delta^{(2)} = (y - t)\$ because the derivative of the linear activation is 1.
  3. Error terms for hidden neurons: \$\deltaj^{(1)} = \delta^{(2)} w{j}^{(2)} \sigma'(net_j)\$.
  4. Weight updates:

    \$w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta \, \deltaj^{(1)} xi,\$

    \$w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta \, \delta^{(2)} h_j.\$

  5. Because the output layer is linear and the loss is SSE, the network is effectively fitting a linear model to the transformed features \$h_j\$. The hidden layer creates non‑linear basis functions, extending ordinary linear regression to a richer hypothesis space.