Learning objective: Demonstrate a clear understanding of back‑propagation of errors and of the main regression methods used in machine learning, and explain how the two are linked.
Model
\$y = \beta0 + \beta1x + \varepsilon\$
Coefficients are obtained by minimising the sum of squared residuals (ordinary least squares):
\$\min{\beta0,\beta1}\sum{i=1}^{n}(yi-\beta0-\beta1xi)^2\$
Closed‑form solution
\$\$\beta1 = \frac{\sum{i}(xi-\bar{x})(yi-\bar{y})}{\sum{i}(xi-\bar{x})^2},
\qquad
\beta0 = \bar{y}-\beta1\bar{x}\$\$
Matrix form
\$\mathbf{y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\$
where \(\mathbf{X}\) is the \(n\times(p+1)\) design matrix (first column = 1 for the intercept). The OLS estimate is
\$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!T}\mathbf{X})^{-1}\mathbf{X}^{\!T}\mathbf{y}\$
Probability of class 1 given a feature vector \(\mathbf{x}\)
\$\$P(y=1\mid\mathbf{x}) = \sigma(\mathbf{w}^{\!T}\mathbf{x}+b)
= \frac{1}{1+e^{-(\mathbf{w}^{\!T}\mathbf{x}+b)}}\$\$
Parameters \(\mathbf{w},b\) are estimated by maximising the log‑likelihood (equivalently, minimising cross‑entropy loss):
\$\$\ell(\mathbf{w},b)=\sum{i=1}^{n}\Big[yi\log\sigma(zi)+(1-yi)\log\bigl(1-\sigma(z_i)\bigr)\Big],
\qquad zi=\mathbf{w}^{\!T}\mathbf{x}i+b\$\$
| Method | Typical use‑case | Output type | Loss function | Training approach |
|---|---|---|---|---|
| Simple linear regression | One predictor → continuous target | Continuous | Sum of squared errors (SSE) | Closed‑form OLS |
| Multiple linear regression | Several predictors → continuous target | Continuous | SSE | Closed‑form OLS or GD for large \(p\) |
| Logistic regression | Binary classification (e.g. spam / not‑spam) | Probability (0–1) | Cross‑entropy (log‑loss) | Iterative optimisation (GD, Newton‑Raphson) |
| Neural network (back‑propagation) | Complex non‑linear relationships, image/audio, game playing | Continuous or categorical (depends on output layer) | MSE, cross‑entropy, etc. | Iterative GD using back‑propagation |
Back‑propagation computes the gradient of a chosen loss function with respect to every weight in a multilayer feed‑forward network, then updates the weights by gradient descent so that the overall error is reduced.
\$E=\frac12\sum{k}(tk-y_k)^2,\$
where \(tk\) is the target for output node \(k\) and \(yk\) its actual output.
\$\deltaj = \frac{\partial E}{\partial netj}.\$
\$w{ij}\leftarrow w{ij}-\eta\,\deltaj\,ai,\$
where \(\eta\) is the learning rate and \(a_i\) the activation of the source neuron.
For a neuron in layer \(\ell\) that feeds forward to layer \(\ell+1\):
\$\$\delta_j^{(\ell)}=
\begin{cases}
(yj-tj)\,f'(net_j) & \text{output neuron}\\[4pt]
\left(\displaystyle\sum{k}\deltak^{(\ell+1)}w{jk}^{(\ell+1)}\right)f'(netj) & \text{hidden neuron}
\end{cases}\$\$
\(f'\) is the derivative of the activation function (e.g. sigmoid: \(\sigma'(z)=\sigma(z)[1-\sigma(z)]\)).
Assume:
\[
hj = \sigma\!\Big(\sumi w{ij}^{(1)}xi + b_j^{(1)}\Big),\qquad j=1,2,3
\]
\[
y = \sum{j=1}^{3} w{j}^{(2)}h_j + b^{(2)}.
\]
\[
\delta^{(2)} = \frac{\partial E}{\partial y}=y-t.
\]
\[
\deltaj^{(1)} = \delta^{(2)}\,w{j}^{(2)}\,\sigma'(net_j),\qquad
\sigma'(netj)=hj(1-h_j).
\]
\[
w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta\,\deltaj^{(1)}xi,
\qquad
bj^{(1)} \leftarrow bj^{(1)} - \eta\,\delta_j^{(1)}.
\]
\[
w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta\,\delta^{(2)}h_j,
\qquad
b^{(2)} \leftarrow b^{(2)} - \eta\,\delta^{(2)}.
\]

Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.
Answer outline (full marks):
\[
hj=\sigma\!\Big(\sumi w{ij}^{(1)}xi+b_j^{(1)}\Big),\qquad
y=\sumj w{j}^{(2)}h_j+b^{(2)}.
\]
\[
\delta^{(2)} = y-t.
\]
\[
\deltaj^{(1)} = \delta^{(2)}\,w{j}^{(2)}\,hj(1-hj).
\]
\[
w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta\,\deltaj^{(1)}xi,\qquad
w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta\,\delta^{(2)}h_j,
\]
and similarly for the biases.
| Syllabus element | Covered? | Action if “no” |
|---|---|---|
| Definition of supervised vs. unsupervised learning | ✔ | ‑ |
| Purpose of regression in ML (continuous prediction) | ✔ | ‑ |
| Simple & multiple linear regression – formulae & OLS solution | ✔ | ‑ |
| Logistic regression – sigmoid, cross‑entropy, optimisation | ✔ | ‑ |
| Back‑propagation algorithm – steps, weight‑update rule, δ‑terms | ✔ | ‑ |
| Link between back‑propagation and regression (linear & logistic cases) | ✔ | ‑ |
| Key exam points (learning rate, regularisation, convergence) | ✔ | ‑ |
| Worked example & diagram of a network with one hidden layer | ✔ | ‑ |
| Sample A‑Level style question & full answer outline | ✔ | ‑ |
When all checks are green, the notes are ready for classroom delivery or revision.
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources, past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.