Published by Patrick Mutisya · 14 days ago
Objective: Show understanding of back‑propagation of errors and regression methods in machine learning.
Machine learning (ML) enables computers to improve performance on a task through experience. Two fundamental approaches are:
Within supervised learning, two key techniques are regression (predicting continuous values) and neural networks trained by back‑propagation.
Back‑propagation is the algorithm used to train multilayer feed‑forward neural networks. It adjusts the weights to minimise the error between the network’s output and the target values.
\$E = \frac{1}{2}\sum{k}(tk - y_k)^2\$
where \$t_k\$ is the target for output node \$k\$.
\$w{ij} \leftarrow w{ij} - \eta \frac{\partial E}{\partial w_{ij}}\$
where \$\eta\$ is the learning rate.
For a weight connecting neuron \$i\$ in layer \$l\$ to neuron \$j\$ in layer \$l+1\$, the derivative is:
\$\frac{\partial E}{\partial w{ij}} = \deltaj^{(l+1)} a_i^{(l)}\$
where \$ai^{(l)}\$ is the activation of neuron \$i\$ and \$\deltaj^{(l+1)}\$ is the error term for neuron \$j\$ defined as:
\$\$\delta_j^{(l)} =
\begin{cases}
(yj - tj) \, f'(net_j) & \text{for output neurons}\\[4pt]
\left(\sumk \deltak^{(l+1)} w{jk}^{(l+1)}\right) f'(netj) & \text{for hidden neurons}
\end{cases}\$\$
\$f'\$ denotes the derivative of the activation function (e.g., sigmoid, ReLU) evaluated at the neuron's net input \$net_j\$.
Regression predicts a continuous dependent variable \$y\$ from one or more independent variables \$\mathbf{x} = (x1, x2, \dots, x_p)\$. The most common methods in A‑Level study are linear regression and logistic regression (used for binary classification but often introduced alongside regression concepts).
Model: \$y = \beta0 + \beta1 x + \varepsilon\$
The coefficients \$\beta0\$ (intercept) and \$\beta1\$ (slope) are estimated by minimising the sum of squared residuals:
\$\min{\beta0,\beta1}\sum{i=1}^{n}(yi - \beta0 - \beta1 xi)^2\$
Closed‑form solution (ordinary least squares):
\$\$\beta1 = \frac{\sum{i=1}^{n}(xi-\bar{x})(yi-\bar{y})}{\sum{i=1}^{n}(xi-\bar{x})^2}, \qquad
\beta0 = \bar{y} - \beta1\bar{x}\$\$
Model: \$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\$
where \$\mathbf{X}\$ is an \$n \times (p+1)\$ matrix (including a column of 1’s for the intercept) and \$\boldsymbol{\beta}\$ is the vector of coefficients. The OLS estimate is:
\$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}\$
Although the output is categorical, the model is built on a regression framework. The probability of class “1” given \$\mathbf{x}\$ is modelled as:
\$P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}\$
Parameters \$\mathbf{w}, b\$ are found by maximising the log‑likelihood (or equivalently minimising the cross‑entropy loss):
\$\ell(\mathbf{w},b) = \sum{i=1}^{n}\Big[ yi\log\sigma(zi) + (1-yi)\log(1-\sigma(z_i))\Big]\$
where \$zi = \mathbf{w}^\top \mathbf{x}i + b\$.
| Method | Typical Use‑case | Output Type | Loss Function | Training Approach |
|---|---|---|---|---|
| Simple Linear Regression | Predict a single continuous variable from one predictor | Continuous | Sum of Squared Errors (SSE) | Closed‑form OLS solution |
| Multiple Linear Regression | Predict a continuous variable from several predictors | Continuous | Sum of Squared Errors (SSE) | Closed‑form OLS or gradient descent for large \$p\$ |
| Logistic Regression | Binary classification (e.g., spam vs. not‑spam) | Probability of class 1 (0–1) | Cross‑entropy (log‑loss) | Iterative optimisation (gradient descent, Newton‑Raphson) |
| Neural Network (Back‑propagation) | Complex non‑linear relationships, image/audio, game playing | Continuous or categorical (depends on output layer) | Mean‑square error, cross‑entropy, etc. | Iterative gradient descent using back‑propagation |
Back‑propagation can be viewed as a generalised regression optimiser. When the network’s output layer uses a linear activation and the loss is SSE, the network is performing a form of linear regression extended to multiple hidden layers. If a sigmoid activation and cross‑entropy loss are used, the network behaves like logistic regression but with hidden layers that capture non‑linear feature transformations.
Question: A neural network with one hidden layer of three sigmoid neurons is trained to predict house prices (a continuous variable). The output neuron uses a linear activation and the loss function is the sum of squared errors. Explain how back‑propagation updates the weights in this network and relate the process to linear regression.
Answer Outline:
\$w{ij}^{(1)} \leftarrow w{ij}^{(1)} - \eta \, \deltaj^{(1)} xi,\$
\$w{j}^{(2)} \leftarrow w{j}^{(2)} - \eta \, \delta^{(2)} h_j.\$