Students will be able to: explain how artificial neural networks support machine‑learning, analyse their operation, and design a simple ANN solution using pseudocode or a high‑level language (AO1–AO3).
| Syllabus Unit | Content Covered in These Notes | AO Tag |
|---|---|---|
| AI – Neural Networks (Section 18.1) | Perceptron, activation functions, network architecture, forward & backward propagation, loss functions, training, evaluation, regularisation, ethical issues. | AO1, AO2, AO3 |
| Algorithms (Section 9) | Pseudocode for ANN training loop; analysis of computational complexity. | AO2, AO3 |
| Data Structures (Section 8) | Arrays / matrices for weights, biases and activations. | AO2 |
| Programming (Section 11) | Worked example in Python‑style syntax. | AO3 |
| Ethics and Ownership (Section 7) | Bias, privacy, transparency, employment and environmental impact. | AO1 |
Single neuron:
\[ a = \phi\!\left(\sum_{i=1}^{n} w_i x_i + b\right) \]where $x_i$ are inputs, $w_i$ the weights, $b$ the bias and $\phi$ the activation function.
For a network with parameters $\theta$ (all weights and biases) the loss $L(\theta)$ is minimised by gradient descent:
\[ \theta \leftarrow \theta - \eta \,abla_{\theta} L(\theta) \]$\eta$ is the learning rate. In practice the gradient is obtained by the back‑propagation algorithm.
For a classification output layer we use soft‑max:
\[ \hat{y}_j = \frac{e^{z_j}}{\sum_{k} e^{z_k}} \]and cross‑entropy loss:
\[ L = -\sum_{j} y_j \log \hat{y}_j \]Applying the chain rule gives the simple error term used in back‑propagation:
\[ \frac{\partial L}{\partial z_j}= \hat{y}_j - y_j \]Thus the “output error” $\delta_O$ in the pseudocode is $\delta_O = \hat{y} - y$.
Numerical stability tip: compute soft‑max as $\hat{y}_j = \frac{e^{z_j - \max(z)}}{\sum_k e^{z_k - \max(z)}}$ to avoid overflow.
# -------------------------------------------------
# TRAIN-MLP (X, Y, hiddenSize, epochs, η)
# X : matrix of training inputs (m × n)
# Y : matrix of target outputs (m × k)
# hiddenSize : number of neurons in the hidden layer
# epochs : number of full passes over the data
# η : learning rate
# -------------------------------------------------
INITIALISE weightIH ← random(n, hiddenSize) # input → hidden
INITIALISE biasH ← zeros(1, hiddenSize)
INITIALISE weightHO ← random(hiddenSize, k) # hidden → output
INITIALISE biasO ← zeros(1, k)
FOR epoch FROM 1 TO epochs DO
FOR each training example (x, y) IN (X, Y) DO
# ---- Forward pass ----
zH ← x · weightIH + biasH # linear combo
aH ← ReLU(zH) # hidden activation
zO ← aH · weightHO + biasO
aO ← softmax(zO) # output probabilities
# ---- Compute loss (cross‑entropy) ----
loss ← – Σ_j y_j * log(aO_j)
# ---- Backward pass ----
δO ← aO – y # output error (derivation box)
gradWeightHO ← aHᵀ · δO
gradBiasO ← Σ_j δO_j
δH ← (δO · weightHOᵀ) ⊙ ReLU'(zH) # hidden error
gradWeightIH ← xᵀ · δH
gradBiasH ← Σ_j δH_j
# ---- Update weights ----
weightHO ← weightHO – η * gradWeightHO
biasO ← biasO – η * gradBiasO
weightIH ← weightIH – η * gradWeightIH
biasH ← biasH – η * gradBiasH
END FOR
END FOR
RETURN weightIH, biasH, weightHO, biasO
Complexity per epoch: $O(m·n·\text{hiddenSize} + m·\text{hiddenSize}·k)$. This satisfies the algorithmic analysis requirement for Paper 2.
Network: 2 inputs → 1 hidden neuron (ReLU) → 1 output neuron (sigmoid). Learning rate $\eta = 0.5$.
| Step | Values |
|---|---|
| Initial weights & biases | $w_{1}=0.1,\; w_{2}=-0.2,\; b_h=0$ (hidden); $w_{h}=0.3,\; b_o=0$ (output) |
| Training sample | $x=[1,0]$, target $y=1$ |
| Forward – hidden | $z_h = 1·0.1 + 0·(-0.2) + 0 = 0.1$; $a_h = \text{ReLU}(0.1)=0.1$ |
| Forward – output | $z_o = 0.1·0.3 + 0 = 0.03$; $a_o = \sigma(0.03)=0.5075$ |
| Loss (MSE) | $L = \frac12 (y-a_o)^2 = 0.5(1-0.5075)^2 = 0.121$ |
| Backward – output error | $\delta_o = (a_o-y)\,\sigma'(z_o)$; $\sigma'(z_o)=a_o(1-a_o)=0.249$; $\delta_o = (0.5075-1)·0.249 = -0.122$ |
| Gradients (output) | $\frac{\partial L}{\partial w_h}=a_h·\delta_o = 0.1·(-0.122)= -0.0122$; $\frac{\partial L}{\partial b_o}= \delta_o = -0.122$ |
| Backward – hidden error | $\delta_h = \delta_o·w_h·\text{ReLU}'(z_h)$; ReLU' = 1 (since $z_h>0$); $\delta_h = (-0.122)·0.3 = -0.0366$ |
| Gradients (hidden) | $\frac{\partial L}{\partial w_1}=x_1·\delta_h = 1·(-0.0366)= -0.0366$; $\frac{\partial L}{\partial w_2}=x_2·\delta_h = 0$; $\frac{\partial L}{\partial b_h}= \delta_h = -0.0366$ |
| Weight update | $w_h \leftarrow 0.3 - 0.5·(-0.0122)=0.306$; $b_o \leftarrow 0 - 0.5·(-0.122)=0.061$; $w_1 \leftarrow 0.1 - 0.5·(-0.0366)=0.118$; $w_2$ unchanged; $b_h \leftarrow 0 - 0.5·(-0.0366)=0.018$ |
Repeating the process for the remaining training examples drives the network toward the desired mapping.
| Network Type | Typical Input Data | Key Architectural Feature | Parameter‑count Consideration | Typical Applications | Training Nuance |
|---|---|---|---|---|---|
| Multilayer Perceptron (MLP) | Fixed‑size tabular or vector data | Fully‑connected layers; static input size | Often high (weights = $n_{\text{in}}\times n_{\text{out}}$) | Simple classification/regression, fraud detection | Standard back‑propagation; sensitive to scaling |
| Convolutional Neural Network (CNN) | Images, video frames, spatial grids | Learnable convolutional filters + pooling; weight sharing | Much lower than MLP for comparable depth (filters reused) | Image classification, object detection, medical imaging | Requires data augmentation; often uses batch normalisation |
| Recurrent Neural Network (RNN) | Sequences of arbitrary length (text, audio, sensor streams) | Hidden state passed forward in time | Parameters grow with hidden size, but not with sequence length | Language modelling, speech recognition, time‑series forecasting | Back‑propagation through time; prone to vanishing gradients |
| Long Short‑Term Memory (LSTM) | Long‑range sequential data | Gated cells (input, forget, output) control information flow | Similar to RNN but with extra gate parameters | Machine translation, video captioning, stock prediction | Mitigates vanishing gradient; slower per step than vanilla RNN |
| Transformer | Very long sequences (text, code, protein strings) | Self‑attention enables parallel processing of the whole sequence | Very high (multi‑head attention matrices) – usually pre‑trained | Large‑scale language models, translation, summarisation | Requires massive data & compute; fine‑tuning is common |
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | \(\displaystyle \frac{TP+TN}{TP+FP+FN+TN}\) | Balanced class distribution |
| Precision | \(\displaystyle \frac{TP}{TP+FP}\) | Cost of false positives is high |
| Recall (Sensitivity) | \(\displaystyle \frac{TP}{TP+FN}\) | Cost of false negatives is high |
| F‑score | \(\displaystyle 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}\) | Balance between precision & recall |
| ROC‑AUC | Area under the Receiver Operating Characteristic curve | Binary classifiers; threshold‑independent |
Scenario:* A secondary school wants to introduce an AI system that automatically grades student essays for plagiarism and writing quality.*
Task (AO1): Identify **two** ethical issues and suggest **one** mitigation for each.
Create an account or Login to take a Quiz
Log in to suggest improvements to this note.
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources, past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.