Students will be able to: explain how artificial neural networks support machine‑learning, analyse their operation, and design a simple ANN solution using pseudocode or a high‑level language (AO1–AO3).
| Syllabus Unit | Content Covered in These Notes | AO Tag |
|---|---|---|
| AI – Neural Networks (Section 18.1) | Perceptron, activation functions, network architecture, forward & backward propagation, loss functions, training, evaluation, regularisation, ethical issues. | AO1, AO2, AO3 |
| Algorithms (Section 9) | Pseudocode for ANN training loop; analysis of computational complexity. | AO2, AO3 |
| Data Structures (Section 8) | Arrays / matrices for weights, biases and activations. | AO2 |
| Programming (Section 11) | Worked example in Python‑style syntax. | AO3 |
| Ethics and Ownership (Section 7) | Bias, privacy, transparency, employment and environmental impact. | AO1 |
Single neuron:
\[
a = \phi\!\left(\sum{i=1}^{n} wi x_i + b\right)
\]
where \$xi\$ are inputs, \$wi\$ the weights, \$b\$ the bias and \$\phi\$ the activation function.
For a network with parameters \$\theta\$ (all weights and biases) the loss \$L(\theta)\$ is minimised by gradient descent:
\[
\theta \leftarrow \theta - \eta \,\nabla_{\theta} L(\theta)
\]
\$\eta\$ is the learning rate. In practice the gradient is obtained by the back‑propagation algorithm.
For a classification output layer we use soft‑max:
\[
\hat{y}j = \frac{e^{zj}}{\sum{k} e^{zk}}
\]
and cross‑entropy loss:
\[
L = -\sum{j} yj \log \hat{y}_j
\]
Applying the chain rule gives the simple error term used in back‑propagation:
\[
\frac{\partial L}{\partial zj}= \hat{y}j - y_j
\]
Thus the “output error” \$\deltaO\$ in the pseudocode is \$\deltaO = \hat{y} - y\$.
Numerical stability tip: compute soft‑max as \$\hat{y}j = \frac{e^{zj - \max(z)}}{\sumk e^{zk - \max(z)}}\$ to avoid overflow.
# -------------------------------------------------# TRAIN-MLP (X, Y, hiddenSize, epochs, η)
# X : matrix of training inputs (m × n)
# Y : matrix of target outputs (m × k)
# hiddenSize : number of neurons in the hidden layer
# epochs : number of full passes over the data
# η : learning rate
# -------------------------------------------------
INITIALISE weightIH ← random(n, hiddenSize) # input → hidden
INITIALISE biasH ← zeros(1, hiddenSize)
INITIALISE weightHO ← random(hiddenSize, k) # hidden → output
INITIALISE biasO ← zeros(1, k)
FOR epoch FROM 1 TO epochs DO
FOR each training example (x, y) IN (X, Y) DO
# ---- Forward pass ----
zH ← x · weightIH + biasH # linear combo
aH ← ReLU(zH) # hidden activation
zO ← aH · weightHO + biasO
aO ← softmax(zO) # output probabilities
# ---- Compute loss (cross‑entropy) ----
loss ← – Σj yj * log(aO_j)
# ---- Backward pass ----
δO ← aO – y # output error (derivation box)
gradWeightHO ← aHᵀ · δO
gradBiasO ← Σj δOj
δH ← (δO · weightHOᵀ) ⊙ ReLU'(zH) # hidden error
gradWeightIH ← xᵀ · δH
gradBiasH ← Σj δHj
# ---- Update weights ----
weightHO ← weightHO – η * gradWeightHO
biasO ← biasO – η * gradBiasO
weightIH ← weightIH – η * gradWeightIH
biasH ← biasH – η * gradBiasH
END FOR
END FOR
RETURN weightIH, biasH, weightHO, biasO
Complexity per epoch: \$O(m·n·\text{hiddenSize} + m·\text{hiddenSize}·k)\$. This satisfies the algorithmic analysis requirement for Paper 2.
Network: 2 inputs → 1 hidden neuron (ReLU) → 1 output neuron (sigmoid). Learning rate \$\eta = 0.5\$.
| Step | Values |
|---|---|
| Initial weights & biases | \$w{1}=0.1,\; w{2}=-0.2,\; bh=0\$ (hidden); \$w{h}=0.3,\; b_o=0\$ (output) |
| Training sample | \$x=[1,0]\$, target \$y=1\$ |
| Forward – hidden | \$zh = 1·0.1 + 0·(-0.2) + 0 = 0.1\$; \$ah = \text{ReLU}(0.1)=0.1\$ |
| Forward – output | \$zo = 0.1·0.3 + 0 = 0.03\$; \$ao = \sigma(0.03)=0.5075\$ |
| Loss (MSE) | \$L = \frac12 (y-a_o)^2 = 0.5(1-0.5075)^2 = 0.121\$ |
| Backward – output error | \$\deltao = (ao-y)\,\sigma'(zo)\$; \$\sigma'(zo)=ao(1-ao)=0.249\$; \$\delta_o = (0.5075-1)·0.249 = -0.122\$ |
| Gradients (output) | \$\frac{\partial L}{\partial wh}=ah·\deltao = 0.1·(-0.122)= -0.0122\$; \$\frac{\partial L}{\partial bo}= \delta_o = -0.122\$ |
| Backward – hidden error | \$\deltah = \deltao·wh·\text{ReLU}'(zh)\$; ReLU' = 1 (since \$zh>0\$); \$\deltah = (-0.122)·0.3 = -0.0366\$ |
| Gradients (hidden) | \$\frac{\partial L}{\partial w1}=x1·\deltah = 1·(-0.0366)= -0.0366\$; \$\frac{\partial L}{\partial w2}=x2·\deltah = 0\$; \$\frac{\partial L}{\partial bh}= \deltah = -0.0366\$ |
| Weight update | \$wh \leftarrow 0.3 - 0.5·(-0.0122)=0.306\$; \$bo \leftarrow 0 - 0.5·(-0.122)=0.061\$; \$w1 \leftarrow 0.1 - 0.5·(-0.0366)=0.118\$; \$w2\$ unchanged; \$b_h \leftarrow 0 - 0.5·(-0.0366)=0.018\$ |
Repeating the process for the remaining training examples drives the network toward the desired mapping.
| Network Type | Typical Input Data | Key Architectural Feature | Parameter‑count Consideration | Typical Applications | Training Nuance |
|---|---|---|---|---|---|
| Multilayer Perceptron (MLP) | Fixed‑size tabular or vector data | Fully‑connected layers; static input size | Often high (weights = \$n{\text{in}}\times n{\text{out}}\$) | Simple classification/regression, fraud detection | Standard back‑propagation; sensitive to scaling |
| Convolutional Neural Network (CNN) | Images, video frames, spatial grids | Learnable convolutional filters + pooling; weight sharing | Much lower than MLP for comparable depth (filters reused) | Image classification, object detection, medical imaging | Requires data augmentation; often uses batch normalisation |
| Recurrent Neural Network (RNN) | Sequences of arbitrary length (text, audio, sensor streams) | Hidden state passed forward in time | Parameters grow with hidden size, but not with sequence length | Language modelling, speech recognition, time‑series forecasting | Back‑propagation through time; prone to vanishing gradients |
| Long Short‑Term Memory (LSTM) | Long‑range sequential data | Gated cells (input, forget, output) control information flow | Similar to RNN but with extra gate parameters | Machine translation, video captioning, stock prediction | Mitigates vanishing gradient; slower per step than vanilla RNN |
| Transformer | Very long sequences (text, code, protein strings) | Self‑attention enables parallel processing of the whole sequence | Very high (multi‑head attention matrices) – usually pre‑trained | Large‑scale language models, translation, summarisation | Requires massive data & compute; fine‑tuning is common |
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | \(\displaystyle \frac{TP+TN}{TP+FP+FN+TN}\) | Balanced class distribution |
| Precision | \(\displaystyle \frac{TP}{TP+FP}\) | Cost of false positives is high |
| Recall (Sensitivity) | \(\displaystyle \frac{TP}{TP+FN}\) | Cost of false negatives is high |
| F‑score | \(\displaystyle 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}\) | Balance between precision & recall |
| ROC‑AUC | Area under the Receiver Operating Characteristic curve | Binary classifiers; threshold‑independent |
Scenario:* A secondary school wants to introduce an AI system that automatically grades student essays for plagiarism and writing quality.*
Task (AO1): Identify two ethical issues and suggest one mitigation for each.
Mitigation: Include a diverse, balanced corpus of essays from various linguistic backgrounds and perform bias‑testing before deployment.
Mitigation: Anonymise all submissions, obtain explicit consent, and retain data only for the minimum period required for model improvement.
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources, past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.