Cambridge Syllabus Notes

18.1 Artificial Intelligence – Artificial Neural Networks (ANNs)

Learning Objective

Students will be able to: explain how artificial neural networks support machine‑learning, analyse their operation, and design a simple ANN solution using pseudocode or a high‑level language (AO1–AO3).

Syllabus Mapping (Cambridge International AS & A Level Computer Science 9618, 2026)

Syllabus Unit	Content Covered in These Notes	AO Tag
AI – Neural Networks (Section 18.1)	Perceptron, activation functions, network architecture, forward & backward propagation, loss functions, training, evaluation, regularisation, ethical issues.	AO1, AO2, AO3
Algorithms (Section 9)	Pseudocode for ANN training loop; analysis of computational complexity.	AO2, AO3
Data Structures (Section 8)	Arrays / matrices for weights, biases and activations.	AO2
Programming (Section 11)	Worked example in Python‑style syntax.	AO3
Ethics and Ownership (Section 7)	Bias, privacy, transparency, employment and environmental impact.	AO1

What Is Not Covered (Checklist)

Bias‑variance trade‑off (beyond the brief mention of over‑fitting).
Gradient‑checking as a debugging technique.
Detailed hardware considerations (GPUs, TPUs).
Advanced optimisation algorithms (Adam, RMSProp) – only basic gradient descent is shown.

Key Concepts (AO1)

Neuron (perceptron): computes a weighted sum of its inputs and passes the result through an activation function.
Activation functions (introduce non‑linearity):
- Sigmoid $\displaystyle \sigma(z)=\frac{1}{1+e^{-z}}$
- Rectified Linear Unit (ReLU) $f(z)=\max(0,z)$
- Hyperbolic tangent (tanh) $\displaystyle \tanh(z)$
Network architecture: layers of neurons (input, hidden, output) connected by weighted edges.
- Multilayer Perceptron (MLP) – fully‑connected.
- Convolutional Neural Network (CNN) – learnable filters + pooling.
- Recurrent Neural Network (RNN) – hidden state across time steps.
- Long Short‑Term Memory (LSTM) – gated RNN.
- Transformer – self‑attention, parallel sequence processing.
Loss / cost function: measures error between predicted output $\hat{y}$ and true label $y$.
- Mean‑squared error (MSE) – regression.
- Cross‑entropy – classification (often paired with soft‑max).
Training (learning): optimisation of weights to minimise the loss, usually by gradient descent and back‑propagation.
Regularisation: L2 weight decay, dropout, early stopping, data augmentation – all aim to reduce over‑fitting.
Evaluation metrics: accuracy, precision, recall, F‑score, ROC‑AUC, confusion matrix.
Ethical considerations: bias, privacy, transparency, employment, environmental cost.

Mathematical Formulation (AO1)

Single neuron:

\[ a = \phi\!\left(\sum_{i=1}^{n} w_i x_i + b\right) \]

where $x_i$ are inputs, $w_i$ the weights, $b$ the bias and $\phi$ the activation function.

For a network with parameters $\theta$ (all weights and biases) the loss $L(\theta)$ is minimised by gradient descent:

\[ \theta \leftarrow \theta - \eta \,abla_{\theta} L(\theta) \]

$\eta$ is the learning rate. In practice the gradient is obtained by the back‑propagation algorithm.

Derivation Box – Soft‑max + Cross‑entropy Gradient

For a classification output layer we use soft‑max:

\[ \hat{y}_j = \frac{e^{z_j}}{\sum_{k} e^{z_k}} \]

and cross‑entropy loss:

\[ L = -\sum_{j} y_j \log \hat{y}_j \]

Applying the chain rule gives the simple error term used in back‑propagation:

\[ \frac{\partial L}{\partial z_j}= \hat{y}_j - y_j \]

Thus the “output error” $\delta_O$ in the pseudocode is $\delta_O = \hat{y} - y$.

Numerical stability tip: compute soft‑max as $\hat{y}_j = \frac{e^{z_j - \max(z)}}{\sum_k e^{z_k - \max(z)}}$ to avoid overflow.

Back‑Propagation Overview (AO2)

Forward pass – compute activations layer by layer.
Loss computation – evaluate $L$ using the network output.
Backward pass – propagate the error gradient from the output layer back to the input layer, applying the chain rule (see Derivation Box).
Weight update – $w \gets w - \eta \frac{\partial L}{\partial w}$ for every parameter.

Pseudocode for Training a Simple MLP (AO2, AO3)

# -------------------------------------------------
# TRAIN-MLP (X, Y, hiddenSize, epochs, η)
# X : matrix of training inputs (m × n)
# Y : matrix of target outputs (m × k)
# hiddenSize : number of neurons in the hidden layer
# epochs : number of full passes over the data
# η : learning rate
# -------------------------------------------------
INITIALISE weightIH ← random(n, hiddenSize)   # input → hidden
INITIALISE biasH   ← zeros(1, hiddenSize)
INITIALISE weightHO ← random(hiddenSize, k)   # hidden → output
INITIALISE biasO   ← zeros(1, k)

FOR epoch FROM 1 TO epochs DO
    FOR each training example (x, y) IN (X, Y) DO
        # ---- Forward pass ----
        zH   ← x · weightIH + biasH               # linear combo
        aH   ← ReLU(zH)                           # hidden activation
        zO   ← aH · weightHO + biasO
        aO   ← softmax(zO)                        # output probabilities

        # ---- Compute loss (cross‑entropy) ----
        loss ← – Σ_j y_j * log(aO_j)

        # ---- Backward pass ----
        δO   ← aO – y                             # output error (derivation box)
        gradWeightHO ← aHᵀ · δO
        gradBiasO    ← Σ_j δO_j

        δH   ← (δO · weightHOᵀ) ⊙ ReLU'(zH)       # hidden error
        gradWeightIH ← xᵀ · δH
        gradBiasH    ← Σ_j δH_j

        # ---- Update weights ----
        weightHO ← weightHO – η * gradWeightHO
        biasO    ← biasO    – η * gradBiasO
        weightIH ← weightIH – η * gradWeightIH
        biasH    ← biasH    – η * gradBiasH
    END FOR
END FOR
RETURN weightIH, biasH, weightHO, biasO

Complexity per epoch: $O(m·n·\text{hiddenSize} + m·\text{hiddenSize}·k)$. This satisfies the algorithmic analysis requirement for Paper 2.

Worked Numerical Example (AO2, AO3)

Network: 2 inputs → 1 hidden neuron (ReLU) → 1 output neuron (sigmoid). Learning rate $\eta = 0.5$.

Step	Values
Initial weights & biases	$w_{1}=0.1,\; w_{2}=-0.2,\; b_h=0$ (hidden); $w_{h}=0.3,\; b_o=0$ (output)
Training sample	$x=[1,0]$, target $y=1$
Forward – hidden	$z_h = 1·0.1 + 0·(-0.2) + 0 = 0.1$; $a_h = \text{ReLU}(0.1)=0.1$
Forward – output	$z_o = 0.1·0.3 + 0 = 0.03$; $a_o = \sigma(0.03)=0.5075$
Loss (MSE)	$L = \frac12 (y-a_o)^2 = 0.5(1-0.5075)^2 = 0.121$
Backward – output error	$\delta_o = (a_o-y)\,\sigma'(z_o)$; $\sigma'(z_o)=a_o(1-a_o)=0.249$; $\delta_o = (0.5075-1)·0.249 = -0.122$
Gradients (output)	$\frac{\partial L}{\partial w_h}=a_h·\delta_o = 0.1·(-0.122)= -0.0122$; $\frac{\partial L}{\partial b_o}= \delta_o = -0.122$
Backward – hidden error	$\delta_h = \delta_o·w_h·\text{ReLU}'(z_h)$; ReLU' = 1 (since $z_h>0$); $\delta_h = (-0.122)·0.3 = -0.0366$
Gradients (hidden)	$\frac{\partial L}{\partial w_1}=x_1·\delta_h = 1·(-0.0366)= -0.0366$; $\frac{\partial L}{\partial w_2}=x_2·\delta_h = 0$; $\frac{\partial L}{\partial b_h}= \delta_h = -0.0366$
Weight update	$w_h \leftarrow 0.3 - 0.5·(-0.0122)=0.306$; $b_o \leftarrow 0 - 0.5·(-0.122)=0.061$; $w_1 \leftarrow 0.1 - 0.5·(-0.0366)=0.118$; $w_2$ unchanged; $b_h \leftarrow 0 - 0.5·(-0.0366)=0.018$

Repeating the process for the remaining training examples drives the network toward the desired mapping.

How ANNs Have Advanced Machine Learning (AO1)

Learning non‑linear relationships – the Universal Approximation Theorem guarantees that a sufficiently large MLP can approximate any continuous function.
Automatic feature extraction – deep architectures (CNN, RNN, Transformer) learn hierarchical representations, reducing manual engineering.
Scalability with data – performance often improves as more labelled examples become available.
Transfer learning – pre‑trained models can be fine‑tuned for new tasks, saving time and compute.
Real‑world impact – image classification, speech‑to‑text, natural‑language processing, and game‑playing (e.g., AlphaGo) demonstrate practical success.

Illustrative Applications (AO1)

Handwritten digit recognition (MNIST) – a simple MLP reaches >98 % accuracy.
Object detection – CNNs locate and label multiple objects in photographs.
Speech‑to‑text – RNNs and Transformers model temporal dependencies in audio streams.
Strategic game playing – Deep Reinforcement Learning combines ANNs with reward‑based learning (e.g., AlphaZero).

Comparison of ANN Types – When to Use Which? (AO1)

Network Type	Typical Input Data	Key Architectural Feature	Parameter‑count Consideration	Typical Applications	Training Nuance
Multilayer Perceptron (MLP)	Fixed‑size tabular or vector data	Fully‑connected layers; static input size	Often high (weights = $n_{\text{in}}\times n_{\text{out}}$)	Simple classification/regression, fraud detection	Standard back‑propagation; sensitive to scaling
Convolutional Neural Network (CNN)	Images, video frames, spatial grids	Learnable convolutional filters + pooling; weight sharing	Much lower than MLP for comparable depth (filters reused)	Image classification, object detection, medical imaging	Requires data augmentation; often uses batch normalisation
Recurrent Neural Network (RNN)	Sequences of arbitrary length (text, audio, sensor streams)	Hidden state passed forward in time	Parameters grow with hidden size, but not with sequence length	Language modelling, speech recognition, time‑series forecasting	Back‑propagation through time; prone to vanishing gradients
Long Short‑Term Memory (LSTM)	Long‑range sequential data	Gated cells (input, forget, output) control information flow	Similar to RNN but with extra gate parameters	Machine translation, video captioning, stock prediction	Mitigates vanishing gradient; slower per step than vanilla RNN
Transformer	Very long sequences (text, code, protein strings)	Self‑attention enables parallel processing of the whole sequence	Very high (multi‑head attention matrices) – usually pre‑trained	Large‑scale language models, translation, summarisation	Requires massive data & compute; fine‑tuning is common

Model Evaluation & Validation (AO2)

Train / Validation / Test split – typical ratios 60 % / 20 % / 20 %.
Cross‑validation – $k$‑fold (commonly $k=5$ or $10$) for more reliable performance estimates.
Confusion matrix – tabulates TP, FP, FN, TN.

Metric	Formula	When to Use
Accuracy	$\displaystyle \frac{TP+TN}{TP+FP+FN+TN}$	Balanced class distribution
Precision	$\displaystyle \frac{TP}{TP+FP}$	Cost of false positives is high
Recall (Sensitivity)	$\displaystyle \frac{TP}{TP+FN}$	Cost of false negatives is high
F‑score	$\displaystyle 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}$	Balance between precision & recall
ROC‑AUC	Area under the Receiver Operating Characteristic curve	Binary classifiers; threshold‑independent

Regularisation Techniques (AO2)

L2 weight decay – adds $\lambda\sum w^2$ to the loss.
Dropout – randomly disables a proportion $p$ of neurons each training step.
Early stopping – halt training when validation loss stops improving.
Data augmentation – artificially enlarge the training set (e.g., image rotations, noise injection).

Ethical & Societal Implications (AO1)

Bias & fairness – training data may encode historic prejudices; models can amplify them.
Privacy – large datasets often contain personal information; GDPR‑compliant handling is required.
Transparency & explainability – “black‑box” nature makes it hard to justify decisions.
Employment impact – automation may displace certain job roles.
Environmental cost – training large models consumes significant energy.

Exam‑style Ethical Scenario

Scenario:* A secondary school wants to introduce an AI system that automatically grades student essays for plagiarism and writing quality.*

Task (AO1): Identify **two** ethical issues and suggest **one** mitigation for each.

Issue – Bias in grading: The model may favour writing styles present in its training set, disadvantaging non‑native speakers.
Mitigation: Include a diverse, balanced corpus of essays from various linguistic backgrounds and perform bias‑testing before deployment.

Issue – Privacy of student work: Essays contain personal data; storing them for model training could breach GDPR.
Mitigation: Anonymise all submissions, obtain explicit consent, and retain data only for the minimum period required for model improvement.

Quick Revision Checklist

Know the perceptron equation and the role of the activation function.

Be able to write the forward‑pass formulas for a fully‑connected layer.

Derive the soft‑max + cross‑entropy gradient (Derivation Box).

Explain back‑propagation steps and the chain rule.

Read and interpret the pseudocode; state its time‑complexity.

Recall when to use MLP, CNN, RNN/LSTM, Transformer (comparison table).

List at least three regularisation methods and when they are useful.

Define accuracy, precision, recall, F‑score and give a situation where each is preferred.

Identify two ethical risks of an ANN‑based system and propose mitigations.

Metric	Formula	When to Use
Accuracy	\(\displaystyle \frac{TP+TN}{TP+FP+FN+TN}\)	Balanced class distribution
Precision	\(\displaystyle \frac{TP}{TP+FP}\)	Cost of false positives is high
Recall (Sensitivity)	\(\displaystyle \frac{TP}{TP+FN}\)	Cost of false negatives is high
F‑score	\(\displaystyle 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}\)	Balance between precision & recall
ROC‑AUC	Area under the Receiver Operating Characteristic curve	Binary classifiers; threshold‑independent

Show understanding of how artificial neural networks have helped with machine learning

18.1 Artificial Intelligence – Artificial Neural Networks (ANNs)