Show understanding of how artificial neural networks have helped with machine learning

18.1 Artificial Intelligence – Artificial Neural Networks (ANNs)

Learning Objective

Students will be able to: explain how artificial neural networks support machine‑learning, analyse their operation, and design a simple ANN solution using pseudocode or a high‑level language (AO1–AO3).

Syllabus Mapping (Cambridge International AS & A Level Computer Science 9618, 2026)

Syllabus UnitContent Covered in These NotesAO Tag
AI – Neural Networks (Section 18.1)Perceptron, activation functions, network architecture, forward & backward propagation, loss functions, training, evaluation, regularisation, ethical issues.AO1, AO2, AO3
Algorithms (Section 9)Pseudocode for ANN training loop; analysis of computational complexity.AO2, AO3
Data Structures (Section 8)Arrays / matrices for weights, biases and activations.AO2
Programming (Section 11)Worked example in Python‑style syntax.AO3
Ethics and Ownership (Section 7)Bias, privacy, transparency, employment and environmental impact.AO1

What Is Not Covered (Checklist)

  • Bias‑variance trade‑off (beyond the brief mention of over‑fitting).
  • Gradient‑checking as a debugging technique.
  • Detailed hardware considerations (GPUs, TPUs).
  • Advanced optimisation algorithms (Adam, RMSProp) – only basic gradient descent is shown.

Key Concepts (AO1)

  • Neuron (perceptron): computes a weighted sum of its inputs and passes the result through an activation function.
  • Activation functions (introduce non‑linearity):

    • Sigmoid \$\displaystyle \sigma(z)=\frac{1}{1+e^{-z}}\$
    • Rectified Linear Unit (ReLU) \$f(z)=\max(0,z)\$
    • Hyperbolic tangent (tanh) \$\displaystyle \tanh(z)\$

  • Network architecture: layers of neurons (input, hidden, output) connected by weighted edges.

    • Multilayer Perceptron (MLP) – fully‑connected.
    • Convolutional Neural Network (CNN) – learnable filters + pooling.
    • Recurrent Neural Network (RNN) – hidden state across time steps.
    • Long Short‑Term Memory (LSTM) – gated RNN.
    • Transformer – self‑attention, parallel sequence processing.

  • Loss / cost function: measures error between predicted output \$\hat{y}\$ and true label \$y\$.

    • Mean‑squared error (MSE) – regression.
    • Cross‑entropy – classification (often paired with soft‑max).

  • Training (learning): optimisation of weights to minimise the loss, usually by gradient descent and back‑propagation.
  • Regularisation: L2 weight decay, dropout, early stopping, data augmentation – all aim to reduce over‑fitting.
  • Evaluation metrics: accuracy, precision, recall, F‑score, ROC‑AUC, confusion matrix.
  • Ethical considerations: bias, privacy, transparency, employment, environmental cost.

Mathematical Formulation (AO1)

Single neuron:

\[

a = \phi\!\left(\sum{i=1}^{n} wi x_i + b\right)

\]

where \$xi\$ are inputs, \$wi\$ the weights, \$b\$ the bias and \$\phi\$ the activation function.

For a network with parameters \$\theta\$ (all weights and biases) the loss \$L(\theta)\$ is minimised by gradient descent:

\[

\theta \leftarrow \theta - \eta \,\nabla_{\theta} L(\theta)

\]

\$\eta\$ is the learning rate. In practice the gradient is obtained by the back‑propagation algorithm.

Derivation Box – Soft‑max + Cross‑entropy Gradient

For a classification output layer we use soft‑max:

\[

\hat{y}j = \frac{e^{zj}}{\sum{k} e^{zk}}

\]

and cross‑entropy loss:

\[

L = -\sum{j} yj \log \hat{y}_j

\]

Applying the chain rule gives the simple error term used in back‑propagation:

\[

\frac{\partial L}{\partial zj}= \hat{y}j - y_j

\]

Thus the “output error” \$\deltaO\$ in the pseudocode is \$\deltaO = \hat{y} - y\$.

Numerical stability tip: compute soft‑max as \$\hat{y}j = \frac{e^{zj - \max(z)}}{\sumk e^{zk - \max(z)}}\$ to avoid overflow.

Back‑Propagation Overview (AO2)

  1. Forward pass – compute activations layer by layer.
  2. Loss computation – evaluate \$L\$ using the network output.
  3. Backward pass – propagate the error gradient from the output layer back to the input layer, applying the chain rule (see Derivation Box).
  4. Weight update\$w \gets w - \eta \frac{\partial L}{\partial w}\$ for every parameter.

Pseudocode for Training a Simple MLP (AO2, AO3)

# -------------------------------------------------

# TRAIN-MLP (X, Y, hiddenSize, epochs, η)

# X : matrix of training inputs (m × n)

# Y : matrix of target outputs (m × k)

# hiddenSize : number of neurons in the hidden layer

# epochs : number of full passes over the data

# η : learning rate

# -------------------------------------------------

INITIALISE weightIH ← random(n, hiddenSize) # input → hidden

INITIALISE biasH ← zeros(1, hiddenSize)

INITIALISE weightHO ← random(hiddenSize, k) # hidden → output

INITIALISE biasO ← zeros(1, k)

FOR epoch FROM 1 TO epochs DO

FOR each training example (x, y) IN (X, Y) DO

# ---- Forward pass ----

zH ← x · weightIH + biasH # linear combo

aH ← ReLU(zH) # hidden activation

zO ← aH · weightHO + biasO

aO ← softmax(zO) # output probabilities

# ---- Compute loss (cross‑entropy) ----

loss ← – Σj yj * log(aO_j)

# ---- Backward pass ----

δO ← aO – y # output error (derivation box)

gradWeightHO ← aHᵀ · δO

gradBiasO ← Σj δOj

δH ← (δO · weightHOᵀ) ⊙ ReLU'(zH) # hidden error

gradWeightIH ← xᵀ · δH

gradBiasH ← Σj δHj

# ---- Update weights ----

weightHO ← weightHO – η * gradWeightHO

biasO ← biasO – η * gradBiasO

weightIH ← weightIH – η * gradWeightIH

biasH ← biasH – η * gradBiasH

END FOR

END FOR

RETURN weightIH, biasH, weightHO, biasO

Complexity per epoch: \$O(m·n·\text{hiddenSize} + m·\text{hiddenSize}·k)\$. This satisfies the algorithmic analysis requirement for Paper 2.

Worked Numerical Example (AO2, AO3)

Network: 2 inputs → 1 hidden neuron (ReLU) → 1 output neuron (sigmoid). Learning rate \$\eta = 0.5\$.

StepValues
Initial weights & biases\$w{1}=0.1,\; w{2}=-0.2,\; bh=0\$ (hidden); \$w{h}=0.3,\; b_o=0\$ (output)
Training sample\$x=[1,0]\$, target \$y=1\$
Forward – hidden\$zh = 1·0.1 + 0·(-0.2) + 0 = 0.1\$;
\$a
h = \text{ReLU}(0.1)=0.1\$
Forward – output\$zo = 0.1·0.3 + 0 = 0.03\$;
\$a
o = \sigma(0.03)=0.5075\$
Loss (MSE)\$L = \frac12 (y-a_o)^2 = 0.5(1-0.5075)^2 = 0.121\$
Backward – output error\$\deltao = (ao-y)\,\sigma'(zo)\$;
\$\sigma'(z
o)=ao(1-ao)=0.249\$;
\$\delta_o = (0.5075-1)·0.249 = -0.122\$
Gradients (output)\$\frac{\partial L}{\partial wh}=ah·\deltao = 0.1·(-0.122)= -0.0122\$;
\$\frac{\partial L}{\partial b
o}= \delta_o = -0.122\$
Backward – hidden error\$\deltah = \deltao·wh·\text{ReLU}'(zh)\$;
ReLU' = 1 (since \$zh>0\$);
\$\delta
h = (-0.122)·0.3 = -0.0366\$
Gradients (hidden)\$\frac{\partial L}{\partial w1}=x1·\deltah = 1·(-0.0366)= -0.0366\$;
\$\frac{\partial L}{\partial w
2}=x2·\deltah = 0\$;
\$\frac{\partial L}{\partial bh}= \deltah = -0.0366\$
Weight update\$wh \leftarrow 0.3 - 0.5·(-0.0122)=0.306\$;
\$b
o \leftarrow 0 - 0.5·(-0.122)=0.061\$;
\$w1 \leftarrow 0.1 - 0.5·(-0.0366)=0.118\$;
\$w
2\$ unchanged;
\$b_h \leftarrow 0 - 0.5·(-0.0366)=0.018\$

Repeating the process for the remaining training examples drives the network toward the desired mapping.

How ANNs Have Advanced Machine Learning (AO1)

  1. Learning non‑linear relationships – the Universal Approximation Theorem guarantees that a sufficiently large MLP can approximate any continuous function.
  2. Automatic feature extraction – deep architectures (CNN, RNN, Transformer) learn hierarchical representations, reducing manual engineering.
  3. Scalability with data – performance often improves as more labelled examples become available.
  4. Transfer learning – pre‑trained models can be fine‑tuned for new tasks, saving time and compute.
  5. Real‑world impact – image classification, speech‑to‑text, natural‑language processing, and game‑playing (e.g., AlphaGo) demonstrate practical success.

Illustrative Applications (AO1)

  • Handwritten digit recognition (MNIST) – a simple MLP reaches >98 % accuracy.
  • Object detection – CNNs locate and label multiple objects in photographs.
  • Speech‑to‑text – RNNs and Transformers model temporal dependencies in audio streams.
  • Strategic game playing – Deep Reinforcement Learning combines ANNs with reward‑based learning (e.g., AlphaZero).

Comparison of ANN Types – When to Use Which? (AO1)

Network TypeTypical Input DataKey Architectural FeatureParameter‑count ConsiderationTypical ApplicationsTraining Nuance
Multilayer Perceptron (MLP)Fixed‑size tabular or vector dataFully‑connected layers; static input sizeOften high (weights = \$n{\text{in}}\times n{\text{out}}\$)Simple classification/regression, fraud detectionStandard back‑propagation; sensitive to scaling
Convolutional Neural Network (CNN)Images, video frames, spatial gridsLearnable convolutional filters + pooling; weight sharingMuch lower than MLP for comparable depth (filters reused)Image classification, object detection, medical imagingRequires data augmentation; often uses batch normalisation
Recurrent Neural Network (RNN)Sequences of arbitrary length (text, audio, sensor streams)Hidden state passed forward in timeParameters grow with hidden size, but not with sequence lengthLanguage modelling, speech recognition, time‑series forecastingBack‑propagation through time; prone to vanishing gradients
Long Short‑Term Memory (LSTM)Long‑range sequential dataGated cells (input, forget, output) control information flowSimilar to RNN but with extra gate parametersMachine translation, video captioning, stock predictionMitigates vanishing gradient; slower per step than vanilla RNN
TransformerVery long sequences (text, code, protein strings)Self‑attention enables parallel processing of the whole sequenceVery high (multi‑head attention matrices) – usually pre‑trainedLarge‑scale language models, translation, summarisationRequires massive data & compute; fine‑tuning is common

Model Evaluation & Validation (AO2)

  • Train / Validation / Test split – typical ratios 60 % / 20 % / 20 %.
  • Cross‑validation\$k\$‑fold (commonly \$k=5\$ or \$10\$) for more reliable performance estimates.
  • Confusion matrix – tabulates TP, FP, FN, TN.

MetricFormulaWhen to Use
Accuracy\(\displaystyle \frac{TP+TN}{TP+FP+FN+TN}\)Balanced class distribution
Precision\(\displaystyle \frac{TP}{TP+FP}\)Cost of false positives is high
Recall (Sensitivity)\(\displaystyle \frac{TP}{TP+FN}\)Cost of false negatives is high
F‑score\(\displaystyle 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}\)Balance between precision & recall
ROC‑AUCArea under the Receiver Operating Characteristic curveBinary classifiers; threshold‑independent

Regularisation Techniques (AO2)

  • L2 weight decay – adds \$\lambda\sum w^2\$ to the loss.
  • Dropout – randomly disables a proportion \$p\$ of neurons each training step.
  • Early stopping – halt training when validation loss stops improving.
  • Data augmentation – artificially enlarge the training set (e.g., image rotations, noise injection).

Ethical & Societal Implications (AO1)

  • Bias & fairness – training data may encode historic prejudices; models can amplify them.
  • Privacy – large datasets often contain personal information; GDPR‑compliant handling is required.
  • Transparency & explainability – “black‑box” nature makes it hard to justify decisions.
  • Employment impact – automation may displace certain job roles.
  • Environmental cost – training large models consumes significant energy.

Exam‑style Ethical Scenario

Scenario:* A secondary school wants to introduce an AI system that automatically grades student essays for plagiarism and writing quality.*

Task (AO1): Identify two ethical issues and suggest one mitigation for each.

  1. Issue – Bias in grading: The model may favour writing styles present in its training set, disadvantaging non‑native speakers.

    Mitigation: Include a diverse, balanced corpus of essays from various linguistic backgrounds and perform bias‑testing before deployment.

  2. Issue – Privacy of student work: Essays contain personal data; storing them for model training could breach GDPR.

    Mitigation: Anonymise all submissions, obtain explicit consent, and retain data only for the minimum period required for model improvement.

Quick Revision Checklist

  • Know the perceptron equation and the role of the activation function.
  • Be able to write the forward‑pass formulas for a fully‑connected layer.
  • Derive the soft‑max + cross‑entropy gradient (Derivation Box).
  • Explain back‑propagation steps and the chain rule.
  • Read and interpret the pseudocode; state its time‑complexity.
  • Recall when to use MLP, CNN, RNN/LSTM, Transformer (comparison table).
  • List at least three regularisation methods and when they are useful.
  • Define accuracy, precision, recall, F‑score and give a situation where each is preferred.
  • Identify two ethical risks of an ANN‑based system and propose mitigations.