Show understanding of the importance/use of pipelining and registers in RISC processors

15.1 Processors, Parallel Processing & Virtual Machines

Learning objectives

  • Compare the main characteristics of RISC and CISC architectures.
  • Explain how a large register file and pipelining give RISC processors their high speed.
  • Identify the three classic pipeline hazards and describe at least one technique for removing each.
  • Compare the four Flynn classifications (SISD, SIMD, MISD, MIMD) and illustrate the difference between SISD and MIMD with a diagram.
  • Describe the key ideas behind massively‑parallel computers, including performance metrics such as speed‑up and Amdahl’s law.
  • Define a virtual machine, explain its layered structure and give two concrete examples of its use.

RISC vs CISC

AspectRISC (Reduced‑Instruction‑Set Computer)CISC (Complex‑Instruction‑Set Computer)
Instruction‑set sizeSmall (≈ 50–100 instructions)Large (hundreds to thousands)
Instruction lengthFixed (usually 32 bits)Variable (1–15 bytes)
Execution time per instructionOne clock cycle (or a few) – most are simpleMany cycles – complex instructions are micro‑coded
Memory access modelLoad/store only (memory accessed in dedicated stages)Many instructions can read/write memory directly
Control unitHard‑wired, simple combinational logicMicro‑programmed, more complex control memory
Register fileLarge (≥ 32 general‑purpose registers) – keeps data on‑chipFewer registers; heavy reliance on memory operands
Typical design goalHigh clock frequency + high instruction throughputRich instruction set for compact code density

Why RISC processors are fast

  • Large register file – most operands are in registers, avoiding costly memory accesses.
  • Simple, fixed‑length instructions – the decoder can be hard‑wired and the pipeline stages are uniform.
  • Pipelining – each instruction is broken into independent stages that operate in parallel, giving an average throughput of ≈ one instruction per clock after the pipeline is filled.

Five‑stage pipeline (classic RISC core)

StageAbbreviationTypical operationRegisters read / written
Instruction FetchIFRead the next 32‑bit word from instruction memory using the PC.Read PC; write next PC into IF/ID pipeline register.
Instruction Decode / Register FetchIDDecode opcode, read up to two source registers, sign‑extend immediates.Read two source registers from the register file; write decoded fields into ID/EX register.
Execute / Address CalculationEXALU operation, effective‑address calculation, branch‑condition evaluation.Read ALU operands from ID/EX; write ALU result & branch decision into EX/MEM register.
Memory AccessMEMLoad from or store to data memory.Read/write data memory; write loaded data into MEM/WB register.
Write BackWBWrite the result (ALU output or loaded data) back to the destination register.Write destination register (first half‑clock) from MEM/WB.

Typical 5‑stage pipeline diagram (showing overlapping instruction flow).
*(Insert diagram with arrows: IF → ID → EX → MEM → WB, each stage holding a different instruction.)*

How pipelining increases throughput

  • Each clock cycle completes part of several instructions simultaneously → after the pipeline fill latency, the average rate is one instruction per cycle.
  • No need to raise the clock frequency; the control logic stays simple and deterministic.
  • Fixed‑length stages give predictable timing – important for real‑time applications.

Pipeline hazards

  1. Structural hazard – two stages need the same hardware resource (e.g., a single memory port).

    Mitigation: duplicate the resource (separate instruction‑fetch and data‑memory ports) or insert a stall.

  2. Data hazard – an instruction needs a result that has not yet been written back (RAW).

    Mitigation: forwarding/bypassing from EX/MEM or MEM/WB registers; otherwise insert NOPs.

  3. Control hazard – a branch or jump changes the PC before the outcome is known.

    Mitigation: static prediction (e.g., “predict not taken”), dynamic branch prediction, branch delay slots, or pipeline flush.

Hazard‑mitigation techniques (quick reference)

  • Forwarding (bypassing) – the result produced in EX or MEM is routed directly to the EX stage of a dependent instruction.
  • Branch delay slot – the instruction immediately after a branch is always executed; the compiler can place a useful, independent instruction there.
  • Branch prediction – simple static (always not‑taken) or dynamic two‑bit saturating counters reduce mis‑prediction penalties.
  • Duplicated functional units – two ALUs, separate instruction‑fetch and data‑memory ports, or multiple load/store units eliminate structural conflicts.

Register file design for a pipelined RISC core

  • Typical size: 32 general‑purpose registers (GPRs), each 32 bits (or 64 bits for a 64‑bit core).
  • Two read ports – allow the ID stage to fetch two source operands in a single cycle.
  • One write port – used by the WB stage to write the result back.
  • Write‑after‑read (WAR) and read‑after‑write (RAW) conflicts are avoided by performing the write in the first half of the clock and the reads in the second half.
  • Optional third read port for special purposes (e.g., forwarding to the EX stage) in more advanced designs.

Example of forwarding

add \$t0, \$t1, \$t2 # \$t0 = \$t1 + \$t2

sub \$t3, \$t0, \$t4 # \$t3 = \$t0 - \$t4

and \$t5, \$t3, \$t6 # \$t5 = \$t3 & \$t6

Without forwarding the sub would have to wait until add reaches the WB stage (two‑cycle stall). With a bypass path from the EX/MEM register to the EX stage, the result of add is supplied directly to sub, eliminating the stall.

Four basic computer‑architecture classifications (Flynn’s taxonomy)

ClassDefinitionTypical example
SISDSingle Instruction stream, Single Data stream – one processor executes one instruction at a time on one datum.Traditional single‑core CPU (no SIMD extensions)
SIMDSingle Instruction stream, Multiple Data streams – one instruction operates on many data elements simultaneously.GPU vector units, Intel SSE/AVX, ARM NEON
MISDMultiple Instruction streams, Single Data stream – rare; used in specialised fault‑tolerant or pipeline‑parallel systems.Theoretical models; some digital‑signal‑processing pipelines
MIMDMultiple Instruction streams, Multiple Data streams – each processor runs its own program on its own data.Multi‑core CPUs, clusters, distributed systems

Comparing SISD and MIMD

Data‑flow contrast (SISD vs. MIMD).

*(Insert two small block diagrams side‑by‑side)*

- SISD: one arrow from a single control unit to a single datapath.

- MIMD: several independent cores, each with its own control unit and datapath, connected to a shared memory hierarchy.

Massively‑parallel computers

  • Definition: Systems that contain hundreds, thousands or even millions of processing elements working concurrently.
  • Typical architectures

    • SIMD GPUs – many simple cores executing the same instruction on different data.
    • MIMD many‑core CPUs (e.g., Xeon Phi, ARM “big‑LITTLE” clusters).
    • Distributed clusters – dozens to thousands of separate nodes linked by a network.

  • Key characteristics

    • Large number of relatively simple cores.
    • Hierarchical memory: fast on‑chip shared memory, slower global memory, and off‑chip DRAM.
    • High memory‑bandwidth and low‑latency interconnects (e.g., NVLink, InfiniBand).
    • Programming models: CUDA/OpenCL for GPUs, OpenMP/MPI for CPUs and clusters.

  • Performance metrics

    • Speed‑up = Tserial / Tparallel.
    • Amdahl’s law – the theoretical limit of speed‑up given the fraction of code that must remain serial.

  • Why it matters for A‑Level – exam questions often ask you to compare single‑core performance with parallel performance, or to identify parts of an algorithm that could be parallelised.

Virtual machines

  • Definition: A software abstraction that mimics a physical computer, providing its own instruction set, memory model and runtime environment.
  • Layered structure

    1. Host operating system (provides hardware access).
    2. Hypervisor or runtime (e.g., JVM, CLR, QEMU) – interprets or JIT‑compiles byte‑code.
    3. Byte‑code (e.g., Java .class files, .NET MSIL, Python .pyc).
    4. Native machine code generated at run‑time (JIT) or interpreted directly.

  • Typical uses

    • Portability – “write once, run anywhere” (Java, .NET).
    • Security isolation – sandboxing of untrusted code.
    • Managed‑runtime services – garbage collection, runtime checks.
    • Emulation – QEMU or Bochs emulate a different CPU architecture.
    • Container‑style lightweight VMs – Docker uses a thin VM layer for isolation.

  • Examples

    • Java Virtual Machine (JVM) – executes Java bytecode.
    • .NET Common Language Runtime (CLR) – executes Microsoft Intermediate Language (MSIL).
    • Python CPython interpreter – executes Python bytecode.
    • QEMU – hardware‑level emulator that can run binaries compiled for a different ISA.

Key points for examination (15.1)

  1. State two reasons why RISC processors can use a simple hard‑wired control unit.
  2. Describe each of the five pipeline stages and indicate which registers are read or written in that stage.
  3. Give one example of a structural, a data and a control hazard and name one technique that removes each.
  4. Compare SISD and MIMD, supporting your answer with a short labelled diagram of data flow.
  5. Explain the role of a virtual machine in providing platform independence and give two concrete examples of its use.


15.2 Boolean Algebra and Logic Circuits

Learning objectives

  • Construct truth tables for any Boolean expression.
  • Apply the fundamental Boolean‑algebra laws to simplify expressions.
  • Use Karnaugh maps to obtain minimal sum‑of‑products (SOP) or product‑of‑sums (POS) forms.
  • Design combinational circuits (half‑adder, full‑adder, multiplexers, decoders) using basic gates.
  • Understand the operation of basic sequential elements (SR, D, JK, T flip‑flops) and how they are combined to form registers and counters.

Logic‑gate symbols and truth tables

GateSymbolBoolean functionTruth table (A B → Y)
ANDA·B0 0 → 0
0 1 → 0
1 0 → 0
1 1 → 1
ORA+ B0 0 → 0
0 1 → 1
1 0 → 1
1 1 → 1
NOT¬Ā0 → 1
1 → 0
NAND(AND)¬Ā·B̅0 0 → 1
0 1 → 1
1 0 → 1
1 1 → 0
NOR(OR)¬Ā+ B̅0 0 → 1
0 1 → 0
1 0 → 0
1 1 → 0
XORA⊕B0 0 → 0
0 1 → 1
1 0 → 1
1 1 → 0
XNOR¬⊕A⊙B0 0 → 1
0 1 → 0
1 0 → 0
1 1 → 1

Fundamental Boolean‑algebra laws

  • Identity: A+0 = A  A·1 = A
  • Null:  A+1 = 1  A·0 = 0
  • Idempotent: A+ A = A  A·A = A
  • Complement: A+Ā = 1  A·Ā = 0
  • Commutative: A+ B = B+ A  A·B = B·A
  • Associative: (A+ B)+ C = A+ (B+ C)  (A·B)·C = A·(B·C)
  • Distributive: A·(B+ C) = A·B + A·C  A+ (B·C) = (A+ B)·(A+ C)
  • De Morgan: Ā·B̅ = (A+ B)̅  Ā+ B̅ = (A·B)̅

Simplifying an expression – worked example

Given F = A·B + A·Ā·C + B·C

  1. Apply the absorption law: A·Ā·C = 0, so it disappears.
  2. Expression becomes F = A·B + B·C.
  3. Factor B: F = B·(A + C). This is the minimal SOP form.

Karnaugh‑map minimisation (3‑variable example)

Minimise G = Σm(1,2,3,5,7) (i.e., true for minterms 1,2,3,5,7).

BC

00 01 11 10

+----------------

A0 | 0 1 1 1 ← row A=0

A1 | 0 1 1 0 ← row A=1

  • Group the four 1’s in the centre (covering cells 2,3,6,7) → term B.
  • Group the remaining 1 at cell 1 (A=0, B=0, C=1) → term Ā·C.

Minimal SOP: G = B + Ā·C.

Designing combinational circuits

Half‑adder

Adds two single‑bit numbers.

ABSum (S)Carry (C)
0000
0110
1010
1101

  • Sum = A ⊕ B  ( XOR gate )
  • Carry = A·B  ( AND gate )

Full‑adder

Adds three bits: A, B and Carry‑in (Cin).

ABCinSumCout
00000
00110
01010
01101
10010
10101
11001
11111

  • Sum = A ⊕ B ⊕ Cin
  • Cout = (A·B) + (B·Cin) + (A·Cin) (majority function)

Multiplexer (2‑to‑1)

Selects one of two data inputs (D0, D1) based on select line S.

  • Output Y = (S̅·D0) + (S·D1).
  • Implemented with two AND gates, one OR gate and an inverter for S̅.

Decoder (2‑to‑4)

Activates exactly one of four outputs (Y0–Y3) according to a 2‑bit binary input (A1A0).

  • Y0 = Ā1·Ā0  Y1 = Ā1·A0  Y2 = A1·Ā0  Y3 = A1·A0
  • Implemented with four three‑input AND gates (including the necessary inversions).

Sequential logic – flip‑flops

SR latch (asynchronous)

SRQ (next)
00Q (no change)
010
101
11Invalid (both 0)

Used as a building block for other flip‑flops.

D flip‑flop (edge‑triggered)

  • Data input D is sampled on a clock edge (rising or falling).
  • Qnext = D at the moment of the clock transition.
  • Eliminates the illegal state of the SR latch.

JK flip‑flop

  • Inputs J and K behave like S and R, but the “both 1” condition toggles the output.
  • Truth table:

    J K | Qnext

    0 0 | Q (no change)

    0 1 | 0

    1 0 | 1

    1 1 | ¬Q (toggle)

T flip‑flop

  • Single input T (Toggle).
  • Qnext = ¬Q when T = 1; otherwise Qnext = Q.
  • Often built from a JK flip‑flop with J = K = T.

Registers and counters

  • Register – a group of D flip‑flops (one per bit) with a common clock; provides a fast on‑chip storage element.
  • Shift register – registers connected serially; used for serial‑to‑parallel conversion.
  • Binary counter – cascade of T or JK flip‑flops; each stage toggles on the carry from the previous stage.

Example: 4‑bit synchronous binary up‑counter

Q0 (LSB) – T‑FF with T = 1

Q1 – T‑FF with T = Q0

Q2 – T‑FF with T = Q0·Q1

Q3 – T‑FF with T = Q0·Q1·Q2

All flip‑flops share the same clock edge.

The counter advances by one on each clock pulse; the carry‑logic (AND gates) implements the “toggle when all lower bits are 1” rule.

Registers in a pipelined RISC core

  • Each pipeline stage is separated by a set of pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB).
  • These registers hold the instruction, control signals and intermediate data, ensuring that each stage works on a stable snapshot of the instruction.
  • They are essentially small, fast registers built from D flip‑flops, clocked by the same global clock as the processor.

Key points for examination (15.2)

  1. Construct the truth table for a given Boolean expression (e.g., (A+ B)·Ā).
  2. Show, step‑by‑step, how an expression is reduced using Boolean laws.
  3. Use a Karnaugh map to minimise a 3‑ or 4‑variable function and write the resulting SOP or POS.
  4. Draw a half‑adder and a full‑adder circuit using only basic gates.
  5. Explain how a D flip‑flop works, and how several D flip‑flops are combined to form a register used in a pipelined processor.