Show understanding of how factors contribute to the performance of the computer system

4.1 Central Processing Unit (CPU) Architecture

Learning Objective

Show understanding of how various factors contribute to the overall performance of a computer system.

1. Von Neumann Model & Stored‑Program Concept

  • The CPU, memory and I/O share a single bus system and the same memory stores both data and the program instructions – the stored‑program principle.
  • Execution proceeds by repeatedly fetching the next instruction from memory, decoding it, executing it and writing back any result (the fetch‑decode‑execute‑write‑back cycle).

2. Core CPU Components (Von Neumann Model)

  • Control Unit (CU) – Decodes the opcode in the Instruction Register and generates the control signals that direct data movement, ALU operation and timing.
  • Arithmetic‑Logic Unit (ALU) – Performs arithmetic, logical and shift operations on the operands supplied by registers or memory.
  • Registers – Fast, on‑chip storage. They are divided into:

    RegisterPurpose (one‑line)
    Program Counter (PC)Holds the address of the next instruction to fetch.
    Instruction Register (IR)Holds the currently fetched instruction.
    Memory Address Register (MAR)Provides the address for a memory read or write.
    Memory Data Register (MDR)Temporarily stores data transferred to/from memory.
    Accumulator (ACC)Primary arithmetic result register.
    General‑purpose registers (R0‑R7, etc.)Hold operands and intermediate results.
    Index Register (IX)Used for address calculation in indexed addressing modes.
    Stack Pointer (SP)Points to the top of the runtime stack.
    Status / Flag RegisterContains condition flags (zero, carry, overflow, etc.).
    Pipeline Registers (IF/ID, ID/EX, EX/MEM, MEM/WB)Separate stages of a pipeline, allowing overlapping execution.

  • Cache Memory – Multi‑level hierarchy (L1, L2, L3) of high‑speed memory that stores frequently accessed instructions and data, reducing main‑memory latency.
  • Clock Generator – Produces a regular timing pulse; the clock rate (Hz) determines the duration of each cycle – a higher clock rate shortens the denominator of the CPU‑time equation, improving performance.
  • Bus Architecture

    • Address bus – Carries memory addresses from the PC, MAR or other units to memory.
    • Data bus – Transfers the actual data between CPU, cache, main memory and I/O.
    • Control bus – Carries control signals such as read/write, interrupt request, and clock.

Suggested diagram: Block diagram of a CPU showing the CU, ALU, registers (including PC, IR, MAR, MDR, ACC), cache hierarchy (L1/L2/L3), clock generator, and the three buses (address, data, control). Highlight the stored‑program concept.

3. The Instruction Cycle (Fetch → Decode → Execute → Write‑back)

  1. Fetch (IF)

    • PC → MARMemory
    • Instruction placed on the data bus and loaded into the IR.

  2. Decode (ID)

    • CU reads the opcode in the IR, generates control signals, and identifies required operands.
    • Operands are read from registers or from memory (address supplied by MAR).

  3. Execute (EX)

    • ALU performs the specified arithmetic, logical or shift operation.
    • Result is placed in a temporary register (often the ACC).

  4. Write‑back (WB) (optional)

    • Result is written back to a destination register or stored in memory.

Register‑Transfer Notation (RTN) example (simplified)

PC → MAR → MEM → IR

IR[opcode] → CU

IR[operand] → R1

R1, R2 → ALU → ACC

ACC → R3 (or MEM)

PC ← PC + 1

4. Interrupt Handling

  • An interrupt suspends the normal fetch‑decode‑execute flow.
  • The current PC (and often status flags) are saved on the stack.
  • The CPU jumps to the address of the Interrupt Service Routine (ISR).
  • After the ISR finishes, the saved PC and status are restored and execution resumes at the point of interruption.

5. Quantifying CPU Performance

The fundamental performance equation is:

\$\text{CPU Time} = \frac{\text{Instruction Count} \times \text{CPI}}{\text{Clock Rate}}\$

  • Instruction Count (IC) – Total number of instructions executed for a program.
  • CPI (Cycles per Instruction) – Average number of clock cycles required per instruction. It varies with instruction mix, pipeline depth, cache hit‑rate, branch‑prediction accuracy, etc.
  • Clock Rate – Frequency of the CPU clock (Hz). A higher clock rate reduces the time of each cycle, directly lowering CPU time.

6. Factors Influencing Performance

FactorEffect on PerformanceTypical Design / Mitigation
Clock SpeedHigher frequency shortens each cycle → lower CPU time.Smaller transistor geometries, dynamic frequency scaling (Turbo Boost), improved cooling.
CPI (Cycles per Instruction)Lower CPI → fewer cycles per instruction → faster execution.Pipelining, superscalar issue, micro‑op fusion, out‑of‑order execution, reducing cache‑miss penalties.
Instruction CountFewer instructions → less total work.Optimised compilers, algorithmic improvements, choice of ISA (CISC vs. RISC).
Cache Hierarchy (L1/L2/L3)Effective caching cuts memory‑access latency → lower CPI for memory‑bound code.Multi‑level caches, larger cache lines, write‑back policies, hardware prefetching, coherence protocols.
PipeliningOverlaps instruction stages, effectively reducing CPI.Deep pipelines, hazard detection, forwarding, branch prediction.
Branch PredictionAccurate prediction avoids pipeline stalls caused by control hazards.Dynamic two‑level predictors, hybrid schemes, return‑address stacks.
Parallelism – Multi‑coreIndependent threads run simultaneously → reduced overall program time.Multiple cores, symmetric multiprocessing (SMP), thread‑level scheduling.
Parallelism – SIMD / Vector UnitsOne instruction processes many data elements → large speed‑up for data‑parallel workloads.SIMD extensions (SSE, AVX), GPU off‑loading, vectorised libraries.
Instruction‑set design (RISC vs. CISC)RISC: simple instructions, lower CPI; CISC: complex instructions, fewer total instructions.ISA choice influences compiler optimisation, decoder complexity, and the IC/CPI trade‑off.

7. Example Calculations

7.1 Single‑core performance – CPI reduction

Program: 1.2 × 10⁹ instructions

Clock rate: 2.5 GHz

Baseline CPI: 1.8

\$\text{CPU Time}_{\text{baseline}} = \frac{1.2\times10^{9}\times1.8}{2.5\times10^{9}} = 0.864\ \text{s}\$

If cache optimisation lowers the average CPI to 1.4:

\$\text{CPU Time}_{\text{optimised}} = \frac{1.2\times10^{9}\times1.4}{2.5\times10^{9}} = 0.672\ \text{s}\$

7.2 Parallel vs. sequential execution – Amdahl’s Law

Assume 60 % of the work can be perfectly parallelised across 4 cores.

Speed‑up for the parallel portion: \(S_{p}=4\).

Overall speed‑up:

\$\$S = \frac{1}{(1-P) + \frac{P}{S_{p}}}

= \frac{1}{0.40 + \frac{0.60}{4}}

= \frac{1}{0.55}

\approx 1.82\$\$

New execution time:

\$\text{CPU Time}_{\text{parallel}} = \frac{0.864\ \text{s}}{1.82} \approx 0.475\ \text{s}\$

This illustrates that, when a substantial portion of a program can be parallelised, adding cores yields a larger performance gain than a modest CPI improvement.

8. Summary Checklist

  • Define the Von Neumann model and the stored‑program principle.
  • Identify and describe all core CPU components: CU, ALU, the full set of special and general‑purpose registers, cache hierarchy, clock generator, and the three buses (address, data, control).
  • Draw and explain the fetch‑decode‑execute‑write‑back cycle using register‑transfer notation; describe how an interrupt modifies the flow.
  • Recall the CPU performance equation and explain why CPI varies with architecture and workload.
  • Discuss how clock speed, CPI, instruction count, cache design, pipelining, branch prediction, ISA choice, multi‑core and SIMD parallelism affect overall speed.
  • Perform calculations:

    • Compute CPU time before and after a CPI change.
    • Apply Amdahl’s Law to compare sequential and parallel execution.