15.1 Processors, Parallel Processing and \cdot irtual Machines
Objective
Show understanding of the characteristics of massively parallel computers.
What is a Massively Parallel Computer?
A massively parallel computer (MPC) is a system that contains a very large number of processing elements (PEs) that operate concurrently. The number of PEs can range from several hundred to millions, and they are typically organised in a regular interconnection network.
Key Characteristics
High Degree of Concurrency – thousands to millions of PEs can execute instructions at the same time.
Fine‑grained Parallelism – tasks are divided into very small sub‑tasks that can be processed independently.
Scalable Interconnection Networks – topologies such as mesh, torus, hyper‑cube, and fat‑tree allow communication to scale with the number of PEs.
Distributed Memory – each PE often has its own local memory, reducing contention for a single shared memory space.
Low Power per PE – individual processors are usually simple and energy‑efficient, enabling large numbers to be packed together.
Fault Tolerance – redundancy and graceful degradation are built in, so failure of some PEs does not halt the whole system.
Performance Metrics
Performance of an MPC is measured using several specialised metrics:
Metric
Definition
Typical Use
Speedup (\$S\$)
\$S = \frac{T{1}}{T{p}}\$ where \$T{1}\$ is the execution time on a single processor and \$T{p}\$ on \$p\$ processors.
Assess how much faster a parallel system runs compared to a serial one.
Efficiency (\$E\$)
\$E = \frac{S}{p}\$
Shows how well the processors are utilised.
Scalability
Ability of the system to maintain efficiency as \$p\$ increases.
Important for future expansion.
Throughput
Number of tasks completed per unit time.
Critical for data‑intensive applications.
Common Architectures
SIMD (Single Instruction, Multiple Data)
All PEs execute the same instruction on different data elements. Example: graphics processing units (GPUs).
MIMD (Multiple Instruction, Multiple Data)
Each PE can execute its own instruction stream. Example: large‑scale clusters and many‑core processors.
Hybrid SIMD/MIMD
Combines both models, e.g., a GPU with multiple streaming multiprocessors that can run independent kernels.
Programming Models
To exploit massive parallelism, programmers use specialised models and languages:
Message Passing Interface (MPI) – explicit communication between distributed PEs.
OpenMP – shared‑memory directives for loop parallelisation.
CUDA / OpenCL – APIs for programming GPUs with thousands of cores.
MapReduce – functional model for processing large data sets across many nodes.
Advantages of Massively Parallel Computers
Ability to solve problems that are intractable on serial machines (e.g., climate modelling, protein folding).
Energy efficiency per operation due to simple, low‑power cores.
High fault tolerance through redundancy.
Scalable performance – adding more PEs can increase throughput linearly up to a point.
Challenges and Limitations
Programming Complexity – designing algorithms that effectively distribute work and minimise communication overhead.
Communication Latency – as the number of PEs grows, the cost of data movement can dominate.
Load Balancing – uneven distribution of work leads to idle processors.
Memory Bandwidth – contention for shared resources can limit performance.
Cost – large numbers of processors and sophisticated interconnects increase system price.
Real‑World Examples
System
PE Count
Primary Use
Architecture
IBM Blue Gene/Q
\overline{1},048,576 cores
Scientific simulations
MIMD, 5‑D torus interconnect
N \cdot IDIA Tesla \cdot 100 GPU
5,120 CUDA cores
Deep learning, HPC
SIMD within SMs, MIMD across SMs
Google TPU v4
\overline{4},096 cores per pod
TensorFlow workloads
Matrix‑multiply specialised units
Suggested Diagram
Suggested diagram: A 2‑D mesh interconnection network showing thousands of processing elements with local memory and routing links.
Summary
Massively parallel computers harness a very high degree of concurrency through large numbers of simple processing elements, specialised interconnects, and distributed memory. They deliver extraordinary performance for data‑intensive and compute‑heavy tasks, but require careful algorithm design, efficient communication strategies, and robust load balancing to achieve their full potential.