Describe and use methods of data validation

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – Data Integrity: Methods of Data \cdot alidation

6.2 Data Integrity

Objective

Describe and use methods of data validation to ensure that data stored, transmitted or processed is accurate, complete and reliable.

What is Data Integrity?

Data integrity refers to the correctness and consistency of data over its entire lifecycle. It is maintained by preventing accidental or intentional alteration of data.

Common Sources of Data Errors

  • Human entry mistakes (typos, omitted fields)
  • Transmission noise (electrical interference, signal loss)
  • Hardware faults (bad sectors, memory errors)
  • Software bugs (incorrect algorithms, overflow)

Validation Techniques

Validation can be performed at three main stages:

  1. Input validation – before data is accepted.
  2. Processing validation – during calculations or transformations.
  3. Output/Storage validation – before data is written or transmitted.

1. Syntactic \cdot alidation

Checks that the data conforms to a required format.

  • Length checks – e.g., a postcode must be exactly 6 characters.
  • Pattern checks – using regular expressions, e.g., email must match ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$.
  • Data‑type checks – ensure a field is numeric, alphabetic, date, etc.

2. Semantic \cdot alidation

Ensures the data makes sense in the real‑world context.

  • Range checks – e.g., age must be between 0 and 120.
  • Cross‑field checks – e.g., Start Date must be earlier than End Date.
  • Lookup validation – verify a code exists in a reference table (referential integrity).

3. Redundancy Checks

Use extra bits or values that can be recomputed to detect errors.

3.1 Parity Bits

A single bit added to a binary word to make the number of 1’s either even (even parity) or odd (odd parity).

Even parity condition:

\$P = b1 \oplus b2 \oplus \dots \oplus b_n\$

where \$P\$ is the parity bit and \$\oplus\$ denotes XOR. If the total number of 1’s is odd, \$P\$ is set to 1 to make it even.

3.2 Checksums

A checksum is the sum of all data bytes, often reduced modulo \$2^8\$ (one byte) or \$2^{16}\$ (two bytes).

Example for an 8‑bit checksum:

\$C = \left(\sum{i=1}^{N} di\right) \bmod 256\$

When the data is received, the receiver recomputes \$C\$ and compares it with the transmitted checksum.

3.3 Cyclic Redundancy Check (CRC)

CRC treats the data as a binary polynomial \$D(x)\$ and divides it by a generator polynomial \$G(x)\$. The remainder \$R(x)\$ is appended to the data.

Transmission packet: \$T(x) = D(x) \cdot x^{k} + R(x)\$ where \$k\$ is the degree of \$G(x)\$.

On receipt, the receiver checks that \$T(x) \bmod G(x) = 0\$. If not, an error is detected.

3.4 Cryptographic Hash Functions

Functions such as SHA‑256 produce a fixed‑length digest that changes dramatically with any alteration of the input.

Used for integrity verification of files, software distribution and database records.

Comparison of Redundancy Methods

MethodTypical Size (bits)Error Detection CapabilityCommon Uses
Parity Bit1Detects any single‑bit error; fails for even number of bit errorsMemory modules, simple serial links
Checksum8–16Detects most single‑byte errors; limited against reorderingFile transfer protocols (e.g., TCP optional checksum)
CRC16–32Detects all single‑bit errors, all double‑bit errors, all odd‑length burst errors up to degree of \$G(x)\$Ethernet, USB, storage devices
Hash (SHA‑256)256Detects any change with probability \$2^{-256}\$Software distribution, digital signatures, database integrity

Practical \cdot alidation Example (Pseudo‑code)

function validateStudentRecord(record):

# 1. Syntactic checks

if not matches(record.id, r'^[A-Z]{2}\d{4}$'):

return false, "Invalid ID format"

if not isNumeric(record.age):

return false, "Age must be numeric"

# 2. Semantic checks

age = int(record.age)

if age < 0 or age > 120:

return false, "Age out of realistic range"

# 3. Cross‑field check

if record.startDate > record.endDate:

return false, "Start date after end date"

# 4. Referential integrity

if not existsInTable('Courses', record.courseCode):

return false, "Course code does not exist"

# 5. Redundancy check (simple checksum)

dataBytes = toBytes(record.id + record.name + record.age)

checksum = sum(dataBytes) % 256

if checksum != record.checksum:

return false, "Checksum mismatch"

return true, "Record valid"

Best Practices for Data \cdot alidation

  • Validate as early as possible – at the point of entry.
  • Never trust client‑side validation alone; always repeat checks on the server.
  • Provide clear error messages to guide correction.
  • Use layered validation – combine syntactic, semantic and redundancy checks.
  • Log validation failures for audit trails and to detect systematic issues.

Suggested diagram: Flowchart showing the stages of data validation (input → syntactic → semantic → redundancy → storage/transmission).

Key Take‑aways

  • Data integrity is essential for reliable computing systems.
  • Validation methods can be categorized into syntactic, semantic and redundancy techniques.
  • Parity, checksum, CRC and cryptographic hashes each offer different levels of error detection.
  • Effective validation combines multiple techniques and follows best practice guidelines.