Cambridge Syllabus Notes

Data Processing and Information

1. Data and Information

Data are raw facts, figures or symbols that have no meaning on their own. Information is the result of giving data context, meaning and purpose – i.e. data become information when they are processed, interpreted and used to support decision‑making. Both direct and indirect data sources can be used to produce information, but the choice of source influences the quality, cost and timeliness of the final output.

2. Sources of Data

2.1 Direct Data Sources

Data are collected straight from the original event, object or person without any intermediate transformation.

Source	Typical Examples	Advantages	Disadvantages
Questionnaires / Surveys	Online forms, paper‑based surveys	High control over questions; can target specific groups	Respondent bias; time‑consuming to design and analyse
Interviews & Focus Groups	Face‑to‑face or video interviews, group discussions	Rich, qualitative insight; ability to probe deeper	Requires skilled interviewer; limited sample size
Data‑logging (Sensors)	Temperature probes, motion detectors, GPS trackers	Real‑time, objective measurements; minimal human error	Equipment cost; maintenance and calibration needed
Observation	Manual counts of footfall, behavioural observation	Direct view of actual behaviour; no reliance on self‑report	Observer bias; may be intrusive
Weather Data	On‑site weather stations, portable anemometers	Accurate, location‑specific information	Limited to physical conditions; equipment dependent
Transaction Records (POS)	Sales receipts, checkout logs	Exact, time‑stamped data; useful for trend analysis	May contain errors if system fails; privacy concerns
Manual Measurements	Ruler, stopwatch, tape measure	Simple, low‑tech; immediate results	Human error; limited precision

2.2 Indirect Data Sources

Data are derived, aggregated or transformed from existing records or publications.

Source	Typical Examples	Advantages	Disadvantages
Census & Electoral Register	National population counts, voter lists	Comprehensive coverage; highly authoritative	Often outdated; limited to demographic variables
Commercial Data Sets	Market‑research databases, credit‑card transaction aggregates	Large volume; ready‑made for analysis	Costly licences; may not match specific research needs
Statistical Reports (Government)	Labour‑market statistics, health‑service reports	Standardised methodology; trustworthy source	Published at set intervals; limited timeliness
Research Publications & Articles	Peer‑reviewed journals, conference papers	Validated findings; detailed methodology	May be behind paywalls; context may differ
Web‑Analytics Summaries	Average session duration, bounce rate from Google Analytics	Instant access; useful for digital performance	Aggregated – loss of raw detail; dependent on tracking setup
Historical Archives	Old survey data, previous research datasets	Provides longitudinal perspective	Potential incompatibility with current formats; may be incomplete

3. Quality of Information

When evaluating any data source, consider the five criteria below. A single data set can fail on one or more of these, affecting its usefulness.

Accuracy – How close the data are to the true value.
Relevance – Whether the data answer the specific question.
Age (Timeliness) – How up‑to‑date the data are.
Level of Detail – The granularity required for the analysis.
Completeness – Extent to which all needed data are present.

Example: A 2020 national census provides very accurate population counts (high accuracy) but may be unsuitable for a 2025 market‑trend study because its age is poor.

4. Choosing the Appropriate Data Source

Define the information requirement clearly (what, why, for whom).
Decide whether real‑time, recent or historical data are needed.
Check the feasibility of obtaining a direct source (e.g., can you run a questionnaire or install a sensor?).
Assess the reliability, relevance and quality of any indirect source.
Consider ethical, legal and cost implications (e.g., consent, GDPR, licence fees).
Select the source that best balances accuracy, timeliness, cost and practicality.

Comparison of Direct and Indirect Data Sources

Aspect	Direct Source	Indirect Source
Origin of data	Collected at the point of occurrence	Derived from existing records or reports
Control over collection	High – researcher designs the method	Low – depends on how the original data were gathered
Timeliness	Often real‑time or current	May be outdated or historical
Cost	Usually higher (fieldwork, equipment)	Generally lower (reuse of existing data)
Potential for bias	Depends on design and execution	May inherit biases from the original source

5. Encryption of Data

Encryption protects data from unauthorised access, especially when data travel over insecure networks or are stored on shared media.

Topic	Key Points
Why encrypt?	Confidentiality, integrity, compliance with data‑protection laws (e.g., GDPR).
Symmetric encryption	Same secret key for encryption and decryption (e.g., AES). Fast, but key distribution is a challenge.
Asymmetric encryption	Public‑key pair (e.g., RSA). Enables secure key exchange; slower than symmetric.
Encryption protocols	TLS/SSL – encrypted channel for web traffic (HTTPS). Uses asymmetric exchange to agree on a symmetric session key. IPsec – encrypts IP packets at the network layer; used for VPNs.
Key management	Generation, safe storage, rotation and revocation of keys are essential; poor key management nullifies the benefit of encryption.
Pros / Cons	Pros: Data remain confidential, tamper‑evident, meet legal requirements. Cons: Processing overhead, key‑management complexity, possible interoperability issues.

6. Checking the Accuracy of Data

6.1 Validation vs. Verification

Aspect	Validation	Verification
Purpose	Ensures relevance and logical consistency of the data for the intended use.	Ensures the data have been entered or transferred correctly.
Typical Methods	Range checks, type checks, length checks, format checks, check‑digit, lookup tables, consistency checks, business‑rule limits.	Checksums, audit trails, double‑entry comparison, hash verification.
Advantages	Reduces inappropriate or nonsensical data before analysis.	Detects transcription or transmission errors.
Disadvantages	May reject legitimate out‑liers if rules are too strict.	Can be time‑consuming; may require extra resources.

6.2 Validation Checklist (Syllabus terminology)

Check	What it ensures
Presence (mandatory)	Field is not left blank.
Range	Value lies between defined minimum and maximum.
Type	Data are of the correct kind (numeric, text, date).
Length	Number of characters/digits is within limits.
Format	Structure matches a pattern (e.g., postcode, email).
Check‑digit	Mathematical test (e.g., ISBN) confirms integrity.
Lookup	Value exists in a predefined list (e.g., country codes).
Consistency	Related fields agree (e.g., start date ≤ end date).
Limit (business rule)	Data obey organisational constraints (e.g., stock cannot be negative).

7. Data‑Processing Methods

7.1 Overview

Batch processing – large volumes handled at scheduled times (e.g., nightly payroll).
Online (transaction‑oriented) processing – each transaction is processed immediately (e.g., ATM withdrawals).
Real‑time processing – results must be produced within a strict time limit (e.g., air‑traffic control).

7.2 Comparative Matrix

Criterion	Batch	Online	Real‑time
Typical Use‑case	Large, non‑time‑critical jobs (payroll, end‑of‑day reports)	Interactive transactions (banking, retail POS)	Safety‑critical or time‑sensitive systems (control, monitoring)
Data latency	Hours or days	Seconds	Milliseconds / microseconds
System load	High during scheduled run, low otherwise	Steady moderate load	Continuous high load; requires robust hardware
Advantages	Efficient for huge data sets; simple error handling.	Immediate feedback; supports concurrent users.	Meets strict timing requirements; enables automatic control.
Disadvantages	Delayed results; errors discovered late.	Higher resource consumption; more complex concurrency control.	Complex design; expensive hardware; stringent testing.

7.3 Simple Algorithms (pseudocode)

Batch processing

FOR each input file IN scheduled_folder
    READ all records
    VALIDATE each record
    TRANSFORM as required
    APPEND to master_output
END FOR
SAVE master_output

Online (transaction‑oriented) processing

WHILE system is running
    WAIT for transaction request
    IF request received THEN
        VALIDATE request
        UPDATE database
        RETURN confirmation to user
    END IF
END WHILE

Real‑time processing

LOOP every 10 ms
    CAPTURE sensor input
    IF input violates safety limit THEN
        TRIGGER alarm / corrective action
    END IF
    LOG event
END LOOP

8. Illustrative Example

A retail chain wants to understand weekly sales trends.

Direct source: Export daily sales figures from each store’s POS system and aggregate them in a spreadsheet.
Indirect source: Use a published market‑research report that summarises industry‑wide sales patterns.

For precise internal analysis, the direct POS data are preferred. For benchmarking against competitors, the indirect market report adds valuable context.

9. Key Points to Remember

Direct data give you maximum control but can be costly and time‑consuming.
Indirect data are useful for background, context and comparison, but you must check relevance, age and original quality.
Assess the five quality criteria: accuracy, relevance, age, level of detail and completeness.
Validate data (presence, range, type, etc.) to ensure suitability; verify data to confirm they have been recorded correctly.
Encrypt sensitive data – choose the appropriate method (symmetric, asymmetric, TLS/SSL, IPsec) and manage keys securely.
Match the processing method (batch, online, real‑time) to the business need and resource constraints.
Combine direct and indirect sources where it enriches the analysis.

10. Suggested Diagram

Decision flowchart: Choose Direct or Indirect source → Assess quality criteria → Validation → Verification → (optional) Encryption → Select processing method (Batch / Online / Real‑time).

11. Mathematical Representation of Data Processing

The transformation of raw data D into useful information I can be expressed as:

$$ I = f(D) $$

where f represents the set of processing operations (sorting, filtering, aggregation, validation, encryption, etc.).

Identify direct and indirect data sources