Perform data transformation and cleaning

Cambridge A‑Level IT 9626 – Complete Syllabus Notes (AS 1‑11, A‑Level 12‑21)

How to Use These Notes

  • Each topic follows the Cambridge assessment objectives: AO1 – Knowledge (definitions, concepts, terminology);
    AO2 – Practical solution (activities you can try in a spreadsheet, SQL database, Python (pandas), or any other tool);
    AO3 – Evaluation (advantages, disadvantages, ethical or societal implications).
  • Key terminology is **bold‑faced** and grouped in a “Key terminology” box for quick revision.
  • Where relevant, short examples illustrate how the concept is applied.
  • At the end of every topic you will find an Exam‑skill checklist to help you prepare for Paper 4.

Topic 1 – Data Processing

Key terminology: binary, ASCII, Unicode, lossless compression, lossy compression, file format, validation, verification, real‑time processing, sensor‑actuator loop.

AO1 – Knowledge

  • Binary representation of text, images, sound and video (e.g., ASCII for English text, Unicode for multilingual text).
  • Data compression:
    • Lossless (e.g., PNG, ZIP) – original data can be perfectly reconstructed.
    • Lossy (e.g., JPEG, MP3) – some information is permanently discarded to achieve higher compression.
  • Common file formats and extensions: .txt, .csv, .jpg, .png, .mp3, .mp4, .pdf.
  • Data validation – checking that input meets required format, range or type (e.g., “age must be a whole number between 0 and 120”).
  • Data verification – confirming that data has been correctly transferred or stored (e.g., checksum, hash comparison).
  • Real‑time processing – handling data as it is generated, typical in sensor‑actuator loops for control systems.

AO2 – Practical Activity

  • Convert the sentence “Cambridge” into ASCII, Unicode (UTF‑8) and binary.
  • Compress a high‑resolution photograph using PNG (lossless) and JPEG (lossy); record file sizes and visual differences.
  • Design a simple validation rule in a spreadsheet: ensure that a “Phone number” column contains exactly 10 digits; use COUNTIF to flag invalid entries.
  • Simulate a real‑time sensor‑actuator loop in Python: generate a random temperature every second and trigger an “alarm” if it exceeds 30 °C.

AO3 – Evaluation

  • Lossless vs. lossy compression – lossless is essential for medical imaging, legal documents, and archival; lossy is acceptable for web images where bandwidth is limited.
  • Validation & verification – prevents garbage‑in‑garbage‑out; however, overly strict validation can reject legitimate data (e.g., international phone numbers).
  • Real‑time processing – enables immediate response in safety‑critical systems, but requires robust hardware and can be expensive to implement.

Exam‑skill checklist

  • Define lossless and lossy compression and give an example of each.
  • Explain why validation is performed before data is stored.
  • Describe a situation where real‑time processing is mandatory.

Topic 2 – Hardware and Software

Key terminology: CPU, RAM, storage, motherboard, I/O device, mainframe, minicomputer, micro‑computer, super‑computer, utility software, system software, application software, proprietary software, open‑source software.

AO1 – Knowledge

  • Core hardware components:
    • CPU – processes instructions; clock speed measured in GHz.
    • RAM – volatile memory for temporary data storage.
    • Storage – HDD, SSD, magnetic tape; persistent data.
    • Motherboard – connects all components via buses.
    • I/O devices – keyboard, mouse, monitor, printers, scanners.
  • Computer classes:
    • Mainframe – high‑capacity, high‑reliability systems for bulk processing (e.g., banking). Advantages: fault tolerance, massive I/O throughput. Disadvantages: very high cost, specialised staff.
    • Super‑computer – extreme processing power for scientific simulations; uses parallel architectures.
    • Minicomputer – mid‑range, multi‑user (historical term, now largely replaced by servers).
    • Micro‑computer – personal computers, laptops, tablets.
    • Mobile devices – smartphones, tablets; ARM architecture, battery‑powered.
  • Software categories:
    • System software – operating systems, device drivers.
    • Application software – word processors, spreadsheets, browsers.
    • Utility software – backup tools, disk defragmenters, antivirus, compression utilities.
  • Licensing models:
    • Proprietary software – source code closed, licence fee.
    • Open‑source software – source code freely available, may be modified and redistributed.

AO2 – Practical Activity

  • Label the components of a disassembled laptop using a diagram.
  • Install a free open‑source application (e.g., LibreOffice) and a proprietary one (e.g., Microsoft Office); note differences in installation steps and licence prompts.
  • Use a system‑information utility (e.g., CPU‑Z) to record CPU clock speed, RAM size and storage type.

AO3 – Evaluation

  • Mainframes – high reliability and massive transaction handling are ideal for banks, but the capital expenditure is prohibitive for small businesses.
  • Super‑computers – enable climate modelling and particle‑physics research; however, they consume large amounts of electricity and require specialised facilities.
  • Utility software – improves system performance and security; yet excessive use of utilities (e.g., multiple antivirus programmes) can cause conflicts.
  • Proprietary vs. open‑source – proprietary often offers dedicated support and polished UI, while open‑source provides flexibility and lower cost but may need more technical expertise.

Exam‑skill checklist

  • List three hardware components and state their purpose.
  • Distinguish between system, application and utility software with examples.
  • Give one advantage and one disadvantage of using open‑source software in a school.

Topic 3 – Monitoring and Control

Key terminology: performance monitoring, event log, SNMP, feedback loop, PID controller, batch processing, real‑time control, sensor‑actuator loop.

AO1 – Knowledge

  • Purpose of monitoring: track performance, detect faults, ensure security, support capacity planning.
  • Common monitoring tools:
    • Task Manager / Activity Monitor – CPU, memory, disk usage.
    • Event logs – record system and application events.
    • Network monitors – Wireshark, Nagios.
    • SNMP (Simple Network Management Protocol) – remote device status.
  • Control mechanisms:
    • Feedback loop – output measured and fed back to adjust input.
    • PID controller – Proportional, Integral, Derivative control for precise regulation.
    • Automation scripts – batch files, PowerShell, cron jobs.
  • Real‑time control – sensor‑actuator loops where response must occur within strict time limits (e.g., temperature control in a chemical plant).

AO2 – Practical Activity

  • Record CPU, memory and disk usage every 30 seconds for 10 minutes using a system monitor; plot the data in a spreadsheet.
  • Create a batch script that copies a folder to a backup drive each night and logs the date‑time of the operation.
  • Write a simple Python script that reads a temperature sensor (simulated) and turns a virtual “cooling fan” on when temperature > 30 °C (demonstrates a sensor‑actuator loop).

AO3 – Evaluation

  • Real‑time monitoring – enables rapid fault detection, crucial for e‑commerce sites; however, continuous monitoring can generate large log files and raise privacy concerns.
  • Batch processing – efficient for off‑peak data loads; but delays results until the batch runs, unsuitable for time‑critical decisions.
  • PID controllers – provide smooth, accurate control in industrial settings; require careful tuning and may be over‑engineered for simple tasks.

Exam‑skill checklist

  • Explain the difference between real‑time monitoring and batch processing.
  • Describe the three components of a PID controller.
  • Identify two privacy issues that arise from employee activity monitoring.

Topic 4 – Algorithms

Key terminology: algorithm, flowchart, pseudocode, sequencing, selection, iteration, recursion, termination condition.

AO1 – Knowledge

  • An algorithm is a finite, unambiguous set of steps that solves a problem.
  • Characteristics: definite, effective, finite, input & output.
  • Algorithmic techniques:
    • Sequencing – step‑by‑step execution.
    • Selection – IF‑THEN‑ELSE decisions.
    • Iteration – loops (FOR, WHILE, REPEAT‑UNTIL).
    • Recursion – a function calling itself with a simpler case.
  • Flowchart symbols: terminator, process, decision, input/output, connector.
  • Standard pseudocode conventions: capitalised keywords (IF, THEN, END IF), indentation for blocks.

AO2 – Practical Activity

  • Design a flowchart and write pseudocode to calculate the factorial of a number using both iteration and recursion.
  • Implement the iterative version in a spreadsheet using a helper column; implement the recursive version in Python.
  • Test both implementations with n = 5 and n = 0; record the results.

AO3 – Evaluation

  • Recursion vs. iteration – recursion can simplify code for problems like tree traversal, but uses more memory (call stack) and may cause stack overflow for large inputs.
  • Flowcharts – excellent for communicating simple algorithms to non‑programmers; become unwieldy for large, nested logic, where pseudocode is clearer.

Exam‑skill checklist

  • State the four characteristics that make a set of steps an algorithm.
  • Convert a simple “find the greatest of three numbers” algorithm from pseudocode to a flowchart.
  • Explain when recursion is preferred over iteration.

Topic 5 – E‑Security

Key terminology: CIA triad, malware, phishing, DDoS, insider attack, firewall, encryption, TLS/SSL, IPsec, authentication, multi‑factor authentication (MFA), access control list (ACL).

AO1 – Knowledge

  • Confidentiality, Integrity, Availability (CIA) – the three pillars of information security.
  • Common threats:
    • Malware – viruses, worms, ransomware.
    • Phishing – deceptive emails or messages.
    • DDoS – overwhelming a service with traffic.
    • Insider attacks – unauthorised actions by employees.
  • Security controls:
    • Firewalls – packet‑filtering, stateful inspection.
    • Encryption – symmetric (AES) and asymmetric (RSA) algorithms.
    • Transport Layer Security (TLS/SSL) – encrypts data in transit for web traffic.
    • IPsec – secures IP packets at the network layer.
    • Authentication – passwords, biometrics, tokens.
    • Multi‑factor authentication (MFA) – combines two or more factors.
    • Access Control List (ACL) – defines which users or systems may access resources.

AO2 – Practical Activity

  • Generate a strong password (minimum 12 characters, mix of upper‑case, lower‑case, numbers, symbols) using a password manager; test its strength with an online entropy checker.
  • Configure a firewall rule on your router to block inbound traffic on port 23 (Telnet) and verify with telnet from another machine.
  • Set up a simple HTTPS website using a self‑signed TLS certificate; browse the site and view the certificate details.
  • Demonstrate IPsec by configuring a site‑to‑site VPN between two virtual machines (use a free tool such as StrongSwan).

AO3 – Evaluation

  • Encryption protocols – TLS/SSL provides confidentiality for web traffic but can be misconfigured (e.g., outdated cipher suites). IPsec secures all IP traffic but is more complex to manage.
  • MFA – dramatically reduces risk of credential theft; however, it adds cost and may cause usability issues for users unfamiliar with tokens.
  • Firewalls – essential perimeter defence, yet cannot stop insider attacks or encrypted malicious payloads.

Exam‑skill checklist

  • Define the CIA triad and give a real‑world example of a breach of each.
  • Explain the difference between symmetric and asymmetric encryption.
  • List two advantages and two disadvantages of implementing MFA in a small business.

Topic 6 – Digital Divide

Key terminology: digital divide, socioeconomic barrier, infrastructure gap, digital inclusion, e‑literacy.

AO1 – Knowledge

  • Definition: the gap between individuals, households or regions that have access to modern ICT and those that do not.
  • Causes:
    • Economic – cost of devices and connectivity.
    • Geographic – remote or rural areas lack infrastructure.
    • Educational – limited e‑literacy.
    • Cultural – language barriers, gender bias.
  • Impact on:
    • Education – reduced access to e‑learning resources.
    • Health – tele‑medicine services unavailable.
    • Employment – fewer remote‑work opportunities.
    • Civic participation – limited access to e‑government services.
  • Initiatives:
    • Government broadband schemes (e.g., UK “Superfast Broadband”).
    • Low‑cost devices (e.g., Raspberry Pi, Chromebook for Education).
    • Community ICT centres and mobile libraries.
    • Digital skills training programmes.

AO2 – Practical Activity

  • Research a local project that provides internet access to an underserved community; produce a 300‑word report covering objectives, technology used, and measurable outcomes.
  • Create a simple infographic (using a spreadsheet or free online tool) that compares internet penetration rates between urban and rural areas in your country.

AO3 – Evaluation

  • Free‑Wi‑Fi hotspots – improve short‑term access but can be unreliable, insecure and expensive to maintain.
  • Digital inclusion programmes – empower long‑term participation; however, without ongoing support they risk becoming “digital dump sites”.
  • Potential unintended consequences: increased dependence on technology, privacy risks from public networks.

Exam‑skill checklist

  • Identify two socioeconomic factors that contribute to the digital divide.
  • Explain how a community ICT centre can reduce the divide.
  • Discuss one possible negative side‑effect of providing free public Wi‑Fi.

Topic 7 – Expert Systems

Key terminology: expert system, knowledge base, inference engine, forward chaining, backward chaining, rule‑based system, knowledge acquisition.

AO1 – Knowledge

  • Components:
    • Knowledge base – collection of facts and rules (IF‑THEN statements).
    • Inference engine – applies rules to known facts to deduce new information.
    • User interface – allows users to query the system.
  • Reasoning methods:
    • Forward chaining – data‑driven; starts with known facts and applies rules to reach a conclusion.
    • Backward chaining – goal‑driven; starts with a hypothesis and works backwards to find supporting facts.
  • Advantages: speed, consistency, ability to work 24 h.
  • Limitations: knowledge acquisition is time‑consuming; lacks creativity; may become outdated if the knowledge base is not maintained.

AO2 – Practical Activity

  • Design a simple expert‑system rule set for diagnosing a common computer problem:
    IF computer won’t start AND power LED is off THEN check power cable.
    IF computer starts BUT no display AND monitor LED is on THEN check graphics driver.
    IF computer restarts randomly THEN check RAM modules.
            
  • Implement the rules in a spreadsheet using nested IF statements, or in Python using a dictionary of rules and a simple inference function.

AO3 – Evaluation

  • When to use an expert system – ideal for routine, well‑structured domains such as medical triage or equipment troubleshooting; not suitable for tasks requiring intuition or creativity.
  • Risks of over‑reliance – users may accept incorrect advice without questioning; maintenance overhead can be high if the underlying knowledge changes.

Exam‑skill checklist

  • Define forward chaining and give a brief example.
  • List two advantages and two disadvantages of expert systems.
  • Explain why knowledge acquisition is a major challenge.

Topic 8 – Spreadsheets

Key terminology: cell reference, relative reference, absolute reference, mixed reference, function, VLOOKUP, HLOOKUP, IF, COUNTIF, CONCATENATE, data validation, conditional formatting, pivot table.

AO1 – Knowledge

  • Cell references:
    • Relative – changes when copied (e.g., A1).
    • Absolute – fixed with $ (e.g., $A$1).
    • Mixed – one part fixed (e.g., $A1 or A$1).
  • Common functions:
    • Arithmetic – SUM, AVERAGE, MIN, MAX.
    • Lookup – VLOOKUP, HLOOKUP, INDEX/MATCH.
    • Logical – IF, COUNTIF, SUMIF.
    • Text – CONCATENATE, LEFT, RIGHT, LEN.
  • Data visualisation tools: charts (column, line, pie), conditional formatting, sparklines, pivot tables.
  • Data validation – restrict entries (e.g., drop‑down list, whole numbers only).

AO2 – Practical Activity

  • Create a budgeting spreadsheet:
    • Enter monthly income and expenses.
    • Use SUM to calculate totals.
    • Apply conditional formatting to highlight any category where spending exceeds 20 % of income.
    • Generate a pie chart showing expense distribution.
  • Build a simple inventory tracker using VLOOKUP to retrieve product details from a separate table.
  • Produce a pivot table that summarises sales by region and month.

AO3 – Evaluation

  • Strengths – rapid prototyping, easy to share, powerful built‑in functions, visual feedback.
  • Limitations – performance degrades with > 100 000 rows, limited multi‑user concurrency, error‑prone when formulas are copied without proper absolute references.
  • Consideration of data integrity: use data validation and protect cells to reduce accidental changes.

Exam‑skill checklist

  • Explain the difference between relative and absolute cell references with an example.
  • Write a formula that returns “Overbudget” if expenses exceed income, otherwise “OK”.
  • Identify one scenario where a spreadsheet would be unsuitable for data analysis.

Topic 9 – Modelling

Key terminology: model, physical model, mathematical model, simulation, prediction, optimisation, assumption, validation, sensitivity analysis.

AO1 – Knowledge

  • Types of models:
    • Physical – scale replicas (e.g., wind‑tunnel model).
    • Mathematical – equations representing relationships (e.g., linear regression).
    • Simulation – computer‑based imitation of a real system (e.g., traffic flow).
  • Purpose: prediction, explanation, optimisation, communication.
  • Key steps:
    1. Define the problem and scope.
    2. Collect and organise data.
    3. Choose an appropriate modelling technique.
    4. Validate the model against real‑world data.
    5. Refine – adjust assumptions, parameters, or structure.
  • Validation techniques: comparison with historical data, peer review, sensitivity analysis.

AO2 – Practical Activity

  • Use a spreadsheet to model car depreciation:
    • Straight‑line method: Depreciation = (Cost – Residual Value) / Useful Life.
    • Reducing‑balance method: Depreciation = Book Value × Rate (rate = 20 %).
    • Create a table showing year‑by‑year book value for each method and plot both series on a line chart.
  • Perform a sensitivity analysis by varying the depreciation rate (15 %, 20 %, 25 %) and observe the impact on the final book value.

AO3 – Evaluation

  • When a simple linear model is inadequate – when data shows curvature, seasonality, or interaction effects; a linear model may produce large residuals and misleading forecasts.
  • Ethical considerations – models influencing public policy must be transparent; hidden assumptions can bias outcomes (e.g., predictive policing).
  • Pros/cons of simulation: allows testing of “what‑if” scenarios without real‑world risk, but results depend heavily on the quality of input data.

Exam‑skill checklist

  • Define validation and give one method used to validate a model.
  • Explain why a reducing‑balance depreciation model might be more realistic for a high‑tech asset.
  • Identify a limitation of using a spreadsheet for complex simulations.

Topic 10 – Databases

Key terminology: relational database, table, record, field, primary key, foreign key, SQL, SELECT, INSERT, UPDATE, DELETE, JOIN, GROUP BY, HAVING, normalization, 1NF, 2NF, 3NF, denormalization, ACID, DBMS, user privileges.

AO1 – Knowledge

  • Relational model – data stored in tables (relations) with rows (records) and columns (fields).
  • Keys:
    • Primary key – uniquely identifies each record.
    • Foreign key – references a primary key in another table, establishing relationships.
  • SQL basics:
    SELECT … FROM … WHERE …;
    INSERT INTO … (col1, col2) VALUES (val1, val2);
    UPDATE … SET col = val WHERE condition;
    DELETE FROM … WHERE condition;
    JOIN types – INNER, LEFT, RIGHT, FULL.
    GROUP BY … HAVING …;
            
  • Normalization:
    • 1NF – atomic values, no repeating groups.
    • 2NF – 1NF + no partial dependency on a composite key.
    • 3NF – 2NF + no transitive dependency.
  • Denormalization – intentional redundancy to improve read performance; may increase update anomalies.
  • ACID properties – Atomicity, Consistency, Isolation, Durability.
  • Security – user privileges (SELECT, INSERT, UPDATE, DELETE), role‑based access control.

AO2 – Practical Activity

  • Create a small relational database with three tables: Students (StudentID PK, Name, DOB), Courses (CourseID PK, Title, Credits), Enrolments (EnrolID PK, StudentID FK, CourseID FK, Grade).
  • Write SQL queries:
    1. List all students enrolled in “Mathematics”.
    2. Show the total number of enrolments per course.
    3. Find the average grade for each student.
    4. Insert a new student and enrol them in two courses.
  • Demonstrate a denormalised view that includes student name, course title and grade for faster reporting.

AO3 – Evaluation

  • Normalization vs. denormalization – normalization eliminates redundancy and update anomalies, but can require many joins, slowing read queries. Denormalization speeds reads but risks inconsistency.
  • Security considerations – granting only the minimum required privileges follows the principle of least privilege; however, overly restrictive rights can hinder legitimate business processes.
  • Impact of ACID compliance on transaction throughput – essential for banking, but may be relaxed (BASE) in big‑data environments.

Exam‑skill checklist

  • Explain the purpose of a primary key.
  • Write a SELECT query that uses an INNER JOIN to combine two tables.
  • State one advantage and one disadvantage of denormalizing a table.

Topic 11 – Video/Audio Editing

Key terminology: codec, container, bitrate, resolution, frame rate, timeline, track, transition, rendering, export settings, compression.

AO1 – Knowledge

  • File formats and codecs:
    • Video containers – MP4, AVI, MOV.
    • Codecs – H.264/AVC, H.265/HEVC, VP9 (video); MP3, AAC, WAV (audio).
  • Editing concepts:
    • Timeline – chronological arrangement of clips.
    • Tracks – separate layers for video, audio, effects.
    • Transitions – cross‑fade, wipe, dissolve.
    • Effects – colour correction, green‑screen, speed change.
  • Export settings:
    • Resolution – e.g., 1920 × 1080 (Full HD).
    • Bitrate – determines quality and file size.
    • Frame rate – 24 fps (cinema), 30 fps (TV), 60 fps (gaming).

AO2 – Practical Activity

  • Import two video clips (10 s each) into a free editor (e.g., Shotcut or DaVinci Resolve).
  • Add a cross‑fade transition of 1 s, overlay background music (adjust volume), and add a title overlay.
  • Export a 30‑second promotional video in MP4 using H.264, 1080p, 5 Mbps bitrate.
  • Compare the exported file size with the original source files and note visual differences.

AO3 – Evaluation

  • <

Create an account or Login to take a Quiz

36 views
0 improvement suggestions

Log in to suggest improvements to this note.