From Two Million to Ten Million Records a Day: Inside the UK's Largest Health Data Platform

How we built and scaled the UK's largest health data lake, processing millions of records in near real-time while keeping sensitive data secure and governed.

Building and Running a National Data Processing Platform

The challenge

A public sector organisation running the largest health data lake in the UK needed an engineering team to build, run, and maintain the data processing platform at the core of its national data strategy. The platform ingested data from dozens of external organisations and internal systems, processed it through validation, transformation, pseudonymisation, and linkage pipelines, and distributed outputs to downstream consumers including government departments and research bodies.

What we did

Aire Logic worked as the prime engineering contractor on the platform from 2018. The technology stack was built on AWS (S3, DynamoDB, Lambda, SQS, EC2, Kinesis Firehose), Databricks, Apache Spark, Python, Terraform, and Privitar for pseudonymisation and tokenisation. Jenkins managed over 120 interconnected CI/CD pipelines. Splunk and Prometheus provided monitoring, alerting, and audit logging.

Scale under pressure

The platform’s scale was tested when it needed to process test results in near real-time during the pandemic. The team scaled ingest from two million records per day to ten million within weeks, using serverless AWS infrastructure and optimised Databricks cluster configurations. A load injector built internally allowed the team to performance test at 24 million records per day before the volume arrived in production.

Data governance

Data governance was embedded throughout. Privitar tokenisation allowed records to be linked across datasets without exposing identifiable information. Data versioning captured received, updated, and logically deleted states on every record, enabling point-in-time extracts for reporting and audit. Access to production data required staff to hold security clearance. The team also built and ran a Trusted Research Environment giving controlled external access to the data for authorised analysts, including the governance processes that determined who could access what on what legal basis.

Rows of server racks with lit status lights in a clean, brightly lit data center corridor.

10 million

records processed per day at peak

120+

interconnected CI/CD pipelines

24 million

records per day load tested before peak volume hit