What can the public sector learn from CERN?

Discover how the European Centre for Nuclear Research processes big data using the following principles: data provenance (or data lineage), open data, and user agency.
March 23, 2022
Insights
Clock Icon
23
minute read
CERN image

By Dr Kevin Maguire PhD, on behalf of Aire Logic Ltd.

The European centre for nuclear research (CERN) is where I started my career in Data Science. I currently work as a consultant software engineer and data scientist for a UK public sector organisation (PSO) on behalf of the health and care technology firm, Aire Logic. 

The mission of CERN is to provide particle accelerators and other experimental apparatus to facilitate cutting edge particle and nuclear physics. Based in Geneva, it was founded in 1954 and operates the world’s largest particle collider, called the large hadron collider (LHC). My work at CERN as a doctoral student involved working on an experiment called LHCb. The experiment involves about 1,400 engineers, software developers, analysts and researchers. 

CERN and LHCb process huge amounts of data and produce world leading research by relying on three principles that public sector organisations can benefit from. These principles are: data provenance, openness, and user agency.

Note: the points below are from my experience working on this experiment, and won’t always apply to the whole organisation. Similarly, my public sector experience mostly concerns a single organisation.

Data Provenance

When a data record has provenance, it can be traced from where it is now to where it originally came from. Each system the record passed through, every filter applied to it, and every change must be known exactly. Provenance is essential to producing robust analysis of data. 

There are two complementary methods for implementing data provenance: versioning and refreshing. For example, consider a set of SNOMED codes¹ representing heart disease that are used to select patients from a database. If we discover that we’re missing a relevant SNOMED code that represents some other form of heart disease it must be added to the filter in the codebase. 

If the codebase version number is not stored with the data, then an analyst has no way of knowing there has been a change. The analyst will see new records with the new SNOMED code, but will not know when that change occurred. Did it occur today, and new records were found today? Did it occur last week, but no records with the new code were found in that week?

When the analyst goes to publish a result from the data, like the percentage of the population with heart disease, the answer will be different on either side of the change. It will be a higher percentage after the change. If the analyst combines the results before and after the change, they are comparing apples to oranges and the result won’t make sense. 

However, if the data has provenance, meaning each record in the data contains the codebase version number, the analyst can initially split the data by version and compare across subsamples. If a deviation is found, the analyst can refer back to the codebase, or a changelog, and see that the new SNOMED code has been added. Now that the analyst knows when the change occurred they can separate the data before and after that change and produce two results.

The example raises another issue, what about the old data that has already been received? We should be ‘refreshing’ the data by running the new codebase on the old data, as we know the old data is wrong or incomplete. When a change occurs to the codebase, we can either do a ‘delta’, where we only process new data with the new codebase, or we can do a full refresh by reprocessing everything. This highlights why it is essential to keep all raw data or as much as is practically possible. 

Such an architecture that keeps all the raw data is called a ‘data lake’. It is a ‘pool’ of messy data in different formats and from different sources. Processing pipelines can then read from the lake and produce clean final assets. Having a data lake allows for data refresh, but equally, if you have data provenance you know where the data came from and can reacquire it. 

For example, if the data comes from a database owned by a GP practice, or from a web API, then its raw form can be obtained. Reacquiring the data depends on trusting that the provider also has data provenance, so in most situations where this is not the case, a data lake would be preferable.

The LHCb experiment facilitates data provenance through a series of processing stages that are run periodically as refreshes on all the data. The version number of each stage becomes a new field in the data. 

When talking about data volumes at CERN we typically use events per second (hertz) instead of size on disk (bytes). So numbers describing the size of the data are hard to come by. The typical collision event is 60 kB according to ‘LHCb Computing Resource Usage in 2018’. The rate of collisions in the detector was 20 MHz, or one collision every 50 nanoseconds, and increased to 40 MHz and 25 nanoseconds from 2015 (Maguire, 2018, p47). Thus the data size per second is roughly 1.2 to 2.4 TB. It isn’t practical to store all of this raw data to disk, so the first three stages of processing don’t store the raw data and instead add codebase version numbers as new fields to the data.

The first processing stage is actually done in hardware and runs at the same time as data is being collected by the detector. The purpose of this stage is to decide whether or not to keep an event using simple filters to find interesting particles. The hardware filters don’t change often, so instead of querying a codebase, the analysts have versioned documentation. The second and third processing stages, known as the ‘triggers’, are implemented in software. The triggers reduce the data rate to 1 MHz or about 60 GB per second. There are about 500 trigger pipelines (Design and Performance of the LHCb Trigger). To avoid duplicating data, each event is stored as a single record and the yes/no decision to keep the event for each of the many triggers is stored as a new field in the data. The output from the trigger is kept in long term storage for later use. Again, these two processing stages do not preserve data provenance as some events are not stored at all. However, the trigger version numbers are stored, and the code is available for reference.

The final processing stage, called the ‘stripping’ is where data provenance becomes possible given the data size. There are at least 500 pipelines at this stage. As before, storing the version number of the stripping enables the code to be referenced to see exactly how the data has changed. Unlike previous stages, the input data to the stripping is stored indefinitely, thus the stripping can be rerun on the ‘raw’ data. Analysts are free to make changes to the stripping pipelines as they see fit, and a few times a year the stripping is refreshed, creating a new version of the output.

¹SNOMED is a clinical coding system where integer codes represent all kinds of medical procedures, conditions, events, diagnoses, medicines, etc. (SNOMED CT, 2021)

So what can public sector organisations learn from LHCb about data provenance?

  1. Store your raw data and make it available.
    I know from my experience at LHCb how challenging it is to not have access to the actual raw data (because of triggers), but conversely how useful it is to have access to some raw-like data (before the stripping). My analysis had a filter that was absolutely necessary to reduce the data volume, but that also biased the value we wanted to measure in the final data (Maguire, 2018, p82).
  2. Store a lookup for the code version with each data record, so you can look back at the code for reference.
  3. Perform data refreshes after each change to the code (if possible).
    If refreshing is not practical due to the size of the data, then small code changes can survive on deltas, and reserve refreshes for big changes. The downside is that the complexity of analysing the data increases somewhat.
  4. Ensure there is effective communication and coordination between stakeholders
    In the ‘upstream’ direction, expect providers of data to add data provenance information or supply truly raw data. In the ‘downstream’ direction, users need to be informed of all changes to a dataset, at all points in the chain, and in a very public way. A changelog for each code release should be provided that is concise and specific to a particular dataset. This should be the same for all systems that touch the data at any part of its journey. At the least, system suppliers should publish a static document for each dataset after big changes that state what filters and processing was applied to the data. I favour dataset specific working groups and old fashioned mailing lists for communicating and coordinating.

Openness

Openness means sharing all materials and knowledge for the benefit of the wider public, without damaging the privacy of individuals. Such materials must be easily accessible and reusable where applicable.

At LHCb and CERN, the only measure of progress is publicly available knowledge. The entire purpose of the experiment is to produce data, analyse it correctly, then make the results public. The results can then be scrutinised, reproduced at other experiments, and used to drive theoretical understanding. The data are made public about 10 years after being collected. This allows LHCb scientists to analyse the data first, as they helped produce it, and it avoids misinterpretation because external people won’t have first hand knowledge of the detector.

Political churn means that PSOs do not always have such a singular vision. This is evidenced by the recent merger of NHS Digital and NHSX into NHS England and Improvements. However, this merger appears to be positive as it aims to continue pushing digital first and to centralise disparate systems and functions (Wade-Gery, 2021). 

I would like to see a future where producing knowledge in the public interest is one of the central goals of a PSO. This can be achieved through close collaborations with domain specific experts, like researchers or charities and a commitment to making results public by default. The office for national statistics (ONS) is a great example of a PSO that produces effective and valuable results in the interest of public knowledge.

The main aspect of openness is the sharing of materials like code, packages, methods and documentation. LHCb has not always been great at sharing tech or code. Many of the large CERN and LHCb software packages are open source but have limited use outside physics. 

One such package is called ROOT (ROOT: Analyzing Petabytes of Data, Scientifically), which is like combining parquet, python matplotlib, and R into a single framework. In research and academia there is a tendency to reinvent the wheel. In some cases, it’s fair to say that CERN invented the wheel, and for LHCb being on the cutting edge requires writing software from scratch, but it’s not always necessary. 

One of the main issues for LHCb openness is the sharing of code written by analysts for their specific piece of work. It is now part of the internal review process of analyses to make all code and materials internally available, but this is a relatively new policy. Analysis code was not made available internally while I was at LHCb, so wasn’t open to scrutiny through reuse or even review.

Within the LHCb organisation, however, there is a lot of code reuse. For example, the package used to create curving particle tracks from a line of hits in the detector sensors is the same code used in the trigger and the stripping, and it’s also available to analysts. The packages that run the stripping and the triggers are also internally available, so anyone can run them on their data whenever and wherever they like. This is facilitated by the wide use of ROOT for almost anything, and easy access to tooling as discussed in the next section (User agency). 

Particularly for the stripping, there are reusable functions that add fields to the data records. There’s a function that takes the most basic particle definition and adds its mass and momentum as new columns to the data. Another function when given two particles gives you their closest distance of approach or the angle between them if they have the same point of origin and so on. Each field that is added has its own unique name that is defined centrally. 

Thus by reading the field names in a completely new dataset, you can understand what they mean and how they are calculated. Sharing code and reusable methods creates a uniformity and standardisation to the resulting data. Being able to reuse code and having standardised data assets saves time for developers and data analysts.

The ‘problem’ of privacy

For PSOs, privacy is a significant challenge to sharing information, one which LHCb does not share. As it happens, particles do not have an address, but if they did, they probably wouldn't be home. If you are familiar with the concept of ‘differential privacy’ you will know that it’s impossible to release any piece of information, aggregated or otherwise, without reducing the privacy of the individuals (Dwork, 2014). A malicious person, with enough seemingly innocent statistical results, could in theory reconstruct information about a small group of individuals.

There is a misunderstanding within PSOs of how code relates to privacy, creating an impenetrable barrier to code sharing. Information governance is not an excuse to keep code repositories locked down where only the authors can see them. At LHCb, every repository is internally available to everyone. 

A code repository at a PSO contains no information that can be used to reduce an individual’s privacy. Code is not subject to differential privacy constraints as it does not provide any information about the actual data values, only its structure. It is possible for someone to accidentally add data to a code repository, but this is not the intended purpose of the repo. If data is being added then it's not being used correctly. 

As an analogy, people fall down stairs all the time, but we still use them. With the proper precautions like code reviews, training and handrails, the issue can be mitigated. The price for mistakes can be very high, but it’s worth it. As a great example, the Ministry of Justice (MOJ) has made a person linkage algorithm publicly available, which has sparked much conversation in other PSOs (Splink, 2022). 

The same argument applies to documentation, where at CERN everything is open. At PSOs, individual teams don’t even know what exists in the company as documentation is only accessible to the authors and their direct team by default. I recall writing new features for users, and then discovering to my shock months later that they didn't have access to the user guide on the developers’ Confluence area.

The reluctance to share code even internally in PSOs leads to everyone reinventing the wheel in a slightly different way. Similarly, the lack of publicly available code means that each PSO reinvents their own wheels. 

When comparing typical PSO data, notably person level data to LHCb data, there are two contrasts. Firstly, PSO data comes from a wide variety of sources while LHCb data comes from a single source, the experiment itself. Secondly, LHCb uses ROOT for almost everything so packages integrate easily, while PSOs use lots of different frameworks. 

An ideal future would be one where a single method for person linkage using Apache PySpark is made public and maintained by MOJ, then when the need arises NHS England publishes a native python port of the method, and the Department for Work and Pensions (DWP) publishes a C# implementation, and so on. This would enable other PSO’s to choose from various ports that match their own systems. 

An additional way to facilitate sharing is to publish reference information. For example, lists of SNOMED codes and prescribed medicine codes used to find patients with certain comorbidities could be made public. Some such reference data already exist, for example census data, IMD lookups (English Indices of Deprivation 2019, 2019) or the very sensitive PDS dataset (The Personal Demographics Service, 2020).

So what can public sector organisations learn from LHCb about openness?

In summary, openness facilitates reuse and collaboration, and will reduce effort and decrease costs across the public sector. To facilitate openness, PSO should:

  1. Commit to making knowledge sharing in the public interest a central goal of the organisation.
    Commit to creating and sharing information by default, in the form of knowledge, methods, documentation and non-sensitive data.
  2. Make code open by default, within the organisation and publicly.
    Privacy and information governance is not an excuse to keep code repositories locked down.

User agency

Of all the points in this article, this most affects me in my daily activities. I’d summarise it as: your staff are capable domain experts, give them the tools, environment and resources they need, and let them get on with it. To have agency is to be in control of what you do and how you do it.

At LHCb, users have access to a number of systems and resources. The first is a local computer cluster at CERN, which is just like the Linux clusters that most universities have. Each user can access the cluster from the command line, and has a home area where they can store their files and work on their code. New and old versions of all the typical software packages (ROOT, event reconstruction, stripping, etc.) are available and users are free to install any third party packages and code in their local area. There’s also a batch job system where longer running processes can be run. 

Next, there are many data storage options such as a few terabytes of cloud-like storage per user. There’s even a long-term data storage system that instructs a robot train to get a tape and put it into a reader. 

Finally, users have a number of tools for accessing CERN’s world wide grid computing system (Worldwide LHC Computing Grid). Crucially, and in large contrast to PSOs, the grid allows users to run anything. Users could run cryptocurrency mines if they desired, although the operators are wise to this. The grid provides the main data storage capacity. When running over the raw data (stripping outputs, etc.), the data files are partitioned, duplicated and distributed around the world to the various data centres. The ‘file path’ that a user has is actually a URL, which the grid uses to locate the chunks of data. The code is then sent to those locations to be closer to the data and reduce transfer times. The users are relatively tech-savvy and comfortable with the environment, or strongly expected to become so.

The environment and resources available to LHCb users give them the freedom to make their own decisions about what tools they use, how they use them and where they run them. Discovery activities like trying a new machine learning algorithm or using a different statistical method package are fast and easy and don’t require approval. Similarly, it’s easy to develop and run code locally, then deploy it to the vast computing resources.

In the public sector, the situation is very different. Users typically have a personal laptop that is absolutely locked down and many organisations have virtual desktop infrastructures (VDIs) that are similarly locked down. Users require approval from multiple persons to install programs like Python and Docker.

For computing resources PSOs have traditionally relied on Microsoft SQL server running on dedicated hardware. The disadvantages of these servers is the difficulty in scaling when demand grows, as the resources are fixed. They also inherently lead to siloed data, where there are many servers each holding slightly different copies of the same datasets, which creates data provenance issues. With the government’s ‘cloud first’ policy (Government Cloud First Policy, 2017) siloed servers are being phased out in favour of cloud based enterprise scale platforms. These have the advantage of being easily scaled to meet demand and they offer a wide range of services (storage, compute, serverless, networking, machine learning etc.) that can be easily connected to existing infrastructure. 

PSOs rely on enterprise scale platforms of varying types and quality. Tools like Databricks, Apache Spark and AWS services like RDS are becoming common. These platforms invariably include access barriers between users and the data to maintain privacy. They require teams of software and DevOps engineers to build, maintain and run, which is at odds with LHCb where users help to run platforms and services.

There are subtle disadvantages to enterprise scale platforms. In the SQL server days, each analyst was the sovereign of their own empire and could control their environment and the tools they used by themselves or in small homogenous working groups. Enterprise scale platforms take away the basic agency of these analysts, who no longer understand the system they use and now have to negotiate with ‘ludicrous technology dweebs’ for each software change². In my own experience, the freedom provided by LHCb empowered and enabled me to do my best work. In contrast, the disempowerment at PSOs is frustrating and the unnecessary barriers reduce my value.

²This is an almost direct quote from an article about software used by banks: https://calpaterson.com/bank-python.html

Fundamental issues that take away the agency of users

There are two fundamental issues that take away the agency of users. The first is the necessity for privacy and who holds the responsibility for maintaining it. Enterprise scale platforms and business decisions give the responsibility of maintaining privacy to the software engineers, which makes sense, but it often goes too far. Just because an analyst is not responsible for maintaining privacy, does not mean they can’t have a permissive development environment. If there is no real data, there is obviously no danger to privacy. 

By providing locked down environments, every little task becomes more complicated. Do you want to store your code on git? Not possible as the VPC is not accessible or the GUI doesn’t have git integration. Do you want to automatically produce aggregated results and download them? Not possible as data can’t be removed or the GUI only allows creating one plot at a time. And as discussed in the previous section (Openness), hiding git repos and documentation from users in the name of privacy is too far. 

The second issue is the varying skills of users. Some users are comfortable working like a developer, with code in development environments, and some users are not. Fundamentally, there is no platform, off the shelf or otherwise, that meets the needs of all users. The developers must provide each and every feature on the platform manually, often through a user interface, to the despair of tech-savvy users. Any new feature will only meet the needs of a small skill range of users.

So what can public sector organisations learn from LHCb about user agency?

  1. Give users the tools and environment they need to get on with it, without having to ask anyone for permission.
    We should treat tech-savvy analysts like developers. If developers can run a dev version of the platform on their own machine and build on it, then so can analysts. Privacy is not an excuse when there is no real data in the development environment. For example, Spark clusters can be run on a local machine, so there’s no reason why a user can’t write a pipeline or a library locally, with their own choice of tools, and deploy it to a live environment. This is one of the agile principles after all, which many PSOs follow (Principles Behind the Agile Manifesto, principle 5).

    Non tech-savvy users can learn on the job like any junior developer must, or developers can focus on building tools that are appropriate to their skill range. By providing a dev environment, both the upper and lower ranges of skill sets can be accommodated separately.
  2. Let users be engineers and work with engineers
    If users have developer tools, they can work together with engineers to build secure and functional pipelines and analyses. This also facilitates understanding of requirements, so that services and features available on the platform are better suited to user needs. Again, maintaining a close working relationship between stakeholders is one of the agile principles (Principles Behind the Agile Manifesto, principle 4)
  3. Let users experiment with their own services and infrastructure.
    The point of cloud platforms is to have lots of different services in one place and to scale to match demand. There are lots of tools, particularly around machine learning, that analysts would like to use. They typically have to ask for permission to use these services, and an engineer will have to do the bulk of the work for them.

    Cloud providers have built-in role-based access controls that can be leveraged to give analysts as much freedom as each PSO is comfortable with. I can envisage scenarios where a user wants to build an API to make their machine learning model available or where an analyst wants to use the AWS Comprehend NLP service to process their data. Such scenarios should be possible without permission and should be encouraged.

Final word

One point that didn’t make the list but is worth mentioning is governance. LHCb is a democracy where the members vote for one of their own to lead. PSOs are led by external managers who are not always familiar with the challenges of the average user. Operational changes and churn made by senior management often serve to confuse the people on the ground without making any difference to their daily work. 

There are plenty of things that LHCb can learn from the public sector. PSOs employ full-time developers who write great code and impress on the rest of the organisation, including analysts, the importance of best practices like code reviews and unit tests. At LHCb there are dedicated developers who work on some things, but there’s no top down expectation for analysts to follow best practices when they write code. 

At LHCb, most users are nomads whose very position in the experiment is finite. A large portion of analysts are students from universities around the world. As a career option it is not desirable because of its lack of stability. Similarly, the lack of stability creates pressure and competition. In contrast, PSOs offer stable careers and a sustainable working environment.

To sum up, LHCb is a cutting edge organisation dedicated to increasing human knowledge. With that aim, and in acknowledgement of the domain experts who work there, users have access to an open and permissive environment of tools and resources. 

PSOs have equally important and often more urgent goals to administer the country, and facilitate health and well-being. By learning from LHCb’s data provenance, openness and user agency, PSO’s can more effectively collaborate, internally and externally to deliver on these goals.

If you found this article interesting or want to find out more, visit our Aire Analytics team page.

References

Note: all references viewed in January 2022

Design and Performance of the LHCb Trigger and full real-time construction in Run 2 of the LHC’ 2019

Dwork C. ‘The Algorithmic Foundations of Differential Privacy: Foundations and Trends in Theoretical Computer Science’ Volume 9, No. 3-4, pages 211–407, 2014

Ministry of Housing, Communities and Local Government ‘English indices of deprivation’ 2019

Central Data and Digital Office ‘Government Cloud First policy’ 2017

LHCb ‘Computing Resource usage in 2018’ 2019

Maguire, K. ‘Precision measurements of indirect CP violation in the charm sector with LHCb’ 2018

NHS Digital ‘The Personal Demographics Service’ 2020

Agile Manifesto Principles

ROOT: analyzing petabytes of data, scientifically

NHS Digital ‘SNOMED CT’ 2021

Splink implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including the EM algorithm to estimate parameters of the model

Department for Health and Social Care, Wade-Gery L. ‘Putting data, digital and tech at the heart of transforming the NHS’ 2021

Worldwide LHC Computing Grid

Written by:
Dr Kevin Maguire PhD