While the pandemic accelerated access to technology and data within the NHS, other previously ground-breaking data sets were left languishing on older tech, which were hard to maintain, and relatively slow to produce and hard to innovate against.
It’s great to be able to tell the story about how NHS Digital partnered with Hippo to realise benefits to a successful, long-running service in desperate need of modernisation to improve the availability of one of their key data assets which has many uses within the NHS, such as for planning and public health, and beyond the NHS to wider use cases, principally in medical research, such as clinical trials.
NHS Digital is the Information Technology partner for the NHS, running national services such as the Spine, Centralised Authentication and Authorisation, managing data collections from Primary care, Secondary Care and Social Care, publishing standards, and running a Cyber Security Centre for the NHS.
NHS Digital collects Secondary Care data from hospitals, these are in the form of Episodes and Spells of care, which are collected using the Commissioning Data Set standard (CDS) and cover Accident and Emergency, Outpatient, and In-patient care in England. The data is processed into monthly releases in the form of Hospital Episode Statistics (HES). The data contains around 800 derived or supplied fields covering hospital activity across the following key areas:
- Accident and Emergency
- Admitted Patient Care
- Adult Critical Care
- ONS matched mortality data
NHS Digital needed to move the processing of HES data from legacy Oracle-based processing, which was hard to support, expensive to run, took a long time to process each month of data (up to 17 days), and was running on hardware that was running in a data centre earmarked for closure.
NHS Digital sought out Hippo as one of the key places to get data engineering experience in Leeds to partner with NHS Digital to migrate and modernise the processing of HES data by moving to AWS Cloud and Databricks, and away from on-premises hardware and legacy software.
A small team worked on the core functionality of HES (100s of derivations) and automation of HES Publications and distribution out to a number of “SQL Clones” — which supports the in-place infrastructure for the onward processing and distribution of data sets using the Data Access Request Service. The areas worked on are described below:
Cloud migration: AWS and Databricks were used to move away from legacy hardware and software. Since processing is focussed on monthly data releases, ephemeral clusters managed by Databricks can scale-up to run jobs and then destroyed, reducing the total operating costs for HES processing.
SQL optimisation: tuned existing queries to make them run on Databricks but also to ensure efficient joins of billions of records.
Don’t repeat yourself: huge reduction in the size of the legacy code base using templating (via Jinja templating), this reduces the surface area for potential bugs, and reduces the effort required for longer term operational maintenance
Automation: investing in continuous delivery for the service and driving out existing manual activities to ensure that the HES monthly processing became a much less labour-intensive activity for NHS Digital.
Resilient processes: ensuring that common failure conditions could be recovered from, Hippo wrote a patch mechanism which allows for simple recovery of partially processed files (previously a common and expensive failure).
Hippo utilised Databricks and AWS Cloud as the key technologies for building the next-gen HES processing.
Value added for NHS Digital
Immediate value added
Prior to the HES work, the time to produce HES monthly datasets took around 17 days to be published. Now data is available in 2-10 days. But it’s not just about speed: there were other huge benefits of the work:
Removal of legacy features: previously HES used a custom Patient Index that was built in-house by NHS Digital as part of HES, this used limited data to generate the index. HES allowed this to be moved to the NHS Master Patient Index — which uses Primary and Secondary care sources and contains a much more complete index of patients, increasing the accuracy of the HES data.
Reduced data latency: HES feeds a huge amount of health research in the UK and is used in other services such as “NHS DigiTrials”, which helps provide data to assist in clinical trials. The reduction in processing time means that data is available sooner.
Reduced operational overheads: The HES work has reduced operational overheads in terms of “Business as Usual” manual effort for processing by way of creating an automated job scheduler to automatically manage monthly workloads with minimal human intervention. Previously the workloads were predominantly manually triggered which required human effort to run and orchestrate.
Future value added
Because of this work, in the future NHS Digital can benefit from:
Scale and cost optimisation: Moving to “Evergreen” platforms in the form of AWS and databricks means better long-term support for NHS Digital and reducing the risk of large-scale and complex migrations in the future.
Security: HES allows for a reduction in data leaving NHS Digital. Data provided by HES is now available via databricks and services such as the Data Access Environment and Trusted Research Environments, those organisations with approved access could now use the data in situ, without having to rely on extracts being sent and processed externally (which also creates additional security and governance overheads).
Features: Moving to a modern Lakehouse-capable PaaS like databricks means that data can be used in ways not previously possible.