Data security is a question of trust

The Biobank data found on an auction site didn't get there because of a technical hack. David Kernohan wants to talk about academic trust and malpractice

David Kernohan is Deputy Editor of Wonkhe

The appearance of de-identified UK Biobank data (covering health records and personal characteristics but not names, addresses, or contact details) on a Chinese auction site is a cause for concern.

The work of UK Biobank – like many research resources and collaborations – is based on the academic ideals of transparency and trust. It is funded by the Medical Research Council (though it is an independent organisation that also receives charitable funding) and contains information about around 500,000 volunteers: from basic healthcare records and questionnaire data through to imaging and biomarkers.

Provided you can prove you are associated with a bonafide research institution, and provided you share your research plan with the team, it offers the opportunity to access detailed health records on a de-identified basis with the option (after seeking further permissions) to access a group of research subjects to gather more information.

A high bar

These are high hurdles. You can’t just declare that you are a research institute of the calibre required to access Biobank. You need a track record of well-conducted research, and you need a history of collaborations or interactions with researchers at other respected institutions.

And you can’t just knock out a convincing research plan with your favourite AI chatbot. Getting this stuff right – the assumptions and hypothesis, the safeguards, the methodological approach – requires expertise and serious high-level academic training. Even then, when you sign up you need to complete further training on the use of Biobank tools and resources. All data analysis takes place on a secure online platform (UKB-RAP) provided by Biobank – with downloads of your completed work available when it is time to write up your research for publication.

There are no technical issues with the security of Biobank itself or the analysis platform. For all the language of “breaches” or “hacks” it is clear that the systems worked entirely as designed and the valuable personal data in Biobank remains secure.

Not the technology

As is familiar to anyone with an interest in cybersecurity, the issue here was with the humans involved.

What happened was three accredited researchers at three recognised research institutions in China chose to carry out little or no analysis on the research platform, and thus the “results” they were able to download were simply the (de-identified) raw data. This data found its way (it is not clear how: an investigation is underway) onto an internet auction site owned by Chinese retailer Alibaba. Here the listings were immediately spotted – Alibaba, UK Biobank, and the UK and Chinese governments worked together to remove the listings. It does not appear that the data was purchased or downloaded during the very short period the listings were live.

The first point it is tempting to leap to is why were researchers able to download the unanalysed dataset? UK Biobank is now adding technical safeguards to prevent this happening again, but why was it possible in the first place?

For me, it appears as if the idea that someone would download the entire dataset for sale was simply not considered. I mean – why would anyone go to the trouble of seeking Biobank user credentials, having a detailed project approved, and then downloading data which is of almost no value to anyone who does not already have the ability to access the data.

The surprisingly low price of research data

The most you pay, for full access to Biobank for a year, is £9,000 for the first year and £3,000 in each subsequent year. If you work at an institute in an eligible lower income country these costs are waived. There are also programmes designed to support early-career researchers, and if you don’t need to access all the data the fees are lower. The agreements required are standard ones in terms of security, ethics, and citation.

There will be additional costs for compute capacity – again there are programmes to support access for eligible researchers. These costs are variable (based on the type of analysis you are carrying out) – and are not solely a Biobank cost. If you are carrying out any analysis of this scale you will incur compute costs, and the costs are standard.

Given the low up-front cost and support programmes, it is almost unfathomable that anyone would want to access data outside of the approved routes. The risks involved – you could certainly never publish or present findings, and it would have a terminal impact on your own research track record and that of your research institute and collaborators, and any onward user would need to have a deep understanding of Biobank data structures – mean that it just doesn’t make sense for any serious researcher to look for a cheaper route. Arguably, anyone who is capable of deriving insights from the data is already registered with Biobank, or at least would meet the criteria for registration.

Because the entry bar to Biobank is so high, and because the usefulness of the data outside of Biobank is limited, the download option (for which there is another fee – the download of the entire 30 petabytes in UK biobank would cost about £1.5m, not including the cost of storage) was not constrained in a way that would flag or censure attempts to download the unanalysed dataset. This is now changing.

Responding to the challenge of trust

Thankfully, Biobank has already taken action, alongside the auction site owners Alibaba and the UK and Chinese governments. There is a full investigation underway at Biobank – access to the resource and analysis platform has been temporarily paused while this happens. There are technical fixes in development – access to the three accredited research institutes involved has also been revoked.

In the House of Commons science minister Ian Murray told MPs that officials have been in contact since the government was made aware of the breach on Monday, and emphasised again that the data did not contain names, addresses, or contact details. There is new guidance coming from the government about the control of data from research studies. And in response to questions, he clarified that the three Chinese institutions in question had been permanently banned from UK Biobank.

Academia as a whole is a system of trust. We trust reputable research institutions worldwide (be they universities, charities, or the private sector) to employ ethical and diligent researchers, and we trust that the norms of openness and citation are universal. Academic malpractice – because this, not a “hack”, is what we are talking about here – is the negation of these ideas. Cases are few and far between because good standing is important to both status and participation in academic life – when they happen it prompts a huge scandal and, most often discreditation and banishment. This is what has happened to the researchers and institutions involved.

Our geopolitical climate, based as it currently is on suspicion and a resurgent nationalism, cuts across the self-policing academic system. Several speakers in the Commons debate – most notably Reform’s Richard Tice (“Will the Minister confirm that our generosity will not be abused by Chinese researchers…”) – drew on the location of the auction and the research institutes. Even though the last institution to be (temporarily) suspended from Biobank was Yale, there is a temptation in some quarters to cast issues like this as espionage.

Banning Chinese researchers – or anyone else who is eligible to apply and is not based in a nation currently facing international sanctions – from Biobank would achieve nothing. The network of collaborations and relationships that define academia extend beyond borders of all sorts. The norms of academic publication mean that findings and methods are available worldwide – with the growth of pre-prints, often instantaneously. This is how research, of any sort, works.

4 Comments
Oldest
Newest
Inline Feedbacks
View all comments
The Data Dugong
27 days ago

Sadly this is not quite accurate. Up until this week anyone with access to the RAP has been able to download raw participant level data unchecked. Participant level data has been downloaded and made publicly available on code sharing websites hundreds of times https://biobank.rocher.lc/
UK Biobanks response to this story was appalling https://www.ukbiobank.ac.uk/news/a-message-to-our-participants-protecting-your-personal-information/
Very glad the UK Biobank is now introducing further security for a known issue. Participants have been extremely generous donating their time and personal data to this project on the provision that UK Biobank will keep their data secure. I hope that public trust in using personal data for research can be restored.

Jim Smith
24 days ago

This is a surprising analysis that completely misses the point.

There is a globally recognised way of providing researchers with access to sensitive data: they’re called Trusted Research Environments (TREs – known as SDE’s in the NHS) operating under the 5-Safes protocol.
The last of these is ‘Safe Outputs’ i.e. TREs put in place a series of checks that nothing researchers want to take away is potentially disclosive (which the raw biobank data clearly is).
TREs are a very simple idea: like the difference between a reference library (TREs) and a lending library (data download model). In a TRE researchers do their analysis inside a (virtual) secure environment with an airlock process to control what comes out. That way there is a series of checks (usually done by experts who know what to look for) that researchers acting in good faith don’t inadvertently ask for something which could breach privacy.
This approach recognises different fields of expertise: Researchers get to be experts in their field, and TRE staff apply their expertise in Statistical Disclosure Control to make the nuanced decisions needed about what is safe to release.

Biobank have so far failed to put in place airlock controls, despite multiple warnings over years…

Skeptic
24 days ago

What I find the most disappointing about this situation is that biobank made a statement in Oct 2024 stating they no longer allowed downloads. Why did they then change their practices? In a world where data is the new gold why is Biobank contiously allowed to fail and damage the reputation of health researchers who use secure environments and who understand statistical disclosure control. Controlled access and controlled release is not new and with a wealth of secure infrastructures, systems and checks why does biobank not come up to standard. Bets on them putting online a new ‘no download’ ‘secure platform only to have the same situation repeat next year with no repercussions. We need to stop protecting them and hold them accountable.

felix RItchie
23 days ago

This op-ed shows no understanding at all of confidential data management, or even basic human psychology. Unfortunately, its Pollyanna-ish view of the world, and bafflement that the world doesn’t seem to work that way, reflects very much UKBioBank’s attitude to data security, which is why UKBB is in such a mess.

There is an enormous amount of literature on how to provide research access to data, and decades of experience. This includes both downloaded data and secure facilities. It is not controversial, or secret, or complicated, or rare. Most UK organisations manage it very well, and the UK generally is seen as a world leader in the governance of confidential research data.

In research data governance, trust is important but it is not blind. It needs to be earned – trustworthiness is what matters, for both service and researchers. Good research organisations understand the genuine risks – not of people being evil or malevolent but of people being people.

UKBB is an outlier, not an example.