There is a long tradition of scientists sharing data as soon as it’s published.
Back in the early days of computing, researchers would have to physically mail magnetic tapes in order to share this information, but luckily now it’s all done via digital networks – a much easier and more secure system.
Often when you think of research, you think about the experiments or the observations you need to do to generate that data. However, most biological research these days is heavily computational, and it often relies on data that already exists in the public domain.
Data analysis is the way to undertake scientific research these days, and health research is no different. The digital networking that we rely on to store, process and share this data – such as Jisc’s Janet Network — has grown in significance and capability over the years, so that biological science now exists first on the internet. This has been particularly evident through the historically swift development of Covid-19 vaccines.
Managing big data
Data sets for research have become very large, which means that infrastructure requirements have had to change to accommodate them. And when the data in question is in relation to human health – although anonymised – it must also be stored very securely. This has far-reaching impact for any institution engaged in research, whether a lab, corporate entity or higher education institution.
Because of these technological advancements, the way that data analysis is done has changed. Now, rather than downloading data sets to their own machines, researchers can access the data they need via the cloud. Often data is stored within the same legal jurisdiction as that in which it’s being used, and in order to allow secure and easy sharing, good networking is important. One element of this is making sure that the network is sufficiently fast and responsive, so that large amounts of data can be moved around. But the network also needs to be robust and accessible enough that authorised researchers can treat the virtual cloud space as an extension of their own laboratories.
In turn, digital infrastructure is of increasing importance for senior leaders and policymakers of research-heavy institutions. The case for sharing and reusing data to improve research outcomes has been widely lauded, and improvements in infrastructure have already paid dividends. A close-to-home example is EMBL-EBI’s work with UK Biobank, which thanks to the development of our Janet Network connection, has meant that since the project began in 2017, we have transferred over 8 petabytes (PB) of human disease data from UK Biobank via the European Genome-phenome Archive (EGA). One PB is equal to 3.4 years of continuous full HD video recording, so that’s a lot of data. The size of the UK Biobank data continues to grow as the project develops. Data sharing for research can also cut down on time, cost, and duplication of effort, allowing projects to develop more efficiently.
The UK Biobank database is hugely valuable, and a major contributor to the advancement of medicine and treatment all over the world. Scientists are highly dependent on these large-scale biological data sets to transform future healthcare research.
Enabling advanced technology
Another advancement that’s influenced infrastructure requirements is the development of biological imaging. This has improved greatly in a variety of different ways, including microscopic imaging, MRI scans, X-ray and CT scans. The advancement in these technologies has allowed us a more three-dimensional view of life, and as such has revolutionised human disease research.
In addition to this, modern digital infrastructure allows for deep learning technology. Very often in the past, human intervention would be required to really interpret an image — now this is possible using computers. But using something like deep learning, or any kind of artificial intelligence (AI), requires processing an awful lot of data. Again, this means that the network used to move, store and analyse that data needs to be secure, robust, and agile enough to deal with increases in data flow as required.
This kind of technology has already had impressive real-world results. A student of mine, Hannah Meyer, recently resolved a 500-year-old medical question using improved human imaging and available datasets. Using resources from the UK Biobank, Hannah answered a question that Leonardo da Vinci first posed in 1516 – what purpose does the uneven surface of the inside of the human heart serve? Hannah found, by analysing thousands of MRI scans of the human heart provided by volunteers, that the uneven surface allows greater oxygen absorption from the blood. This has far-reaching implications for the study of heart disease and related conditions.
This really goes to show the advancements technology enables for biological research, making demonstrable, real-world impact. Data science is not only the future of research, but the present, and we need to ensure the available infrastructure is up to the challenge.
Ewan will be giving a keynote speech at Networkshop49, which runs online from 27-29 April and is free for Jisc member organisations.
Hi Ewan, and thanks very much for this – a great illustration of the trajectory of research data management, and some of the challenges that lie ahead. Looking forwards, say 20-30 years, d’you think it will still be desirable for all institutions to be building and maintaining their own high-performance infrastructures? Or should we be looking to consolidate storage, processing, analysis and publication/sharing into a single, common infrastructure for all researchers?