Open data is about more than a licence

The release of the “Higher Education Student Statistics: UK, 2016/2017” (Statistical First Release 247) by HESA was accompanied around the sector by a series of sudden sharp intakes of breath in institutional data offices. It represents a brave and bold move into new ways of presenting and sharing data, and showed off a new format that will delight some and disappoint others. In this article I look at what has changed, and why.

The dash for designation

In applying for Designated Data Body status in England, HESA has made a move towards offering “open data”, suggesting that “From 2021 all of our publications will be available in open data format, allowing additional access to the information we enrich.” The Open Data Institute defines open data as “data that anyone can access, use or share,” which sounds like a pretty good thing. In many cases, though, open data has simply meant data that is available under an open (usually Creative Commons) licence – good to have legal clarity, but not at all the same as providing easily usable data.. HESA should be lauded for making this move for SFR248, but it is only a starting point.

The Designated Data Body will have to comply with the Code of Practice for Official Statistics (there will be no need for it to be designated as a provider of official statistics, although HESA is). This concise manual sets out the principles that underpin the way government statistics need to be designed. Principle 8 requires that data bodies “ensure that official statistics are disseminated in forms that enable and encourage analysis and re-use.” Principle 4 requires that statistical releases “promote comparability within the UK […] by adopting common standards” and “where […] changes are made to methods or coverage, produce consistent historical data where possible.”

The new SFR

The changes to the Statistical First Release (SFR) format make the report easier to skim. Tableau-style interactive graphics allow the viewer to quickly answer immediate questions concerning particular groups or institutions. But seasoned HESA readers will note – with some considerable alarm – the absence of the old Excel tables.

This is a deliberate decision. As the release says, “All the data is presented in interactive tables on the HESA website and will not be published in Excel spreadsheets. Below each table you will find a ‘Get the data’ button; this button will allow you to download a *.csv file of the data.”

But this is the data as prepared to create the visualisation so the clarity of the old sheets is lost, as is the consistency with previous presentations that allows for the kind of year-on-year comparison that we enjoy here at Wonkhe. Utterly unforgivably, UK Provider Reference Numbers (UKPRNs) are absent.

If this is a foretaste of the way HESA data will be made available in the future, I’m not delighted with it.

Why have they done this?

HESA argues that these data releases are not designed for detailed statistical manipulation – after all, the numbers are rounded, and they’d really rather you used the more nuanced data on the (undeniably cool) HEIDIplus system. But for a lot of use cases these easily available spreadsheets fit well into data manipulation workflows – if, say, you were idly interested in how changes in student numbers correlated to TEF Awards it would be this data you would reach for in the first instance. And if you are not lucky enough to be based in a university, it would be your only option.

HESA told us: “The rationale behind the new SFR format is to provide more data and make it easier to find any particular statistic. Using filtered Google tables and charts we replace the need for numerous Excel tables based on different populations and with each attempting to cross-tabulate multiple variables.”

The new presentation may be more attractive but, as someone who has used this data over a number of years, I would argue it makes it harder to use. For institutional data, the absence of UKPRNs means that these now need to be added manually – if you’ve ever done this, you’ll know what a bind this can be. The larger datasets have multiple rows for different aspects of the same core category – much more difficult to read, and requires some serious hacking about in Excel to recreate the much clearer single row approach.

A HESA spokesperson argued that “the background data tables generally contain more data than the equivalent Excel tables from previous years.” If they do, I’ve yet to find it – and even if there is further data the fact that it is no longer directly comparable with what came previously makes the value limited.

Oddly, the spokesperson claimed that “They are however much more suitable for use in simple tools like pivot tables where the old tables would have needed to be stripped of their design features and cross-tabulation artefacts.” The whole point of pivot tables is that they can handle any data you throw at them, and if you need to specifically format data to put into a pivot table you are doing pivot tables wrong. Certainly, my colleague Catherine is a pivot table fan and has had no issues with using the technique with previous releases (I’m more of an old-school formulas-and-filters person, as it happens).

HESA are actively looking for your feedback on the format of this new release. If you’ve been affected by any of the issues raised here, or have other opinions about the way SFRs will be presented in the future, please do let them know.

Note: In response to this article, HESA have now added UKPRNs into some of the downloadable tables.