In the humanities, data is not just numbers

A painting, a poem, laws and politics are all studied under the humanities umbrella – the academic discipline that covers aspects of human society and culture.

“Most humanists probably wouldn’t describe their research as data-driven,” concedes Mark Taylor Batty, deputy head of the school of English at the University of Leeds. “But when we read a novel and form a view on the narrative of that story, we’re interpreting data.

At Leeds, we never really thought of such materials as data, let alone capture it as such, until recently, when we started building bodies of data to use more widely.

Taylor-Batty has been pioneering the creation of a new data set around the works of playwright, Harold Pinter, and is now investigating whether that set can be manipulated for wider use.

A complex discipline

The creation of digital datasets within humanities is on the rise, but faces stubborn barriers, continues Taylor-Batty. “Digital humanities is a relatively new discipline and there is no such thing as a unified methodology. We don’t fully know who else is working on this and there aren’t systems in place at a university level more broadly.

This leads to the question of how we put procedures in place so that these wide-ranging materials can be disseminated across the sector and beyond.

Nick Sheppard, open research advisor at Leeds, also flags the lack of coordination and connectedness: “Our problem is that we don’t know what data out there is associated with the University of Leeds, other than the long tail of data in our local repository.

For instance, we know the National Environment Research Council (NERC) puts data in its data centres rather than in the local repository. If we could find a way to share or link that data to our repository, we could widen our field of research. This is just one example, but we know that there are at least 2,000 data sets that could be affiliated with the University of Leeds. It’s incredibly challenging to link these data sets from a technical as well as a cultural point of view. Technically, because we don’t really have an easy way to curating them; culturally, because they rely on authors consistently using identifiers such as ORCID.

About infrastructure

To support a more coherent data infrastructure for research, the education and technology not-for-profit, Jisc, is committed to the creation of a national infrastructure for research. Underpinned by the UK Janet Network which provides superfast connectivity for research and education, and other existing infrastructure, such as in-built cyber security and cloud services, Jisc is looking to develop a set of flexible solutions for institutions and research collaborations.

Rachel Proudfoot, research data advisor at the University of Leeds, supports the idea:

We really need to consider the architecture of how we bring data together. If we are looking at distributed institutional repositories, we will need agreement around metadata and the use of identifiers for various elements. In this way, researchers will be able to bring together disparate types of data, including various iterations of that data.

But Taylor-Batty, Sheppard and Proudfoot all signal major hurdles to overcome before a national infrastructure for humanities data can be considered.

“We’re starting from a very mixed, broken and deconstructed field”, says Taylor-Batty. “There are certain standards out there, but it would be good to have a fixed way of approaching things”.

What I’ve experienced at Leeds is that, as researchers, we are encouraged to build a database from the bottom up, but there are no sector-wide standards. Therefore, databases are not compatible and can’t easily ‘talk’ to each other. I’ve included an application programming interface (API) into our dataset to make it easier to make a connection between computers or between computer programs. But it would be so much better if we could be confident of finding the majority of what we’re looking for in a specific field, or a specific area of research, in one place, rather than having to rely on a disparate local repository network.

A culture of deposit

“There are also cultural points to consider, such as required skills and sector commitment, before we even get to the infrastructure,” adds Sheppard.

“Indeed, we need to change the culture of deposit,” agrees Proudfoot. “The motivation behind where and why researchers deposit their work is very varied. Some people feel it’s the right thing to do; they want to be transparent about their practice and value the data they generate. But for a whole swathe of researchers, they’re mainly doing it because it is a funding requirement.”

Speaking from experience, Taylor-Batty adds:

In digital humanities, we’re dealing with a bunch of people whose expertise is not computational. My relationship with coding is much like my relationship with German; I can tell where the nouns, the verbs and the adjectives are but I can’t read the sentence. It’s about recognising what data is, the value of being able to manipulate it and what machine learning we can apply to study it.