Do university rankings measure anything at all?

There are many complaints that we might make about university rankings.

Mostly these fall into two categories. The first is that they simply reinforce existing power structures, privileging western, english-speaking universities and maintaining an exclusive club, which is impossible to break into and self-perpetuating. The second is that the data used and the processes used to analyse them are incomplete, biased towards specific universities, and the methodology flawed for the use that they’re being put to.

The response is that rankings at least provide some objective view of university performance, that some data is required to inform strategic decision making, or perhaps that without this outside view we will be susceptible to the continued operation of old boys clubs, of “we know good work when we see it”.

Are rankings meaningful?

Rankings have meaning for senior university leadership because they are not infrequently their explicit and personal key performance indicators. They are meaningful for researchers because we like to be associated with other high performing researchers. They hold meaning for students because they are used to guide choices about which degrees from which university will provide the career boost that they are looking for.

But this is just to say that they are used. Are they actual useful? Does a rise from 231 to 224 (or a fall to 245) provide information that a university can usefully apply to improving itself? Is it reliable enough, stable enough, and unbiased enough to help universities understand where they can improve themselves? In particular are they useful for the universities outside the top reaches of these rankings, that are actively seeking to improve their standing on a global stage?

In two pieces of work recently posted as preprints we examined two questions to address this. The first was how existing tools for identifying the outputs of a university operated as instruments. Are they consistent, are they providing a consistent overview of what a university produces? The second piece looked at how that data, along with others was being used.

Same information, different data

In the first piece of work we built two “toy” rankings, one based on citations and one on open access performance, and asked how they change if you use data from the three major sources of such information: Scopus, Web of Science, and Microsoft Academic. We deliberately took a naive approach, using a simple search based on the internal identifier for each university.

What we found, for a sample of 155 universities, was that the apparent rankings could change radically. The universities that appear at the top of rankings tended to see less of a shift and less prestigious universities were more likely to see large ones. Worse than this, while the average change was zero, it seemed that universities from non-English speaking countries were much more likely to see large shifts in their rankings. We conclude that any ranking based on a single source of output data is less reliable for the middle and lower ranking universities outside the traditional English language centres of power that most rely on them.

This problem of reliability is reinforced when we look at the results from the second piece of work. Here we looked at real rankings, the THES, ARWU and QS, and all of their input variables from 2012-2018. In common with previous work we saw that that the top 50 changed very little. While the rich may not be able to get richer (you can’t get any higher than first place) they certainly stay wealthy.

Publications and prizes

To try and understand the intent behind the rankings, we looked at how the input data related to each other. Using two different techniques, principal component analysis, and factor analysis, we sought to understand whether the input data sources were actually independent of each other, whether they could be grouped in categories, and how they were related.

In all three rankings we saw a segmentation of the input data into one group that was focussed on measures of prestige and reputation (survey results, prizes) and those that focussed on measures of outputs (citation measures and various kinds of counts of publications). For each ranking, these clusters were consistent across the two methods we applied and quite stable over the several years of data collected. It is well known that universities in english speaking countries and in North America and Northwestern Europe tend to do better on rankings. But what we also observed was an apparent difference based on language and region of how output measures and reputation measures related to each other. For many universities, regions and most non-english languages a substantial increase in the output measure might have no apparent effect on the reputation measures.

For the THES and QS rankings these reputation measures are strongly based on surveys which have been criticised by many groups previously. Our results strengthen the idea that these surveys aren’t measuring things that are driven by the quality of research outputs. Indeed the extraordinarily high correlation between teaching and research reputation within the THES input data suggests that it is measuring neither of these things but some form of brand awareness, a measure that less privileged institutions will necessarily struggle with. There is also a massive difference for Chinese universities between the rankings, they perform very badly on the ARWU reputation component but increasingly well on the THES and QS survey measures.

Are rankings stable outside the top 100?

Taken together our results suggest that usefulness of rankings is limited for universities outside North Western Europe and North America, or outside the top 100 places. The data is reliable and stable, if not necessarily strongly connected to actual performance for top-ranked universities. But as you move down the lists the rankings become both less stable, and less meaningful. Outputs seem disconnected from reputation and given they are drawn from single sources, the output measures themselves should probably not be relied on.

But what of their meaning? Perhaps the most disturbing part of our results is that, while each ranking can be segmented into two components, outputs and reputation, that these are not consistent across the rankings. In fact the output component of the ARWU is more correlated with the reputation component of the THES than with the output measure that you’d expect to be most similar. That is perhaps not surprising given it is largely based on papers published in Nature and Science.

Taken together our two pieces of work suggest that, if you’re a university that sits outside the top echelons, if you’re not based in Europe or North America, and if you are based in a country that doesn’t have English as its first language then your position in these rankings can be highly volatile. It is dependent on choices of input data, of survey distribution, and of weightings that are entirely outside your control.

Overall our results show that while the underlying intent of the rankings is to measure similar things, they do not, and what they do measure is very unclear. If they measure anything at all, it appears to be visibility and prestige, something that feeds on itself, and would be predicted to lead to fixation at the top of the rankings. In fact it is worse than that. By giving these rankings importance and meaning, we concentrate our attention merely on doing well at them. The statistical analysis suggests that they are biased, unstable and unreliable, precisely to those institutions that most rely on them to provide an “objective” view of their performance.

Co-authored by Friso Selten and Paul Groth of the University of Amsterdam.