It’s National Student Survey season, which probably means a fun new round of polarized arguments about its merits, but the key problem is surprisingly straightforward – and hiding in plain sight.

The NSS is just not that good at providing meaningful institutional comparisons at subject level. We shouldn’t be too surprised. It is hard for student surveys to provide robust data at subject level, particularly at a level of subject categorisation that is sufficiently fine-grained to broadly match institutional units (i.e. departments) and the interests of prospective students. It was acknowledged as ambitious when the NSS was launched, and there have been official concerns ever since.

## Presentation is everything

The 2004 consultation on the introduction of the NSS indicated that 95% confidence intervals (“to show the statistical reliability of the scores”) would be included on the public presentation of the data (then the Teaching Quality Information website, which turned into Unistats). Confidence intervals are a kind of margin of error around the actual score, representing the range in which we can be 95% confident that the hypothetical ‘true’ value falls. This looks like it was prompted in part by the researchers hired to develop the pilot: “There is a danger with information of this nature that when making comparisons between institutions users may attach significance to apparent differences that are not statistically significant. We are concerned that the exercise must produce results that are resistant to such misinterpretation.” They found, when using 18 relatively coarse-grained subject groups, that a fair number of comparisons between institutions were not statistically significant: 68% for business studies and 45% for medicine. The consultation on the revisions to the NSS in 2015 again highlighted the need for better explanation of the effect of sample sizes on comparisons of the data. Researchers have made the same point.

In 2012 HEFCE moved in the right direction by starting to flag whether institutions were statistically significantly better or worse than their benchmark, the score they ‘should’ have got given factors such as their mix of students. A statistically significant difference is one that we can be 95% confident represents a genuine difference in students’ perceptions, rather than just random variation. The same approach was used for the institution-level Teaching Excellence Framework, and for the subject-level TEF pilots – though using a more lenient level of confidence (90% rather than 95%). But away from the benchmarked institution-level scores provided by the OfS and the institution- and subject-level TEF, out in the wilds of league tables, Unistats and institutions’ own internal use, raw NSS scores are normally used without much acknowledgement of their statistical limitations.

## In confidence

The confidence intervals for NSS scores are easy to get hold of: the Office for Students (and HEFCE before them) provides them with the publicly available NSS data. Crucially, confidence intervals allow us to work out whether a difference between two NSS scores is statistically significant. To horribly oversimplify, if the confidence intervals for two scores *don’t* overlap, the difference is statistically significant. If they *do* overlap, it isn’t. The confidence intervals are provided precisely so that anyone sufficiently interested can assess the statistical meaningfulness of differences in NSS scores. So what happens if we use them in that way?

I’ve looked at three kinds of comparisons someone might want to make using institutions’ subject-level NSS scores: comparison with the previous year, comparison with the sector average (for that subject) and comparisons between institutions. I’ve limited the analysis to 147 of the larger institutions included in the NSS data – the universities, and some small and specialist institutions such as conservatoires – and used the finest-grain JACS subject categorization, with 108 subjects.

The results differ quite a lot by question and subject, but overall the picture is not pretty. For 2014-2015 and 2015-2016, 2% of the year-to-year comparisons were statistically significant. Which means that for a university department comparing their NSS results to the previous year, the chance of the score for one question being meaningfully different is 50%. So on average, every other year the score for one of their questions will have meaningfully changed. Comparing institutions’ subject-level scores with the (subject-level) sector average, 24% of the comparisons are statistically significant. Which means that, on average, a department will find that their scores for just six of the 27 NSS questions will differ meaningfully from the sector average (and 21 won’t).

For the comparisons between institutions, to limit the processing power required I had to randomly select 50 of the 147 institutions, which still provided around two million comparisons. 6% of those comparisons were statistically significant, meaning that when a department compares their NSS score for a particular question with those of other institutions, on average 6% will be meaningful; i.e. if there are 50 other institutions to compare against, their score will differ meaningfully from three of them (on average). On average, for 10 of the 27 questions, there will be *no* other institution whose score differs meaningfully. And for 12% of institution-subject combinations the situation is particularly bad – they don’t differ significantly from any other institution, for any question. Of course, we can improve those numbers by using a less fine-grained subject categorisation (increasing the numbers of responses and so reducing the size of the confidence intervals), but then we start lumping together subjects that are quite different from the perspective of prospective students and departmental divisions.

This is only one way of looking at the data, and looking at overlap of confidence intervals isn’t the ideal way of assessing statistical significance. On the other hand, all I’ve done is use the confidence intervals the way that a department looking at their NSS scores is supposed to – I’ve just looked at the whole university sector. It also isn’t a secret. When HEFCE (successfully) proposed in 2014 to reduce the publication threshold from 23 results to 10, they provided analysis indicating that the proportion of overlap of confidence intervals between institutions (using 21 subjects) ranged between 72% (medicine) and 98% (history and philosophy).

## Appalling, but good enough?

In one sense, these results suggest that the ability of the NSS to provide meaningful comparisons at subject level is appalling, but only because there’s an expectation that it could tell us more. With institutional responses averaging 58 students at the finest-grained subject level, the confidence intervals are inevitably going to be large. They are 31 % points, on average, and that magnitude of confidence interval means that for an actual score of 75%, the ‘true’ value could fall anywhere between 60% and 91%. But that doesn’t mean that the NSS is useless. It’s an incredibly useful tool for suggesting where there might be areas of particularly good and not-so-good practice. It doesn’t have much to say about everything in between – most of the sector – but the real problem is the assumption that it ever could.

To put the same point a different way, the statistical limitations of the NSS data only seem like a serious flaw because of the way the results are used. When they are used in a way that suggests that any difference in the raw numbers is statistically meaningful then the limitations are a problem. When the scores are used in a way that takes into account the confidence intervals – publicly available on the OfS website – then the limitations are just a neutral fact of life. As per usual with the NSS, its strength or weakness almost entirely depends on how we choose to deploy the results.

“For the comparisons between institutions, to limit the processing power required I had to randomly select 50 of the 147 institutions, which still provided around two million comparisons. 6% of those comparisons were statistically significant, meaning that when a department compares their NSS score for a particular question with those of other institutions, on average 6% will be meaningful”

Since the confidence interval is 95% (i.e. a 5% chance of results so different occurring by chance), doesn’t this mean -again, to oversimplify the effects on confidence intervals of multiple comparisons – that 5% of that 6% can also be explained as random chance – and therefore approximately none of the inter-institution comparisons are actually significant?

“2% of the year-to-year comparisons were statistically significant”

And for the same reason, probably none of these actually are either?

Thanks for the question, this gets a bit technical. In saying that e.g. 6% of the c.2 million comparisons between institutions (for a particular subject and question) are statistically significant, I mean that for 6% of the comparisons, the 95% confidence intervals overlap. If 95% confidence intervals overlap, we can be 99% confident that the difference is not due to chance (the reason for that is pretty technical, see the link in the piece, https://www.apastyle.org/manual/related/cumming-and-finch.pdf). So for that 6%, roughly speaking we will get a false positive 1 time in a 100.

Worth saying that there are better ways than this of working out the proportion of statistically significant differences (this is quite a clumsy way of doing it) but they rely on having the full individual-level dataset, which isn’t made publicly available. What I’ve tried to do is look at what institutions & departments should expect to find if they do what OfS expects them to do with the (publicly available) confidence interval data, i.e. none of this is a secret.

Great article Alex – I remember seeing you present some of this at Edinburgh and people being quite shocked! We should all think carefully about how we present these visualisations, I’m definitely guilty of leaving the CIs off visualisations of NSS data and I’m not actually sure why I do this. Maybe I can’t be bothered with the conversation? (In which case – shame on me!)

As a bit of a shameless plug, last year I was playing about with providing NSS data in tidy format, although I’m struggling to incorporate older data given the changes to the survey. The package is hosted on github if anyone wants: https://github.com/jillymackay/nss

Back in my healthcare days, we had a quite simple metric for institutions on patient dose data … Don’t be in the worst 25%

If you just simplify the use of the scores to this level they may be of some use.

But when I was working with a potential graduate recruiter, who was bombarded by lots of headline NSS scores from ‘the sellers’, they noted how much closer our scores were to some lesser recognised HEI than to Oxbridge say. They felt this said more about the stats than the institution.