This article is more than 6 years old

by Alex Buckley

20/03/19

How much are your NSS results really telling you?

Alex Buckley asks just how reliable your institutional NSS results really are.

This article is more than 6 years old

by Alex Buckley

Analysis

20/03/19

wonkhe-significant — Image: Shutterstock

Alex Buckley

Education enhancement team

by Mark Leach

staff

21/10/14

Alex Buckley works in the Education Enhancement team at the University of Strathclyde.

Presentation is everything

The 2004 consultation on the introduction of the NSS indicated that 95% confidence intervals (“to show the statistical reliability of the scores”) would be included on the public presentation of the data (then the Teaching Quality Information website, which turned into Unistats). Confidence intervals are a kind of margin of error around the actual score, representing the range in which we can be 95% confident that the hypothetical ‘true’ value falls. This looks like it was prompted in part by the researchers hired to develop the pilot: “There is a danger with information of this nature that when making comparisons between institutions users may attach significance to apparent differences that are not statistically significant. We are concerned that the exercise must produce results that are resistant to such misinterpretation.” They found, when using 18 relatively coarse-grained subject groups, that a fair number of comparisons between institutions were not statistically significant: 68% for business studies and 45% for medicine. The consultation on the revisions to the NSS in 2015 again highlighted the need for better explanation of the effect of sample sizes on comparisons of the data. Researchers have made the same point.

In 2012 HEFCE moved in the right direction by starting to flag whether institutions were statistically significantly better or worse than their benchmark, the score they ‘should’ have got given factors such as their mix of students. A statistically significant difference is one that we can be 95% confident represents a genuine difference in students’ perceptions, rather than just random variation. The same approach was used for the institution-level Teaching Excellence Framework, and for the subject-level TEF pilots – though using a more lenient level of confidence (90% rather than 95%). But away from the benchmarked institution-level scores provided by the OfS and the institution- and subject-level TEF, out in the wilds of league tables, Unistats and institutions’ own internal use, raw NSS scores are normally used without much acknowledgement of their statistical limitations.

In confidence

The confidence intervals for NSS scores are easy to get hold of: the Office for Students (and HEFCE before them) provides them with the publicly available NSS data. Crucially, confidence intervals allow us to work out whether a difference between two NSS scores is statistically significant. To horribly oversimplify, if the confidence intervals for two scores don’t overlap, the difference is statistically significant. If they do overlap, it isn’t. The confidence intervals are provided precisely so that anyone sufficiently interested can assess the statistical meaningfulness of differences in NSS scores. So what happens if we use them in that way?

I’ve looked at three kinds of comparisons someone might want to make using institutions’ subject-level NSS scores: comparison with the previous year, comparison with the sector average (for that subject) and comparisons between institutions. I’ve limited the analysis to 147 of the larger institutions included in the NSS data – the universities, and some small and specialist institutions such as conservatoires – and used the finest-grain JACS subject categorization, with 108 subjects.

The results differ quite a lot by question and subject, but overall the picture is not pretty. For 2014-2015 and 2015-2016, 2% of the year-to-year comparisons were statistically significant. Which means that for a university department comparing their NSS results to the previous year, the chance of the score for one question being meaningfully different is 50%. So on average, every other year the score for one of their questions will have meaningfully changed. Comparing institutions’ subject-level scores with the (subject-level) sector average, 24% of the comparisons are statistically significant. Which means that, on average, a department will find that their scores for just six of the 27 NSS questions will differ meaningfully from the sector average (and 21 won’t).

For the comparisons between institutions, to limit the processing power required I had to randomly select 50 of the 147 institutions, which still provided around two million comparisons. 6% of those comparisons were statistically significant, meaning that when a department compares their NSS score for a particular question with those of other institutions, on average 6% will be meaningful; i.e. if there are 50 other institutions to compare against, their score will differ meaningfully from three of them (on average). On average, for 10 of the 27 questions, there will be no other institution whose score differs meaningfully. And for 12% of institution-subject combinations the situation is particularly bad – they don’t differ significantly from any other institution, for any question. Of course, we can improve those numbers by using a less fine-grained subject categorisation (increasing the numbers of responses and so reducing the size of the confidence intervals), but then we start lumping together subjects that are quite different from the perspective of prospective students and departmental divisions.

This is only one way of looking at the data, and looking at overlap of confidence intervals isn’t the ideal way of assessing statistical significance. On the other hand, all I’ve done is use the confidence intervals the way that a department looking at their NSS scores is supposed to – I’ve just looked at the whole university sector. It also isn’t a secret. When HEFCE (successfully) proposed in 2014 to reduce the publication threshold from 23 results to 10, they provided analysis indicating that the proportion of overlap of confidence intervals between institutions (using 21 subjects) ranged between 72% (medicine) and 98% (history and philosophy).

Appalling, but good enough?

In one sense, these results suggest that the ability of the NSS to provide meaningful comparisons at subject level is appalling, but only because there’s an expectation that it could tell us more. With institutional responses averaging 58 students at the finest-grained subject level, the confidence intervals are inevitably going to be large. They are 31 % points, on average, and that magnitude of confidence interval means that for an actual score of 75%, the ‘true’ value could fall anywhere between 60% and 91%. But that doesn’t mean that the NSS is useless. It’s an incredibly useful tool for suggesting where there might be areas of particularly good and not-so-good practice. It doesn’t have much to say about everything in between – most of the sector – but the real problem is the assumption that it ever could.

To put the same point a different way, the statistical limitations of the NSS data only seem like a serious flaw because of the way the results are used. When they are used in a way that suggests that any difference in the raw numbers is statistically meaningful then the limitations are a problem. When the scores are used in a way that takes into account the confidence intervals – publicly available on the OfS website – then the limitations are just a neutral fact of life. As per usual with the NSS, its strength or weakness almost entirely depends on how we choose to deploy the results.

fest Festival side

TFOHE25_Website_Column_1000x1680_BTBS@2x

View here

by Mark Leach

featured message

19/05/23

post list Latest articles

Shutterstock_2188036449 — Image: Shutterstock

The battle for people, culture and environment

by Elizabeth Gadd

Comment

7/10/25

What the saga of Oxford Business College tells us about regulation and franchising

by David Kernohan

Analysis

7/10/25

Shutterstock_2636302717 — Image: Shutterstock

The risk of opening a legislative backdoor into the sector’s pocket

by Michael Salmon

Comment

6/10/25

shutterstock_2214899301 — Image: Shutterstock

The case for collaborative purchasing of digital assessment technology

by Debbie McVitty

Comment

6/10/25

wonkhe-construction — Image: Shutterstock

Making grants and the levy work

by David Kernohan

Analysis

6/10/25

Shutterstock_105019910 — Image: Shutterstock

Rising to the political moment means universities putting down deep and meaningful roots in their places

by Jo Heaton-Marriott

Comment

4/10/25

Wonkhe_WonkheShow_Social_Blue@2x — Image: Wonkhe

Podcast: Labour Conference 2025

by Team Wonkhe

Podcasts

3/10/25

OfS rebalances the free speech/harassment see-saw on antisemitism

by Jim Dickinson

Comment

3/10/25

shutterstock_2559252915 (1) — Image: Shutterstock

Visa oversubscription at UCL may be more than just a PR problem

by Jim Dickinson

Comment

3/10/25

184solentwarsashoncork — Image: Shutterstock

Higher education postcard: Warsash Maritime School

by Hugh Jones

Comment

3/10/25

4 Comments

Oldest

Newest

Inline Feedbacks

View all comments

cim

6 years ago

“For the comparisons between institutions, to limit the processing power required I had to randomly select 50 of the 147 institutions, which still provided around two million comparisons. 6% of those comparisons were statistically significant, meaning that when a department compares their NSS score for a particular question with those of other institutions, on average 6% will be meaningful” Since the confidence interval is 95% (i.e. a 5% chance of results so different occurring by chance), doesn’t this mean -again, to oversimplify the effects on confidence intervals of multiple comparisons – that 5% of that 6% can also be explained… Read more »

Alex Buckley (@ajbtwit)

6 years ago

Thanks for the question, this gets a bit technical. In saying that e.g. 6% of the c.2 million comparisons between institutions (for a particular subject and question) are statistically significant, I mean that for 6% of the comparisons, the 95% confidence intervals overlap. If 95% confidence intervals overlap, we can be 99% confident that the difference is not due to chance (the reason for that is pretty technical, see the link in the piece, https://www.apastyle.org/manual/related/cumming-and-finch.pdf). So for that 6%, roughly speaking we will get a false positive 1 time in a 100. Worth saying that there are better ways than… Read more »

Jill MacKay

6 years ago

Great article Alex – I remember seeing you present some of this at Edinburgh and people being quite shocked! We should all think carefully about how we present these visualisations, I’m definitely guilty of leaving the CIs off visualisations of NSS data and I’m not actually sure why I do this. Maybe I can’t be bothered with the conversation? (In which case – shame on me!) As a bit of a shameless plug, last year I was playing about with providing NSS data in tidy format, although I’m struggling to incorporate older data given the changes to the survey. The… Read more »

JimW

6 years ago

Back in my healthcare days, we had a quite simple metric for institutions on patient dose data … Don’t be in the worst 25%
If you just simplify the use of the scores to this level they may be of some use.

But when I was working with a potential graduate recruiter, who was bombarded by lots of headline NSS scores from ‘the sellers’, they noted how much closer our scores were to some lesser recognised HEI than to Oxbridge say. They felt this said more about the stats than the institution.

Presentation is everything

In confidence

Appalling, but good enough?

Share

Share

fest Festival side

post list Latest articles