The signal and the noise: Nate Silver, Bayesian methods and higher education

When Nate Silver comes to speak at Wonkfest on November 5, we will no doubt hope to hear from him about the extraordinary state of American politics and perhaps also about how Britain’s political travails are seen on the other side of the Atlantic. But we should also take some time to quiz him on his methods, particularly about how he goes about using data as the foundation for much of his analysis. For Nate Silver is a Bayesian.

One of the ways in which he tries to distinguish the signal from the noise in political and sporting analysis is by using an extensive toolkit of Bayesian inference models. As he is at pains to point out, he doesn’t make predictions, he renders forecasts based on “conditional probability” – the chances we estimate of certain events occurring, given our assumptions and observations. In his work, he strongly encourages us to think like Bayesians.

What does this mean? The term will be familiar to some, but unknown to many. It refers to a mathematical principle first described circa 1740 by a Presbyterian minister, Thomas Bayes. The principle deals with the world of probability – in particular, the probability of something given prior knowledge about the conditions that may affect it. This principle underpins a whole system of statistical analysis.

Probability versus frequency

For example, we are in business manufacturing tyres, and we want to know how often they fail, for quality control purposes. In a familiar approach using “frequentist” statistics, we start with no prior assumptions and we take a fixed random sample from a batch of tyres, wear testing them on a testbed for 1000 miles. If the sample is large enough we state our results and also what level of confidence we have in them (by testing for their statistical significance). In a Bayesian approach, we start with an informed assumption, say, that the probability of a tyre failing after 1000 miles on a testbed will be 1 in 10,000. We then test tyres chosen randomly, one by one, updating our original assumption based on whether each tyre got past 1000 miles or not. Each sequential test gives a new data point that makes our original estimate more accurate. In other words, we have an improving model of the probability of tyre failure, as opposed to taking a snapshot of the frequency of tyre failure. A similar approach was used to test ammunition in the Second World War.

At this point, a declaration: I am no statistician. But I am interested in how we might use diverse quantitative methods to understand what happens in higher education systems, especially when the “research” concerned becomes directly instrumental in policy and performance judgements, as with LEO and the TEF. Within qualitative research, methodological diversity is taken as a given. This is not so much the case for quantitative methods, at least those used in policy-sensitive research, which tend to be dominated by survey or administrative data analysed using frequentist statistics.

But adherence to only one kind of quantitative analysis simply because it is more established, or more widely understood by policy makers, seems like a missed opportunity. This is especially so when we consider that Bayesian methods are increasingly mainstream in fields as diverse as medicine, forensic science, defence, sport, meteorology and many others. It also underpins a great deal of artificial intelligence and machine learning.

Intuitive statistics

My point here is not that we should replace all our existing analysis wholesale with an alternative analytical system. It is that we might be missing opportunities by not using a blend of statistical approaches. Both these schools of statistics have strengths and weaknesses. One strength of the approach is that in some ways it resembles human intuition, but can also be used to challenge and improve it.

For example, experienced academics form a view of their students’ likely strengths and weaknesses in future assessments, based on formative exercises. Sometimes it doesn’t take many exercises to develop a picture of a student, but then if they demonstrate a sudden improvement in something this can radically change the picture. In effect, teachers are thinking like Bayesians, revising their appraisal of students’ chances of success as they go along picking up new data points. This is not to suggest there isn’t a deeply humanistic process at work here, rather it is to say that Bayesian methods naturally align to it. Thus we can add statistical value to teacher insights, helping them to “refine their intuition”, especially under the pressure incumbent on working in a mass HE system.

Further work could be done here to extend these methods and help teachers to understand how they work, what their possibilities they offer, and what limitations they have. Indeed, like all statistics, these methods must be used responsibly. A key risk is that Bayesian models can be very sensitive to the first assumptions we make. If we were to assume at the outset that a particular student’s chances of success are zero, the assumption can’t be changed by any further data points, and that simply isn’t what we want from the perspective of an ethical pedagogy.

Handling complexity

Another strength of the Bayesian approach is its particular usefulness in “nested structures” where differing conditions, and small sample sizes, occur in different parts of the structure. To see why, back to those stacks of tyres. It’s apparent that if we want to reliably test a batch of twenty thousand tyres using a frequentist model, we probably need to sample at least a few hundred, testing them in such a way that makes them unusable later. This is quite wasteful and rather time-consuming. Just as reducing waste was a key motivation for wartime ammunition testers to use a Bayesian system. By using a Bayesian model we can derive useful – albeit imperfect – inferences from smaller samples, and the utility increases where different conditions apply in different parts of a structure.

A frequentist approach to testing tyres might be fine if you are producing tyres for the mass market, where none of the usage conditions are extreme and tend to average out over time. But it might be less appropriate for a specialist application, say Formula 1, where the tyre manufacturer might only make a few thousand tyres a year for all the teams put together, and the cost of each tyre might be several hundred times more than a conventional road tyre. Each team’s car might apply wear to the tyres differentially, which we might also want to understand. In this situation we need to get useful results with small samples for different conditions and a low number of tests, which we can’t even really control. When we analyse the results we are interested in the conditional probability of the tyres lasting for a certain number of laps, given track temperatures, surface gradients, weather conditions, even the quixotic question of “driver style”. In his time, Jenson Button was often complimented as being “very easy on the tyres” – but try quantifying it!

The world of higher education is replete with real-world, nested, small-sample situations for which we need good tests with very limited scope for controlling those tests or even making observations. For example, we can benefit from developing statistical models adept for assessing the effect on students of specific programmes, or even modules within programmes. To this end, Bayesian statistics already form part of the armoury of techniques used in learning analytics. They could also help us with complex learning and teaching challenges like the challenge of measuring learning gain, or understanding the conditions that lead to attainment gaps between different student groups.

The bigger picture and the case for diverse methods

When we come to wider public policy there is perhaps even more scope for progress. For example, the Office for Students might use such techniques to evaluate the risk profile of providers: a probability-based approach lends itself well to this purpose, as it is fundamentally about the chances we estimate of risks crystallising. Elsewhere, in LEO, we have a major problem with the historicity of data (i.e. much of it is really quite old) – Bayesian methods could enable rigorous analysis using only more recent data. A group of academics have published an experimental paper illustrating how this might be done.

In another, related, area, the government says a subject level TEF will go ahead, but many in the sector have major doubts about the usefulness of the piloted subject-level TEF because it does not in fact go down to the programme level, and one reason why this is very hard to achieve is that sample sizes at this level are very small. But if deployed as a complementary (not necessarily replacement) tool, Bayesian models might enable a more granular analysis. We also know that the system of statistical analysis now utilised in the TEF, originally a very radical innovation developed to create sector-wide performance indicators, simply wasn’t designed for the purposes now imposed on it, such as assigning ratings to institutions or providing student information. Its application in the TEF has become highly controversial in the statistics community. With decisions about precisely how the subject level TEF is to go forward still up in the air, perhaps now might be a good time to consider whether different – or mixed – statistical methods may be the way forward.

Wonkfest runs 4-5 November in London, and tickets are available. Nate Silver will speak on Tuesday afternoon and there will be a session on the future of the Teaching Excellence Framework as well as several sessions about data in higher education.