Fun with flags (pt 2): How did metrics work in TEF outcomes?

In the run up to the TEF results, one of the features of the exercise posing the most questions was way in which providers’ benchmarked scores were ‘flagged’.

In December, Wonkhe published a guide to the flagging process and shared HEFCE’s provisional data on how institutions had fared on the data. Flags are vitally important to the TEF process as they indicate the extent to which an institution is beating, or losing against, the measure expected for its students.

Where flags appear on the data sheets, they can show a positive or negative flag, or even a ‘double positive’ or ‘double negative’ flag for more significant results in either direction way. To be more precise, a double flag deviates from the benchmark by 3%, a single flag by 2% – fans of the UK Key Performance Indicators will recall that 3% is the significance bar for that exercise. There can also be a neutral score which means that the institution is performing more or less as expected.

While the exercise’s published guidance stated that these flags would not cancel each other out – the notion was that it should have been difficult to receive a Gold award with any negative marks against you – a number of negatively marked providers received Gold. It seems, therefore, that there has been some balancing off between the ups and the downs.

However, what follows are calculations that some may consider could be used to approximate a ‘panelless’ TEF and we should be very careful not to allow that interpretation. In viewing these graphs we should be clear that none of this work shows what an institution ‘should have had’ – the panel has always had the final say in allocating awards based on metrics, context and the provider statement. Neither should we assume that anything here approximates the raw allocations that the panel used as a starting point – none but the panel and secretariat can know what these were, and – as before – these are irrelevant in the context of determining TEF awards.

‘Implied awards’

We tested the extent of the application of flag criteria by allocating an “implied award” based on the criteria given in the guidance – any institution with three or more positive flags and no negative flags was given a Gold, any institution with two or more negative flags was given a Bronze, and all the others were given a Silver. The below chart shows all those institutions whose ‘implied award’ differed from their final outcome.

TEF final outcomes changed from initial hypothesis

Flag scoring

Next, we tried to use the flags to produce a ranking. Our scoring system uses the flags and combines the positive and negative. We’ve allocated a score of +2 to a double positive, +1 to a positive and 0 to neutral. A negative flag gets -1, and double negative -2. We shouldn’t labour under the assumption that double flags are, in every case, ‘twice as good’ or ‘twice as bad’ as singles. You can see from the ‘z-scores’, which show in numerical form a measure of the standard error (the distance from the benchmarked mean), that some providers are much further away. However, the assessors and panel were instructed to take into account the flags in the first instance, when looking at providers’ performance. Thus our combination of the flags, creating a 25-point scale, gives a good idea of the spread of performance (click for full-screen for the best view).

Flags as UKPIs

We also decided to see what would happen if we treated the TEF core metrics as if they were UKPIs – in this case we simply allocated a score of 1 for a double-positive flag and of -1 for a double-negative flag. This 13 point scale gives a tighter spread but gives a better indication of what would, in other settings, be seen as significant over- and under- performance. (click for full-screen for the best view).