Algorithms aren't the problem. It's the classification system they support

Jim Dickinson

Associate Editor (SUs)

by Mark Leach

staff

22/10/14

Jim is an Associate Editor (SUs) at Wonkhe

More, not less

A few months ago now on Radio 4’s More or Less, I was asked how Covid had impacted university students’ attainment. On a show driven by data, I was wary about admitting that as a whole, I think it would be fair to say that UK HE isn’t really sure.

When in-person everything was cancelled back in 2020, universities scrambled to implement “no detriment” policies that promised students wouldn’t be disadvantaged by the disruption.

Those policies took various forms – some guaranteed that classifications couldn’t fall below students’ pre-pandemic trajectory, others allowed students to select their best marks, and some excluded affected modules entirely.

By 2021, more than a third of graduates were receiving first-class honours, compared to around 16 per cent a decade earlier – with ministers and OfS on the march over the risk of “baking in” the grade inflation.

I found that pressure troubling at the time. It seemed to me that for a variety of reasons, providers may have, as a result of the pandemic, been confronting a range of faults with degree algorithms – for the students, courses and providers that we have now, it was the old algorithms that were the problem.

But the other interesting thing for me was what those “safety net” policies revealed about the astonishing diversity of practice across the sector when it comes to working out the degree classification.

For all of the comparison work done – including, in England, official metrics on the Access and Participation Dashboard over disparities in “good honours” awarding – I was wary about admitting to Radio 4’s listeners that it’s not just differences in teaching, assessment and curriculum that can drive someone getting a First here and a 2:2 up the road.

When in-person teaching returned in 2022 and 2023, the question became what “returning to normal” actually meant. Many – under regulatory pressure not to “bake in” grade inflation – removed explicit no-detriment policies, and the proportion of firsts and upper seconds did ease slightly.

But in many providers, many of the flexibilities introduced during Covid – around best-mark selection, module exclusions and borderline consideration – had made explicit and legitimate what was already implicit in many institutional frameworks. And many were kept.

Now, in England, OfS is to all intents and purposes banning a couple of the key approaches that were deployed during Covid. For a sector that prizes its autonomy above almost everything else, that’ll trigger alarm.

But a wider look at how universities actually calculate degree classifications reveals something – the current system embodies fundamentally different philosophies about what a degree represents, are philosophies that produce systematically different outcomes for identical student performance, and are philosophies that should not be written off lightly.

What we found

Building on David Allen’s exercise seven years ago, a couple of weeks ago I examined the publicly available degree classification regulations for more than 150 UK universities, trawling through academic handbooks, quality assurance documents and regulatory frameworks.

The shock for the Radio 4 listener on the Clapham Omnibus would be that there is no standardised national system with minor variations, but there is a patchwork of fundamentally different approaches to calculating the same qualification.

Almost every university claims to use the same framework for UG quals – the Quality Assurance Agency benchmarks, the Framework for Higher Education Qualifications and standard grade boundaries of 70 for a first, 60 for a 2:1, 50 for a 2:2 and 40 for a third. But underneath what looks like consistency there’s extraordinary diversity in how marks are then combined into final classifications.

The variations cluster around a major divide. Some universities – predominantly but not exclusively in the Russell Group – operate on the principle that a degree classification should reflect the totality of your assessed work at higher levels. Every module (at least at Level 5 and 6) counts, every mark matters, and your classification is the weighted average of everything you did.

Other universities – predominantly post-1992 institutions but with significant exceptions – take a different view. They appear to argue that a degree classification should represent your actual capability, demonstrated through your best work.

Students encounter setbacks, personal difficulties and topics that don’t suit their strengths. Assessment should be about demonstrating competence, not punishing every misstep along a three-year journey.

Neither philosophy is obviously wrong. The first prioritises consistency and comprehensiveness. The second prioritises fairness and recognition that learning isn’t linear. But they produce systematically different outcomes, and the current system does allow both to operate under the guise of a unified national framework.

Five features that create flexibility

Five structural features appear repeatedly across university algorithms, each pushing outcomes in one direction.

1. Best-credit selection

This first one has become widespread, particularly outside the Russell Group. Rather than using all module marks, many universities allow students to drop their worst performances.

One uses the best 105 credits out of 120 at each of Levels 5 and 6. Another discards the lowest 20 credits automatically. A third takes only the best 90 credits at each level. Several others use the best 100 credits at each stage.

The rationale is obvious – why should one difficult module or one difficult semester define an entire degree?

But the consequence is equally obvious. A student who scores 75-75-75-75-55-55 across six modules averages 68.3 per cent. At universities where everything counts, that’s a 2:1. At universities using best-credit selection that drops the two 55s, it averages 75 – a clear first.

Best-credit selection is the majority position among post-92s, but virtually absent at Russell Group universities. OfS is now pretty much banning this practice.

The case against rests on B4.2(c) (academic regulations must be “designed to ensure” awards are credible) and B4.4(e) (credible means awards “reflect students’ knowledge and skills”). Discounting credits with lowest marks “excludes part of a student’s assessed achievement” and so:

…may result in a student receiving a class of degree that overlooks material evidence of their performance against the full learning outcomes for the course.

2. Multiple calculation routes

These take that principle further. Several universities calculate your degree multiple ways and award whichever result is better. One runs two complete calculations – using only your best 100 credits at Level 6, or taking your best 100 at both levels with 20:80 weighting. You get whichever is higher.

Another offers three complete routes – unweighted mean, weighted mean and a profile-based method. Students receive the highest classification any method produces.

For those holding onto their “standards”, this sort of thing is mathematically guaranteed to inflate outcomes. You’re measuring the best possible interpretation of what students achieved, not what they achieved every time. As a result, comparison across institutions becomes meaningless. Again, this is now pretty much being banned.

This time, the case against is that:

…the classification awarded should not simply be the most favourable result, but the result that most accurately reflects the student’s level of achievement against the learning outcomes.

3. Borderline uplift rules

What happens on the cusps? Borderline uplift rules create all sorts of discretion around the theoretical boundaries.

One university automatically uplifts students to the higher class if two-thirds of their final-stage credits fall within that band, even if their overall average sits below the threshold. Another operates a 0.5 percentage point automatic uplift zone. Several maintain 2.0 percentage point consideration zones where students can be promoted if profile criteria are met.

If 10 per cent of students cluster around borderlines and half are uplifted, that’s a five per cent boost to top grades at each boundary – the cumulative effect is substantial.

One small and specialist plays the counterfactual – when it gained degree-awarding powers, it explicitly removed all discretionary borderline uplift. The boundaries are fixed – and it argues this is more honest than trying to maintain discretion that inevitably becomes inconsistent.

OfS could argue borderline uplift breaches B4.2(b)’s requirement that assessments be “reliable” – defined as requiring “consistency as between students.”

When two students with 69.4% overall averages receive different classifications (one uplifted to First, one remaining 2:1) based on mark distribution patterns or examination board discretion, the system produces inconsistent outcomes for identical demonstrated performance.

But OfS avoids this argument, likely because it would directly challenge decades of established discretion on borderlines – a core feature of the existing system. Eliminating all discretion would conflict with professional academic judgment practices that the sector considers fundamental, and OfS has chosen not to pick that fight.

4. Exit acceleration

Heavy final-year weighting amplifies improvement while minimising early difficulties. Where deployed, the near-universal pattern is now 25 to 30 per cent for Level 5 and 70 to 75 per cent for Level 6. Some institutions weight even more heavily, with year three counting for 60 per cent of the final mark.

A student who averages 55 in year two and 72 in year three gets 67.2 overall with typical 30:70 weighting – a 2:1. A student who averages 72 in year two and 55 in year three gets 59.9 – just short of a 2:1.

The magnitude of change is identical – it’s just that the direction differs. The system structurally rewards late bloomers and penalises any early starters who plateau.

OfS could argue that 75 per cent final-year weighting breaches B4.2(a)’s requirement for “appropriately comprehensive” assessment. B4 Guidance 335M warns that assessment “focusing only on material taught at the end of a long course… is unlikely to provide a valid assessment of that course,” and heavy (though not exclusive) final-year emphasis arguably extends this principle – if the course’s subject matter is taught across three years, does minimizing assessment of two-thirds of that teaching constitute comprehensive evaluation?

But OfS doesn’t make this argument either, likely because year weighting is explicit in published regulations, often driven by PSRB requirements, and represents settled institutional choices rather than recent innovations. Challenging it would mean questioning established pedagogical frameworks rather than targeting post-hoc changes that might mask grade inflation.

5. First-year exclusion

Finally, with a handful of institutional and PSRB exceptions, the first-year-not-counting is now pretty much universal, removing what used to be the bottom tail of performance distributions.

While this is now so standard it seems natural, it represents a significant structural change from 20 to 30 years ago. You can score 40s across the board in first year and still graduate with a first if you score 70-plus in years two and three.

Combine it with other features, and the interaction effects compound. At universities using best 105 credits at each of Levels 5 and 6 with 30:70 weighting, only 210 of 360 total credits – 58 per cent – actually contribute to your classification. And so on.

OfS could argue first-year exclusion breaches comprehensiveness requirements – when combined with best-credit selection, only 210 of 360 total credits (58%) might count toward classification. But OfS explicitly notes this practice is now “pretty much universal” with only “a handful of institutional and PSRB exceptions,” treating it as neutral accepted practice rather than a compliance concern.

Targeting something this deeply embedded across the sector would face overwhelming institutional autonomy defenses and would effectively require the sector to reinstate a practice it collectively abandoned over the past two decades.

OfS’ strategy is to focus regulatory pressure on recent adoptions of “inherently inflationary” practices rather than challenging longstanding sector-wide norms.

Institution type

Russell Group universities generally operate on the totality-of-work philosophy. Research-intensives typically employ single calculation methods, count all credits and maintain narrow borderline zones.

But there are exceptions. One I’ve seen has automatic borderline uplift that’s more generous than many post-92s. Another’s 2.0 percentage point borderline zone adds substantial flexibility. If anything, the pattern isn’t uniformity of rigour – it’s uniformity of philosophy.

One London university has a marks-counting scheme rather than a weighted average – what some would say is the most “rigorous” system in England. And two others – you can guess who – don’t fit this analysis at all, with subject-specific systems and no university-wide algorithms.

Post-1992s systematically deploy multiple flexibility features. Best-credit selection appears at roughly 70 per cent of post-92s. Multiple calculation routes appear at around 40 per cent of post-92s versus virtually zero per cent at research-intensive institutions. Several post-92s have introduced new, more flexible classification algorithms in the past five years, while Russell Group frameworks have been substantially stable for a decade or more.

This difference reflects real pressures. Post-92s face acute scrutiny on student outcomes from league tables, OfS monitoring and recruitment competition, and disproportionately serve students from disadvantaged backgrounds with lower prior attainment.

From one perspective, flexibility is a cynical response to metrics pressure. From another, it’s recognition that their students face different challenges. Both perspectives contain truth.

Meanwhile, Scottish universities present a different model entirely, using GPA-based calculations across SCQF Levels 9 and 10 within four-year degree structures.

The Scottish system is more internally standardised than the English system, but the two are fundamentally incompatible. As OfS attempts to mandate English standardisation, Scottish universities will surely refuse, citing devolved education powers.

London is a city with maximum algorithmic diversity within minimum geographic distance. Major London universities use radically different calculation systems despite competing for similar students. A student with identical marks might receive a 2:1 at one, a first at another and a first with higher average at a third, purely over algorithmic differences.

What the algorithm can’t tell you

The “five features” capture most of the systematic variation between institutional algorithms. But they’re not the whole story.

First, they measure the mechanics of aggregation, not the standards of marking. A 65 per cent essay at one university may represent genuinely different work from a 65 per cent at another. External examining is meant to moderate this, but the system depends heavily on trust and professional judgment. Algorithmic variation compounds whatever underlying marking variation exists – but marking standards themselves remain largely opaque.

Second, several important rules fall outside the five-feature framework but still create significant variation. Compensation and condonement rules – how universities handle failed modules – differ substantially. Some allow up to 30 credits of condoned failure while still classifying for honours. Others exclude students from honours classification with any substantial failure, regardless of their other marks.

Compulsory module rules also cut across the best-credit philosophy. Many universities mandate that dissertations or major projects must count toward classification even if they’re not among a student’s best marks. Others allow them to be dropped. A student who performs poorly on their dissertation but excellently elsewhere will face radically different outcomes depending on these rules.

In a world where huge numbers of students now have radically less module choice than they did just a few years ago as a result of cuts, they would have reason to feel doubly aggrieved if modules they never wanted to take in the first place will now count when they didn’t last week.

Several universities use explicit credit-volume requirements at each classification threshold. A student might need not just a 60 per cent average for a 2:1, but also at least 180 credits at 60 per cent or above, including specific volumes from the final year. This builds dual criteria into the system – you need both the average and the profile. It’s philosophically distinct from borderline uplift, which operates after the primary calculation.

And finally, treatment of reassessed work varies. Nearly all universities cap resit marks at the pass threshold, but some exclude capped marks from “best credit” calculations while others include them. For students who fail and recover, this determines whether they can still achieve high classifications or are effectively capped at lower bands regardless of their other performance.

The point isn’t so much that I (or OfS) have missed the “real” drivers of variation – the five features genuinely are the major structural mechanisms. But the system’s complexity runs deeper than any five-point list can capture. When we layer compensation rules onto best-credit selection, compulsory modules onto multiple calculation routes, and volume requirements onto borderline uplift, the number of possible institutional configurations runs into the thousands.

The transparency problem

Every day’s a school day at Wonkhe, but what has been striking for me is quite how difficult the information has been to access and compare. Some institutions publish comprehensive regulations as dense PDF documents. Others use modular web-based regulations across multiple pages. Some bury details in programme specifications. Several have no easily locatable public explanation at all.

UUK’s position on this, I’d suggest, is a something of a stretch:

University policies are now much more transparent to students. Universities are explaining how they calculate the classification of awards, what the different degree classifications mean and how external examiners ensure consistency between institutions.

Publication cycles vary unpredictably, cohort applicability is often ambiguous, and cross-referencing between regulations, programme specifications and external requirements adds layers upon layers of complexity. The result is that meaningful comparison is effectively impossible for anyone outside the quality assurance sector.

This opacity matters because it masks that non-comparability problem. When an employer sees “2:1, BA in History” on a CV, they have no way of knowing whether this candidate’s university used all marks or selected the best 100 credits, whether multiple calculation routes were available or how heavily final-year work was weighted. The classification looks identical regardless. That makes it more, not less, likely that they’ll just go on prejudices and league tables – regardless of the TEF medal.

We can estimate the impact conservatively. Year one exclusion removes perhaps 10 to 15 per cent of the performance distribution. Best-credit selection removes another five to 10 per cent. Heavy final-year weighting amplifies improvement trajectories. Multiple calculation routes guarantee some students shift up a boundary. Borderline rules uplift perhaps three to five per cent of the cohort at each threshold.

Stack these together and you could shift perhaps 15 to 25 per cent of students up one classification band compared to a system that counted everything equally with single-method calculation and no borderline flexibility. Degree classifications are measuring as much about institutional algorithm choices as about student learning or teaching quality.

Yes, but

When universities defend these features, the justifications are individually compelling. Best-credit selection rewards students’ strongest work rather than penalising every difficult moment. Multiple routes remove arbitrary disadvantage. Borderline uplift reflects that the difference between 69.4 and 69.6 per cent is statistically meaningless. Final-year emphasis recognises that learning develops over time. First-year exclusion creates space for genuine learning without constant pressure.

None of these arguments is obviously wrong. Each reflects defensible beliefs about what education is for. The problem is that they’re not universal beliefs, and the current system allows multiple philosophies to coexist under a facade of equivalence.

Post-92s add an equity dimension – their flexibility helps students from disadvantaged backgrounds who face greater obstacles. If standardisation forces them to adopt strict algorithms, degree outcomes will decline at institutions serving the most disadvantaged students. But did students really learn less, or attain to a “lower” standard?

The counterargument is that if the algorithm itself makes classifications structurally easier to achieve, you haven’t promoted equity – you’ve devalued the qualification. And without the sort of smart, skills and competencies based transcripts that most of our pass/fail cousins across Europe adopt, UK students end up choosing between a rock and a hard place – if only they were conscious of that choice.

The other thing that strikes me is that the arguments I made in December 2020 for “baking in” grade inflation haven’t gone away just because the pandemic has. If anything, the case for flexibility has strengthened as the cost of living crisis, inadequate maintenance support and deteriorating student mental health create circumstances that affect performance through no fault of students’ own.

Students are working longer hours in paid employment to afford rent and food, living in unsuitable accommodation, caring for family members, and managing mental health conditions at record levels. The universities that retained pandemic-era flexibilities – best-credit selection, generous borderline rules, multiple calculation routes – aren’t being cynical about grade inflation. They’re recognising that their students disproportionately face these obstacles, and that a “totality-of-work” philosophy systematically penalises students for circumstances beyond their control rather than assessing what they’re actually capable of achieving.

The philosophical question remains – should a degree classification reflect every difficult moment across three years, or should it represent genuine capability demonstrated when circumstances allow? Universities serving disadvantaged students have answered that question one way – research-intensive universities serving advantaged students have answered it another.

OfS’s intervention threatens to impose the latter philosophy sector-wide, eliminating the flexibility that helps students from disadvantaged backgrounds show their “best selves” rather than punishing them for structural inequalities that affect their week-to-week performance.

Now what

As such, a regulator seeking to intervene faces an interesting challenge with no obviously good options – albeit one of its own making. Another approach might have been to cap the most egregious practices – prohibit triple-route calculations, limit best-credit selection to 90 per cent of total credits, cap borderline zones at 1.5 percentage points.

That would eliminate the worst outliers while preserving meaningful autonomy. The sector would likely comply minimally while claiming victory, but oodles of variation would remain.

A stricter approach would be mandating identical algorithms – but would provoke rebellion. Devolved nations would refuse, citing devolved powers and triggering a constitutional comparison. Research intensive universities would mount legal challenges on academic freedom grounds, if they’re not preparing to do so already. Post-92s would deploy equity arguments, claiming standardisation harms universities serving disadvantaged students.

A politically savvy but inadequate approach might have been mandatory transparency rather than prescription. Requiring universities to publish algorithms in standardised format with some underpinning philosophy would help. That might preserve autonomy while creating a bit of accountability. Maybe competitive pressure and reputational risk will drive voluntary convergence.

But universities will resist even being forced to quantify and publicise the effects of their grading systems. They’ll argue it undermines confidence and damages the UK’s international reputation.

Given the diversity of courses, providers, students and PSRBs, algorithms also feel like a weird thing to standardise. I can make a much better case for a defined set of subject awards, a shared governance framework (including subject benchmark statements, related PSRBs and degree algorithms) than I can for tightening standardisation in isolation.

The fundamental problem is that the UK degree classification system was designed for a different age, a different sector and a different set of students. It was probably a fiction to imagine that sorting everyone into First, 2:1, 2:2 and Third was possible even 40 years ago – but today, it’s such obvious nonsense that without richer transcripts, it just becomes another way to drag down the reputation of the sector and its students.

Unfit for purpose

In 2007, the Burgess Review – commissioned by Universities UK itself – recommended replacing honours degree classifications with detailed achievement transcripts.

Burgess identified the exact problems we have today – considerable variation in institutional algorithms, the unreliability of classification as an indicator of achievement, and the fundamental inadequacy of trying to capture three years of diverse learning in a single grade.

The sector chose not to implement Burgess’s recommendations, concerned that moving away from classifications would disadvantage UK graduates in labour markets “where the classification system is well understood.”

Eighteen years later, the classification system is neither well understood nor meaningful. A 2:1 at one institution isn’t comparable to a 2:1 at another, but the system’s facade of equivalence persists.

The sector chose legibility and inertia over accuracy and ended up with neither – sticking with a system that protected institutional diversity while robbing students of the ability to show off theirs. As we see over and over again, a failure to fix the roof when the sun was shining means reform may now arrive externally imposed.

Now the regulator is knocking on the conformity door, there’s an easy response. OfS can’t take an annual pop at grade inflation if most of the sector abandons the outdated and inadequate degree classification system. Nothing in the rules seems to mandate it, some UG quals don’t use it (think regulated professional bachelors), and who knows where the White Paper’s demand for meaningful exit awards at Level 4 and 5 fit into all of this.

Maybe we shouldn’t be surprised that a regulator that oversees a meaningless and opaque medal system with a complex algorithm that somehow boils an entire university down to “Bronze”, “Silver” Gold” or “Requires Improvement” is keen to keep hold of the equivalent for students.

But killing off the dated relic would send a really powerful signal – that the sector is committed to developing the whole student, explaining their skills and attributes and what’s good about them – rather than pretending that the classification makes the holder of a 2:1 “better” than those with a Third, and “worse” than those with a First.

fest Festival side

TFOHE25_Website_Column_1000x1680_BTBS@2x

View here

by Mark Leach

featured message

19/05/23

post list Latest articles

wonkhe-earthquake-data — Image: Shutterstock

The Office for Students steps on to shaky ground in an attempt to regulate academic standards

by David Kernohan

Policy Watch

6/11/25

Shutterstock_1963334275 — Image: Shutterstock

Students taking resits need specific support

by Karen Lander

Comment

6/11/25

Wonkhe-build-back-blocks-3 — Image: Shutterstock

What we still need to talk about when it comes to the LLE

by Vy Chuong

Comment

6/11/25

Shutterstock_1333773065 — Image: Shutterstock

Northern Ireland can be a testbed for research culture

by Chris Browne

Comment

5/11/25

New heights in tertiary partnerships

by Lisa Roberts

Comment

5/11/25

shutterstock_774167890 (1) — Image: Shutterstock

Universities in England can’t ignore the curriculum (and students) that are coming

by Jim Dickinson

Policy Watch

5/11/25

Hand,Holding,Smartphone,With,Electronic,Bill,Notification,Flying,Out,Of — Image: Shutterstock

Higher education should back a national digital skills wallet

by Patrina Law

Comment

4/11/25

Autumn,,Centuries-old,Chestnut,Forest,In,The,Tuscan,Mountains.,Time,For — Image: Shutterstock

What might lower response rates mean for Graduate Outcomes data?

by Tej Nathwani

Data

4/11/25

Shutterstock_1717358041 — Image: Shutterstock

Change the incentives to change the economy

by James Coe

Comment

3/11/25

Set,Of,Rune,Stones,For,Divination,And,Fortune,Telling.,Mystic — Image: Shutterstock

Securing educational excellence may demand a new leadership compact

by Charles Knight

Research

3/11/25

2 Comments

Oldest

Newest

Inline Feedbacks

View all comments

Jonathan Alltimes

2 hours ago

The awarded marks are a codification of a qualitative judgment with reference to the academic as the standard of measurement (there is no way of knowing for certain if their judgment is consistent from year to year). Merging marks implies merging qualitative judgments, as if the judgments are consistent among academics for different courses of study (paired second markers could be a check on the boundaries), which can not be proven and I think it is not so (collegiality may go some way to setting boundaries). The division of degrees into honours and ordinary was based on more specialisation, which… Read more »

Lecturer and external examiner

2 hours ago

There are at least three rational arguments in favour of weighting later years more heavily. Firstly, many programmes (esp. but not exclusively STEM) have very highly incremental learning, to the extent that final year assessments are implicitly re-examining earlier material – indeed, perhaps even examining that material more deeply as the student is expected to have improved their grasp of it with subsequent use. In this scenario, an equal weighting of year marks would skew the average heavily towards the earlier material. Secondly, there is a good case that the degree certificate is supposed to certify the student’s capability at… Read more »

Share

More, not less

What we found

Five features that create flexibility

1. Best-credit selection

2. Multiple calculation routes

3. Borderline uplift rules

4. Exit acceleration

5. First-year exclusion

Institution type

What the algorithm can’t tell you

The transparency problem

Yes, but

Now what

Unfit for purpose

Share

Share

fest Festival side

post list Latest articles