Academic judgement? Now that’s magic

Jim Dickinson questions whether academic judgment can hold in the AI era, as flawed detection tools create a system where students can't challenge misconduct allegations

Jim is an Associate Editor (SUs) at Wonkhe

Every day’s a school day.

In my head, I thought I understood the line between what counts as “academic judgement” and what doesn’t in cases, processes, appeals and complaints.

It matters because my understanding has long been that students can challenge and appeal all sorts of decisions – right up to the Office of the Independent Adjudicator (OIA) in England and Wales – but not if the decision is one that relates to matters of academic judgment.

Thus in a simplistic coin sorter, “this essay looks like a 2:1 to me” can’t be challenged, but “they’ve chucked me out for punching someone when I didn’t” can.

I’ve often wondered whether the position can hold when we think about the interaction between consumer protection law – which requires that services be carried out with reasonable skill and care – and this concept of unchallengeable academic judgement in the context of workloads.

Back in the halcyon days of Twitter, I’d regularly see posts from academic staff bemoaning the fantasy workload model in their university that somehow suggested that a 2,000 word essay could be read, graded and fed back on in 15 minutes flat.

Add in large numbers of assignments dribbling in late via extensions and accommodations, the pressure to hit turnround times generated by the NSS question, wider workload issues and moderation processes that look increasingly thin (which were often shredded or thinned out even further during the marking boycotts), and I imagined a judge evaluating a student’s case to say something along the lines of “to deploy your magic get out jail free card, sunshine, you’ll need to have used more… care.”

But that’s about an academic judgement being made in a way that isn’t academically defensible. I had a conversation with an SU officer this afternoon about academic misconduct off the back of a webinar they’d attended that OfS ran on AI, and now I’m more confused than ever.

Do not pass go

The bones first. The Higher Education Act 2004 mandated a body that would review complaints to replace the old “visitor” system, and it includes a line on what will and won’t qualify as follows:

A complaint which falls within subsection (1) is not a qualifying complaint to the extent that it relates to matters of academic judgment.

The concept is neither further defined nor mentioned anywhere else in UK law – but has deep roots. In medieval universities scholarly masters enjoyed autonomous assessment rights, and it gained legal recognition as universities developed formal examination systems during the Enlightenment period.

By the 20th century, academic judgment became legally protected from external interference, exemplified by landmark cases like Clark v. University of Lincolnshire and Humberside (2000), which established that courts should not intervene in academic assessments except in cases of procedural unfairness:

This is not a consideration peculiar to academic matters: religious or aesthetic questions, for example, may also fall into this class. It is a class which undoubtedly includes, in my view, such questions as what mark or class a student ought to be awarded or whether an aegrotat is justified.

The principle that most understand is that specialised academic expertise uniquely qualifies academics to evaluate student performance and maintain educational standards, free from political or economic pressures.

The OIA takes the line in the legislation and further defines things as follows:

Academic judgment is not any judgment made by an academic; it is a judgment that is made about a matter where the opinion of an academic expert is essential. So for example a judgment about marks awarded, degree classification, research methodology, whether feedback is correct or adequate, and the content or outcomes of a course will normally involve academic judgment.

It also helpfully sets out some things that it doesn’t consider fall into the ambit:

We consider that the following areas do not involve academic judgment: decisions about the fairness of procedures and whether they have been correctly interpreted and applied, how a higher education provider has communicated with the student, whether an academic has expressed an opinion outside the areas of their academic competence, what the facts of a complaint are and the way evidence has been considered, and whether there is evidence of bias or maladministration.

I’m not convinced that that “what’s in, what’s out” properly considers or incorporates the consumer protection law issue I discuss above – but nevertheless it hangs together.

In his paper on whether the concept will hold in an era of consumerism, David Palfreyman argues that academic judgment properly applies to subjective assessments requiring specialised expertise – grading student work, designing curriculum, and evaluating learning outcomes.

Educational institutions and courts generally consider these issues beyond external scrutiny to protect academic freedom and professional autonomy – because academics possess unique qualifications to make nuanced, context-dependent judgments about academic quality that outside parties lack the expertise to evaluate effectively.

On the other side of that see-saw, he argues that academic judgment should not shield factual determinations, procedural errors, or administrative decisions from review. When institutions make claims about whether they properly applied their own rules, or failed to follow fair procedures, these issues fall outside protected academic judgment.

Religious or aesthetic?

But then back in the OIA’s guidance on its scheme rules at 30.4, there’s this curious line:

Decisions about whether a student’s work contains plagiarism and the extent of that plagiarism will normally involve academic judgment, but that judgment must be evidence based.

If that feels like a fudge, it’s because it is. “Whether” feels like a significantly different concept to “extent of that”, insofar as I can see how “did you punch the student” is about weighing up facts, but “how much harm did you cause” might require an expert medical judgement. But in a way, the fudge is topped off with that last sub-clause – what the OIA will insist on if someone uses that judgement is that they’ve used some actual evidence.

And if they haven’t, that then is a process issue that becomes appealable.

The problem here in 2025 is that 30.4 starts to look a little quaint. When someone was able to say “here’s one student’s script, and here’s another” with a red sharpie pointing out the copying, I get the sense that everyone would agree that that counts as evidence.

Similarly, when Turnitin was able to trawl both the whole of the internet and every other essay ever submitted to its database, I get the sense that the Turnitin similarity score – along with any associated reports highlighting chunks of text – counts as evidence.

But generative AI is a whole different beast. If this blog over-used the words “foster” and “emphasize”, used Title Case for all the subheadings, and set up loads of sentences using “By…” and then “can…”, not only would someone who reads a lot of essays “smell” AI, it would be more likely to be picked up by software that purports to indicate if I have.

That feels less like evidence. It’s guess work based on patterns. Even if we ignore the research on who “false flags” disproportionately target, I might just like using those phrases and that style. In that scenario, I might expect a low mark for a crap essay, but it somehow feels wrong that someone can – without challenge – determine whether I’m “guilty” of cheating and therefore experience a warning, a cap on the mark or whatever other punishment can be meted out.

And yes, all of this relates back to an inalienable truth – the asynchronous assessment of a(ny) digital asset produced without supervision as a way of assessing a students’ learning will never again be reliable. There’s no way to prove they made it, and even if they did, it’s increasingly clear that it doesn’t necessarily signal that they’ve learned anything when they did.

But old habits and the economics of massification seem to be dying hard. And so in the meantime, increasing volumes of students are being “academically” judged to have “done it” when they may not have, in procedures and legal frameworks where, by definition, they can’t challenge that judgement. And an evaluation of whether someone’s done it based on concepts aligned to religion and aesthetics surely can’t be right.

Cases in point

There’s nothing that I can see in the OIA’s stock of case summaries that sheds any light on what it might or might not consider to count as “evidence” in its scheme rules.

I don’t know whether it would take as its start point “whatever the provider says counts as evidence”, or whether it might have an objective test up its sleeve if a case crossed its desk.

But what I do know is just how confusing and contradictory a whole raft of academic misconduct policies are.

The very first academic misconduct policy I found online an hour or so ago says that using AI in a way not expressly permitted is considered academic misconduct. Fair enough. It also specifies that failing to declare AI use, even when permitted, also constitutes misconduct. Also fair enough.

It defines academic judgement as a decision made by academic staff regarding the quality of the work itself or the criteria being applied. Fair enough. It also specifically states that academic judgement does not apply to factual determinations – it applies to interpretations, like assessing similarity reports or determining if the standard of work deviates significantly from a student’s usual output. Again, fair enough.

But in another section, there’s another line – that says that the extent to which assessment content is considered to be AI generated is a matter of academic judgement.

The in-principle problem with that for me is that a great historian is not necessarily an LLM expert, or a kind of academic Columbo. Expertise in academic subject matter just doesn’t equate to expertise in detecting AI-generated content.

But the in-practice problem is the thing. AI detection tools supplying “evidence” are notoriously unreliable, and so universities using them within their “academic judgment” put students accused of using AI in an impossible situation – they can’t meaningfully challenge the accusation because the university has deemed it unchallengeable by definition, even though the evidence may be fundamentally flawed.

Academic judgements that are nothing of the sort, supported by unreliable technology, become effectively immune from substantive appeal, placing the burden on students to somehow prove a negative (that they didn’t use AI) against an “expert judgment” that might be based on little more than algorithmic guesswork or subjective impressions about writing style.

Policies are riddled with this stuff. One policy hedges its bets and says that the determination of whether such AI use constitutes academic misconduct is “likely to involve academic judgement”, especially where there is a need to assess the “extent and impact” of the AI-generated content on the overall submission. Oil? Water? Give it a shake.

Another references “academic judgment” in the context of determining the “extent and nature” of plagiarism or misconduct, “including the use of AI” – with other bits of the policy making clear that that can’t be challenged if supported by “evidence”.

One I’m looking at now says that the determination of whether a student has improperly used AI tools is likely to involve academic judgement, particularly when assessing the originality of the work and whether the AI-generated content meets the required academic standards. So is the judgement whether the student cheated, or whether the essay is crap? Or, conveniently, both?

Set aside for a minute the obvious injustices of a system that seems to be profoundly incurious about how a student has come to think what they think, but seems obsessed with the method they’ve used to construct an asset that communicates those thoughts – and how redundant that approach is in a modern context.

Game over

For all sorts of reasons, I’ve long thought that “academic judgement” as something that can be deployed as a way of avoiding challenge and scrutiny is a problem. Barristers were stripped of their centuries-old immunity from negligence claims based on evolving expectations of professional accountability in the 2000s.

In medicine, the traditional “Bolam test” was that a doctor was not negligent if they acted “in accordance with a responsible body of medical opinion” among their peers But a case in the nineties added a crucial qualification – the court must be satisfied that the opinion relied upon has a “logical basis” and can withstand logical analysis.

Or take accountancy. Prior to 2002, accountants around the world enjoyed significant protection through the principle of “professional judgment” that shielded their decisions from meaningful challenge, but the US Congress’s Sarbanes-Oxley Act radically expanded their liability and oversight following Arthur Andersen’s role in facilitating Enron’s aggressive earnings management and subsequent document shredding when investigations began.

Palfreyman also picks out architects and surveyors, financial services professionals and insurance brokers, patent agents and trademark lawyers, software suppliers/consultants, clergy providing counseling services, and even sports officials – all of whom now face liability for their professional judgments despite the technical complexity of their work.

As Palfreyman notes in his analysis of the Eckersley v Binnie case (which defined the standards for “reasonable skill and care” for professionals generally), the standard that a professional “should not lag behind other ordinarily assiduous and intelligent members of his profession” and must be “alert to the hazards and risks” now applies broadly across professions, with academics sticking out like a sore, often unqualified thumb.

Maybe the principle is just about salvageable – albeit that the sorry state of moderation, external examining, workload modelling and so on does undermine the already shaky case for “we know better”. But what I’m absolutely sure of is that extending the scope of unchallengeable decisions involving “academic judgement” to whether a student broke a set of AI-misconduct rules is not only a very slippery slope, but it’s also a sure fire way to hasten the demise of the magic power.

3 responses to “Academic judgement? Now that’s magic

  1. Having started my academic career in an era before assessment criteria were made available to students, even before constructive alignment informed most assessment, I have witnessed how far academic judgement, quite rightly has at least been given quite firm guard rails. Yes, there is always going to be subjectivity and not just in arts, humanities and social sciences. However, these days I am not aware of any UK HEI which does not insist that students are made aware of what they are to be assessed on and what constitutes a ‘good’ or a ‘poor’ piece of work. Moderation of assessed work has long been in place and these days is typically prepared for with calibration too. You might argue that the opinions of the marker, moderator (and third marker if required) remain subjective, but the whole process is much more robust against bias than it certainly was 30 years ago.

    Ironically students still express a concern that they need to be taught by the very staff who will mark them so as to get to know what will grade best, whereas that is far less of a factor than it was in the past. While sometimes unpopular anonymous marking has been another boost in this.

    In the 2020s students have the ability to see how decisions are arrived at and what should get them a good grade. It may not get them the grade they feel they deserve but following the guidance should be enough to secure a (decent) pass. Perhaps the challenge is that all this information around assessments is not necessarily employed to best effect. In many ways it is feedforward even before the assessment has to be completed, but if students do not make use of it then the decisions made about their submissions still might appear to have come from ‘magic’. From that stems a sense of arbitrariness and thus naturally resentment. It is ironic that this was probably more applicable back in the 20th Century when staff did say things like ‘this feels like a 2.1’ than now.

    The rapid and widespread acceptance of constructive alignment for modules and especially their assessments has required a rise in assessment literacy on both the part of the staff and of the students and for both sides has not always been explicitly recognised. Assessment might still have some commonalities with baking, but despite concerns about ‘unfair’ marking, effective engagement with the process by all concerned, certainly moves us well away from ‘magic’.

  2. One challenge in terms of ‘evidence’ of academic misconduct came about once Turnitin came into common usage in such case. As noted here it shows similarity rather than pointing to plagiarism per se. With so many pieces of work being uploaded daily common phrases for certain topics turn up constantly. Conversely it often cannot ‘reach’ sources behind paywalls so depends on staff familiarity with them. In addition for some subjects, notably Law, a low similarity match can indicate a poor piece of work not citing the necessary cases. Turnitin is probably best at highlighting collusion especially across institutions.

    What I noted was that as Turnitin came to be seen as the prime evidence of academic misconduct, any lack of such evidence was deemed to be ‘proof’ that no offence had been committed. Students would often turn up with reports from some other matching software or wrangle about the percentages of matching that constituted an offence. This now may also be a challenge when students are accused of using AI. The requirement of staff to ‘prove’ the offence really undermines their academic judgement from their knowledge of the content and the students’ work.

    I have always stated that packages like Turnitin are only ‘tools’ to aid academic judgement. They could have suspicions on a piece of pristine contract cheated work but not be able to show its source, even looking for document history or hidden characters. Thus, they put steps in place like questioning the student about what they knew of what they had submitted. The onus then falls to the student to ‘prove’ this is their work, not by producing a series of drafts, but by simply showing that they have they knowledge and skills that they are saying they possess when they submitted the work.

    At the end of the day that is the role of assessment, to show that the student has achieved the learning outcomes. Unfortunately the focus has now shifted more towards treating the submitted work as a kind of product, the authenticity of which has to be proven, detached from what that assessment should actually be demonstrating.

  3. Yes, the ‘magic’ of judicial deference to the PROPER exercise of academic judgment is universal across jurisdictions and I am aware of only two court cases where it has been challenged as a concept – one under the Law of Louisiana by Judge Plotkin and one under English law (where in the later case a bemused judge did query its reliability when told that a series of markers for a contested Criminal Law exam paper had managed to grade variously as it Fail to First!). Note PROPER – for instance, the examiners/assessors should not be piddled or ga-ga; they should be reasonably competent professionals but need not be top-notch; they should not show bias or discriminate; and the exam board must be able to add up the marks accurately! DP.

Leave a reply