It looks like you’re writing a disciplinary letter. Would you like help?

I chaired a fascinating debate at the University of London on Monday night on AI in education.

Jim is an Associate Editor at Wonkhe

It was part of an exciting programme of events that UoL is staging in the coming months on the future of higher education.

One of the things that struck me on the way home was the enormity of the threats and opportunities that represent each side of the same coin of generative AI.

I hadn’t really considered, for example, the way in which the tech might be able to facilitate access to higher education for non-native speakers and writers.

But are there downsides?

We don’t know what’s around the corner

To my considerable surprise, over the past month or so it’s emerged that several universities have been using GPT detection software to try to catch cheating students.

When Turnitin’s AI detection software launched, I figured the furore around its functionality would be enough to cause most universities to call for it to be disabled – and that’s largely what happened.

Nevertheless some have been using it, and others have been using similar tools that purport to be able to detect AI-written text.

Some are taking scores from the software direct to panels as the core piece of evidence. Some are using it as a trigger for an allegation and parallel investigation that is often drawn-out and hugely stressful for the student, only for the case to be dropped when SUs or lawyers start asking difficult questions.

Some are using “deal or no deal” principles, like a parking fine. Accept this non-student career ending penalty, or we’ll investigate – and then the punishment could be worse.

I’m fairly confident that if someone has used GPT tools, then the detectors will likely give it a high score. That’s because fundamentally the software tends to ask the same question that GPT asks – what is the most likely next word/sentence.

The problem is false flags. What if the software catches a bunch of non-GPT generated text? What’s the false detection rate? The firms flogging the software tend to be more coy.

Think the early Covid tests, which when coupled with the risk of isolation if the test was positive, posed a real risk to unnecessary isolation.

Of course, an even bigger risk is if the false flags end up being unevenly distributed amongst student characteristics.

Even an allegation can be devastating for a student, especially if they’re already less likely to have a support network, or if the stakes are higher because of, say, the level of investment from a family or the proximity to a final mark or grade.

That’s why this new research led by James Zou, an assistant professor of biomedical data science at Stanford University, is so important.

Toeful of Asha on the 45

In the study 91 English essays written for the Test of English as a Foreign Language (TOEFL), penned by non-native English speakers, were uploaded to seven GPT detectors to see how well they performed.

More than half were flagged as AI-generated – with one program flagging 98 per cent of the submissions as composed by AI. But when essays written by native US English-speaking eighth graders were uploaded, detectors classed more than 90 per cent of them as human-generated.

This makes loads of sense. When we learn a new skill, before we get to the confidence of unconscious competence, we are quite mechanical when we’re at conscious competence.

Like when you’re driving. Mirror, signal, manoeuvre, feed the wheel, and all that.

Effectively the research found that non-native writers were being more conservative and predictable in how they used language and structure.

That has some quite serious implications. As well as false, unfounded or unprovable allegations, the researchers point out that if non-native writing is more consistently flagged as GPT, we may end up in a situation of ironically causing non-native writers to use GPT to refine their vocabulary and linguistic diversity to sound more native.

And given non-native speakers may use GPT legitimately as a way to improve their English, they’re subsequently more likely to adopt grammatical structures that are common in GPT models as their own.

This all raises huge and serious questions about the use of those tools. Even if accompanied by investigations, this is all the problems we’re familiar with with “stop and search”, only on steroids – and almost certainly would represent a new artefact of institutional racism. It has to stop.

This isn’t just about international or non-native speaking students either. Plenty of students are less confident than others, and so display less creativity/flair in their writing.

We see this a lot at Wonkhe with new writers. There’s a tendency to include phrases like “In conclusion”, already a classic GPT trope, where we would encourage the writer to signal in more creative ways that they’re summing up.

And plenty of students that are recruited on potential rather than traditional academic attainment will be less skilled at that kind of writing.

That’s not a reason to not mark them down. I can see how a university would want to grade writing appropriately and develop better writing over time in their students across a breadth of subject areas.

But giving a less confident or skilled writer a lower grade or feedback aimed at improvement is one thing – accusing a student of cheating is quite another.

And then “deal or no dealing” a penalty with the threat of something worse if the scary and opaque processes still find you “guilty” on the balance of probabilities is another thing again.

Stop and ask searching questions

That universities rarely publish stats on the characteristics of students who are most likely to be accused of misconduct – both academic and non academic – is already troubling.

I sort of get not wanting there to be a league table. But even internally, I know this is not something that’s routinely tabled for discussion.

Not doing so runs the risk of maintaining a “racist stop and search” type culture. And not doing so also prevents targeted approaches at supporting students to avoid academic misconduct in the first place.

Given that by September, Microsoft Word will incorporate AI powered spelling, grammar and next word/sentence suggestion tools based on what’s already been written, this continued assumption that we’ll be able to use AI to detect AI is already turning into a dystoptian arms race where the only winners will be the tech bros in Silicon Valley.

Time spent on deeply discriminatory academic misconduct goose chases would almost certainly be better deployed on reinventing assessment for an AI age. And if nothing else, those using AI tools in this way to uphold academic “integrity” should probably look up the word’s meaning.


2 responses to “It looks like you’re writing a disciplinary letter. Would you like help?

  1. “That universities rarely publish stats on the characteristics of students who are most likely to be accused of misconduct – both academic and non academic – is already troubling … even internally, I know this is not something that’s routinely tabled for discussion.”

    I don’t recognise this. We don’t publish it, but we always include it in analysis, and we aren’t the only ones doing this. Numbers are generally too small to draw much in the way of conclusions, but students from underrepresented groups aren’t overrepresented in the data (to date).

    We also aren’t using AI detection software though, nor have we forbidden use of AI software outright. The challenge will (for us) will be to give clarity on what is and isn’t appropriate under different circumstances.

    1. “students from underrepresented groups aren’t overrepresented in the data (to date).”

      it would be interesting to see if this changes when/if AI detection software is used though.

Leave a Reply