Attainment, assessment and the sawtooth effect

A thoughtful Ofqual report on the impact and management of assessment changes gives the admission debate a lot to consider.

David Kernohan is Deputy Editor of Wonkhe

Ofqual has been thinking hard about the meanings we attach to assessment at GCSE and A level.

A superb pair of reports written by Ofqual’s Paul Newton – one on the “sawtooth effect” itself, and one on the way standards are maintained during periods of qualification reform – act as a path into contemporary debates about academic standards, with particular applicability to current interest in university applications based on qualifications rather than predictions.

A is A?

Though A level results, especially, hold the status of a gold standard in understanding the attainment and potential of prospective students there’s quite a lot of caveats that often go unspoken.

For instance, an exam is a sampling methodology – designed to infer an overall mastery of a curriculum (which itself is a sample of the entirety of a domain of knowledge) from responses to questions probing knowledge of a subset of that curriculum. This makes sense, in that an exam that covered the entirety of the A level English Literature subject would take several days to complete.

Drawing on the famous metaphor developed by Gilbert Ryle, the report infers that to truly understand how well someone can hit a target with a rifle you need to observe them attempting to do so several times under different conditions. If you just make one observation you are more likely to see performance attributable to a fluke (she may miss entirely or hit the bullseye for reasons entirely unconnected to proficiency.

So performance (how a person did on one occasion) is necessarily different from attainment (how well a person has mastered a domain of knowledge – it’s the particular versus the general.

What about the sawtooth?

There are ways we can prepare for a particular assessment. Teachers with a knowledge of the way questions are likely to be asked, or the ability to make professional predictions of what questions may come up, will inform the way we bring our own preparations and expectations to the exam hall. Teachers are better at doing this where exam patterns are established – they do worse at this kind of preparation after a major exam reform.

Many people argue that this kind of preparation disguises the “true” level of attainment from a particular cohort, others (usually diametrically opposed ideologically, but arguing from the same principles) argue that examination reform makes the first cohort after a change look like they have attained less than they probably have.

The combination of these effects can be described as the “sawtooth effect” – a sharp dip in performance immediately after a major change, a slow climb thereafter until the next major change.

Seeing a saw

If you look at, say, the spread of A level grades within each cohort between 2010 and 2020 you will not see this effect, despite the radical changes in examination practices between 2015 and 2018. This is because, by the time we see the publication of this kind of data about A level grades performance has been normalised – what the paper describes as the Comparable Outcomes approach.

“When qualifications change, we follow the principle of comparable outcomes – this means that if the national cohort for a subject is similar (in terms of past performance) to last year, then results should also be similar at a national level in that subject”

So the actual exam performance (in terms of a raw mark) for each student relies on the performance of the whole cohort (and previous cohorts) to attribute a grade. If you get 68 per cent of your exam paper right you might have gotten a C in 2013, an A in 2018, or a D in 2016. This approach mitigates against the impact of the sawtooth effect, but arguably weakens further the link between what we know about the degree of mastery a given student has and the grade they get. It’s much more like a place in a ranking than a measure of potential, and that presents a problem.

What A levels are for

A levels are squarely within the Comparable Outcomes tradition, but are typically used to support progression into higher-level courses. The Ofqual paper argues that Comparable Outcomes informed approaches are really best at managing the transition between different forms of assessment (the dip in the sawtooth, in other words).

Has the time to end this practice for A levels therefore come? Maybe. In all four UK systems we departed from the standardised distribution imposed on “normal” A level grades to the Centre Assessed Grades that (primarily) drew on teacher predictions informed by in class performance. This meant that, for the first time in a decade, we saw what commentators love to call “grade inflation”, as the number of students allocated higher grades (particularly from disadvantaged backgrounds) increased.

This was due to the move abandoning both the Comparable Outcomes tradition (that can be understood to refer to norm-referenced grading) and the Similar Cohort Adage (that brings previous performance linked to cohort demographics) into play. It was the latter that caused the problems characterised by Boris Johnson’s language of “mutant algorithms” – while there is some intellectual weight to arguing that A levels are a ranking rather than a pure assessment and thus that individual exam results do not, in essence, matter, there is little support to an idea that suggests you should do less well in your exams because you have experienced socio-economic disadvantage.

In a regular, non-plague, year the “similar cohort adage” would multiply existing differences in attainment – when we removed it we saw better results based on teacher assessments which were in turn based on teacher understanding about the kind of assessment that students were liable to face and how they might do.

I’m labouring a point here, and it is a point I have made before – predicted grades contain more information on student mastery (because they are based on multiple, lower stakes, assessments) than the actual (post-norm-referencing) grades themselves. It would be a shame to lose that depth of information in pursuit of asserting a single measurement means what it doesn’t.

Leave a Reply