The post-election regroupement of the government provides a surprising and welcome opportunity to rethink the TEF before it does too much damage, before the shaping of behaviours and unplanned consequences become entrenched.
Let’s consider what might happen if the government took that opportunity—replacing TEF 2 (2017), whose first results will be announced – along with celebrations and wailing – tomorrow, with a much better TEF 3 (2018) for implementation next time round.
TEF 3 would need to start from three assumptions. First, for better and for worse we live in an era of performance management in higher education. This is dictated ultimately by public accountability, and it is an unanswerable requirement. We need performance management, not to create false differentiation between providers in order to make an artificial quasi-market look real, but because outcomes in fundamental areas matter. This means that the new public management genie won’t be pushed back into the collegial bottle, whether in research or in relation to teaching and learning. However, systems of performance management can be better or worse and should be judged by the effects they have. The task of system design is to render any new teaching and learning performance measures as valid as possible in the light of educational and social objectives, and with the least negative consequences.
Second, in the UK we are excellent at collecting comprehensive data on higher education, but less good at interpreting this data. Correlations are all too quickly turned into causes and simplistic explanations abound, such as ‘higher education generates graduate earnings’. Simple mantras can generate major distortions of the real pattern of disciplinary and institutional performance, when they become systematised into reputation judgments and funding outcomes.
Third, UK higher education is closely regulated by a potent mechanism for measuring research outcomes, the REF, while global university reputation is driven above all by standing in research publication and citation. Research-related rankings shape the sector. If we truly want the education function to gather a status equivalent to research, to genuinely place students at the centre, we need a viable method for assessing teaching and learning, one that can evolve and improved over time (and keep a step ahead of the inevitable gaming).
Creating the virtuous circle
The key to achieving the best possible and least damaging performance management system, is to create a virtuous circle between real outcomes, performance measure and the resulting competitive position. Any ranking worth its salt must be grounded in real outcomes. Improvements in real performance, in a comparative sense, must lead to new behaviours that generate improvements in the relative position of the institution or discipline or the virtuous circle has not been established. There is no incentive to improve.
For example, in the case of measures of university reputation based on surveys of academic staff, as used in some ranking systems, the virtuous circle is absent. These rankings, which reward the established university names regardless of actual effort, are not worth the paper on which they are printed. In a ranking system like this universities have incentives to boost their marketing, not their research or teaching and learning.
Yet measures of research performance based on the volume of high citation papers produced by a university do tell us about real outcomes, and arguably, that kind of ranking has driven growth in the quantity and quality of research. Likewise the quality assessment in the REF fosters improvements in research, all else being equal, and this is a primary reason why UK universities achieve stellar research outcomes from a modest national outlay on research and development. Perhaps the principal problem with the REF impact measure is that it is unclear whether the evidence-base captures real impacts, rather than simulations of impact, and thus it is unclear whether the virtuous circle has been created.
The limitations of TEF’s proxies
In TEF 2 there is no virtuous circle. The bottom line is that there is no guarantee that positive effects on outcomes and behaviour will be generated, leading to better teaching and learning in UK higher education. This is because as designed, the indicators used in TEF 2 do not measure teaching and learning. They measure proxies, and there is no necessary relation between the proxy outcomes and the actual teaching and learning outcomes. TEF 2 is a competition game that despite the best intentions is decoupled from real classrooms and real learning achievement.
Graduate employment rates and salaries tell us about many things at once. They are governed by a mix of elements: the social backgrounds of students and their continuing impact on the career through social networking, the formal educational experience, the extra-curricular experience, the macro-economy and the local micro-economy, and so on. Any statistical attempt to separate the different causal elements, which continually affect each other, and isolate the singular ‘effects’ of teaching and learning, is ultimately artificial and assumption-driven.
This means that not only do outcomes related to graduate employment fail to identify educational effects accurately, comparisons of performance over time, or between different institutions or different disciplines, have no intrinsic merit either. This data has secured a hold on the public imagination based on the mantra that ‘education causes earnings’, but educationists know that this mantra is seriously misleading. Graduate outcomes are very important, and providers do affect those outcomes, but graduate returns data should not be hijacked for the purpose of false measures of what has been learned.
The use of student satisfaction data is also problematic. There is a long history of studies in social choice theory about the conditions under which inter-personal comparisons of welfare are valid and the NSS does not mean those conditions. Student satisfaction surveys are important in themselves, telling us something we need to know, it’s just that we cannot validly use them to measure something other than satisfaction. A key problem here is that by importing student satisfaction measures into the TEF problematic, with its financial drivers, we create an incentive for institutions, and their individual disciplines, to dumb down the academic demands of programmes so as to boost student survey outcomes. The American study by Richard Arum and Josipa Roksa, published as ‘Academically Adrift’, found that there was some evidence that this had happened following the longer use of student satisfaction surveys and student ratings of academic faulty in the United States.
In short, two of the principal measures to be used in TEF 2 will not lead to a necessary improvement in teaching and learning, and they are likely to have perverse effects. These are very serious weaknesses.
Making TEF better
What then should TEF 3 do? Inescapably, TEF 3 needs to be based on solid, well worked through measures of actual student learning outcomes, including value added measures. Proxies won’t work and there are no short cuts. Generic measures of academic learning can tell us relatively little because in higher education learning occurs through immersion in complex domains of knowledge.
Measuring learning outcomes within one discipline in one institution is straightforward. The challenge is to devise a system which enables comparability across disciplines and between different kinds of institution, operating in varying contexts. This is difficult but not impossible. In Europe, the collaborative Tuning project on learning outcomes is currently testing measures of learning across a range of discipline clusters, and in different kinds of institutions, in several countries. It is not beyond the whit of UK higher education to do something similar. What is needed is hard, patient graft over time to devise measures of teaching and learning that will work, with the minimum negative effects on either student performance or inter-institutional fairness, and well piloted before they are implemented.
This is interesting, but could you provide a link to the Tuning work on learning outcomes? Google got me to http://www.unideusto.org/tuningeu/teaching-learning-a-assessment.html, but there didn’t seem to be anything there that could be operationalised in TEF 3.
I don’t understand the comments on the NSS. Firstly, the US evidence is a red herring: it’s old, and comes from a very different environment (and in the above, student satisfaction is bundled in with Rate My Professor et al). The NSS is a national and research-informed exercise. Secondly, though there is a headline element of ‘satisfaction’ in the NSS, it captures a lot more than that.
I’d be interested to know which of these questions (from NSS 2017) the ‘educationists’ referenced think aren’t usable proxies for good teaching, and why a good performance on these questions indicates dumbing down:
Are staff are good at explaining things? Have they made the subject interesting?
Is the course intellectually stimulating, and has it challenged the student to achieve her best work?
Have the criteria used in marking been clear in advance?
Are marking and assessment fair?
Has feedback been timely, and have comments been helpful?
Has the student been able to contact staff when she needed to?
Has she received sufficient advice and guidance on the course, and good advice when making study choices?
In response to Stephen. None of these questions elicit data on what students have actually learned during their programmes. What they say they have learned, and whether they consider the learning experience ‘good’, ‘interesting’, ‘stimulating’, ‘challenging’, ‘clear’, ‘fair, ‘helpful’, ‘sufficient’, etc – you can pick any other slippery ambiguous value-laden signifier you like – remain trapped in the realm of subjective opinion. There is no necessary consistency between variations in subjective self-reporting and variations in objective learning.
You could ask a group of people to estimate the distance between the earth and the moon and take the average of the guesses as the truth. But it wouldn’t be very good astronomy. At least though that question would be about the material world. If you asked them how far it was on a Likert scale of 1-5, from ‘a very long way’ to ‘not so far at all’ that would tell you even less. Why do we have to run educational assessment using business research methods designed for customer marketing? Why can’t we design measures of actual learning outcomes, for comparative purposes? The OECD can do it for 15-year olds in three domains in the PISA programmes. It’s harder at higher education stage because of the specialised character of knowledge but it is not impossible.