Saturday, 9 August 2025

Good for the Goose? Norman Tebbit's Other Test

When I was a boy in the early 1980’s, Norman Tebbit was an almost permanent feature on the news and airwaves. He was, from 1981 to 1987, a member of Margaret Thatcher’s cabinet. He was considered by his critics to be uncompromising, unforgiving, and unyielding; but for those same qualities he was as loved as he was loathed.



If there's one single test associated with Norman Tebbit it is undoubtedly his infamous 1999 cricket test. Coined by him, it arose because of the perceived lack of loyalty shown by South Asian and Caribbean immigrants for the English cricket team. Tebbit controversially suggested that the test should be applied to the children of these immigrants to assess whether they had genuinely integrated into this country.

However, in this post, I'm going to focus on an entirely different test: one that Tebbit failed twice, and one which, I hope, illuminates the assessment principle of reliability and one of the responses to unreliability; generosity.

Upwardly Mobile


I'm going to begin by relating a story that Tebbit tells in his 1989 autobiography Upwardly Mobile. The date is 1953, and Tebbit was flying high as an officer in the Royal Auxiliary Air Force. He was an ambitious young man and he wanted a job as a pilot for the British Overseas Airways Corporation (BOAC).

At the time, BOAC policy was to train its pilots to be multi-skilled; able to fly the plane and if necessary navigate it as well. Consequently, to become a fully-trained pilot Tebbit had to become a fully-trained navigator. The technical examination for the navigation licence was, he tells us, of a very high academic standard with, all told, twenty weeks of intensive instruction and many hours of swotting at home.

Thankfully, he passed all the components of the exam, all that is except one - the signals practical. That’s not a bad outcome; he’d learned the necessary discipline of self-study, and now all he had to do was focus his attention on the remaining, outstanding component. The requirement of the failed assessment was to send and receive six Morse code words per minute sent by an Aldis lamp or flashed by airfield beacons. Tebbit says that he found the six words per minute requirement difficult enough without the added difficulties of the media. And so, he failed his second attempt.

Not unsurprisingly he was nervous when he went for his second re-sit. He took his seat in a large hall with forty other candidates, and to his great surprise the examiner called him out to the front.

—You know the rules, don’t you?
—Yes Sir.
—If you fail any subject three times, including this practical one, you have to retake the complete examination.
—Yes Sir, I do.
—Then you mustn’t fail, must you?
—No Sir, except, you see Sir, I’m not much good at this.
—I know that, so if you miss anything give me a nod and I’ll repeat it - oh and if all else fails I’ve put you next to an RAF signaller!

Needless to say, he passed. 
Or did he? 
Did he pass, or was he passed?

Unreliability and Generosity


Let’s consider the helpful examiner’s actions. How would you describe what he did? Cheating? Forgiving? A lowering of the academic standard? A helpful hand-up to someone who was obviously capable of flying a plane? Does the answer depend on your political prejudices?

Reliability is defined by the Cambridge Assessment Network as the extent to which the results of an assessment are consistent and replicable. So if an assessment is highly reliable it means that if a student took a different version of the test, or if a different examiner marked the test, they would get exactly the same result. I’ve written more about reliability here.

However, when assessors consider an assessment to be unreliable, generosity is used as a compensatory technique. Did the examiner think it ridiculous that pilots should be fully-trained navigators as well? Or, did the examiner disagree with the two-strike rule? Did he think it overly harsh, an inefficient waste of everyone’s time?

On using Generosity to Combat Unreliability


Tom Benton in his Cambridge Assessment Network paper On using Generosity to Combat Unreliability begins from an acceptance that no assessment system is perfect and that by focusing on the risk for individual students, we might logically decide how much generosity is required. In particular, Benton examines how to adjust assessment grading when reliability decreases, such as during unforeseen events like exam cancellations. (The COVID lockdown carry-on is the obvious context). The article proposes different strategies for setting grade boundaries, including maintaining the original grade distribution, ensuring no student is disadvantaged by being awarded a lower grade than deserved, or maximizing overall classification accuracy. It introduces concepts like "true scores" versus "observed scores" and quantifies the impact of various reliability levels on misclassification rates across different grades. It can all get statistically complex but ultimately, Benton argues that a logical application of "benefit of the doubt" can lead to justifiable changes in grade distributions during periods of lower assessment reliability.

This positive light on generosity has implications for how it is managed. Should it be left to individual examiners, as it seems in the case of the signals practical resit, or should institutional authorities transparently bake generosity into their assessment strategy?

One might easily think that that question would depend upon the social significance of the assessment. Is it easier to tolerate generosity in a low-stakes school assessment but not in assessment taken by an apprentice electrician if the outcome is that he might electrocute himself, or worse, others, as a consequence of being passed rather than passing? It’s probably worth remembering that Tebbit’s examination was to allow him to fly passenger jets. There are no easy or obvious answers.

‘You're certainly relatively competent': assessor bias due to recent experiences


One further interesting area of possibility is to be found in Yeates et al's academic paper, ‘You're certainly relatively competent': assessor bias due to recent experiences. This research paper sheds an interesting light on the psychology of how assessors make poor judgements.

The study's context is medical education, where inter-rater score variability is a known challenge. The researchers conducted an experiment with consultant doctors assessing videos of medical trainees. They split the assessors into two groups: one viewed trainee performances in descending order of proficiency (good, then borderline, then poor), and the other viewed them in ascending order (poor, then borderline, then good).

The study found significant "contrast effects" in assessors' judgement. This means that assessors who had recently seen better performances tended to give lower scores to subsequent candidates, while those who had recently seen poorer performances tended to give higher scores to subsequent candidates. This poor judgement was found to be present involving perceptual judgement and judgements about abstract concepts alike. In essence, differences between the current candidate and recently seen candidates were overemphasized, leading to scores that unduly diverged.

Yeates et al identify a number of key findings that could shed some light on the examiner's actions. The most relevant for this post include:

Normative vs. Criterion-Referenced: The findings suggest that assessors often use "normative" (comparative to others) rather than purely "criterion-referenced" (against a fixed standard) decision-making, and that these internal "norms" are easily influenced by recent experiences.

Lack of Insight: Interestingly, the assessors' confidence in their ratings did not correlate with their susceptibility to this poor judgement, suggesting a lack of insight into how their judgement might be influenced by prior candidates.

Within their “lack of insight” finding, there might be further possibilities such as: 

Lack of Professional Reflection: Assessors, like professionals in many fields, often operate under time constraints and may not be explicitly trained or given the structured opportunities for deep, critical self-reflection on their assessment practices. If they lack the "tools" (e.g., frameworks for poor judgement detection, consistent feedback on their own ratings) or the "time" (owing to workload pressures) to engage in meaningful reflection, then their lack of insight into their own poor judgement is less about malicious intent and more about systemic limitations. In my experience, while reflection is encouraged in principle, the practical support, training, and protected time for truly meaningful self-assessment and debriefing are often scarce. This points to a need for better professional development, support, and a culture that values reflective practice in assessment. If assessors aren't equipped to critically analyze their own judgement and the factors influencing them, unconscious poor judgement are much more likely to persist unchecked. 

Dunning-Kruger Effect: The Dunning-Kruger effect describes how people with low ability at a task often overestimate their competence, precisely due to their lack of metacognitive ability to recognize their own errors. Conversely, highly competent individuals may underestimate their relative competence. In the context of the Yeates et al. paper's finding that assessors lacked insight into their susceptibility to poor judgement despite their confidence, it directly aligns with the Dunning-Kruger phenomenon. If assessors are unconsciously incompetent at detecting or mitigating their own poor judgement, they may confidently believe they are fair and objective even when they are not. This "unrealized or unacknowledged" lack of competence in poor judgement mitigation could be a significant factor.

Ultimately Yeates et al conclude that these cognitive predilections can significantly influence assessors' judgements in ways that are unfair to candidates. There is a question here begging to be asked: unfair to which candidates?

Whilst Benton introduces "generosity" as a logical response to unreliability, the Yeates et al. paper delves into subconscious poor judgement that could contribute to an examiner's subjective decisions. In this new light, the decision of Tebbit's examiner might not have been a deliberate act of "generosity" or "cheating" in the conventional sense, but perhaps an unconscious "contrast effect." If the examiner had just come from assessing candidates who performed significantly worse, Tebbit's "not quite good enough" performance might have seemed "relatively competent" in contrast.

Or, had he, perhaps, seen Tebbit’s scores for the other components? Were Tebbit's scores perhaps higher than his peers and that through experience the examiner recognized Tebbit’s potential? Under those circumstances, did he think that the assessment was unfairly disadvantageous to a perfectly good pilot? We, of course, cannot know. But it is not insignificant that these questions come so easily to mind.

This latter possibility raises a further intriguing question about the assessor decision-making process: could a similar psychological effect, when an assessor views the same candidate's strong performance in two or three components of an assessment, influence their judgement of a weaker performance in an another less well performed component? In other words, is it possible that the examiner, knowing Tebbit's stellar performance in previous sections, was unconsciously swayed to pass him on the final component – a fascinating area we'll delve into in a future post.
 

Unfinished Business


Nearly forty years after his BOAC assessment, Tebbit left front-line politics. In the final chapter of his subsequent 1991 book Unfinished Business, Tebbit set out the direction of what a post-Thatcher Tebbit government would have looked like. I agree with much of it. But something very particular caught my attention. 

Tebbit correctly frames his criticism of the contemporary regretful state of the education system on the socialist reforms of the 1960's. Without lapsing into nostalgic sentimentality, Tebbit praises the one-time tripartite structure of grammar, technical and secondary modern schooling which, in their own ways, served the best interests of their pupils and their country coalescing around achievement and meritocracy. The socialist ethos sought, and for that matter still seeks, to slur the distinction between failure and success. The quicker pupils have to be held back to accommodate the pace of the slowest. What could not be achieved by the many was put beyond the reach of the any. He then continues:

Despite the overwhelming evidence of falling standards a quite contrary picture is displayed by the architects of these disasters who point to statistics of ever-increasing examination success. Closer examination shows that the biggest examination cheats are not students but examiners, teachers and educationalists who simply water down standards to ensure that the appropriate quota of examinees achieve success.

That's how I remember him. Totally unafraid of left-wing intellectualism, he doesn't just pour scorn over their false gods, he pulls them down and tramples on them unmercifully. When will we see his like again. And yet. What are we to conclude about that hand-up given to him in an assessment that he simply wasn't good enough to pass on his own merit? 

One shallow, simplistic solution would be to point a finger at him and cry out “hypocrite”. It would be an answer that tells us nothing. Let us however aim for something a little more complex. In their paper, Yeates et al identify unfairness as one of the consequences of assessor poor judgement. Helping Tebbit to pass let him think that he had passed. It allowed him to forget, consciously or unconsciously, the hand-up. It admonishes Tebbit from the charge of hypocrisy but it doesn't get him off the hook that his critique is made from a position of unrealized privilege.

And finally, if we consider the Assessor's action as a form of lie, we can genuinely see how the lie distorted Tebbit's reality. The very act designed to help him inadvertently harmed his self-perception. It is a tragedy of good intentions. This is then the painful irony of the examiner's actions: that the person treated most unfairly was Tebbit himself.


Reference


‘Reliability’, Assessment 101 Glossary: 101 words and phrases for assessment professionals (nd), The Assessment Network, University of Cambridge, p.10

Benton, T. (2021). On using generosity to combat unreliability. Research Matters: A Cambridge Assessment publication, 31, 22-41

Tebbit, N.
- (1989) Upwardly Mobile, London, Futura Publications 
- (1991) Unfinished Business, London, Weidenfeld & Nicolson 

No comments:

Post a Comment

L'appel du vide: Electricians and Paratroopers

High Places Phenomenon (HPP), or more commonly, "l'appel du vide," (the call of the void) is the sudden, inexplicable urge to ...