What Ofqual's AI Marking Principles Mean for Us — and How We Built to Meet Them

Ofqual recently published a working paper on the use of AI in marking. It is a careful, substantive document — and one that every developer in this space should engage with honestly. This post explains how Top Marks AI's architecture addresses the concerns Ofqual raises, what our evidence actually shows, and why we think the ethical questions here run deeper than the regulatory ones.

What Ofqual Is Saying

Ofqual's working paper on AI use in marking does not ban AI. What it does is distinguish between different categories of AI system, identify the risks that matter most to regulators, and set out principles that responsible AI marking should meet.

The taxonomy the paper uses is worth understanding. At one end: classical feature-based systems — rule-based, limited in scope, but highly interpretable. At the other: generative large language models — the most capable, the least transparent, and the category that attracts the most scrutiny.

Top Marks AI uses large language models. We will not pretend otherwise. But the question Ofqual is actually asking is not "what type of model?" — it is "what safeguards are in place, and is the evidence for those safeguards genuine?" Those safeguards are the subject of this post.

The Non-Determinism Problem — and Our Solution

The concern I find most technically substantive in Ofqual's paper is non-determinism: the property of LLMs whereby the same input, submitted twice, can produce different outputs. For marking, this is not a theoretical problem. If a student's essay were re-submitted tomorrow, the score might differ. That is not the behaviour of a reliable assessment system.

Our response is architectural. Rather than submitting an essay once and accepting a single output, the system scores each essay multiple times using multiple models simultaneously, then applies a robust aggregation layer engineered specifically to identify and discard outliers. A consistency gate must be satisfied before any final score is accepted — if the models disagree beyond a defined threshold, the result is flagged rather than forced.

The effect is to treat non-determinism not as a flaw to be papered over, but as a statistical property to be measured and controlled. If the scores from multiple runs cluster tightly, confidence is high. If they diverge, the system flags the result. This does not eliminate non-determinism — nothing does — but it transforms an unpredictable property into a quantifiable one.

Context-Specific Validity Evidence

Ofqual is explicit that general claims about LLM capability are not sufficient. Validity evidence must be tied to specific subjects, question types, and mark schemes — and it must be genuinely held-out, not inflated by in-sample fitting.

This requirement maps directly to how Top Marks AI is built. We do not have a single marking tool. We have over 400 individually calibrated tools, each calibrated against specific exam board standardisation materials for a particular subject, question type, and mark scheme.

Each tool is calibrated to preserve monotonicity — a better response must always receive a higher mark — with correct behaviour enforced at both the minimum and maximum possible scores. And throughout, calibration is validated on held-out data: the accuracy figures we publish cannot be inflated by overfitting to the scripts we trained on.

The 0.94 Pearson correlation we report for AQA English Language is not a claim about what our models can do in principle. It is a measured result on scripts that were never part of the calibration process. Ofqual asks for that kind of evidence. We have it.

Transparency and Explainability

Ofqual is right to raise opacity as a concern. An AI system that produces a mark but cannot explain how it got there is unsuitable for assessment. Students, teachers, and institutions need to understand why a mark was awarded — not just that it was.

Our response is the ScaMP feedback framework — Scaffolded, Modelled, Precise. Rather than generating a mark alongside a generic comment, the system produces feedback that explicitly traces the connection between the student's response, the mark scheme criteria, and the band awarded. Students can see which assessment objectives they met and where they fell short.

This is not cosmetic. It is a structural requirement of the system design: feedback must surface mark-scheme reasoning explicitly, making marks defensible to the student, the teacher, and any external review. The per-assessment-objective scoring architecture means that the overall mark is always decomposable into its constituent parts.

Human Accountability and Scope

Ofqual's caution about AI as sole marker is understandable when evidence is limited. But at Top Marks AI, the evidence is not limited. Our independently corroborated accuracy data shows performance that materially exceeds the consistency benchmarks achieved by experienced human markers working with the same mark schemes. That raises a question worth sitting with: if AI demonstrably outperforms human marking on consistency, what does the ethical obligation to students actually require?

We are not suggesting regulatory frameworks should change overnight. What we are saying is that the conversation about AI in assessment should be led by evidence, not by default assumptions about human primacy. The teachers and leaders we work with find this framing clarifying: the question is not "AI versus human" but "what combination of AI and human gives students the most consistent, well-evidenced feedback?"

In practice, teachers remain in the loop — reviewing, contextualising, and where appropriate overriding outputs. We regard this not as a constraint but as a deliberate design choice: the combination of AI consistency with teacher expertise and contextual judgement is stronger than either alone. The case for AI marking infrastructure is precisely this — a system that makes the whole more trustworthy than the sum of its parts.

Where the Evidence Stands

The accuracy evidence we have published — corroborated independently by Ark Schools and Community Schools Trust — is a genuine attempt to answer Ofqual's call for context-specific validity evidence. We have not cherry-picked results or presented in-sample figures. The LOO-CV methodology is specifically designed to produce honest estimates that cannot be inflated by overfitting.

Our production quality gates go beyond accuracy. Before any tool is deployed, it is tested for:

  • Monotonicity — does a better response always receive a higher score?
  • Cliff detection — are there sharp discontinuities where small differences in quality produce large differences in marks?
  • Coverage — does the system handle the full range of response quality, including very weak and very strong scripts?
  • Re-mark stability — do repeated runs on the same scripts produce consistent results within acceptable bounds?

These checks do not make the system perfect. What they do is make its behaviour visible and measurable — which is precisely what Ofqual is asking for.

To our knowledge, Top Marks AI is the only AI marking system to have its accuracy independently corroborated by multiple school groups — Ark Schools and Community Schools Trust — using scripts that were never part of the calibration process. LLMs carry interpretability challenges that simpler systems do not. We have invested significant architectural effort in counteracting those challenges: through multi-model scoring, robust aggregation, held-out validation, and structured feedback that surfaces mark-scheme reasoning explicitly. The result is a system whose behaviour is measurable, whose evidence is independently verifiable, and whose accuracy at scale exceeds what human-only marking achieves.

A Shared Goal

Ofqual's working paper is, ultimately, about protecting the integrity of assessment. So is our work. The principles it sets out — non-determinism mitigation, context-specific validity evidence, transparency, human accountability — are not obstacles to good AI marking. They are the conditions that make good AI marking possible.

The AI marking field will earn credibility only if developers engage with these principles seriously rather than treating them as compliance hurdles. We will continue to publish our evidence openly — because transparency is what separates credible AI marking from marketing noise, and because we believe our evidence speaks for itself.

If you are a school leader, MAT director, or someone responsible for assessment policy and you want to understand exactly how our system works and what the data shows, we welcome that conversation.

Richard Davis

Richard Davis

Founder & CEO, Top Marks AI

Richard founded Top Marks AI to build AI marking infrastructure that meets the evidentiary standards schools and regulators need. He leads the technical and commercial development of the platform.

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Learn more in our Cookie Policy.