Why Our AI Marking Is Different

Most AI marking tools are rubrics layered on top of a general-purpose language model. Ours aren't. Here's what we do instead — and why it matters.

The AI marking market is growing fast. And with that growth has come a wave of products that, under the hood, work in essentially the same way: take an off-the-shelf large language model, feed it a rubric and a student's essay, and ask it to produce a mark.

It's a reasonable-sounding approach. But it has a fundamental problem: it assumes the language model already knows how to mark to a specific exam board's standard. In practice, it doesn't. The result is marking that sounds plausible but doesn't reliably align with the grades a real examiner would give.

At Top Marks, we do something materially different.

400+ Bespoke Tools, Not One Generic Model

Every marking tool on our platform is built from the ground up for a specific question type, exam board, and qualification. An AQA GCSE English Language Paper 1, Question 5 tool is a completely different tool from an Edexcel A Level Politics source question tool. They don't share a model. They don't share a prompt. They are individually engineered.

For each tool, our pipeline evaluates thousands of candidate model configurations — different base models, prompt structures, and scoring approaches. We then use proprietary machine learning techniques to select and combine the configurations that align most closely with the exam board's own standardisation materials.

How we build each tool
📋

Board Materials

Standardisation essays with chief examiner marks

🔬

Test Thousands

AI configurations scored against those scripts

🎯

Optimise

Correlation, MAE, % in tolerance

Ship or Hold

Only if benchmarks are met

What We Optimise For

Where many competitors rely on subjective user feedback to tune their outputs ("Did this score feel right?"), we optimise against hard metrics drawn from board standardisation essays — scripts that have been marked by chief examiners:

  • Pearson correlation — how closely our marks track the chief examiner's marks across a full set of scripts. A score of 1.0 means perfect agreement; 0 means no relationship. We target 0.85+ across Humanities tools, and regularly exceed 0.90.
  • Mean Absolute Error (MAE) — the average number of marks our tool is off by, per script. Lower is better. On a typical GCSE or A Level question, our MAE is roughly half that of experienced human markers.
  • Percentage of scores within tolerance — the proportion of scripts where our mark falls within the exam board's acceptable range. We typically hit 80–85%, compared to roughly 45% for experienced human markers.

This isn't a black box. We publish accuracy data for every tool on our accuracy blog, so schools can verify the numbers for themselves before committing.

A Direct Comparison: Edexcel A Level Politics

To illustrate the difference this approach makes, we ran a head-to-head test. We downloaded all 51 standardisation essays from Edexcel's website for A Level Politics — scripts with known chief examiner marks — and marked them with both Top Marks AI and a competitor's tool.

0.84

Pearson correlation
Top Marks AI

0.48

Pearson correlation
Competitor

A note on transparency: 0.84 is actually slightly below our typical range — most of our Humanities tools exceed 0.90. We chose Edexcel A Level Politics for this comparison deliberately, because it is one of our more challenging tools and the standardisation essays are freely available for anyone to download and verify our results. Even on one of our tougher tools, our marks track closely with the chief examiner's — comfortably ahead of the ~0.70 correlation that experienced human markers typically achieve, according to this Cambridge University study. A correlation of 0.48 is significantly worse than human performance — barely better than chance for a structured assessment.

MetricTop Marks AICompetitorExperienced Humans
Pearson Correlation0.840.48~0.70
Mean Absolute Error2.55 marks5.0 marks5+ marks
Within 3-mark tolerance?Yes (avg)No (avg)No (avg)

Our MAE of 2.55 marks means that, on average across all 51 essays, our tool was within a 3-mark tolerance of the chief examiner. The competitor averaged 5 marks out — roughly the same as human markers, but without the pedagogical judgement a human brings. Their outliers were also far more egregious.

Independent verification: Our accuracy findings have been independently corroborated by Ark Schools, one of the UK's largest and most respected multi-academy trusts. When schools ask whether our numbers are real, we can point beyond our own benchmarks to external validation.

Why It Matters

Schools are making real decisions based on AI-generated marks — setting targets, identifying intervention groups, informing reports to parents. If the marks aren't reliable, those decisions are built on sand.

The difference between our approach and a rubric-on-an-LLM isn't a technicality. It's the difference between a tool that has been rigorously calibrated to an exam board's standard and one that is, at best, an educated guess. We believe schools deserve to know which one they're getting.

Every tool we build is individually benchmarked against board standardisation materials. We publish the results. If the numbers aren't good enough, we don't ship the tool.

If you'd like to see the accuracy data for any of our 400+ tools, visit our accuracy blog — or book a demo and we'll walk you through it.

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Learn more in our Cookie Policy.