"How reliable is your A-Level Politics AI marking system?" We hear this question constantly. So we put our tool to the test on the Edexcel A Level Politics 30-mark "no source" essay — the longer, source-free evaluative questions — using 51 of the exam board's own standardisation exemplars. Here is exactly how accurate it was, with every script's marks laid out in full.
This study focuses on Edexcel A Level Politics — specifically the 30-mark "no source" questions. These are the extended, evaluative essay questions (the "evaluate" and "to what extent" essays), where students build and sustain an argument without a source extract to lean on. They are demanding to write and demanding to mark.
Edexcel publishes a large bank of exemplar essays for these papers, and we used 51 of those exam-board-approved standardisation materials. These are the official scripts used to show teachers the full spectrum of answer quality, from low-mark responses through to near-perfect ones — the marks in our sample run all the way from 8 to 30.
We ran all 51 essays through our dedicated Edexcel Politics marking tool, then measured the correlation between the official marks the board awarded and the marks Top Marks AI assigned to those same essays. The AI was not shown the official marks or any annotations.
The headline measure is the Pearson correlation coefficient. In short:
What sort of correlation do experienced human markers achieve when marking scripts already marked by a lead examiner? A landmark study by Fowles (2009) set out to measure exactly that, comparing the marks of experienced GCSE examiners against the chief examiner's marks on the same scripts.
The findings are sobering. Those experienced examiners correlated with the chief examiner at only around 0.65 — a positive correlation, but far from perfect — and just ~45% of their marks fell within the exam board's tolerance of the definitive mark. Marking, it turns out, is far less consistent than most people assume, even among seasoned professionals.
Across all 51 essays, Top Marks recorded a correlation of 0.84 (0.835 to three decimal places) — a very strong positive correlation that comfortably outperforms the ~0.65 achieved by experienced human markers (Fowles, 2009), and on a harder, source-free question type.
Pearson correlation with the exam board
of marks within 3 of the board (35 of 51)
mean absolute error — about 8.5% on a 30-mark scale
root mean square error on the 30-mark scale
That Mean Absolute Error of 2.55 marks means that, on average, the AI differed from the board by just 2.55 marks — an average difference of about 8.5% on the 30-mark scale. The within-tolerance figure tells the same story: Top Marks AI landed 68.6% of its marks within 3 marks of the board, against the ~45% that experienced examiners managed in Fowles (2009). On the measure exam boards care about most — how often a marker falls inside tolerance — our tool was markedly more consistent than the human benchmark.
We don't claim Top Marks is infallible — so when it does get things wrong, how bad is it? For that we turn to the Root Mean Square Error (RMSE), which highlights the severity of large errors by squaring them: square 2 and you only reach 4, but square 5 and you jump all the way to 25. Top Marks AI's RMSE was 3.23, meaning that even when larger errors occur, they stay remarkably small relative to the 30-mark scale.
You can see the full side-by-side human and AI scores for all 51 essays below. A negative difference means we marked below the board; a positive difference means we marked above it.
| Essay ID | Board Score | Top Marks AI Score | Difference |
|---|---|---|---|
| 1-June 2019 Q2a - 30 Mark, No Source 1 (-) (13).docx | 13.0 | 13.5 | +0.5 |
| 1-June 2019 Q2a - 30 Mark, No Source 2 (-) (27.5).docx | 27.5 | 25.4 | -2.1 |
| 1-June 2019 Q2b - 30 Mark, No Source 1 (-) (25).docx | 25.0 | 23.0 | -2.0 |
| 1-June 2019 Q2b - 30 Mark, No Source 2 (-) (30).docx | 30.0 | 28.0 | -2.0 |
| 1-June 2019 Q2b - 30 Mark, No Source 3 (-) (18).docx | 18.0 | 17.8 | -0.2 |
| 1-June 2023 Q2a - 30 Mark, No Source 1 (-) (30).docx | 30.0 | 24.0 | -6.0 |
| 1-June 2023 Q2a - 30 Mark, No Source 1 (-) (13).docx | 13.0 | 18.3 | +5.3 |
| 1-June 2023 Q2b - 30 Mark, No Source 2 (-) (25).docx | 25.0 | 24.0 | -1.0 |
| 1-June 2023 Q2b - 30 Mark, No Source 3 (-) (27.5).docx | 27.5 | 22.0 | -5.5 |
| 1-June 2024 Q2a - 30 Mark, No Source 1 (-) (17).docx | 17.0 | 9.7 | -7.3 |
| 1-June 2024 Q2b - 30 Mark, No Source 1 (-) (28).docx | 28.0 | 25.7 | -2.3 |
| 1-June 2024 Q2b - 30 Mark, No Source 2 (-) (14.5).docx | 14.5 | 17.5 | +3.0 |
| 2-June 2019 Q2a - 30 Mark, No Source 1 (-) (29).docx | 29.0 | 23.4 | -5.6 |
| 2-June 2019 Q2b - 30 Mark, No Source 1 (-) (30).docx | 30.0 | 27.1 | -2.9 |
| 2-June 2023 Q2a - 30 Mark, No Source 1 (-) (27.5).docx | 27.5 | 25.0 | -2.5 |
| 2-June 2023 Q2a - 30 Mark, No Source 2 (-) (27.5).docx | 27.5 | 25.8 | -1.7 |
| 2-June 2023 Q2b - 30 Mark, No Source 1 (-) (27.5).docx | 27.5 | 26.7 | -0.8 |
| 2-June 2024 Q2a - 30 Mark, No Source 1 (-) (26).docx | 26.0 | 28.4 | +2.4 |
| 2-June 2024 Q2b - 30 Mark, No Source 1 (-) (25).docx | 25.0 | 24.7 | -0.3 |
| 3a-June 2019 Q3a - 30 Marks, No Source 1 (-) (20).docx | 20.0 | 24.1 | +4.1 |
| 3a-June 2019 Q3a - 30 Marks, No Source 2 (-) (14).docx | 14.0 | 19.4 | +5.4 |
| 3a-June 2019 Q3a - 30 Marks, No Source 3 (-) (30).docx | 30.0 | 26.7 | -3.3 |
| 3a-June 2019 Q3b - 30 Marks, No Source 1 (-) (17).docx | 17.0 | 17.0 | 0.0 |
| 3a-June 2019 Q3b - 30 Marks, No Source 2 (-) (20).docx | 20.0 | 21.5 | +1.5 |
| 3a-June 2019 Q3b - 30 Marks, No Source 3 (-) (26).docx | 26.0 | 22.1 | -3.9 |
| 3a-June 2019 Q3c - 30 Marks, No Source 1 (-) (25).docx | 25.0 | 24.0 | -1.0 |
| 3a-June 2019 Q3c - 30 Marks, No Source 2 (-) (20).docx | 20.0 | 20.1 | +0.1 |
| 3A-June 2023 Q3a -30 Marks, No Source 1 (-) (28).docx | 28.0 | 25.7 | -2.3 |
| 3A-June 2023 Q3b -30 Marks, No Source 1 (-) (26).docx | 26.0 | 25.7 | -0.3 |
| 3A-June 2023 Q3c -30 Marks, No Source 1 (-) (25).docx | 25.0 | 25.0 | 0.0 |
| 3A-June 2024 Q3a -30 Marks, No Source 1 (-) (18).docx | 18.0 | 25.5 | +7.5 |
| 3A-June 2024 Q3a -30 Marks, No Source 2 (-) (28).docx | 28.0 | 29.0 | +1.0 |
| 3A-June 2024 Q3b -30 Marks, No Source 1 (-) (15).docx | 15.0 | 21.4 | +6.4 |
| 3A-June 2024 Q3b -30 Marks, No Source 2 (-) (25).docx | 25.0 | 24.4 | -0.6 |
| 3A-June 2024 Q3c -30 Marks, No Source 1 (-) (19).docx | 19.0 | 17.5 | -1.5 |
| 3A-June 2024 Q3c -30 Marks, No Source 2 (-) (27).docx | 27.0 | 25.4 | -1.6 |
| 3B - June 2024 Q3a -30 Marks, No Source 1 (-) (27).docx | 27.0 | 28.3 | +1.3 |
| 3B - June 2024 Q3a -30 Marks, No Source 2 (-) (8).docx | 8.0 | 8.0 | 0.0 |
| 3B - June 2024 Q3b -30 Marks, No Source 1 (-) (22).docx | 22.0 | 25.6 | +3.6 |
| 3B - June 2024 Q3b -30 Marks, No Source 2 (-) (26).docx | 26.0 | 28.4 | +2.4 |
| 3B - June 2024 Q3c -30 Marks, No Source 1 (-) (25).docx | 25.0 | 25.9 | +0.9 |
| 3B - June 2024 Q3c -30 Marks, No Source 2 (-) (10).docx | 10.0 | 13.5 | +3.5 |
| 3b-June 2019 Q3a - 30 Marks, No Source 1 (-) (30).docx | 30.0 | 26.0 | -4.0 |
| 3b-June 2019 Q3a - 30 Marks, No Source 2 (-) (28).docx | 28.0 | 25.7 | -2.3 |
| 3b-June 2019 Q3b - 30 Marks, No Source 1 (-) (28).docx | 28.0 | 25.4 | -2.6 |
| 3b-June 2019 Q3b - 30 Marks, No Source 2 (-) (28).docx | 28.0 | 25.9 | -2.1 |
| 3b-June 2019 Q3c - 30 Marks, No Source 1 (-) (27).docx | 27.0 | 24.9 | -2.1 |
| 3b-June 2019 Q3c - 30 Marks, No Source 2 (-) (30).docx | 30.0 | 25.0 | -5.0 |
| Summer 2022 2b - 30 mark, no source (-) (22).docx | 22.0 | 26.4 | +4.4 |
| Summer 2022 3a - No Source 1 (-) (29).docx | 29.0 | 29.7 | +0.7 |
| Summer 2022 3b - No Source 1 (-) (30).docx | 30.0 | 29.0 | -1.0 |
Absolutely. First, here's a scatter graph showing what a theoretical perfect correlation of 1 would look like — every essay sitting exactly on the diagonal:
Now the real-life graph, drawn from the data above:
On the horizontal axis is the mark given by the exam board; on the vertical, the mark given by Top Marks AI. Each dot is one essay, and you can see how closely the cloud of points hugs the line of perfect correlation.
This is the second of a pair. Our companion study looks at the Edexcel A Level Politics source-based 30-marker, where Top Marks AI scored an even higher 0.89 correlation. Together they show the accuracy holds across both the source-based and source-free 30-mark question types.
Book a demo and we'll show you how Top Marks AI marks Edexcel A Level Politics — benchmarked to the board's own standards, with the evidence to back it up.
Across 51 of Edexcel's own standardisation exemplars for the 30-mark "no source" questions, Top Marks AI correlated with the exam board's marks at 0.84 (Pearson). The average difference was 2.55 marks — about 8.5% on a 30-mark scale — and 68.6% of our marks fell within 3 marks of the board.
Research into marker reliability — most notably Fowles (2009) — found that experienced GCSE examiners correlate with the chief examiner at only around 0.65, with just ~45% of their marks falling within the board's tolerance. Top Marks AI's 0.84 correlation and 68.6% of marks within 3 of the board are stronger on both measures — and on a question type with no source extract to anchor the marking.
We used 51 official Edexcel exemplar essays for the 30-mark "no source" questions (the extended, source-free evaluative essays), spanning the June 2019, Summer 2022, June 2023 and June 2024 papers. They cover the full spectrum of quality, with board marks ranging from 8 to 30, so the correlation reflects performance across weak, middling and strong responses alike.
No. Because the marks are benchmarked to the board's standards and independently evidenced, an AI mark is a reliable first draft — not a guess. But the teacher stays in the loop for professional judgement, accountability and the things only a teacher sees. The AI removes the repetitive marking; the expertise stays human.
We use cookies for analytics and marketing to improve your experience — these are only set if you accept. Decline and we'll only use cookies that are strictly necessary. (Live chat is always available either way.) Learn more in our Cookie Policy.