AI Marking for A Level Politics: 0.84 Correlation on the Edexcel 30-Mark (No Source) Question

"How reliable is your A-Level Politics AI marking system?" We hear this question constantly. So we put our tool to the test on the Edexcel A Level Politics 30-mark "no source" essay — the longer, source-free evaluative questions — using 51 of the exam board's own standardisation exemplars. Here is exactly how accurate it was, with every script's marks laid out in full.

Key Takeaways
  1. Across 51 official Edexcel exemplar essays, Top Marks AI achieved a Pearson correlation of 0.84 with the marks the exam board awarded.
  2. 68.6% of our marks landed within 3 marks of the board, and almost half (45%) within 2 marks.
  3. The average difference was just 2.55 marks — about 8.5% on a 30-mark scale.
  4. That comfortably beats the benchmark for experienced human markers, who correlate at only around 0.65 with the chief examiner and place just ~45% of marks within tolerance (Fowles, 2009).
  5. This is the companion to our study on the Edexcel source-based 30-marker — and shows the accuracy holds up on the harder, source-free evaluative questions too.

What we tested

This study focuses on Edexcel A Level Politics — specifically the 30-mark "no source" questions. These are the extended, evaluative essay questions (the "evaluate" and "to what extent" essays), where students build and sustain an argument without a source extract to lean on. They are demanding to write and demanding to mark.

Edexcel publishes a large bank of exemplar essays for these papers, and we used 51 of those exam-board-approved standardisation materials. These are the official scripts used to show teachers the full spectrum of answer quality, from low-mark responses through to near-perfect ones — the marks in our sample run all the way from 8 to 30.

We ran all 51 essays through our dedicated Edexcel Politics marking tool, then measured the correlation between the official marks the board awarded and the marks Top Marks AI assigned to those same essays. The AI was not shown the official marks or any annotations.

The headline measure is the Pearson correlation coefficient. In short:

  • A value of 1 means perfect correlation — when one marker scores high, so does the other, and likewise for low scores.
  • A value of 0 means no correlation — one marker's score tells you nothing about the other's.
  • Negative values mean the markers systematically disagree.

For context, how do humans perform?

What sort of correlation do experienced human markers achieve when marking scripts already marked by a lead examiner? A landmark study by Fowles (2009) set out to measure exactly that, comparing the marks of experienced GCSE examiners against the chief examiner's marks on the same scripts.

The findings are sobering. Those experienced examiners correlated with the chief examiner at only around 0.65 — a positive correlation, but far from perfect — and just ~45% of their marks fell within the exam board's tolerance of the definitive mark. Marking, it turns out, is far less consistent than most people assume, even among seasoned professionals.

How did Top Marks AI perform?

Across all 51 essays, Top Marks recorded a correlation of 0.84 (0.835 to three decimal places) — a very strong positive correlation that comfortably outperforms the ~0.65 achieved by experienced human markers (Fowles, 2009), and on a harder, source-free question type.

0.84

Pearson correlation with the exam board

68.6%

of marks within 3 of the board (35 of 51)

2.55

mean absolute error — about 8.5% on a 30-mark scale

3.23

root mean square error on the 30-mark scale

That Mean Absolute Error of 2.55 marks means that, on average, the AI differed from the board by just 2.55 marks — an average difference of about 8.5% on the 30-mark scale. The within-tolerance figure tells the same story: Top Marks AI landed 68.6% of its marks within 3 marks of the board, against the ~45% that experienced examiners managed in Fowles (2009). On the measure exam boards care about most — how often a marker falls inside tolerance — our tool was markedly more consistent than the human benchmark.

We don't claim Top Marks is infallible — so when it does get things wrong, how bad is it? For that we turn to the Root Mean Square Error (RMSE), which highlights the severity of large errors by squaring them: square 2 and you only reach 4, but square 5 and you jump all the way to 25. Top Marks AI's RMSE was 3.23, meaning that even when larger errors occur, they stay remarkably small relative to the 30-mark scale.

You can see the full side-by-side human and AI scores for all 51 essays below. A negative difference means we marked below the board; a positive difference means we marked above it.

Essay IDBoard ScoreTop Marks AI ScoreDifference
1-June 2019 Q2a - 30 Mark, No Source 1 (-) (13).docx13.013.5+0.5
1-June 2019 Q2a - 30 Mark, No Source 2 (-) (27.5).docx27.525.4-2.1
1-June 2019 Q2b - 30 Mark, No Source 1 (-) (25).docx25.023.0-2.0
1-June 2019 Q2b - 30 Mark, No Source 2 (-) (30).docx30.028.0-2.0
1-June 2019 Q2b - 30 Mark, No Source 3 (-) (18).docx18.017.8-0.2
1-June 2023 Q2a - 30 Mark, No Source 1 (-) (30).docx30.024.0-6.0
1-June 2023 Q2a - 30 Mark, No Source 1 (-) (13).docx13.018.3+5.3
1-June 2023 Q2b - 30 Mark, No Source 2 (-) (25).docx25.024.0-1.0
1-June 2023 Q2b - 30 Mark, No Source 3 (-) (27.5).docx27.522.0-5.5
1-June 2024 Q2a - 30 Mark, No Source 1 (-) (17).docx17.09.7-7.3
1-June 2024 Q2b - 30 Mark, No Source 1 (-) (28).docx28.025.7-2.3
1-June 2024 Q2b - 30 Mark, No Source 2 (-) (14.5).docx14.517.5+3.0
2-June 2019 Q2a - 30 Mark, No Source 1 (-) (29).docx29.023.4-5.6
2-June 2019 Q2b - 30 Mark, No Source 1 (-) (30).docx30.027.1-2.9
2-June 2023 Q2a - 30 Mark, No Source 1 (-) (27.5).docx27.525.0-2.5
2-June 2023 Q2a - 30 Mark, No Source 2 (-) (27.5).docx27.525.8-1.7
2-June 2023 Q2b - 30 Mark, No Source 1 (-) (27.5).docx27.526.7-0.8
2-June 2024 Q2a - 30 Mark, No Source 1 (-) (26).docx26.028.4+2.4
2-June 2024 Q2b - 30 Mark, No Source 1 (-) (25).docx25.024.7-0.3
3a-June 2019 Q3a - 30 Marks, No Source 1 (-) (20).docx20.024.1+4.1
3a-June 2019 Q3a - 30 Marks, No Source 2 (-) (14).docx14.019.4+5.4
3a-June 2019 Q3a - 30 Marks, No Source 3 (-) (30).docx30.026.7-3.3
3a-June 2019 Q3b - 30 Marks, No Source 1 (-) (17).docx17.017.00.0
3a-June 2019 Q3b - 30 Marks, No Source 2 (-) (20).docx20.021.5+1.5
3a-June 2019 Q3b - 30 Marks, No Source 3 (-) (26).docx26.022.1-3.9
3a-June 2019 Q3c - 30 Marks, No Source 1 (-) (25).docx25.024.0-1.0
3a-June 2019 Q3c - 30 Marks, No Source 2 (-) (20).docx20.020.1+0.1
3A-June 2023 Q3a -30 Marks, No Source 1 (-) (28).docx28.025.7-2.3
3A-June 2023 Q3b -30 Marks, No Source 1 (-) (26).docx26.025.7-0.3
3A-June 2023 Q3c -30 Marks, No Source 1 (-) (25).docx25.025.00.0
3A-June 2024 Q3a -30 Marks, No Source 1 (-) (18).docx18.025.5+7.5
3A-June 2024 Q3a -30 Marks, No Source 2 (-) (28).docx28.029.0+1.0
3A-June 2024 Q3b -30 Marks, No Source 1 (-) (15).docx15.021.4+6.4
3A-June 2024 Q3b -30 Marks, No Source 2 (-) (25).docx25.024.4-0.6
3A-June 2024 Q3c -30 Marks, No Source 1 (-) (19).docx19.017.5-1.5
3A-June 2024 Q3c -30 Marks, No Source 2 (-) (27).docx27.025.4-1.6
3B - June 2024 Q3a -30 Marks, No Source 1 (-) (27).docx27.028.3+1.3
3B - June 2024 Q3a -30 Marks, No Source 2 (-) (8).docx8.08.00.0
3B - June 2024 Q3b -30 Marks, No Source 1 (-) (22).docx22.025.6+3.6
3B - June 2024 Q3b -30 Marks, No Source 2 (-) (26).docx26.028.4+2.4
3B - June 2024 Q3c -30 Marks, No Source 1 (-) (25).docx25.025.9+0.9
3B - June 2024 Q3c -30 Marks, No Source 2 (-) (10).docx10.013.5+3.5
3b-June 2019 Q3a - 30 Marks, No Source 1 (-) (30).docx30.026.0-4.0
3b-June 2019 Q3a - 30 Marks, No Source 2 (-) (28).docx28.025.7-2.3
3b-June 2019 Q3b - 30 Marks, No Source 1 (-) (28).docx28.025.4-2.6
3b-June 2019 Q3b - 30 Marks, No Source 2 (-) (28).docx28.025.9-2.1
3b-June 2019 Q3c - 30 Marks, No Source 1 (-) (27).docx27.024.9-2.1
3b-June 2019 Q3c - 30 Marks, No Source 2 (-) (30).docx30.025.0-5.0
Summer 2022 2b - 30 mark, no source (-) (22).docx22.026.4+4.4
Summer 2022 3a - No Source 1 (-) (29).docx29.029.7+0.7
Summer 2022 3b - No Source 1 (-) (30).docx30.029.0-1.0

Can I see this as a graph?

Absolutely. First, here's a scatter graph showing what a theoretical perfect correlation of 1 would look like — every essay sitting exactly on the diagonal:

Perfect correlation (y=x) reference graph
A theoretical perfect correlation of 1.0.

Now the real-life graph, drawn from the data above:

Correlation graph for the Edexcel A Level Politics 30-mark no-source question: board score vs Top Marks AI score
Board score (horizontal) vs Top Marks AI score (vertical) for all 51 Edexcel exemplar essays.

On the horizontal axis is the mark given by the exam board; on the vertical, the mark given by Top Marks AI. Each dot is one essay, and you can see how closely the cloud of points hugs the line of perfect correlation.

This is the second of a pair. Our companion study looks at the Edexcel A Level Politics source-based 30-marker, where Top Marks AI scored an even higher 0.89 correlation. Together they show the accuracy holds across both the source-based and source-free 30-mark question types.

See AI Marking for A Level Politics

Book a demo and we'll show you how Top Marks AI marks Edexcel A Level Politics — benchmarked to the board's own standards, with the evidence to back it up.

Frequently Asked Questions

How accurate is Top Marks AI on the Edexcel A Level Politics 30-mark essay?

Across 51 of Edexcel's own standardisation exemplars for the 30-mark "no source" questions, Top Marks AI correlated with the exam board's marks at 0.84 (Pearson). The average difference was 2.55 marks — about 8.5% on a 30-mark scale — and 68.6% of our marks fell within 3 marks of the board.

How does that compare to human markers?

Research into marker reliability — most notably Fowles (2009) — found that experienced GCSE examiners correlate with the chief examiner at only around 0.65, with just ~45% of their marks falling within the board's tolerance. Top Marks AI's 0.84 correlation and 68.6% of marks within 3 of the board are stronger on both measures — and on a question type with no source extract to anchor the marking.

Which essays were used in the study?

We used 51 official Edexcel exemplar essays for the 30-mark "no source" questions (the extended, source-free evaluative essays), spanning the June 2019, Summer 2022, June 2023 and June 2024 papers. They cover the full spectrum of quality, with board marks ranging from 8 to 30, so the correlation reflects performance across weak, middling and strong responses alike.

Does this mean AI should replace the teacher?

No. Because the marks are benchmarked to the board's standards and independently evidenced, an AI mark is a reliable first draft — not a guess. But the teacher stays in the loop for professional judgement, accountability and the things only a teacher sees. The AI removes the repetitive marking; the expertise stays human.

Richard Davis

Richard Davis

Founder & CEO, Top Marks AI

Richard read English at UCL and Cambridge before founding Accolade Press, a boutique academic publisher. A lifelong educator and the author of four bestselling thriller novels, he founded Top Marks AI to bring rigorous, exam-board-calibrated marking to every school in the UK.

We use cookies for analytics and marketing to improve your experience — these are only set if you accept. Decline and we'll only use cookies that are strictly necessary. (Live chat is always available either way.) Learn more in our Cookie Policy.