AI Marking for Teachers: Top Marks AI Achieves 0.92 Correlation for AQA GCSE English Language: Paper Two, Question Five (Persuasive Writing)

Study reveals Top Marks AI achieving a 0.92 correlation for AQA GCSE English Language: Paper Two, Question Five (Persuasive Writing), 31 May 2026

AI Marking for Teachers Achieves 0.92 Correlation for AQA GCSE English Language: Paper Two, Question Five (Persuasive Writing)

"How accurate are your AI GCSE English Language marking tools?" It's one of the most common questions we receive from schools.

As such, we've been systematically testing how accurate the Top Marks' GCSE English Language AI marking tools really are. We think you'll find the results compelling!

This time we're examining performance on AQA English Language — specifically, Paper Two, Question Five: the 40-mark persuasive and transactional writing task. It's worth pausing on that, because Q5 is the single largest and most demanding piece of extended writing on the paper — and extended writing is precisely the kind of open-ended task that markers, human or machine, find hardest to assess consistently. That makes it one of the most meaningful tests of a marking tool there is.

AQA makes available numerous exemplar essays for their exam papers, and we put our tool to the test using 30 of those very same exam-board-approved standardisation materials. These exemplars showcase a broad spectrum of answer quality — the board scores in our sample range from 9 to 39 out of 40 — and they are provided for standardisation purposes, so teachers can see what different levels of response actually look like in practice.

We took these 30 essays and ran them through our dedicated marking tool, then measured the correlation between the official marks the board awarded each essay and the marks Top Marks AI assigned to those same essays.

We used a measurement called the Pearson correlation coefficient. In short:

• A value of 1 would mean perfect correlation — when one marker assigns a high score, the other always does too, and when one assigns a low score, the other always does too.
• A value of 0 means no correlation whatsoever — knowing one marker's score tells you nothing about what the other marker awarded.
• Negative values would mean the markers systematically disagree — when one assigns high scores, the other assigns low scores.

For context, how do humans perform?

What sort of correlation do experienced human markers achieve when marking essays already marked by a lead examiner?

Cambridge Assessment conducted a rigorous study to measure precisely this. 200 GCSE English scripts — which had already been marked by a chief examiner — were sent to a team of experienced human markers. These experienced markers were not told what the chief examiner had given these scripts, nor were they shown any annotations.

The Pearson correlation coefficient between the scores these experienced examiners gave and the chief examiner was just below 0.7. This indicated a positive correlation, though far from perfect. If you are interested, you can find the study here.

How did Top Marks AI perform?

Our system demonstrated a correlation of 0.92 — a very strong positive correlation that comfortably outperforms the experienced human markers in the Cambridge study, and a particularly notable result given that this is the hardest writing task on the paper to mark consistently. (Top Marks AI was also not privy to the "correct marks" or any annotations.)

AQA's tolerance for this 40-mark question is ±4 marks. 73% of the marks we gave fell within that tolerance of the chief examiner's mark.

Another key metric is the Mean Absolute Error, for which our system scored 2.55. On average, the AI differed from the board by just 2.55 marks — only 6.4% of the 40-mark total. In contrast, in that same Cambridge study, experienced examiners marking a 40-mark question showed a Mean Absolute Error of 5.64 marks, a difference of 14.1%. In other words, Top Marks AI's average error was less than half that of experienced human markers on a comparable task.

It is also worth noting that our marking showed almost no systematic bias: the average signed difference was just +0.37 of a mark, so the tool is not quietly over-generous or over-harsh — it sits right on the board's standard.

We don't claim that Top Marks is infallible, but when it does get things wrong, just how bad is it? For that, we turn to the Root Mean Square Error. RMSE is a measure of the severity of large errors: when you square the number 1 you still get 1, and squaring 2 only reaches 4 — but square 5 and you are suddenly all the way up at 25. That's how RMSE works: it (essentially!) highlights large errors by squaring them.

Top Marks AI's Root Mean Square Error was 3.59, meaning that even when larger errors occur they remain small relative to the 40-mark scale — the single largest difference across all 30 scripts was around 8 marks, with the great majority clustering far more tightly than that.

You can see the full side-by-side board and AI scores below.

Essay ID	Board Score	Top Marks AI Score	Difference
Exem AQ-EL2-Q5 S 1	14.0	13.6	-0.4
Exem AQ-EL2-Q5 S 2	22.0	22.1	+0.1
Exem AQ-EL2-Q5 S 3	37.0	33.0	-4.0
Exem AQ-EL2-Q5 S 4	9.0	11.0	+2.0
Exem AQ-EL2-Q5 S 5	12.0	12.2	+0.2
Exem AQ-EL2-Q5 S 6	18.0	18.5	+0.5
Exem AQ-EL2-Q5 S 7	24.0	21.3	-2.7
Exem AQ-EL2-Q5 S 8	31.0	22.8	-8.2
Exem AQ-EL2-Q5 S 9	35.0	30.5	-4.5
Exem AQ-EL2-Q5 S 10	39.0	32.6	-6.4
Exem AQ-EL2-Q5 S 11	26.0	25.2	-0.8
Exem AQ-EL2-Q5 S 12	9.0	12.7	+3.7
Exem AQ-EL2-Q5 S 13	14.0	14.5	+0.5
Exem AQ-EL2-Q5 S 14	18.0	23.5	+5.5
Exem AQ-EL2-Q5 S 15	26.0	33.3	+7.3
Exem AQ-EL2-Q5 S 16	29.0	33.1	+4.1
Exem AQ-EL2-Q5 S 17	38.0	35.3	-2.7
Exem AQ-EL2-Q5 S 18	22.0	28.0	+6.0
Exem AQ-EL2-Q5 S 19	35.0	36.8	+1.8
Exem AQ-EL2-Q5 S 20	35.0	35.2	+0.2
Exem AQ-EL2-Q5 S 21	17.0	17.3	+0.3
Exem AQ-EL2-Q5 S 22	22.0	22.4	+0.4
Exem AQ-EL2-Q5 S 23	27.0	25.0	-2.0
Exem AQ-EL2-Q5 S 24	34.0	35.0	+1.0
Exem AQ-EL2-Q5 S 25	36.0	36.0	+0.0
Exem AQ-EL2-Q5 S 26	18.0	25.6	+7.6
Exem AQ-EL2-Q5 S 27	25.0	27.1	+2.1
Exem AQ-EL2-Q5 S 28	28.0	27.7	-0.3
Exem AQ-EL2-Q5 S 29	21.0	21.5	+0.5
Exem AQ-EL2-Q5 S 30	37.0	36.3	-0.7

Can I see a graph to help me visualise this?

Absolutely.

First, here's a scatter graph to show you what a theoretical perfect correlation of 1 would look like:

Now, let's look at the real-life graph, drawn from the data above:

Scatter graph of exam board marks against Top Marks AI marks for AQA GCSE English Language Paper 2, Question 5

On the horizontal axis is the mark given by the exam board; on the vertical, the mark given by Top Marks AI. Each dot is an essay, and the dashed line shows perfect agreement. You can see how closely the points hug that line — the signature of a very strong correlation.

Discover how Top Marks AI can revolutionise assessment in education. Contact us at hello@topmarks.ai.

We use cookies for analytics and marketing to improve your experience — these are only set if you accept. Decline and we'll only use cookies that are strictly necessary. (Live chat is always available either way.) Learn more in our Cookie Policy.