AI Marking for Teachers: Top Marks AI Achieves 0.97 Correlation for AQA GCSE English Literature: Nineteenth Century Prose

Study reveals Top Marks AI achieving 0.97 correlation for AQA GCSE English Literature: Nineteenth Century Prose, February 11, 2026

AI Marking for Teachers Achieves 0.97 Correlation for AQA GCSE English Literature: Nineteenth Century Prose

"Can we really trust AI to mark GCSE English Literature essays?" We encounter this question regularly when speaking with teachers and educational institutions.

As such, we've performed comprehensive evaluations to demonstrate the accuracy of the Top Marks' GCSE English Literature AI marking tools really are. The results speak for themselves!

In this study, we're analyzing AQA English Literature -- specifically, the AQA GCSE English Literature: Nineteenth Century Prose.

AQA makes available numerous exemplar essays for their exam papers and we've put our tool to the test using 36 of those very same exam board approved standardisation materials. These exemplars showcase a broad spectrum of answer quality. These are official standardisation materials that show teachers the spectrum of answer quality.

We took 36 of these essays and ran them through our dedicated marking tool. Then we measured the correlation between the official marks the board awarded each essay, and the marks Top Marks AI assigned to those same essays.

We employed the Pearson correlation coefficient. In short:

• A value of 1 would mean perfect correlation -- when one marker assigns a high score, the other always does too, and when one assigns a low score, the other always does too.
• A value of 0 means no correlation whatsoever -- knowing one marker's score tells you nothing about what the other marker awarded.
• Negative values would mean the markers systematically disagree - when one assigns high scores, the other assigns low scores.

For context, how do humans perform?

What sort of correlation do experienced human markers achieve when marking essays that have also been marked by a lead examiner?

In an authoritative academic paper that AQA has cited before Parliament, a set of scripts were marked, first, by AQA human markers, and then, second, by the chief examiner. The chief examiner was not told what the original examiners had given these scripts. Nor were they shown any annotations.

The Pearson correlation coefficient between the scores these experienced examiners gave and the chief examiner was 0.67. This indicated a positive correlation, though far from perfect. If you are interested, you can find the study here.

And this study is not a fluke. Cambridge Assessment took 200 scripts that had been given an initial score by OCR markers, then sent these same scripts for blind marking to experienced human examiners. The correlation between the two sets of scores was below 0.7. You can find the study here.

How did Top Marks AI perform?

Our system demonstrated a correlation of 0.97 -- an incredibly strong positive correlation that far outperforms the experienced human markers in the Cambridge study. (Top Marks AI was also not privy to the "correct marks" or any annotations).

Moreover, 91.67% of the marks we gave were within 3 marks of the grade given by the chief examiner.

Another interesting metric is the Mean Absolute Error, for which our system scored 1.53. On average, the AI differed from the board by 1.53 marks, which is comfortably within 3 marks. As a percentage, that's an average of 5.1% difference.

In contrast, in that same Cambridge study, experienced examiners marking a 40-mark question showed a Mean Absolute Error of 5.64 marks, that's a difference of 14.1%. These results highlight the exceptional accuracy of Top Marks AI compared to traditional marking practices.

We don't claim that Top Marks is infallible, but when it does get things wrong, just how bad is it? Well, let's turn to the Root Mean Square Error to find out. Root Mean Square Error (RMSE) is a measure of the severity of large errors. When you square the number 1, you still get 1, and when you square 2, you still only make a small jump to 4. But square 5, and you're suddenly all the way up at 25. That's how RMSE works - it (essentially!) highlights large errors by squaring them.

Top Marks AI's Root Mean Square Error was 1.95, meaning even when larger errors occur, they remain remarkably small relative to the 30-mark scale.

You can see the full side-by-side human and AI scores below.

Essay ID	Board Score	Top Marks AI Score	Difference
Exem AQ-19c S5 (?) (8).docx	8.0	7.1	-0.9
Exem AQ-19c S15 (?) (30).docx	30.0	30.0	+0.0
Exem AQ-19c S2 (?) (8).docx	8.0	7.7	-0.3
Exem AQ-19c S1 (?) (21).docx	21.0	23.6	+2.6
Exem AQ-19c S6 (?) (9).docx	9.0	8.3	-0.7
Exem AQ-19c S3 (?) (30).docx	30.0	30.0	+0.0
Exem AQ-19c S16 (?) (30).docx	30.0	30.0	+0.0
Exem AQ-19c S11 (?) (6).docx	6.0	6.8	+0.8
Exem AQ-19c S17 (?) (17).docx	17.0	14.1	-2.9
Exem AQ-19c S7 (?) (17).docx	17.0	17.3	+0.3
Exem AQ-19c S4 (?) (11).docx	11.0	8.0	-3.0
Exem AQ-19c S12 (?) (20).docx	20.0	17.9	-2.1
Exem AQ-19c S18 (?) (4).docx	4.0	3.9	-0.1
Exem AQ-19c S8 (?) (17).docx	17.0	18.1	+1.1
Exem AQ-19c S19 (?) (11).docx	11.0	7.3	-3.7
Exem AQ-19c S13 (?) (10).docx	10.0	12.2	+2.2
Exem AQ-19c S9 (?) (18).docx	18.0	19.1	+1.1
Exem AQ-19c S14 (?) (17).docx	17.0	15.1	-1.9
Exem AQ-19c S23 (?) (20).docx	20.0	18.7	-1.3
Exem AQ-19c S21 (?) (5).docx	5.0	5.4	+0.4
Exem AQ-19c S10 (?) (16).docx	16.0	14.7	-1.3
Exem AQ-19c S22 (?) (27).docx	27.0	28.5	+1.5
Exem AQ-19c S20 (?) (23).docx	23.0	21.4	-1.6
Exem AQ-19c S32 (?) (25).docx	25.0	23.6	-1.4
Exem AQ-19c S31 (?) (22).docx	22.0	24.2	+2.2
Exem AQ-19c S33 (?) (27).docx	27.0	24.6	-2.4
Exem AQ-19c S35 (?) (18).docx	18.0	23.3	+5.3
Exem AQ-19c S30 (?) (17).docx	17.0	17.5	+0.5
Exem AQ-19c S34 (?) (23).docx	23.0	19.3	-3.7
Exem AQ-19c S28 (?) (20).docx	20.0	21.6	+1.6
Exem AQ-19c S36 (?) (27).docx	27.0	25.9	-1.1
Exem AQ-19c S27 (?) (27).docx	27.0	30.0	+3.0
Exem AQ-19c S26 (?) (30).docx	30.0	30.0	+0.0
Exem AQ-19c S25 (?) (18).docx	18.0	17.2	-0.8
Exem AQ-19c S29 (?) (16).docx	16.0	17.4	+1.4
Exem AQ-19c S24 (?) (30).docx	30.0	28.2	-1.8

Can I see a graph to help me visualise this?

Absolutely.

First, here's a scatter graph to show you what a theoretical perfect correlation of 1 would look like:

Now, let's look at the real-life graph, drawn from the data above:

Actual Correlation Graph for AQA GCSE English Literature: Nineteenth Century Prose

On the horizontal axis, we have the grade given by the exam board. On the vertical, the grade given by Top Marks AI. The individual dots are the essays -- their position tells us both the mark given by the exam board and by Top Marks AI. You can see how closely it resembles the theoretical graph depicting perfect correlation.

Discover how Top Marks AI can revolutionise assessment in education. Contact us at hello@topmarks.ai.

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Learn more in our Cookie Policy.