This 10-mark ‘How does the writer…?’ question asks students to analyse how the writer uses language to achieve effects. Below we share the accuracy of Top Marks AI across 40 Eduqas exemplars.
For questions with a limited marking range like this 10-mark question, we focus on a particularly important metric: Mean Absolute Error (MAE). MAE tells us, on average, how many marks our AI differs from the exam board's marks. A low MAE means high accuracy.
As such, these results demonstrate just how accurate the Top Marks' GCSE English Literature AI marking tools really are. We think you’ll be impressed.
For this analysis we examined scripts from Eduqas English Language — specifically, the Component 2: Question 2 - How does the writer…? question.
Eduqas makes available numerous exemplar essays for their exam papers and we've put our tool to the test using 40 of those very same exam board approved standardisation materials. These exemplars showcase a broad spectrum of answer quality. They span lower to higher bands, providing a realistic spread for evaluation.
We took 40 of these essays and ran them through our dedicated marking tool. Then we measured the difference between the official marks the board awarded each essay, and the marks Top Marks AI assigned to those same essays.
What level of accuracy do experienced human markers achieve when marking essays already marked by a lead examiner?
Cambridge Assessment conducted a rigorous study to measure precisely this. 200 GCSE English scripts — which had already been marked by a chief examiner — were sent to a team of experienced human markers. These experienced markers were not told what the chief examiner had given these scripts. Nor were they shown any annotations.
The Mean Absolute Error (average difference) between the experienced markers and the chief examiner was 5.64 marks on a 40-mark question — that's an average difference of 14.1%. If you are interested, you can find the study here.
Across those 40 scripts, our system achieved a Mean Absolute Error of 0.85 marks. On average, the AI differed from the board by just 0.85 marks on this 10-mark question. As a percentage, that's an average of 8.5% difference — significantly better than the 14.1% human marker difference in the Cambridge study.
Moreover, 77.50% of the marks we gave were within 1 mark of the grade given by the chief examiner.
As an additional measure of accuracy, we also calculated the Pearson correlation coefficient, which was 0.87. This indicates a strong positive relationship between our marks and the exam board's marks. For context, the Cambridge study reports experienced human markers at just under r = 0.70 against the chief examiner on a 40-mark question, so r = 0.87 is materially higher.
We don't claim that Top Marks is infallible, but when it does get things wrong, just how bad is it? Well, let's turn to the Root Mean Square Error to find out. Root Mean Square Error (RMSE) is a measure of the severity of large errors. When you square the number 1, you still get 1, and when you square 2, you still only make a small jump to 4. But square 5, and you're suddenly all the way up at 25. That's how RMSE works — it (essentially!) highlights large errors by squaring them.
Top Marks AI's Root Mean Square Error was 1.08, meaning even when larger errors occur, they remain remarkably small relative to the 10-mark scale.
You can see the full side-by-side human and AI scores below.
| Essay ID | Board Score | Top Marks AI Score | Difference |
|---|---|---|---|
| Summer 2022 Component 2 Question Two 1 (-) (3).pdf | 3.0 | 3.3 | +0.3 |
| Summer 2023 Component 2 Question Two 1 (-) (5).pdf | 5.0 | 5.3 | +0.3 |
| Summer 2018 Component 2 Question Two 1 (-) (6).pdf | 6.0 | 8.0 | +2.0 |
| Summer 2017 Component 2 Question Two 1 (-) (7).pdf | 7.0 | 6.5 | -0.5 |
| November 2018 Component 2 Question Two 1 (-) (5).pdf | 5.0 | 6.3 | +1.3 |
| November 2022 Component 2 Question Two 1 (-) (2).pdf | 2.0 | 2.7 | +0.7 |
| November 2023 Component 2 Question Two 1 (-) (2).pdf | 2.0 | 3.0 | +1.0 |
| CPD 2019 7C Question Two 1 (-) (2).pdf | 2.0 | 2.7 | +0.7 |
| CPD 2017 6A Question Two 1 (-) (3).pdf | 3.0 | 2.7 | -0.3 |
| CPD 7C Autumn 2018 Question Two 1 (-) (2.5).pdf | 2.5 | 3.0 | +0.5 |
| November 2017 Component 2 Question Two 1 (-) (4).pdf | 4.0 | 4.7 | +0.7 |
| CPD Autumn 2017 Question Two 1 (-) (6.5).pdf | 6.5 | 8.0 | +1.5 |
| Summer 2023 Component 2 Question Two 2 (-) (1).pdf | 1.0 | 1.3 | +0.3 |
| Summer 2022 Component 2 Question Two 2 (-) (6).pdf | 6.0 | 6.7 | +0.7 |
| Summer 2018 Component 2 Question Two 2 (-) (2).pdf | 2.0 | 5.3 | +3.3 |
| Summer 2017 Component 2 Question Two 2 (-) (5).pdf | 5.0 | 4.0 | -1.0 |
| November 2022 Component 2 Question Two 2 (-) (6).pdf | 6.0 | 4.7 | -1.3 |
| November 2023 Component 2 Question Two 2 (-) (5).pdf | 5.0 | 6.0 | +1.0 |
| November 2018 Component 2 Question Two 2 (-) (7).pdf | 7.0 | 6.0 | -1.0 |
| CPD 2019 7C Question Two 2 (-) (4.5).pdf | 4.5 | 4.3 | -0.2 |
| CPD 2017 6A Question Two 2 (-) (5).pdf | 5.0 | 4.7 | -0.3 |
| CPD Autumn 2017 Question Two 2 (-) (3.5).pdf | 3.5 | 3.2 | -0.3 |
| November 2017 Component 2 Question Two 2 (-) (5).pdf | 5.0 | 5.7 | +0.7 |
| Summer 2023 Component 2 Question Two 3 (-) (7).pdf | 7.0 | 6.0 | -1.0 |
| Summer 2022 Component 2 Question Two 3 (-) (9).pdf | 9.0 | 7.7 | -1.3 |
| Summer 2017 Component 2 Question Two 3 (-) (6).pdf | 6.0 | 4.7 | -1.3 |
| Summer 2018 Component 2 Question Two 3 (-) (4).pdf | 4.0 | 6.5 | +2.5 |
| November 2018 Component 2 Question Two 3 (-) (4).pdf | 4.0 | 4.7 | +0.7 |
| November 2022 Component 2 Question Two 3 (-) (8).pdf | 8.0 | 7.7 | -0.3 |
| November 2023 Component 2 Question Two 3 (-) (7).pdf | 7.0 | 5.0 | -2.0 |
| CPD 2017 6A Question Two 3 (-) (8).pdf | 8.0 | 8.0 | +0.0 |
| CPD 2019 7C Question Two 3 (-) (7.5).pdf | 7.5 | 8.0 | +0.5 |
| CPD 7C Autumn 2018 Question Two 3 (-) (7.5).pdf | 7.5 | 7.3 | -0.2 |
| CPD Autumn 2017 Question Two 3 (-) (8.5).pdf | 8.5 | 7.7 | -0.8 |
| November 2017 Component 2 Question Two 3 (-) (7).pdf | 7.0 | 7.7 | +0.7 |
| Summer 2018 Component 2 Question Two 4 (-) (10).pdf | 10.0 | 9.3 | -0.7 |
| CPD 2017 6A Question Two 4 (-) (4).pdf | 4.0 | 4.3 | +0.3 |
| November 2017 Component 2 Question Two 4 (-) (5).pdf | 5.0 | 5.7 | +0.7 |
| CPD 2017 6A Question Two 5 (-) (6).pdf | 6.0 | 6.2 | +0.2 |
| CPD 2017 6A Question Two 6 (-) (7).pdf | 7.0 | 8.0 | +1.0 |
Absolutely.
First, here's a scatter graph to show you what a theoretical perfect correlation of 1 would look like:
Now, let's look at the real-life graph, drawn from the data above:
On the horizontal axis, we have the grade given by the exam board. On the vertical, the grade given by Top Marks AI. The individual dots are the essays — their position tells us both the mark given by the exam board and by Top Marks AI. You can see how closely it resembles the theoretical graph depicting perfect correlation.
Discover how Top Marks AI can revolutionise assessment in education. Contact us at info@topmarks.ai.
We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Learn more in our Cookie Policy.