Teachers know the dilemma all too well. We encourage our students to practise their writing especially in the lead-up to exams but there simply isn’t enough time to mark everything they produce. The result is often a sense of frustration – on both sides.
The big hope has been the development of automated marking tools. But despite some early promise, feedback tended to be vague, offering comments such as, ‘There may be an issue with this sentence’ without explaining what the issue was or how to fix it. And no system really pretended to be able to assess whether the learner had actually achieved the communicative purpose of the task.
The arrival of generative AI has changed all this. After 18 months of intensive work at Clarity, we’ve developed an AI marking tool that probably surpasses what we expected when we started work on it, and provides three big breakthroughs: it gives learners targeted feedback and suggestions for improvement, it evaluates the learner’s communicative competence, and it grades the writing with the same levels of accuracy and reliability as a human marker.
The first breakthrough we discovered was simply the depth of feedback that AI can provide. With skilful design and a thorough technical understanding of AI tools, we were able to build a model which can analyse a student’s writing against the kind of writing criteria that you find in CEFR descriptors. This goes way beyond the kind of response you get if you simply ask AI to ‘mark my IELTS essay’. It will evaluate the learner’s ability in structure, organisation, vocabulary and grammar and give targeted suggestions for how they can improve what they wrote.
The second major development came in what is often called task achievement. We all know that good writing isn’t simply about correct grammar or good vocabulary. As teachers we want to know whether the learner can write in a way that successfully communicates the intended message to the intended reader. Historically, this has been the hardest part for automated systems to evaluate.
A key part of our AI model, therefore, is information about the specific task. Including generic writing descriptors, as described above, is essential, but we also need to provide full details about each individual task – a description of what the learner is asked to do, the specific skills that the teacher wants the learner to show (e.g. how to start and end an email appropriately, or how to structure a for-and-against essay), and the standards that define what task achievement means at different levels. Armed with this background information, the system is able to provide reliable feedback on how well the writer addresses the task.
The result of these two breakthroughs is a level of feedback that even the most diligent teacher couldn’t realistically provide for a class of thirty students producing multiple practice texts in preparation for an exam. Using this tool, the learner can write and rewrite sample answers and get feedback that is immediate, thorough, and directly linked to the learning objectives of the task.
There was one more breakthrough and this was the most challenging one – reliable and accurate grading. In fact, at the outset of the project, our priority was the feedback element. We viewed grading as an aspiration – only worth incorporating if we could get it to a reliable standard. The foundation of the grading system was the same as the foundation for feedback. The use of the standards and descriptors that enable the feedback element would also allow the AI system to evaluate writing in a way that fitted with the skills and competences that appear in the frameworks that teachers typically use to understand learner progress.
We asked the AI system to grade a large set of student responses using all the available information and to give scores for a range of criteria – vocabulary, grammar, organisation, coherence, task achievement. The scores looked plausible but of course accurate grading requires evidence. So, the final stage of development was a calibration test. We recruited some human examiners, with a lot of experience in marking high-stakes exam scripts, and asked them to grade the same set that we used with the AI.
Our hope was that we would see reasonable agreement between human and AI scores while assuming we would still need to do further rounds of refinement to bring the AI to the same levels as expert humans. But in fact, the results surprised us. Statistically, the AI scores sat squarely within the range of human scores, producing a reliability coefficient that showed the AI’s judgments were effectively indistinguishable from those of experienced markers.
In describing the development of the tool above, I’ve glossed over the intensity of the work. There were numerous iterations, with a few breakthrough moments but many times when progress was slow and painstaking. The key thing was the collaboration of the technical team and the pedagogic team to ensure that we didn’t simply deliver a clever piece of technology (which it is) but above all something that genuinely helps learners and teachers.
