It’s fair to say there’s plenty of scepticism among teachers about the use of AI in testing. Can a machine match a human’s ability to assess a learner’s proficiency in English? In particular, can AI really evaluate a learner’s communicative competence in spoken and written English? The evidence suggests that with the right test and the right system, automated marking can score spoken English responses to the same level of accuracy as the original expert human raters. But it’s like all uses of AI – it depends on the right combination of human skills and machine capability.
The first part of the equation is the design of the test. This is where a team of human experts devise a set of tasks that will require test takers to demonstrate their communicative abilities. In the case of a speaking test, it’s not just about pronunciation and fluency; it’s about using language in realistic communicative scenarios. A well-designed test will elicit responses that show the test taker’s ability – and by the same token, a poorly designed test won’t.
The second part of the equation is to evaluate the responses. In the case of Clarity’s Dynamic Speaking Test (DST), evaluation is done against five criteria: pronunciation, fluency, vocabulary, grammar and task achievement. The last of these – task achievement – draws on ‘can do’ descriptors from the CEFR and is at the heart of communicative competence. But can it be accurately assessed by a machine?
This was one of the critical questions we examined when devising DST. We know that AI has the capability to measure things such as words per minute, the ratio of flowing speech to bad pauses or pronunciation accuracy. But how can it measure the degree to which a test taker has completed a task properly? This is down to prompt engineering – the way we brief AI to generate consistent and reliable output. It’s a mix of science and art and is another example of how success depends on the right combination of human and machine expertise.
Prompt engineering is about writing clear and effective instructions to the AI system. In the case of assessment, that means defining the construct (i.e. the skills we want the test taker to demonstrate) and also the standards that will be used to measure achievement.
Let’s imagine a task where the test taker is made an offer which they have to decline (with reasons) and then suggest an alternative solution. On top of the detailed rubrics used to score pronunciation and language use, the test designer will write a prompt which instructs the AI also to focus on the skills that are required to complete this specific task:
- Does the test taker …
- decline the offer?
- provide clear reasons?
- suggest an alternative solution?
The AI then assesses responses for evidence of each of these points and scores them accordingly. This task achievement score is added to the measures of pronunciation, fluency, grammar and vocabulary to create an overall score.
The example above is clearly a simplified version of the prompt but the key to success here is that the prompt is produced by the expert test designer who understands the test construct, before it is reviewed by a developer who understands the best way of structuring AI prompts. The evaluation can then be carried out in a consistent and objective way by the AI system. (For more detail about the marking, see How the AI calculates the CEFR level.)
As in the development of any new test, pre-testing was carried out with learners from different countries in order to test the reliability of DST. We then set up workshops in which experienced raters scored the tests independently before a) comparing their own scores with each other, and then b) comparing their scores with the AI scores. When the AI v. human scores were analysed, we found a similar level of reliability as in the case of human v. human scores. This confirmed that the AI system was effective in assessing not only the linguistic features (pronunciation, fluency, grammar and vocabulary) but also the communicative competence (task achievement).
Creating a valid, automarked test isn’t about handing over the whole job to AI. It’s about using AI for the things it can do better than humans. After all, we know from research that it is a huge challenge for markers to simultaneously listen to and assess a test taker’s response against multiple criteria – and keep doing so accurately and consistently over a period of time. AI marking, however, excels at this kind of job. The final results, however, are only meaningful if the test has been designed by experts.
