Can you trust the results of an AI speaking test?

by | 14 May 2024

It’s fair to say there’s plenty of scepticism among teachers about the use of AI in testing. Can a machine really match a human’s ability to assess a learner’s proficiency in English? What about speaking and writing where you want to measure more than just technical competence (grammar, pronunciation, fluency, etc.) and are looking to evaluate communicative competence as well? When Clarity first started exploring automated speaking tests, we held off developing a test for precisely these reasons. But with the recent leap forward in AI, Clarity’s new Dynamic Speaking Test (DST) is achieving results which align closely with those given by human raters.

One of the reasons for people’s scepticism is that the talk around AI sometimes gives the impression that it’s a kind of magic, whereas in reality the way an AI system gets trained is not so different from a human rater – except for the sheer volume of data which it can process. When training human raters, we ask them to score a small sample of responses that have already been benchmarked by experts and we calibrate their scores against the benchmarked scores. In the case of the DST, the AI system was trained on a sample of five thousand benchmarked responses from students around the world (each manually scored by three expert human raters) and through the use of machine learning a scoring algorithm was created. With further analysis of more than a million data points across many thousands more responses, the algorithm was refined until it could rate new spoken English responses to the same level of accuracy as the original expert human raters.

Another source of scepticism is whether AI can judge the whole response, not just individual features like pronunciation or fluency. In other words, can it evaluate how well someone is actually communicating? This is where the advances in AI kick in, with DST using a ‘task achievement’ algorithm to assess whether the speaker really has given their opinion on the issue raised in the task, or suggested a solution to the problem described. It’s this use of AI that allows DST to assess a speaker’s communicative ability as well as their technical speaking skills.

As in the development of any new test, pre-testing was carried out with learners from different countries in order to test the reliability of DST. We then set up workshops in which experienced raters scored the tests independently before a) comparing their own scores with each other, and then b) comparing their scores with the AI scores. It’s worth saying something at this point about the reliability of human scoring. Test providers routinely measure the variance in scoring between human raters and it is well-established that ‘inter-rater reliability’ never approaches 100%. A well-calibrated test is one where the statistical variance between raters is relatively low. Therefore, in judging AI’s performance, we need to evaluate whether the correlation between AI and human scores is the same as the correlation between two or more sets of human scores.

So, what were the findings of the calibration workshop on DST scores? First of all, the expert human raters compared their own scores with each other, which as expected showed some variance. They then discussed their differences and agreed on a model ‘human’ score for each response. When the AI v. model human scores were compared it was shown that there was a similar level of inter-rater reliability as in the case of human v. human scores, meaning that results produced by DST are likely to be as accurate as those produced by a set of trained human raters.

Interestingly, at the start of the calibration workshop, several of the expert raters were openly sceptical about the potential of an AI test but by the end they were impressed not only by the accuracy of the scores but by the range of tasks and the quality of responses. The conclusion of the pre-testing workshop was that users can have confidence in the value of the results given by DST – whether it’s teachers needing to place students at the start of a course, or employers needing to judge whether a candidate’s spoken English is good enough for them to proceed to the next round of the recruitment process.

Martin Moore, Editor

Martin Moore, Editor