AI speaking assessment: How we designed a valid test

by | 16 April 2024

Ever since ClarityEnglish launched the Dynamic Placement Test in 2017, the ambition has been to complement it with a speaking test. But the problem was always: How do we deliver a speaking test that has the same qualities of reliability, validity and speed as DPT? What use is an automated test if the accuracy of the results is unreliable? What if the tasks are so inauthentic they don’t actually measure someone’s ability to use English in real situations?

In 2017, the technology simply wasn’t good enough to produce an automarked speaking test. When comparisons were made with human markers, the results were too erratic. Although some coincided, many varied by a factor of more than one CEFR level (up and down). Clearly a test isn’t useful if a large number of your students receive a score that is significantly inaccurate and you don’t know which are the problematic ones.

To try and compensate for this unreliability, test creators have typically simplified the tests so they focus on more mechanical tasks which would be easier for the AI to mark. So, for example, if a test-taker only has to read a sentence off the screen, the AI has a clear model to assess against and can generate consistent scores. But what does a task like that really tell you about a learner’s speaking ability? If you were conducting a face-to-face test, would you use tasks like this to assess your students?

But in the last couple of years, as we know, AI has taken a leap forward, and this enhanced capability has given us the chance to develop a speaking test that is both reliable and valid as a placement tool. We’ll talk about advances in reliability in a future post, but the question of validity is just as important. The purpose of a speaking test must be to show how well a learner can communicate in spoken English. It must consist of a range of tasks where test takers have to deal with the types of scenarios they’ll encounter when they speak English for real. For example, we can ask the test-taker to read an email of complaint, summarise the problem and come up with a solution. A human marker will focus not only on features like pronunciation, fluency and lexis but just as importantly they’ll evaluate how well the test-taker addresses the task. It’s the second half of this that has been the biggest challenge for AI.

The exciting thing is that the AI we use in the Dynamic Speaking Test has proved itself capable of doing this accurately and reliably. The algorithm doesn’t just measure the mechanical features; it also measures the extent to which the test-taker can complete a task by speaking English. So it’s no good a candidate pre-learning a set script and reproducing it fluently ‒ the AI can detect the response’s relevance and weights its score according to how well the task is achieved. As a result the Dynamic Speaking Test can use the type of tasks that we’d use in a human-marked test, with test-takers having to produce spontaneous answers in response to tasks that target a range of oral competencies and domains from the CEFR. This means the test really measures what you want it to measure, while having all the benefits of an automated test ‒ simultaneous delivery to hundreds of students and instant results.

Martin Moore, Editor

Martin Moore, Editor