Machine marking of receptive skills is nothing new: IELTS Reading and Listening papers have been marked by computers for years. So, in a very similar way, has the parking ticket you feed into the reader, and your ATM card at the cash point. The input matches a tightly restricted and predictable pattern, which triggers an equally predictable response: an IELTS band score, a parking fee or a number of bank notes of particular denominations. Simple stuff.
When it comes to machine marking speaking and writing though, we find that both the input and the output are less predictable and less restricted. As a result, the grading process is much, much more complex. That’s why we always get the right amount of cash from the ATM (in 30 years I’ve never been given more or less), but we often hear Amazon’s Alexa say ‘Sorry, I am not sure.’
So, for a machine to ‘understand’ and respond to productive skills we can’t rely on closely defined pattern-matching. We need another approach, and currently, there are two main contenders.
Aligning marked assessment to a standard
Before we examine them, though, we need to be clear about our goal. If we are assessing student writing, this means aligning the marking to a standard. If the standard is the CEFR, we need to know whether the student’s answer shows that they ‘can do’ a particular task, for example at A1 that they ‘can write a short, simple postcard’. For a human marker this is judged through a marking scheme making clear what the target is and what sort of evidence to look for: grammatical and lexical complexity, for example. The human marker can annotate the answer to highlight the evidence they have used to make an assessment. This helps with consistency for an individual marker, and reliability across a body of markers.
Let’s now turn to how Artificial Intelligence aligns marking to standards.
Natural Language Processing
Natural Language Processing (NLP) is concerned with enhancing the quality of interactions between human languages and machines. In the very early days this meant word for word machine translation between English and Russian. (It was famously defeated by homonyms when ‘The spirit is willing but the flesh is weak‘ reverse-translated as ‘The vodka is promising, but the meat is rancid’.)
Sixty years later NLP involves breaking a text into words and chunks, and analysing the syntax, semantics and discourse. Some models rely on a pipeline of tools, with each tool processing an individual element of the text and passing it on to the next process. As text moves through the pipeline, it is inspected and described: the output is a measure of grammar, vocabulary and meaning. This is, broadly speaking, how a (human) linguist builds their own model of the text.
Classification
The other main approach is classification. Let’s take a hypothetical example, where a student is given this task: ‘Write a short review of a restaurant or coffee shop. You want to award 4 out of 5 stars. The restaurant was good, but not perfect.’
The first step is to get some data to create a machine learning model. We can grab this from Yelp, the review website, which has about 6,900,000 authentic reviews (put 10% of them aside to use as test cases once we have built the model). The dataset includes the text of the review and the classification, which in this case is the number of stars that the reviewer gave. These are fed into one of the many open source algorithms available.
The algorithm spots patterns in the data and works out which patterns lead to which classification. At the very simplest level, the word ‘delicious’ might correlate with four stars, while ‘absolutely delicious’ leads to five. A lot goes on in the algorithm to avoid over-matching, coping with messiness and avoiding exponential growth of complexity. The end product is our machine learning model. This can be tested with the 10% of reviews we put aside, and refined if necessary. We can then feed in a new bit of data — the student’s review — and the model will predict, with a confidence level, the classification.
To summarise, on the one hand, NLP works like a skilled human, but speeds the process along. On the other hand, we do not know (or perhaps even care) how classification works, but we can be sure it produces an accurate result.
So, do these two techniques allow us to align their marking to a standard? Stunningly well, actually. After all, the classification model has learnt to follow the output of human marking. In our example, it is a matching of descriptions with a star-based grading system; but it could equally be an essay with an IELTS band score or a task with a CEFR descriptor. This means we can design the model to be perfectly aligned with our criteria.
The NLP pipeline, by contrast, does not give you an answer. You end up with a lot of information about the written work, and how it relates to the marking criteria. You use this information to assign a CEFR level (or other classification) to the text.
It is in combining the two that machine learning promises so much. If we have enough annotated data, and spend time tweaking the classification algorithm, and perhaps add some constraints and anchors from the NLP descriptions, we will make the best marker there is — and this holds out the very real promise that in the foreseeable future you, as a teacher, will get your evenings back.