Putting a test “to the test” — how to make it valid and reliable

by | 16 April 2019

When I was teaching, I made a test for the end of every course I gave — it’s no big deal, just part of a teacher’s work. Or so I thought until a conversation with a friend who worked at OUP. “The Oxford Placement Test,” I said. “How do you know whether the questions actually work?” “Well,” he replied, “for a start, each question is looked at by at least 100 people.” “A hundred people,” I thought. “Wow! That’s impressive.”

Now I’m in publishing, and having worked on the Dynamic Placement Test for the last three years, I understand that impressive as that 100 people is, it is actually just one part of a much bigger picture. So in anticipation of the Dynamic Placement Test Standard Setting event next month (more on standard setting below), I’d like to look at other parts of this bigger picture — at what the whole team does to make sure the test you use is both valid and reliable.

1. Anchor items

An early step in putting the test together is to find “anchor items”. These are questions that have been taken from existing tests and are proven, through the use of data analysis, to correspond to a specific level. For the Dynamic Placement Test, anchor items were taken from telc language tests at all levels from A1 to C2 of the CEFR. These anchor items make up a certain percentage of the test and they sit alongside newly created items.

We know the anchor items work, so we can measure new items against them. So, for example, if a large number of candidates answer the B1 anchor items correctly, but have difficulty with a new “B1” item, this shows that the new item is not performing as it should. Developers can then evaluate the non-performing item and decide to adapt it or replace it.

2. Item analysis

If they adapt the item, they will then carry out further analysis to check its performance — and the process continues for the duration of a test’s life. But what is this analysis? The University of Washington’s Office of Educational Assessment describes it like this: “Item analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items.”

The Dynamic Placement Test is digital so it is easy to collect and anonymise data. Clarity then sends this off to the telc team in Frankfurt who run it through their item analysis algorithms. Analysis can do other, exciting things too. It can look at:

  • performance of test takers on mobile devices vs computers
  • performance of homogeneous vs heterogeneous L1 test taker groups
  • performance of different L1 groups (comparing Arabic and Tagalog speakers for example)

3. Editorial

The item analysis is done by computer, of course, and it may not detect issues that are obvious to the human eye. That means each test item needs to be reviewed by a cross-cultural panel of editors. What are they looking for? Firstly, they need to make sure that each question is fair. For example, in a reading text about Australia, the question “What is the capital of the Northern Territory?” favours test takers living in Australia over those in Brazil because they are more likely to know the answer without having to read the text.

Next, we need to avoid items that might be upsetting on grounds of gender, religion, race or politics. We don’t want test takers to feel uncomfortable or under strain. This is why the panel needs to be cross-cultural: it simply may not occur to a German editor that a Saudi Arabian student could find a red cross offensive.

A third example is accent in the listening section of the test. English is an official language in more than 50 countries and it’s up to the editors to ensure that a range of accents is represented. Perhaps there should also be speakers from countries where English is not an official language but is widely spoken. So the editors are the first filter in making sure that no one group of test takers has an advantage over any other group.

4. Standard setting

And finally, we come back to the 100 people who so impressed me. Most of them would have been part of standard setting events — mini-conferences where a whole range of language professionals look at the test items one by one. These might be testing experts, English teachers, PhD students, lecturers or corporate English trainers, and they should come from a variety of countries, sectors and cultures. The primary task is to discuss and decide whether each item matches its designated “can do” statement and to ensure that they are all correctly mapped to the CEFR scale.

Clarity and telc language tests held the first standard setting event for the Dynamic Placement Test in Hong Kong in May 2019. With experts from countries across Europe, Asia, the Middle East and Australia, it was a lively and exciting event.


Further reading:

Overview of item analysis
telc language tests
About Dynamic Placement Test

Andrew Stokes, Publisher, ClarityEnglish

Andrew Stokes, Publisher, ClarityEnglish