The quest for VoiceGPT

by | 28 March 2023

The voices we hear around us in day-to-day life aren’t always what they seem. Take announcements on public transport. Sometimes they are obviously human: the speaker has an unusual accent or corrects themself a couple of times; sometimes they are so robotic that they can only come from a voice generator. And then there are the ones in the middle that leave you thinking ‘Human or machine? Hmmm, I’m really not sure.’

If the answer is ‘machine’, there are some obvious ways in which computer-generated voices (CGV) can help with language learning. In this post, we’ll identify potential benefits of using CGV as a resource for teachers, students and publishers, then we’ll ask whether the technology is up to the job yet.

It’s generally accepted that in the context of global ELT students should aim for ‘international intelligibility’ in their speaking. So while learning (and test) materials may predominantly use ‘standard’ British, North American or Australian models, they should include other accents too: Indian, Caribbean, Hispanic, Chinese, and so on. It’s difficult to visualise circumstances in the 2020s in which a globally-focused language learner would not be exposed to a variety of accents. But how do we improve access to these voices?

The good news is that it’s never been easier for teachers to find a range of accents: Radio Garden will transport you with a flick of the mouse from Anime Radio in Bison, Kansas to Glorious Radio in Colombo via all points in between. But it’s difficult to exploit this audio in class. At least 90% of the output is music; where there is voice you can’t record it; and anyway, it’s aimed almost exclusively at listeners at C1 or C2 level.

The enticing solution CGV offers is the possibility of quickly and easily creating your own audio, including a similar variety of accents but at the right language level and at an appropriate speed of delivery. It would be both beneficial and highly motivating to use them in class for listening activities or (receptive) pronunciation tasks.

This would work equally well for publishers who may want to include a range of accents but find it impractical or unaffordable to do so because voice talents typically charge by the half-day and budgets won’t run to it.

Many students already use the voice generator in Google Translate (or similar) to find out how to pronounce unknown words. This works less well at the sentence level because the delivery tends to be monotonous, but CGV would similarly help with the suprasegmental aspects of pronunciation, both receptive and productive.

So the demand is there. How about the supply? We explored three voice generators: Revoicer, Murf and Azure. Our benchmarks were that the output should be sufficiently human to be an acceptable pronunciation model for a learner, and that the site should be practical to use and affordable. All the sites had ‘minority’ accents, though the majority of voices were American and British.

1. Revoicer

This site is easy-to-use, and usability tended not to be a problem with any of the sites we sampled. The biggest issue with Revoicer is the lack of consistency both within and across voices. Some voices sound natural but others are very robotic. And a voice that sounds quite good in preview often sounds unnatural when given a larger target text to read out. With a list, for example, intonation is an important element in the meaning, but the rise, rise, fall is not there in the initial output, and there’s no way of manipulating the intonation within an utterance.

2. Murf.ai

With Murf.ai, the most popular and best reviewed site, we tested sentence stress. Take this utterance: ‘Actually, I asked for a medium cappuccino, not a large one.’ The expected stress on ‘medium’ was not there in the initial output, but Murf has an emphasis function enabling you to add stress to particular words. However, it doesn’t seem to recognise that there are three elements to sentence stress: pitch, volume and length, and that these apply only to the stressed syllable. Murf manipulates only the pitch, and applies it to the whole word – and this, of course, sounds extremely strange.

3. Azure

The Microsoft Azure voice generator seems to be a generation ahead of the other two we tried. However, to use it you have to subscribe to a whole network of Azure services, which immediately rules it out for teachers, small publishers and students. So we had to exclude it on cost grounds – in fact, we weren’t even able to put it through its paces properly or to get a quote for the price. It’s worth noting also that if you decide to subscribe to Revoicer, the site is less than transparent about its charging model. Their ‘One Time Payment’ (US$65) is immediately followed by an invitation to make another one time payment – and it’s not difficult to guess where that will lead.

Based on a few hours’ research, the answer to the question ‘Human or machine?’ in public transport announcements is probably ‘Human’. We found that CGV sites are generally practical and easy to use but lack the sophistication and fine-tuning required in the language classroom. The pricing models are both expensive and opaque. But it’s still fun to embrace, exciting to use, and with the speed at which these technologies are developing, it would be surprising if there weren’t dramatic improvements within the next year.

 

Andrew Stokes, Publisher, ClarityEnglish

Andrew Stokes, Publisher, ClarityEnglish