What makes speech and language interfaces hard to create? Part 2: Speech

Introduction

This is the third in a series of articles about using speech and text to interact with computers:

Why are speech or language interfaces useful?
What makes speech and language interfaces hard to create? Part 1: Overview
What makes speech and language interfaces hard to create? Part 2: Speech
What makes speech and language interfaces hard to create? Part 3: Language
When is a speech and language interface a poor choice?

This article builds on those two and goes into the question: What makes speech so hard for computers?

The next article in this series will be about what makes text so hard for computers.

An analogy

I alluded to this in the previous article, but here’s an analogy that might help to explain one of the basic problems in speech interfaces.

You can do this for real, or just imagine it. Wherever you are, stand up (if you’re not already), and then walk to the nearest door between inside and outside. Go through this door, then turn around and walk back to where you started. Now do this again. However, you must reproduce what you did the first time exactly. The way you moved your arms – how your shoulders, elbows and wrists flexed (at what time and by how much), the way you moved your legs and feet, the way your torso twisted, the way your head moved. Reproduce all of that exactly. (If you’re a wheelchair user, then I hope you can interpret this accordingly.)

It’s surprisingly hard to get it exact, even though walking is something you do every day.

Mickey Mouse walking

Speech is a similar problem. There is a lot of scope for variation, even though there is a small number of moving parts to the system (lips, tongue, jaw, vocal chords and lungs). Some variation is incidental – how you move or speak today vs. yesterday. Some variation might be important – for instance different dance moves in a ballet or signs in sign language.

Variation in speech

If you look at speech from the point of view of phonetics there are three dimensions that can vary, which are the three dimensions of a spectrogram (the x and y axes, plus the colour).

Duration – you can slow down or speed up
Volume – you can be louder or quieter
Frequency – you speak higher or lower

However, we’re not usually interested in knowing how high-pitched someone’s talking, or how quietly or quickly; we’re usually interested in what words they’re saying. So in speech recognition we have to navigate the variations in these low-level properties, to arrive at a higher-level understanding: the words they’re saying.

The tricks your brain plays

Before we carry on with the theory, here’s another thought experiment. I’ve done this as an actual experiment, and you could too if you have audio editing software like Audacity.

Imagine you have a recording of someone saying we hid the stain. In your imaginary (or actual) audio editing program, delete a couple of bits of this and then shuffle what’s left together so that there are no gaps. The bits to delete are either half of the i in hid and all the s in stain. Then play the shorter version of the recording – what do you think you will hear?

What I heard was we hit the Dane. That was weird – shortening the vowel in the first word changed which consonant came next and removing a consonant from the cluster of consonants at the start of the last word changed the remaining consonant in that cluster.

I mention this because it shows how the context a sound is in changes how it is said. Our brains have learned to work with these context-dependent changes when decoding the noises that come into our ears. When we fiddle with that noise such that the normal link between context and noise is broken, our brains continue to use the rules that link context and noise, but now the rules lead it back to a different context (a different set of words).

In British English, the duration of a vowel sound is affected by the consonant that follows it. If the consonant is voiced (your vocal chords are making some noise), the vowel gets longer than if it’s followed by a voiceless consonant. t and d are a pair of consonants where your mouth is doing the same thing, but your vocal chords are working during d and not during t. So when the vowel is shortened to the length it is when followed by a voiceless consonant, your brain takes that as a stronger cue for what consonant follows the vowel than whether it’s voiced or voiceless.

The last word shows that we kind of lie to ourselves and each other, at least in British English. We don’t actually say stain, we normally say something closer to sdane but pretend that it was stain. Your vocal chords need to be working for the vowel sound, and instead of turning on abruptly at the start of the vowel, they start to work during the previous consonant which makes it sound more like sdane. However, because sdane isn’t a word in British English, the brain ignores this difference in the audio signal and assumes that you actually said stain. When you removed the s sound your brain hears what was there all along.

Given this is what’s actually in the audio signal, you can begin to see that it can be hard to decode it, or to create a realistic audio signal if we’re doing speech synthesis.

What is encoded in a speech signal?

If you refer back to the previous post, you can see all the linguistic layers. Information in these layers has to all squash down into the speech signal somehow. As well as that there is information about the speaker – age, gender, nationality, region, physical state (tired, drunk etc.), emotional state (excited, sad etc.)

In speech recognition it’s even harder. As well as the signal we’re interested in, there’s noise that we’re not. This could be other people talking in the background (as in a café), other background noise (trains going by at a station), the speaker’s beard or piercings rubbing against the microphone, and coughs and sniffs.

The speaker is likely to not be perfect all the time – they might stumble over words, leading to fillers like err, stuttering, repeating themselves etc.

The medium itself might not be perfect. A phone actually throws away a lot of the frequency range, which can make the letter S sound like the letter F. This might make e.g. a UK postcode hard to recognise reliably. Did the speaker say CB3 9SW or CB3 9FW?

Basic problem – speech recognition

In speech recognition you need to match what’s coming in against a list of known words to see which one matches best. You need to ignore some differences (male vs. female speaker, variation in regional accents etc.) while paying attention to other differences (cat vs cut, clock vs clog etc.).

To make matter worse, as mentioned in the previous article you don’t know where the boundaries are between words. The speech signal is a fairly unbroken stream of noise, and the speech recognition system has to choose where to apply boundaries in order to chunk it into words. For many people, the audio that they produce when they say up at eight o’clock could just as easily be chunked up into a potato clock.

A computational linguistics professor once told me this interesting comparison. A native adult speaker of a language might understand tens of thousands of words (even if they don’t use them all). So, even if you could somehow know that the previous word has just finished, you have something like a 1 in 40,000 guess for the next word. If, as well as knowing that the previous word had finished, you can keep track of the context at all the linguistic levels (syntax etc.) then you can narrow that down to about 1 in 32. (If you had previously been talking about oranges, it’s unlikely that you’d suddenly start talking about gutters. If you had just said the word the, it’s unlikely that the next word would also be the – unless you’re talking about 80s indie bands.)

Basic problem – speech synthesis

For a given set of input words (e.g. as text), you need to pick a gender, age, nationality, regional accent, emotional state and so on. You need some base representation of the sound of each word based on those choices, bearing in mind the affect of context on sounds. (So if you’re using rules to go from letters to sounds, the results won’t sound natural.)

The basic representation of the words’ sounds then need to be modified based on whether it’s a sentence or question, where the word is in the sentence, where the word’s clause is in the sentence and so on.

As an example, consider this text from US declaration of independence:

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.

This sentence is quite long but is understandable if someone read it to you. As they read it their speech would convey the structure of the overall sentence and where each word is in that structure. For instance:

self-evident ends a clause and announces that a list follows, so the tone will fall and the word will be lengthened slightly and be followed by a pause;
equal and Rights both end an element in that list which isn’t the last element, so the tone will rise slightly and there will be another pause after each one;
Life and Liberty are elements of a nested list which is inside the last element of the outer list, and so they be followed by a pause that is shorter than the pause after equal and Rights;
Happiness ends the inner list, the outer list and the whole sentence, so the tone will fall and the word will be lengthened.

For synthesised speech to sound natural and be easy to understand it has to convey this structural (syntactic) information, as well as information from all the other linguistic levels. Merely plucking each word’s sound from a dictionary and stringing these sounds together produces a robotic noise that is hard work to listen to as it is so unnatural.

Summary

Speech is a very rich soup of information. As made by humans in real life, it’s made even more complex due to hesitation, background noise and other things outside the scope of the words being said. It’s natural for the soup to never be exactly the same from one speaker to the next, or even one moment to the next for the same speaker.

Recognition means having to cope with this variation and still pick out some information out from the rich soup. Synthesis, to make something that sounds natural and comfortable to humans, means creating this rich structure out of very little, and then ideally not just doing this once but doing it slightly different each time.