This is the first article in a series about speech and language:
- Why are speech or language interfaces useful?
- What makes speech or language interfaces hard to create? Part 1: Overview
- What makes speech or langauge interfaces hard to create? Part 2: Speech
- What makes speech or language interfaces hard to create? Part 3: Language
- When is a speech or language interface a poor choice?
By speech or language interfaces, I mean things like: speech synthesis like in a satnav, speech recognition in Siri or Alexa, understanding text as in a chat bot or context-sensitive help from Word or Grammarly etc. We’re getting increasingly used to things like this, but what’s going on that makes them a different kind of thing from the previously ways of using a computer based on menus and mice?
The big thing about these interfaces is that they’re much closer to humans than a GUI is. If you think about the parties involved, there’s the user on one side and the computer on the other. The user probably doesn’t care much about the computer, other than it’s a way of getting things done. (Sad people like me might be interested in the computer, but it’s like with me and my car – I’m glad I have one, but I usually take it for granted as a way of getting me to work and I don’t tinker with the suspension or transmission.)
Imagine that I as the user have a goal which is to send a message to Paul Carter to say my train will arrive at 4pm. In GUI-based systems, the icons and so on help, but I still have to break down my goal into a series of steps imposed on me by the computer:
- Find the email program’s icon on the screen
- Move the mouse so that the pointer is on it and double-click
- Move the mouse so that the pointer is on New Message and single-click
- Move the mouse so that the pointer is on the To field and type Paul Carter
- Move the mouse so that the pointer is in the Body field and type Hi Paul, my train will arrive at 4 pm
- Move the mouse so that the pointer is on the Send button and single-click
I might have learned keyboard shortcuts, but they are effectively the same for these purposes – I have to press the keys that the computer has said I need to, at the times set by the computer. In terms of the two parties and the interface between us, it’s a long way over towards the computer.
With a speech based system I can do something more like this:
- Say “OK Google”
- Say “Send an email to Paul Carter”
- Wait for the computer say “Sending a message to Paul Carter”
- Say “Hi Paul, my train will arrive at 4 pm” and then pause.
- Wait for the computer to say “Sending”
There isn’t much difference in the number of steps, but I’m doing things that are natural to me (speaking and listening) and most of them are close to the task in hand rather than the computer-imposed overhead of menus and so on.
The interface is much closer to me than before:
I am delegating to the computer a lot of detail of how it solves my problem. I don’t care where the email program is stored and how I access it, its list of email addresses etc. Those are all irrelevant details to me, and a speech-based interface takes them away.
It’s similar for purely text systems like a Google search. While I can break out things like Boolean logic if I want to, I can also choose to ignore tricky implementation details. I don’t need to use a weird syntax that identifies key parts of the text as the important search terms, I don’t need to say how a word changes in different situations (like present / past tense, or plural / singular) and it can even understand typos. In short, it can cope with how I already am (spelling mistakes and all), rather than forcing me to be like a computer.
Following my lead
A well-designed speech and language system will adapt to the user’s actions, so that the user achieves their goal as painlessly as possible, in a way that would be harder to do with a GUI-based system.
Imagine that I want to book a flight. The first thing I say to the computer could be any of these:
- I want to fly tomorrow
- I want a return flight to Oslo
- I want to fly to Oslo from Stansted tomorrow and return two days later
In order to make a booking, even assuming it’s one person and a return and I don’t care about lots of important details like the airline, the price, the class of seat or the time of day, I need to give these bits of information:
- The airport I’m flying from
- The airport I’m flying to
- The date I’m leaving
- The date I’m returning
(with fewer assumptions the longer this list gets).
I could give 1-4 things from that list in my first sentence, which is 4 + 4*3 + 4*3*2 + 4*3*2*1 = 64 possible first sentences in terms of bits of information and their order. (There are even more if you allow for synonyms, like “I want a return ticket from … “, “I want to fly from …” etc.)
A well-designed system can realise what I’ve given and what’s missing, and ask only for what’s missing. I’m not forced to follow a single order that a wizard on a GUI wants to impose on me.
I’m deliberately including lots of things here, for people of lots of different abilities. It’s too easy to think of accessibility as something that applies only to a group of people different from you whom you can ignore. This is not true in general (as well as a horrible position to take). It’s even less true with speech and language interfaces, if nothing else because the group of users they help is potentially everyone.
As well as doing things like giving people such as Prof Stephen Hawking a voice, speech synthesis means that even able-bodied drivers can get directions safely from their satnav or phone. Not only can speech recognition mean that someone with mobility problems can control things in their home, it means that someone working on a production line can issue commands over a headset microphone while their hands are busy.
I hope that this has unpacked the benefits of speech and language to user interfaces, and hence to users. That’s not to say they’re useful at all tasks, or easy to do. These are topics I hope to cover in future posts.