Introduction
This post is in a series about computers, speech and language:
- Why are speech and language interfaces useful?
- What makes speech and language interfaces hard to create? Part 1: Overview
- What makes speech and language interfaces hard to create? Part 2: Speech
- What makes speech and language interfaces hard to create? Part 3: Language
- When is a speech and language interface a poor choice?
Having described in a previous post when speech and language can be an excellent choice for a user interface, I’d like to talk in this post about tasks for which they are a poor choice. Note that for some people they are the only choice, because for instance they can’t see or type. That is a valid discussion to have but deserves its own post or more, and in this post I will concentrate on when all options are available.
Seeing patterns and structure
We can hear many things at once, but only people with more musical skill than me, like choir masters and conductors, can hear the individual parts and how they relate. That’s why I like the podcast Song Exploder – as the artist describes the process of making a piece of music its components are introduced separately, so that when I hear the end result I appreciate its richness much more than I would normally.
So for audio duffers like me, while I can appreciate an overall sound it’s hard to separate out a complex sound into separate parts and see a structure. The structure in music – instrumental or choral or even musicals – is much clearer to me visually. (Follow those links to some gorgeous visualisations.)
It might be just because we haven’t developed the techniques or technology yet, or because of how Western culture has been over history, but complex information is usually more understandable when seen than when heard. We can use hearing to quickly detect direction and distance, but not a complex picture.
An interesting example of this contrast is a podcast from Radiolab about the colour ranges that different creatures can see. This would be straightforward to represent visually – just portions of a spectrum, extending into the infra-red and ultra-violet if necessary. However, this is a podcast i.e. audio only. They took an interesting approach by arranging colours on the musical scale, with low notes being towards the red end and high notes towards the blue end. They then had a choir sing all the notes mapped to the colours that a given animal could see, and you could tell the size of the colour range from the sound.
However, I think this is an exception that proves the rule. There was only one dimension in the information space, and the relationship between the bits of information (the individual colours for an animal) wasn’t as important as the number of them and min / max values. I wouldn’t want to take an arbitrary graph or visualisation and turn it into sound.
Similarly, the adage a picture is worth a thousand words is so often true. Written language can be just as poor as conveying richly structured information. (It can be amazing at conveying other things, like emotion.)
Scanning through a list of options
This is a special case of the structure problem above. If there is a list of things, a spoken interface will give them to you one at a time, starting at one end of the list and moving to the other.
It would be much easier to scan through a visual menu of words or pictures. You can easily move about this list in whichever order you prefer, skip things easily and so on.
I know this is a statement of the bleedin’ obvious, but sometimes user experience people or gadget freaks jump on the latest shiny bandwagon even when it’s headed in the opposite direction of the bleedin’ obvious. Just because Alexa can read you all your emails, it doesn’t mean that it’s the best way for you to read them – just pull out your phone and scan them all more quickly. Better yet, a smart speaker with a screen will combine the best of both worlds – speech recognition / synthesis when that makes most sense, and a GUI when that’s better.
Summary
I’m not saying any style of user interface is inherently bad or good. I am saying that user interface designers should (as always) respect the particular user and their use case[s], and not assume one size fits all.
There is an old tale, probably apocryphal, of an early public demo of speech recognition. Where someone from the audience shouted out “arr emm dash arr star” which was converted to text and sent to the command line of the Unix workstation running the demo of the speech recognition software.
Whether this caused the workstation to fail or other adverse outcomes is not recorded.
LikeLike