Speech interfaces are ready to listen
By Fred Hapgood
(IDG) -- In the early '90s CIO Magazine stuck a fork, editorially speaking,
into speech recognition and declared it done.
The reason for such excitement was that the technical problems that had
bottled up the technology in labs for decades were finally being addressed.
The most basic issue had been associating specific vocalizations with specific
phonemes. (Phonemes are the basic units of speech. For example, the "wuh"
sound in "one.")
Making the associations required compiling huge databases of how the more than
40 English-language phonemes are spoken by those of different ages, genders,
linguistic cultures and under different phone-line conditions. Developers then
had to write programs that could find the degree of fit between a given user's
vocalization and one of those samples.
Once software recognizes a phoneme, the sound has to be assigned to a meaning
(unless the product in question is a simple dictation engine for directly
recording and transcribing speech). That meant building more databases, and
this time of all the ways humans might express meanings of interest, such as
"yes," "sure," "right," "correct," "yep," "yup," "yeah," "uh-huh," "fine,"
"OK," "affirmative," "good" and so on. Since the culture was throwing off new
expressions constantly -- "whatever," "no doubt," "word" -- the programs also
needed to be easy to update in the field. Then grammar algorithms needed to be
written and hardware developed that was fast enough to do all that computation
in real-time, yet cheap enough for ordinary businesses to buy.
By the early '90s all those pieces were falling into place or were close to
doing so, and we felt the implications were significant. A wide range of
enterprise functions, from order-entry to customer support to incident and
inspection reporting, were about to get cheaper and easier to perform.
Computation and telephony were going to merge. We were all going to get out of
voice-mail jail. "A decade from now perhaps speech recognition will be as
ubiquitous as voice-messaging is today," we concluded.
While considerable progress has been made since then, we're not there yet.
Were we wrong in our judgment of what had been achieved technically? Probably
It turned out, however, that speech recognition is only partly a technical
problem. Recognition implies a conversation, and conversations make sense only
in the context of relationships. When humans enter relationships they
immediately impose a structure of assumptions and expectations. Is the person
smart? Knowledgeable? Nice? Lazy? Snobbish? That structure controls the
interaction. If a comprehension problem comes up during a conversation with a
smart person we assume we are at fault and take on the responsibility of
working it out. We do the same if we think our respondent is not too bright
but basically nice. On the other hand, if we think the other party is lazy,
doesn't care or worse, is trying to manipulate us, we behave very differently.
Those relationship issues are just as important when talking with machines as
with people; even more so, since most users were and are uncertain about how
to talk to software. "Suppose you said you wanted to go to Boston and you
heard the reply, 'I don't understand,'" says William Meisel, president of TMA
Associates, a speech recognition consultancy in Tarzana, California. "This was
a common response at the time. But what didn't (the computer) understand? Was
it your pronunciation? Usage? The logical thread? You didn't know."
What you did know was that the program refused to give help when you needed
it. This refusal became a cue in and of itself -- a sign that the machine
planned to shift all the work of the conversation onto the user. Humans
reacted to that the same way they would have in a conversation, with
resentment and irritation. They raised their voices and sounded out words as
if they were speaking to a child. Their voice became stressed. They changed
their pitch. They started to swear. This would confuse the program even more,
until eventually users hung up with a bang.
Today the industry understands the importance of giving the user as much help
as possible, Meisel concludes, which might mean building another database --
this time of the most common errors -- and writing prompts that suggest
specific solutions to problems. For example, the computer could ask, "Do you
want Austin or Boston?" This does more than locate the problem as a
pronunciation issue; it reassures the user that the program has the smarts to
understand the situation and is willing to help the speaker solve it, which in
turn makes users more disposed to working with the program.
Mike Phillips, CTO of SpeechWorks International, a speech recognition product
and services vendor in Boston, offers many other examples. Answering a
question with "unauthorized request" might work on a Web page, for instance,
but in the context of speech it communicates a haughty indifference. A better
answer would be, "I'm sorry, my supervisor doesn't allow me to make that
transaction" in a sympathetic, you-and-me-against-the-world tone of voice. The
old way of saying that a database is down was simply, "That database is down."
That might work as an error message onscreen, but in the context of speech a
better way might be to sigh and say, "I'm sorry, the system is giving me a lot
of trouble right now."
Speech recognition systems used to ask people to "Wait for the prompt to
finish before speaking." That made the technology easier to implement, but it
communicated a snippy insistence on privilege and hierarchy that annoyed
users. Today most speech recognition software is "barge-in enabled," which
means that speech recognition programs defer to users whenever they interrupt.
"The point is to keep assuring the user that the system is on her side,"
During the past few years, the underlying technology has continued to improve.
(Meisel estimates that the error rate in phoneme recognition accuracy falls by
about 30 percent per year.) The technology is now in the peculiar position of
outrunning expectations, says Phillips. Good speech recognition is perfectly
capable of handling a complete sentence, such as "I want to take the redeye
from Boston to Austin a week from Wednesday," but most users still want a
highly structured interaction that prompts for each element of the
transaction. Phillips' hope is that as their experience with speech
recognition applications grows, users will relax and conversations will get
more ambitious and wide-ranging. But whatever happens, the programs will
always be very, very nice.
--- --- --- --- ---
Useless hypotheses, etc.:
consciousness, phlogiston, philosophy, vitalism, mind, free will, qualia,
analog computing, cultural relativism, GAC, Cyc, Eliza, cryonics, individual
uniqueness, ego, human values, scientific relinquishment
We move into a better future in proportion as science displaces superstition.
This archive was generated by hypermail 2b30 : Sat May 11 2002 - 17:44:15 MDT