Re: Text to speech programs

Dan Clemmensen (Dan@Clemmensen.ShireNet.com)
Thu, 24 Jul 1997 22:20:12 -0400


CALYK@aol.com wrote:
>
> All they need to do is to get someone to record their voice for each word,
> probably a few different times, with different stresses each time. The
> problem comes in assembling the grammar to fit which stressed words should be
> used. Does typical grammar apply here or are there new forms of language
> composition/evaluation that can be applied?
>
> danny (some dumb kid)

You are correct in principle, but your solution would be rather
expensive
both in initial cost and in storage space. Still, it may be time to
look at this approach again.

This is essentially a good idea. It violates no known laws of physics
:-)

Lets see: the english language has a bunch of words, but the typical
working vocabulary for this sort of thing might be about 15,000, roots,
with (say) an average of 4 words per root. (I'm guessing, but check a
dictionary.) So, we need to pay a reader to read 60,000 words, four
times
each. With associated overhead like clicking a button when the word is
flashed on the screen, the paid reader might do a word every 4 seconds
for (say) 4 hours per paid workday. That's 3600 words per day. Call it
20 workdays, and you can probably get it done by cheap labor (i.e., a
high school kid.) OK, the up-front cost isn't as high as we thought!

Now, what about the storage cost? I'd guess that a word will take, on
average, half a second to speak. A modern particular realtime
continuous speech compression algorithm known as A-CELP can compress
high-quality speech into 4000 bits/sec, or 500 bytes/sec. I strongly
suspect that you can derive an off-line compression algorithm that can
cut this in half, to 250 bytes/sec, or 125 bytes/word. So, our database
isn't too big after all: 7.5Mb. On a modern computer, that's nothing.

The computer would still have to handle the words other than these
60,000
based on a phonetic dictionary, and any word not in the reader's list
or in the phonetic dictionary would still have to use the rules
dictionary,
just as the current crop of programs do. However, we can use a much
bigger
phonetic dictionalry than today's programs, because we can most likely
write a program that can parse the phonetic spellings from the Random
House Unabridged CD-ROM or any other CD-ROM dictionary (after paying the
royalty, of course!)

Yes, grammatical analysis would be needed for stressing, and yes, it
would
be a big problem. My guess is that the program would get it wrong about
half the time. Its amazing how often a radio announcer messes up an
advertizement when a word is used in an unusual way.
For instance the emphasis should be different on the second word of

"We have an occasional problem" and "We have an occasional table"