Hasegawa-Johnson works to improve speech recognition
By Susan Kantor, ECE ILLINOIS
December 4, 2009
- ECE Prof. Mark Hasegawa-Johnson is working to improve a computer's ability to recognize human speech.
- He is working with prosody, the stress and rhythm patterns of spoken language.
- This research is related to a three-year project on speech recognition for people with cerebral palsy.
ECE Associate Professor Mark Allan Hasegawa-Johnson is working to put words into peoples’ mouths--with the help of a computer, of course. His research focuses on speech recognition, the ability of a computer to recognize and convert human words into code.
“Speech recognition now works very well if you’re willing to wear a head-mounted microphone, and if you’re speaking in a dialect that the speech recognizer has been trained to recognize,” Hasegawa-Johnson said. “Otherwise, it doesn’t work.”
One of the areas Hasegawa-Johnson is investigating is the use of prosody, the stress and rhythm pattern of naturally spoken language. In speech recognition, two levels of language structure are typically recognized: words and the individual consonants and vowels that make up words, or phonemes. Higher levels of language structure, like syntactic phrases, or lower levels of structure, like how the movements of the tongue and lips are planned, are not considered.
“What we’ve been doing is gradually adding some of the representations of some of those other levels of structure to the probabilistic models we use to build speech recognition,” Hasegawa-Johnson said.
Speech recognizers do not work as well when there is speech variability. Hasegawa-Johnson has been working to correctly recognize spontaneous speech, despite the disfluencies (“ums,” “uhs,” and word fragments) that are a normal part of spontaneous speech. In a person’s speech planning process, words that have some relationship are grouped together. The planning mechanism puts those into a queue, and those words are said. But while those words are being said, the next group of words should be ready to go in the queue. When that doesn’t happen, speech becomes disfluent. Every truncated word causes two speech recognition errors.
“Most of what we do here is to build better probabilistic models of the ways in which the audio is related to the things the person was trying to say,” Hasegawa-Johnson said. “We try to use probability theory to describe all of the audio in the world.”
Graduate students use between 3,000 and 4,000 hours of recorded English speech to test the codes they write. Hasegawa-Johnson and his research team are trying to learn the probability of the overlapping of certain phonemes in speech, like when a “K” sound is softened when it is preceded by a vowel.
This research is related to a three-year project on speech recognition for people with cerebral palsy that Hasegawa-Johnson recently completed (and is hoping to continue). He hopes to develop ways for speech recognition to be used for accessing the Internet, writing documents, and finding work.
“A lot of people with speech disorders can be understood well by those who know them well, but someone walking in off the street can’t understand what they’re saying,” Hasegawa-Johnson said. “We can get over that by having them record some speech and then trying to model their speech as well as we can.”
By working with students both on campus and at other universities, Hasegawa-Johnson developed a speech recognizer game, where people with cerebral palsy can talk to the computer, and it will try to recognize what they’ve said. He is also trying to make a keyboard interface, which can type documents by using a limited vocabulary.
“It’s satisfying to do something that you can immediately see the effect,” Hasegawa-Johnson said. “We’re working with individual students here and elsewhere. It’s quite motivating to see someone trying to use a keyboard, and he just can’t use a keyboard, and then put a speech recognizer in front of him. Quite frankly, our speech recognizers are not good enough yet to replace a keyboard for him, but you see what that could mean to him.”
Editor's note: media inquiries should be directed to Brad Petersen, Director of Communications, at firstname.lastname@example.org or (217) 244-6376.