I’ve seen the stories (on boingboing, remember the news) about the Elmo kids’ book that has interactive audio and has been telling kids, "Who wants to die!" in an apparent prank by someone involved with making the book. I’ve also read the press release today by the publisher:
the track was recorded as ‘Uh oh, who has to go’ and due to compression of the digital audio file, some consumers hear a different phrase… We are absolutely certain that the audio file was not tampered with.
Covering their ass, I thought, until I heard the audio sample in this news video from KNDU and now I believe that the publisher is correct. The Elmo sentence under question is an excellent example of the psychological principal of Priming, whereby what you perceive can be affected by your expectations. Listen to the sample expecting to hear "Who wants to die," and that is exactly what you hear. However, listen expecting to hear "Who has to go," and then the correct phrase becomes what you hear. Listen to the video a couple of times and force yourself to "expect" the two different phrases, and many of you will in fact switch what you hear depending on your expectation.
Of course, the first person who misidentified the sample as "Who wants to die" wasn’t expecting to hear this frightening threat, so priming wasn’t the reason they had their misunderstanding (even though the sentence demonstrates priming very well). How did consumers hear this unintended death threat, then, if priming wasn’t the reason?
There are two main confusions with the sentence in question: "has" is confused with "wants", and "go" is confused with "die". The Elmo book certainly uses some severe compression scheme to reduce the bit rate necessary to store the speech in the book as the publisher stated–that’s obvious just by listening to it. This compression scheme distorts the speech (in addition to the speech distortion that occurs from the annoying Elmo voice), adds a certain amount of noise, and reduces the speech bandwidth. All of these could lead to confusions in consonants and vowels perceived in the sentence. I decided to pull out some research papers on speech confusion and see if there’s an explanation for this mix-up.
Classic research on consonant confusion by Miller and Nicely in 1955 looked at the impact of noise and bandwidth on consonant confusions. According to their research, for speech at a +12 signal-to-noise ratio and a bandwidth of 200-1200 Hz (probably not a bad approximation to the sever compression applied to the Elmo speech), the phoneme /g/ will be incorrectly identified as a /d/ as often as it is correctly identified as a /g/ (click on the figure to the right to see the full-sized confusion matrix–the data of relevance is highlighted in yellow). This begins to explain confusing "go" with "die": the word sounds like it starts with a /d/ instead of a /g/ due to the crappy compression system.
The vowel confusion is a little more difficult to explain, but I’ll try assuming that they are represented by the dipthongs /OW/ and /AY/. The vowel sound in "go" has a similar first formant time-course to the vowel in"die" (according to Rabiner and Juang), so again a compression system that limits the bandwidth of speech might make the two vowel sounds more alike.
So now I’ve explained from a scientific basis how Elmo’s "go" could be misinterpreted as "die".
A similar explanation can be made for the vowels in the confusion of "has" with "wants": both words have similar first formants. The consonant confusion with these words is more difficult to explain. Confusing /h/ with /w/ isn’t common according to research by Wang and Bilger in 1973 (Miller and Nicely’s paper did not look at these consonants). The /h/ is a frication, the /w/ is voiced–the two are rarely confused. I suspect that the compression distortion obliterated the soft consonant /h/ and allowed the user to imagine whatever consonant they want.
This opens a whole new line of work for linguists–alerting companies when their crappy compression systems may cause customers mental anguish (or worse if it’s in a car’s GPS system). You don’t need to mind your p’s and q’s but be careful because, according to Miller and Nicely under the noisy conditions I considered above, the phoneme /t/ is more likely to be heard incorrectly as a /p/ than correctly as a /t/. So, if you get your face slapped at a noisy bar asking a woman if she wants to see your cool trick, at least now you know why.