The human voice - computer interface

Text to Speech and Speech to Text

Dr. Dan Boye
Physics Department
Davidson College
daboye@davidson.edu

The voice–computer interface affords many opportunities for explorations because it is not a mature technology. There are still many issues that need to be solved in order for the interface to work seamlessly. This website uses a series of exercises to investigate the successes and shortcomings of Text to Speech (TTS) and Automated Speech Recognition (ASR) routines. The exercises are designed to help non-science majors explore concepts in much the same way that a scientist does.  This work is made possible through a 2006 Technology Fellowship from the Associated Colleges of the South.

Text to Speech (TTS)

Logon a computer in Dana 126 using natreader as both the username and password.

The formant map presents the vowel loops, defined by the first and second formant frequencies, used in vowel recognition.

By using the pull-down box next to the microphone icon, you can change what you can record.  Choosing the "stereo mix" allows you to record sounds that are being played through your sound card. (You may need to go to View/Toolbars and check the Mixer Toolbar.)  This way you can record the voices that Natural Reader is using. 

TTS Exercises

1. Use Audacity to record Microsoft Sam reading some text.  You can go to a webpage or a text/MSWord document.  In Audacity, select the whole text and plot the spectrum (Analyze/Plot Spectrum).  What is the frequency ranges for this voice?  What about the frequency range of two of the professional software voices (Paul, Kate, or More voices)?

2.  Write out 10 vowel sounds of the formant map in MS Word.  Record a human reading it. Have Sam read it in Natural Reader.  For both cases, capture the sound with Audacity and export it in two .wav files. Analyze the formant structure of the different vowel sounds in SFSWIN (Tools/Speech/Analysis/Formant Estimate Track).  Plot the position of the first and second formants on the formant map for both the human and computer voices.  How well does your data fit in the vowel loops?  What observations can you make comparing the two voices?

Speech to Text (ASR)

Lecture notes

AccessScience article on Speech Recognition and AS block diagram of speech recognition model (Figure 1).

ASR Exercises

Open Dragon Naturally Speaking.  Select Activate Later.

Follow the microphone setup and ASR training procedure.  Once you have trained the software to recognize your voice, continue with the following exercises.

1. Read and transcribe the "Fundamental Statements of Frequency Analysis".  Calculate the error rate.
2. Repeat the reading a second time and calculate the error rate.  How does it compare to your first reading?
3. Have another person read the same statements and calculate their error rate.  Comment on the differences in your two voices.