The human voice - computer interface

Text to Speech and Speech to Text

Dr. Dan Boye
Physics Department
Davidson College
daboye@davidson.edu

The voice–computer interface affords many opportunities for explorations because it is not a mature technology. There are still many issues that need to be solved in order for the interface to work seamlessly. This website uses a series of exercises to investigate the successes and shortcomings of Text to Speech (TTS) and Automated Speech Recognition (ASR) routines. The exercises are designed to help non-science majors explore concepts in much the same way that a scientist does.  This work is made possible through a 2006 Technology Fellowship from the Associated Colleges of the South.

Text to Speech (TTS)

Download and install the latest version of the free tool from Natural Reader.  This software reads highlighted text and sends the output to the sound card and on to the computer speakers.

The formant map presents the vowel loops, defined by the first and second formant frequencies, used in vowel recognition.

Download the latest version of the free audio software Audacity.  By using the pull-down box next to the microphone icon, you can change what you can record.  Choosing the "wave out mix" allows you to record sounds that are being played through your sound card.  This way you can record the voices that Natural Reader is using. 

TTS Exercises

1. Record Michael, Michelle, and Sam reading the same text.  Select the whole text and plot the spectrum.  What are the frequency ranges for the different voices?  What about the frequency range of the professional software voices?
2.  Write out 10 vowel sounds of the formant map in MS Word.  Record a human reading it and Natural Reader reading it.  Capture the sound with Audacity and save it in .wav format. Analyze the formant structure of the different vowel sounds in SFSWIN.  Plot the position of the first and second formants on the formant map.  How well does your data fit in the vowel loops?  What observations can you make comparing the two voices?

Speech to Text (ASR)

Lecture notes

AccessScience article on Speech Recognition and AS block diagram of speech recognition model (Figure 1).

ASR Exercises

Open MS Word.  Go to the Tools menu and select Speech from the menu. 

(If Speech is not listed as an option, you will need to activate the Speech Tools in MS Office.  The easiest way is to bring up the Help file in MS Word by hitting the F1 button. Enter "speech recognition" into the search field and hit enter.  Find the entry that says "use speech recognition" and follow the directions in that help document.  You will need to be logged in as a computer administrator to do this.)

Follow the microphone setup and ASR training procedure.  Once you have trained the software to recognize your voice, continue with the following exercises.

1. Read the "Fundamental Statements of Frequency Analysis".  Calculate the error rate.
2. Repeat the reading a second time and calculate the error rate.  How does it compare to your first reading?
3. Have another person read the same statements and calculate their error rate.  Comment on the differences in your two voices.

Using SFSWIN

Download the latest version of SFSWIN, a software package used in speech research.

Synthesize speech using SFS

Record a speech signal, then do:
1. Tools | Speech | Analysis | Fundamental Frequency | Fundamental Frequency Track
2. Tools | Speech | Analysis | Formant Estimates Track - and select synthesizer control data output
This should give you a basic set of data for formant synthesis.  Then do:
3. Tools | Synthesis Data | Synthesize speech