Real time measurements of the vocal tract resonances during speech.

Julien Epps, Annette Dowd, John Smith and Joe Wolfe
School of Physics
The University of New South Wales
Sydney 2052 Australia

Published in: ESCA. Eurospeech97. Rhodes, Greece. ISSN 1018-4074, pp 721-724.

Abstract

The formants of speech sounds are usually attributed to resonances of the vocal tract. Formant frequencies are usually estimated by inspection of spectrograms or by automated techniques such as linear prediction. In this paper we measure the frequencies of the first two resonances of the vocal tract directly, in real time, using acoustic impedance spectrometry. The vocal tract is excited by a carefully calibrated, broad band, acoustic current signal applied outside the lips while the subject is speaking. The sound pressure response is analysed to give the resonant frequencies. We compare this new method (Real-time Acoustic Vocal tract Excitation or RAVE) with linear prediction and we report the vocal tract resonances for eleven vowels of Australian English. We also report preliminary results of using feedback from vocal tract excitation as a speech trainer, and its effect on improving the pronunciation of foreign vowel sounds by monolingual anglophones.


Introduction

In the source and filter model of voiced speech, the resonances of the vocal tract (the filter) modify the harmonic-rich signal from the vocal folds (the source). If the pitch of the source signal is sufficiently low, these resonances give rise to formants: frequency bands of increased power in the spectrum of the sound radiated from the lips. (The term formant is sometimes used to describe both the resonances and the local maxima in the sound spectrum. We measure the two independently, so we reserve the terms "resonance" for the former and "formant" for the latter. We refer to them as (R1,R2,...) and (F1,F2,...) respectively.)

Formants are commonly estimated by inspection of spectrograms, or by automated routines such as linear prediction (LP). There is an inherent limitation to the precision of such estimates: they necessarily include an error which cannot be very much less than the frequency at which the tract's response is sampled, i.e. the pitch frequency. This imprecision may not be an important problem for many applications, especially if the speaker has a low pitched voice with a fundamental pitch of say 100 Hz or less. It is more important for higher pitched voices. For the voices of children and some women, the pitch frequency may be so high that the formants are poorly defined (Clark and Yallop, 1990). This is particularly the case when the pitch frequency is of the order of, or greater than, the frequency of the first resonance (Sundberg, 1987).

The relatively poor precision in formant estimation obtained from excitation by the speech signal is usually not a problem when the signal is used for recognition by human listeners. Normal spoken and written languages have a high level of internal redundancy (Fletcher, 1992) and so, with appropriate semantic and linguistic context, high word recognition rates may be achieved with poor resolution of the vowel formants, or sometimes even in the [k-mpl-t -bs-ns -v v--lz]. If automated recognition systems are to identify phonemes successfully without contextual clues, greater precision in formant or resonance estimation would be useful.

Speech training is another application which requires precise measurements in real time. People with very poor or no hearing have difficulty learning accurate speech because they lack auditory feedback. Adults learning foreign languages rarely acquire good accents because their auditory feedback is complicated by categorisation and interference: they interpret a foreign phoneme in terms of one in their native language and then reproduce a sound more like that with which they are familiar. A real time system which measures accurately the relevant articulatory properties of the vocal tract can be used to give feedback which is not compromised by categorisation and interference (Dowd et al, 1996).

Other investigators have used various methods to determine vocal tract resonances with higher precision than that available from voiced speech excitation. Pham Thi Ngoc and Badin (1994) report measurements made by exciting the tract mechanically near the glottis. This technique is good for measuring the vocal tract transfer function, but is perhaps less suitable for application to speech training because it is a little invasive and because it is also possible that phonation during such stimulation is not the same as phonation in the absence of external mechanical excitation. The resonances may also be determined from the spectrum of whispered speech, but the signal is noisy and unpredictable, so time averaging is necessary for precise measurement (Pham Thi Ngoc, 1995; Dowd, 1995).

We describe a system for rapidly, precisely and non-invasively measuring the resonances of the vocal tract during phonation, its application in the measurement of resonances for 11 vowels in Australian English, and the preliminary results of using feedback from vocal tract excitation as a speech trainer.

Materials and Methods

Impedance spectrometer. The measurements use a development of an acoustic impedance spectrometer described by Wolfe et al (1995); Wolfe and Smith (1995). A signal which comprises the sum of several hundred sine waves is synthesized by a computer, converted to an analogue signal, amplified and input to an enclosed loudspeaker connected to the large end of an exponential horn. Near the other end of the horn (the source) is a small microphone (Fig 1), and both are positioned just outside the subject's lips, with a hemi-cylindrical cowl touching the face just below the nose. The signal is calibrated with a reference acoustic load: in this case it is the free field near the subject's mouth, with the subject's face (mouth closed) and the cowl acting as baffles. During calibration, the amplitude of each of the sine waves is adjusted so that the resulting spectrum measured by the microphone with the subject's mouth closed is independent of frequency or "flat". When the subject opens his/her mouth to speak, the vocal tract is in parallel with the acoustic field and its resonances appear as maxima followed by steep falls in the broad band response, as shown in Figs 2c and 2d. To determine these from the total signal, the speech signal is removed. This is done by measuring its pitch (using a high order low pass or band pass filter and zero-crossings) and then removing integral multiples of this frequency (+/- 20 Hz) from the combined signal. These gaps are filled by interpolation. A routine searches for the largest negative going discontinuities between levels of adjacent frequency bands averaged over 25 Hz (for R1) and 100 Hz (for R2). The frequency spacing is about 5 Hz but, because of the calculations, each cycle takes a little longer than 200 ms and new values for R1, R2 and other data are displayed four times per second. Technical details and the performance of the method are described elsewhere (Epps et al, 1997).

Australian vowels. The vocal tracts of 33 young Australian men (students at the University of New South Wales) were measured. The acoustic source and microphone were positioned outside their mouths, with the microphone about 10 mm and the source about 30 mm from their lips. An instruction sheet asked them to pronounce and to sustain the words "heed", "hid", "head", "had", "hard", "hot", "hoard", "hood", "who'd", "hut" and "heard". 15 measurements were taken for each vowel.

Comparison with Linear Prediction. To compare RAVE and the standard automated technique of linear prediction (LP) (Makhoul 1975), we measured the formant frequencies of one male and one female subject. Linear prediction (LP) is a spectral smoothing process which fits a curve to the envelope of the speech signal and reports peaks in the fitted curve. Subjects were requested to produce the same set of 11 Australian vowels, in their usual conversational voice, into a microphone. A 24th order LP model was fitted to their speech in real time, and the formant frequencies were estimated by calculating the frequencies of the two lowest-frequency poles of the model. 20 estimates of their first two formant frequencies were recorded for each vowel sound. The same two speakers were also requested to produce the same vowels while their vocal tract resonances were measured using RAVE, as described above.

Speech trainer. The experiment using vocal tract feedback as a speech trainer used an earlier version of the spectrometer and software which did not allow phonation. For this study, subjects were taught to raise the soft palate (velum) and to mime the production of a vowel. "Target" sounds were chosen from the French language, as spoken by 5 female native speakers of the language. The sounds were recorded on magnetic tape, and the frequencies of the first two resonances of the tract were measured. 11 monolingual Australian women volunteers were asked to imitate these target sounds. One group received only auditory feedback - they listened to the sounds and attempted to reproduce them. Another group used both auditory feedback and vocal tract feedback. The latter consisted of displaying to the subject the acoustic response of her own vocal tract, upon which was superimposed the values of the resonant frequencies of the target speaker. Tape recordings of subjects attempting to imitate the target sounds were distributed to a listening panel of native speakers who indicated which vowel they thought that the subject had produced. The procedure is described in detail in a manuscript submitted elsewhere.

Fig. 1. A synthesized signal containing several hundred frequency components is input to the amplifier, speaker and impedance matching horn. The microphone signal includes both the subject's voice and the response of the vocal tract to the external signal. The monitor may be used to give the subject instructions for recording data, or to display visual feedback about his/her vocal tract.

Results

The operation of RAVE on a male and female voice is shown in examples in Fig. 2. (a) and (b) show the spectrum of the voices alone, while (c) and (d) show the spectrum measured with both phonation and external excitation. To compare RAVE with linear prediction (LP), 20 measurements were made of the resonances and formants for each of 11 vowels of Australian English as spoken by one male and one female speaker. The reason for single subject samples was to minimise intra-sample variation. The first two resonances were measured using RAVE. The first two formants were measured using LP. In each case RAVE gave a smaller variation (Table 1).

Fig. 2 (a) is the magnitude spectrum for a speaker with pitch frequency 126 Hz. From this, one can estimate formants at about 0.5 and 1.4 kHz. (b) is the spectrum for the same vowel spoken by a speaker with pitch frequency 361 Hz. From this spectrum it is much more difficult to estimate formant frequencies. The RAVE technique uses a calibrated broad band source to excite the vocal tract from just outside the lips. In this example it contains 354 sine waves with frequencies equally spaced over the range 0.2 to 2.1 kHz. (c) and (d) show the spectra measured with the vocal tract excited by the broad band source during pronunciation of the same vowels. At a modest conversational level, the harmonics of the speaker's voice are seen above the broad band spectrum at frequencies less than about 1.6 kHz. The maxima followed by large negative slopes in the broad band response at 0.4 and 1.3 kHz (in c) and at 0.5 and 1.1 kHz (in d) are due to the resonances of the vocal tracts in the configuration for these vowels.

Table 1. The average standard deviations in the formants (using LP) and the vocal resonances (using RAVE) for 20 measurements each of 11 Australian vowels.

                  Male Voice                     Female Voice
         s.d.(F1,R1)   s.d.(F2,R2)       s.d.(F1,R1)   s.d.(F2,R2)

LP           122          240                 206          498   Hz
RAVE          64           16                  91           31   Hz
Fig. 3. Measurements of the vocal tract resonances (R2,R1) of 33 young Australian men pronouncing 11 Australian vowels. The mean (R2,R1) is at the centre of each ellipse and the semi- axes are the standard deviations. The figure is a "screen dump" of a version of the RAVE technique used as a speech trainer, and for that reason only the axes are also labelled "Position of tongue constriction" and "Jaw opening". These parameters were not measured directly, but they are correlated with R2 and R1. A cursor on the monitor (the cross at 0.52, 1.20 kHz) shows the current configuration of the user's own vocal tract in real time, and the ellipses are used as targets which change colour when "hit" by the cursor.

Fig. 3 shows the first two resonances of 11 vowels of Australian English as measured on 33 Australian men. This figure also displays the screen used as the application of RAVE to speech training. In this application, target areas are shown on the screen, along with the current configuration (R1,R2) of the user's vocal tract, shown as a moving cursor. The axes are labelled as jaw position and tongue position respectively as an initial aid to moving the cursor. After a little training, users can steer the cursor just by "thinking about where they want it to go": it is a little like a video game, but with mouth control rather than a joy stick.

Fig. 4 shows preliminary results from a study to investigate the use of feedback about the vocal tract as a speech trainer for foreign language teaching. Six vowels from French were chosen as targets. These may be considered as three pairs of vowels which are often confused by non-native speakers. The vowels in "patte" and "pa^te" are acoustically similar and are occasionally confused by native speakers. Those in "paix" and "the' " are also rather similar. Those in "poux" and "pu" are quite different acoustically. They are virtually never confused by native speakers of French, but relatively often confused by English speakers.

Fig. 4. Measurements of the first two vocal tract resonances by monolingual anglophone subjects attempting to produce six "target" vowels spoken by native French speakers. The head of each arrow is the target value, the tail is the average for the subjects and the semi-axes of the ellipses are the standard deviations among the subjects. (Short arrows are good imitations, small ellipses show little variability.) (a) shows the results for the subjects using auditory feedback only. (b) shows the results for subjects who had spent one hour learning vocal tract feedback and who were then given both auditory and vocal tract feedback to imitate the target vowels.

Fig. 4 shows that subjects who used vocal tract feedback as well as auditory feedback produced values of R1 and R2 which were very similar to those of the target native speakers. Tape recordings of the sounds made by the different groups were played by a listening panel of native speakers who were asked to identify the vowel sounds. The recognition rate was significantly higher for the group using both types of feedback.

Conclusion

Auditory feedback is a model of standard language teaching in which students hear a sound and attempt to imitate it. This is a method that they have been using all their lives. Vocal tract feedback, on the other hand, is a new type of feedback and involves novel coordination between eye and articulation. Nevertheless, one to two hours training with this feedback significantly improved the articulation and comprehensibility of our subjects. These preliminary results (to be reported in detail elsewhere) suggest that RAVE has considerable potential in language laboratories and in speech pathology.

Acknowledgements. We acknowledge support from the Australian Research Council and thank our volunteer subjects.

Patent. The technology is the subject of provisional patents and the authors would welcome enquiries from companies interested in manufacturing the device.

References

  • Clark, J. and Yallop, C. An Introduction to Phonetics and Phonology, (Blackwell, Oxford 1990).
  • Dowd, A. Real time non-invasive measurements of vocal tract impedance spectra and applications to speech training. Undergraduate thesis, Medical Physics, UNSW Sydney (1995).
  • Dowd, A.; Smith, J. and Wolfe, J. Real time, non-invasive measurements of vocal tract resonances: application to speech training. Acoustics Australia 24: 53-60 (1996).
  • Epps, J., Smith, J.R. and Wolfe, J. (1997) "A novel instrument to measure acoustic resonances of the vocal tract during speech" Measurement Science and Technology, 8, 1112-1121.
  • Fletcher, N.H. Acoustic Systems in Biology (Oxford, NY 1992).
  • Makhoul, J. Linear Prediction: A Tutorial Review. Proc. IEEE 63: 561-579 (1975).
  • Pham Thi Ngoc, Y. "Caracterisation acoustique du conduit vocal: fonctions de transfert acoustiques et sources de bruit" Doctoral thesis, Institut National Polytechnique de Grenoble (1995).
  • Pham Thi Ngoc, Y. and Badin, P. Vocal tract acoustic transfer function measurements: further developments and applications. J. de Physique IV C5: 549-552 (1994).
  • Sundberg, J. The Science of the Singing Voice, (Northern Illinois Univ. Press., De Kalb, Ill 1987).
  • Wolfe, J. and Smith, J. A comparison of acoustic impedances of flutes - a preliminary study. Intl. Symposium on Musical Acoustics, Dourdan, France. 100-106 (1995).
  • Wolfe, J., Smith, J.,Brielbeck. G., and Stocker, F. A system for real time measurement of acoustic transfer functions. Acoustics Australia 23: 19-20 (1995).

 

[Basics | Research | Publications | Flutes | Clarinet | Saxophone | Brass | Didjeridu | Guitar | Violin | Voice | Cochlear ]
[ People | Contact Us | Home ]

© Joe Wolfe / J.Wolfe@unsw.edu.au 
phone 61-2-9385 4954 (UT + 10, +11 Oct-Mar)
Joe's music site

 
Music Acoustics Homepage What is a decibel? Didjeridu acoustics