Post by Admin on Dec 16, 2019 18:36:06 GMT
VOWEL QUALITIES CAN CONTRAST WITHOUT A LOW LARYNX: FUNDAMENTALS OF SPEECH ACOUSTICS
In a recent handbook on animal bioacoustics, Fitch and Suthers (93) discuss the difficulties biologists encounter trying to adapt the principles and methods of speech research to study animal communication. As an advanced introduction to the topic and to subsequent discussion, we aim in this section to present certain fundamentals of speech production, VT modeling from birth to adulthood, and vowel system organization that we have found crucial for the analysis of primate vocalizations. Many of these ideas are detailed further in classic textbooks [e.g., (94, 95)].
Let us note that the material we present consists also of concepts and methods developed over more than two decades of work by multiple overlapping international multidisciplinary teams (including researchers in phonetics and vowel universals, VT modeling, acoustic speech processing, anatomy, genetics, VT ontogeny, speech development, paleo- and physical anthropology, primatology, and cognition), linking a core group at GIPSA-lab in Grenoble, France, with researchers from many different laboratories. This multidisciplinary collaboration was necessary to reopen the doors to lines of inquiry and research on the emergence of speech that had been effectively barred by the consensus around LDT.
Source-filter theory
The acoustic spectrum and formant structure of the speech signal were made visible by the spectrograph (96) and then readable by the acoustic theory of speech production (97), the two combining to reveal the relationship between formants and certain key aspects of the VT configurations. This section explains the core of that theory and its instantiation in modern articulatory-acoustic research on speech.
Principles. Fant’s acoustic theory of speech production (97) is also known as the source-filter theory because it explains speech sounds as arising from glottal vibration as a source signal, which is then modified by the VT as a filter (with the simplifying assumption that minor interactions between the glottal source and the VT are discounted). The speech signal in typical vowels (i.e., voiced oral vowels) comes from the expiratory airflow that initiates or maintains periodic modulation by the aerodynamic effects of its passage through the glottis (the gap between the vocal folds). That source signal is a plane wave that moves from the glottis, through the VT, and out at the lips. It is spectrally simple, with the energy decreasing rapidly up the F0’s harmonics (successive multiples of F0, the vocal folds’ frequency of vibration). The VT transforms the acoustic spectrum supplied by the laryngeal source and redistributes its spectral energy by filtering it according to the resonance characteristics implied by the VT’s configuration.
Acoustic theory allows determination of the VT’s filter characteristics through several analytical steps. First, because the VT is a bent nonuniform tube, the VTL must be evaluated in 2D along the VT’s median line in the midsagittal plane, from the glottis to the lips. MRI [e.g. (98)] has replaced lateral radiography and tomography [e.g., (97)] for this step. Then, the transverse cross-sectional area, the third dimension of the VT, is sampled in planes perpendicular to and along that median line. These sampled areas are reconstituted as “stacked” cylinders aligned axially on their centers, resulting in a straight tube with variable sections [the acoustic effects of straightening are negligible; (99)], which is acoustically equivalent to the VT. Plotting those areas gives the area function, specifying the cross-sectional area of each cylindrical element as a function of its distance from the glottis.
Fig. 1 Source-filter theory.
Interpreting vowel spectra as the result of the transformation, by the VT, of the glottal source . (A) Source. Top: view from above on the vocal folds; bottom: spectrum of the glottal source signal, with high amplitude in low frequencies. (B) Filter. Top: sagittal section of the VT, with the median line (dotted) where VTL is calculated; middle: VT’s area function; bottom: VT’s acoustic transfer function; black: calculated from the area function; red: extracted by LPC analysis from speech signal. (C) Radiated sound. Top: classic spectrogram of a synthesized vowel, showing formants and their peak frequencies over time [here calculated by Praat (171)]; bottom: vowel’s acoustic spectrum, whose amplitude envelope (dotted line) is imposed on the source signal by the VT transfer function.
From that area function—which is an acoustically sufficient 3D representation of the VT—Fant’s theory enables calculation of the acoustic transfer function, which quantifies the filter that the VT applies to the glottal source. The amplified regions in the resulting spectrum are termed formants, labeled Fn and numbered from low to high frequencies (F1 the lowest, F2 next, etc.). They are the key to contrasting vowel qualities in human speech (specifically, the first three formants, F1-F3, characterize the contrasting vowels of human speech, acoustically for researchers and perceptually for listeners). Formants can also be calculated indirectly from a recorded speech signal, either by visual analysis of a classic spectrogram or through digital processing using fast Fourier transform or, as we discuss below, LPC analysis. The analytical keys to the source-filter theory are illustrated in Fig. 1 (presented right to left, in keeping with the century-plus convention in phonetics of presenting the sagittal section from its left, a convention we follow throughout this paper).
Source-filter theory has been widely adopted for research on mammal vocalizations and, more generally, for bioacoustics (40, 93, 100, 101). Because the spectral energy in human speech rolls off sharply as frequency rises (as noted above), detection of F4 is uncertain and that of higher formants is rare. In other primates, though, researchers typically detect F6 (43) and higher, sometimes up to F12 (102). This is possible because the glottal source signal in primate vocalization is regularly very rich up to the high spectral frequencies.
Formant extraction by LPC. LPC was developed in the late 1960s [see (37) for historical perspectives] and became common in speech research during the 1970s to allow estimation of the VT transfer function directly from the speech signal. While the mathematical details and LPC’s specific strengths, difficulties, and limits are beyond our scope here [see (103, 104), or (105) for different treatments of those aspects], LPC exploits the redundancy in the signal by using a sequence of samples of the digitized signal to predict the subsequent samples. This method has the advantage of separately calculating the rapid changes attributable to the laryngeal source from the slower changes in resonance patterns (i.e., formants) due to movement of the VT. This makes it essentially parallel to the source-filter theory (97) in separating the effects of the glottal source from the VT as a resonating filter and gives it its power as a tool to isolate and specify formant values. Note, however, that to use LPC successfully, the user must specify the number of formants expected in the signal. Years of experience with LPC have taught researchers to manage these settings for the different talkers and various vowels of human speech, but the settings are difficult to determine for primate vocalizations, and we are just beginning the long effort to master them (35). It is perhaps for this reason that, as we will see later, animal communication research did not incorporate LPC analysis until the late 1980s, about two decades after its appearance in speech research.
In a recent handbook on animal bioacoustics, Fitch and Suthers (93) discuss the difficulties biologists encounter trying to adapt the principles and methods of speech research to study animal communication. As an advanced introduction to the topic and to subsequent discussion, we aim in this section to present certain fundamentals of speech production, VT modeling from birth to adulthood, and vowel system organization that we have found crucial for the analysis of primate vocalizations. Many of these ideas are detailed further in classic textbooks [e.g., (94, 95)].
Let us note that the material we present consists also of concepts and methods developed over more than two decades of work by multiple overlapping international multidisciplinary teams (including researchers in phonetics and vowel universals, VT modeling, acoustic speech processing, anatomy, genetics, VT ontogeny, speech development, paleo- and physical anthropology, primatology, and cognition), linking a core group at GIPSA-lab in Grenoble, France, with researchers from many different laboratories. This multidisciplinary collaboration was necessary to reopen the doors to lines of inquiry and research on the emergence of speech that had been effectively barred by the consensus around LDT.
Source-filter theory
The acoustic spectrum and formant structure of the speech signal were made visible by the spectrograph (96) and then readable by the acoustic theory of speech production (97), the two combining to reveal the relationship between formants and certain key aspects of the VT configurations. This section explains the core of that theory and its instantiation in modern articulatory-acoustic research on speech.
Principles. Fant’s acoustic theory of speech production (97) is also known as the source-filter theory because it explains speech sounds as arising from glottal vibration as a source signal, which is then modified by the VT as a filter (with the simplifying assumption that minor interactions between the glottal source and the VT are discounted). The speech signal in typical vowels (i.e., voiced oral vowels) comes from the expiratory airflow that initiates or maintains periodic modulation by the aerodynamic effects of its passage through the glottis (the gap between the vocal folds). That source signal is a plane wave that moves from the glottis, through the VT, and out at the lips. It is spectrally simple, with the energy decreasing rapidly up the F0’s harmonics (successive multiples of F0, the vocal folds’ frequency of vibration). The VT transforms the acoustic spectrum supplied by the laryngeal source and redistributes its spectral energy by filtering it according to the resonance characteristics implied by the VT’s configuration.
Acoustic theory allows determination of the VT’s filter characteristics through several analytical steps. First, because the VT is a bent nonuniform tube, the VTL must be evaluated in 2D along the VT’s median line in the midsagittal plane, from the glottis to the lips. MRI [e.g. (98)] has replaced lateral radiography and tomography [e.g., (97)] for this step. Then, the transverse cross-sectional area, the third dimension of the VT, is sampled in planes perpendicular to and along that median line. These sampled areas are reconstituted as “stacked” cylinders aligned axially on their centers, resulting in a straight tube with variable sections [the acoustic effects of straightening are negligible; (99)], which is acoustically equivalent to the VT. Plotting those areas gives the area function, specifying the cross-sectional area of each cylindrical element as a function of its distance from the glottis.
Fig. 1 Source-filter theory.
Interpreting vowel spectra as the result of the transformation, by the VT, of the glottal source . (A) Source. Top: view from above on the vocal folds; bottom: spectrum of the glottal source signal, with high amplitude in low frequencies. (B) Filter. Top: sagittal section of the VT, with the median line (dotted) where VTL is calculated; middle: VT’s area function; bottom: VT’s acoustic transfer function; black: calculated from the area function; red: extracted by LPC analysis from speech signal. (C) Radiated sound. Top: classic spectrogram of a synthesized vowel, showing formants and their peak frequencies over time [here calculated by Praat (171)]; bottom: vowel’s acoustic spectrum, whose amplitude envelope (dotted line) is imposed on the source signal by the VT transfer function.
From that area function—which is an acoustically sufficient 3D representation of the VT—Fant’s theory enables calculation of the acoustic transfer function, which quantifies the filter that the VT applies to the glottal source. The amplified regions in the resulting spectrum are termed formants, labeled Fn and numbered from low to high frequencies (F1 the lowest, F2 next, etc.). They are the key to contrasting vowel qualities in human speech (specifically, the first three formants, F1-F3, characterize the contrasting vowels of human speech, acoustically for researchers and perceptually for listeners). Formants can also be calculated indirectly from a recorded speech signal, either by visual analysis of a classic spectrogram or through digital processing using fast Fourier transform or, as we discuss below, LPC analysis. The analytical keys to the source-filter theory are illustrated in Fig. 1 (presented right to left, in keeping with the century-plus convention in phonetics of presenting the sagittal section from its left, a convention we follow throughout this paper).
Source-filter theory has been widely adopted for research on mammal vocalizations and, more generally, for bioacoustics (40, 93, 100, 101). Because the spectral energy in human speech rolls off sharply as frequency rises (as noted above), detection of F4 is uncertain and that of higher formants is rare. In other primates, though, researchers typically detect F6 (43) and higher, sometimes up to F12 (102). This is possible because the glottal source signal in primate vocalization is regularly very rich up to the high spectral frequencies.
Formant extraction by LPC. LPC was developed in the late 1960s [see (37) for historical perspectives] and became common in speech research during the 1970s to allow estimation of the VT transfer function directly from the speech signal. While the mathematical details and LPC’s specific strengths, difficulties, and limits are beyond our scope here [see (103, 104), or (105) for different treatments of those aspects], LPC exploits the redundancy in the signal by using a sequence of samples of the digitized signal to predict the subsequent samples. This method has the advantage of separately calculating the rapid changes attributable to the laryngeal source from the slower changes in resonance patterns (i.e., formants) due to movement of the VT. This makes it essentially parallel to the source-filter theory (97) in separating the effects of the glottal source from the VT as a resonating filter and gives it its power as a tool to isolate and specify formant values. Note, however, that to use LPC successfully, the user must specify the number of formants expected in the signal. Years of experience with LPC have taught researchers to manage these settings for the different talkers and various vowels of human speech, but the settings are difficult to determine for primate vocalizations, and we are just beginning the long effort to master them (35). It is perhaps for this reason that, as we will see later, animal communication research did not incorporate LPC analysis until the late 1980s, about two decades after its appearance in speech research.