The ICP Virtual Talking Head
The ICP Virtual Talking Head is an anthropomorphic model that integrates in a coherent way aspects of several speech production phenomena. It is partially based on data gathered from one reference subject. The core of the Virtual Talking Head consists of an articulatory model associated with a vocal tract aerodynamic-acoustic model that includes a voice source. This is a Virtual Talking Head, because it can show articulators that are usually hidden, such as the tongue or velum, a notion known as augmented reality.
The X-ray data-base for a reference subject
In order to elaborate the articulatory model that will mimic the specific male reference speaker, a database containing midsagittal profiles of the speech articulators (jaw, tongue, lips, etc.) was collected by cineradiography. Front views of the lips (painted in blue to facilitate contour detection) and speech sounds were synchronously recorded. The speaker uttered French vowels and /VCV/ sequences of fricative and plosive consonants.
The articulatory model
The articulatory model aims at simulating the various articulators used in speech. It is based on a factor analysis of articulatory measurements performed on the contours of the articulators extracted from the cineradiographic images. The movement of the jaw, the rigid articulator that carries the tongue, is controlled in this model by one parameter termed jaw height. The shape of the tongue is then mainly specified by three parameters that express the degrees of freedom of tongue movements: tongue body, tongue dorsum and tongue tip. The lips are principally controlled by two parameters: lip height and lip protrusion.
The aerodynamic and acoustic models
Once the position of the articulators and thus the shape of the instrument (the vocal tract) is defined, one has to blow air into the tube in order to produce acoustic sources. For vowels and voiced sounds, the voice source is controlled by the pressure in the lungs and the distance between the vocal folds. For consonants such as fricatives, a noise source is generated in the vicinity of the constriction or of the incisors.
Learning articulatory movements from sounds
The Virtual Talking Head thus constituted is able to produce speech sounds, as far as one is able to exert appropriate temporal control over the articulator movements.
A basic method for determining these articulatory movements in order to produce synthetic speech is to derive them from the analysis of the original speech sound produced by the reference speaker. An inversion procedure allows the articulator movements to be determined from the acoustic characteristics of the sound, thereby mimicking the original talker. The results can be exemplified by the vowel-fricative-vowel sequence [azha]. First, the sequence uttered by the subject can be heard. Second, the sequence resynthesized by using the control parameters determined by inversion is played. The final item is similar to the second, but it is produced by opening wide the vocal folds during the consonant, in order to prevent them from vibrating: this results in the voiceless consonant [sh] that corresponds to the voiced consonant [zh]. Other sequences are also available ([ava], [aza], [azi], [azu]).
Finally, more complex items can be produced, such as the French sentence "Sophie, je suis fâché ! Vous savez ?".
This demonstration broadly illustrates the data-driven approach followed by ICP in the study of speech production phenomena. In addition to the knowledge gained about speech, this type of study allows the development of applications such as audiovisual text-to-speech synthesis. The possibility of augmented reality also opens the way to developments in the area of foreign language pronunciation learning, or speech rehabilitation.