A little story of speech communication
Told by the researchers of the Speech Communication Institute in Grenoble, France
Institut de la Communication Parlée - CNRS - INPG - Université Stendhal
(http://www.icp.inpg.fr)
Speech is a basic and unique property of human beings, and is our principal means of communication. It is therefore a fascinating area for research and technology, involving a large number of interdisciplinary collaborations between engineering, human and life sciences.
First and foremost, speech begins with an aero-acoustical instrument, the vocal tract, which is a physical system that must be studied and modelled as precisely as possible, using the experimental and theoretical tools provided by acoustics and fluid mechanics. This system converts the quasi-static pressure provided by the lungs into sound, making use of various kinds of acoustic sources, such as voice source, which is due to the auto-oscillation of the vocal folds, where tension controls the voice pitch; bursts, at the release following a plosive occlusion; or noise sources due to turbulence created by obstacles to the sound flow, for example when the tongue comes close to the palate to utter "fricative" consonants such as "s" or "sh". The vocal instrument can be played by modifying its timber, through the control of the resonances of the vocal tract. This is done by continuously displacing the lips, jaw, tongue, velum: "keys" that are called the speech "articulators". The temporal variations of the acoustic timbre produce in the listener's brain the basic auditory correlates of phonetic contrasts between the phonemes of a given language (typically, about 20 to 40 phonemes in each language).
![]() |
![]() |
![]() |
| Vocal tract acoustic modelling is based on models that are checked against measurements. Here is represented the field of the acoustic pressure radiated at the lips. Comparison between theoretical prediction (taking into account the higher order modes), numerical simulation (Transmission Line Matrix method), and measurements on models. | ||
When the speech researcher has this instrument at his disposal, through computer models, he must learn "how to play it", just as the child has to learn to speak. This is what the infant attempts when he performs his first tentative vocal cries at 7 months, in babbling; and then, in the first years of age, when he systematically explores the possibilities of his instrument, in order to play it with more and more skill and efficiency. The researcher attempts to study the moving vocal tract in various experimental paradigms, from baby to adult and male to female speech, involving clear or perturbed speech, quiet or noisy environments, slow or fast speaking rates, etc. He looks for as many kinds of experimental data as possible: acoustical, aerodynamical, cineradiographic, or articulatory, involving electromagnetic sensors or 2D- or 3D-imaging techniques. He also takes advantage of new brain imaging techniques to determine what could be the neural circuits of speech in the human brain. From these data, speech labs are able to develop biomechanical models of the lips, jaw and tongue, and finally to design "Virtual Talking Heads". Such anthropomorphic agents can then be controlled by algorithms inspired by cognitive principles, and are technically similar to various robotic systems: the basic ingredients consist in defining the acoustical or auditory task, elaborating inversion procedures to infer input motor commands from output goals, implementing motor learning and generalisation procedures to be able to "play" any phonetic chain and translate it into a sequence of motor gestures, and finally utter suitable sounds. The "speech robotics" framework provides the theoretical basis for "articulatory synthesis", which can then be assessed with respect to both its perceptual quality, and its resemblance to human performance, from speech acquisition to adult production. The results of this work are also of great use in the field of speech therapy and reeducation of speech pathologies, due to diseases or surgical treatments.
![]() |
One variant of the ICP Talking Heads. Biomechanical models of jaw and tongue, geometrical lip model, and face with transparent skin. |
This kind of anthropomorphic speech agent can be integrated into (tele)communication systems for man-man or man-machine dialog, in order to exploit two major properties of such talking faces. Firstly, these are intrinsically multimodal agents, visible as well as audible. It is a well-known fact that speech can be "read from the lips", not only by hearing-impaired people, but also by persons with normal hearing, which provides very significant intelligibility gains and increased communication efficiency both for noisy environments, and for complex linguistic tasks, such as understanding a foreign language: just think of the difficulty in understanding a foreign language on the telephone or on the radio, when you cannot read the speaker's lips. Multimodality is increasingly acknowledged as a major characteristic of speech communication, both for audiovisual speech synthesis - talking faces uttering any message from a printed orthographic sequence, thus producing coherent auditory and visual outputs - and for the design of systems achieving automatic recognition of audiovisual speech. This leads to significant improvements of vocal interfaces for man-machine dialog in noisy environments. Secondly, these talking heads are not simple artefacts reproducing speech sounds and eventually images, but are true coherent physical models. Such models can be inserted into multimedia telecommunication systems (visiophony), by distant animation of a virtual agent driven by your own movements, reliably reproducing on-line the lip gestures of a speaker in a dialog: this is the topic of the "Labiophone" project of the ELESA Federation which groups together in Grenoble various partners working in the fields of speech and signal processing, automation, and electronics. In the Labiophone project, lip movements are captured on line from the movements of a given speaker thanks to a series of image processing algorithms and speech production models, and these movements enable the face of a distant agent to be driven within a telecommunication system. Talking heads can also be used in language learning systems, in which all speech gestures (including lip, jaw and tongue movements) can be made visible for learning the phonemes of a foreign language: an application of great potential use in our multi-lingual Europe!
|
| ||||||||||||||||||||
| The "Labiophone" (ICP - ELESA - LIS - Ganymédia) | 3D model for lip tracking | ||||||||||||||||||||
Lip movements are estimated by projecting a speaker-dependent 3D model of the lips on the video image. Articulatory parameters of the 3D model are adjusted so that the pixels inside the lips have the adequate color statistics. Such an analysis-synthesis technique provides the optimal 3D shape that can then be viewed in a different angle. We show here the profile - on the left - estimated from the face - displayed on the right side.
Once these lip and face movements have been captured and studied, they can be predicted from the phonetic content. ICP has recently built a Text-to-audiovisual Synthesis System which includes also the prediction of hand gestures of the so-called manual cued speech. This virtual speaker could be of great help for deaf people!
Finally, once this instrument and the relevant skills needed to play it are available as computer software facilities, the speech researcher can begin to attempt to ask questions related to human language. The challenge is to determine how this could have "emerged" from pre-linguistic precursors, and what kinds of constraints on its emergence could be inferred from the properties of the physical, perceptual, and production systems necessary for its existence. In this vein, for example, there has been a theoretical debate on the possibility that the specific position of the larynx (low in Homo Sapiens, but suggested to have been higher in the Neanderthal Man) would have been a pre-requisite for human language: however, we have shown in our laboratory, in collaboration with specialists in acoustics on the one hand, anthropology on the other, that this hypothesis is wrong, that in fact it is quite likely that the larynx was as low in the Neanderthal as in Homo Sapiens, and that anyway, the emergence of speech was a question of motor skill rather than vocal tract ability. Another area concerns the development of artificial life models of sensory-motor interactions to predict what kinds of vowels, consonants and syllables would emerge if a set of speech agents attempt to communicate with maximal efficiency, that is by minimizing their articulatory efforts while maximizing their chance to be understood. Such kind of models have been shown to be quite useful in the understanding of some "regularities" of human languages, such as the fact that languages, though all different, systematically exploit several basic phonetic configurations, such as the vowels [i a u], the plosives [b d g] or the nasals [m n], and the alternance of vowels and consonants in syllables.
![]() |
![]() |
![]() |
Skulls of human adult, human baby, and monkey baby. Acoustic simulations have shown that nothing prevents any of them from talking: the challenge is now to understand language acquisition (ontogenesis) and emergence (phylogenesis). |
This (short) story has been told by
a laboratory, called ICP (the Speech Communication Institute / Institut de la Communication Parlée), supported by two departments of the French CNRS (Centre National de la Recherche Scientifique), namely the Engineering Sciences and the Human and Social Sciences Departments, and by two Grenoble Universities (INPG / National Polytechnical Institute; and Stendhal University). This laboratory groups together about 100 people, specialists in acoustics, motor control, physiology and psychology of perceptual systems, dynamical systems, signal processing, phonetics and phonology. It carries out research on speech perception and production, on the development of audiovisual speech synthesis and recognition devices and telecommunication systems, and on the study of the ontogeny and phylogeny of human language. It is funded by various local, national and international sources, including of course many EEC projects in the fields of computer and information sciences, speech and language, and telematics.
In conclusion, ICP, firmly standing on its own three feet (signal, language and cognition) and searching to preserve a good equilibrium between fundamental research on human language and technological developments for man-machine communication, telecommunications and linguistic engineering, is happy to have told you this little story about this fascinating domain, and hope it will have succeeded in arousing your interest in the study of speech communication !
This CDROM has been realised by Christian Bulfone with the collaboration of Alain Arnal, Pierre Badin, Gérard Bailly, Louis-Jean Boë, Xavier Pelorson, Gordon Ramsay, Lionel Revéret, Christophe Savariaux, and Jean-Luc Schwartz.
![]() |