Towards
an AudioVisual Virtual Talking Head:
3D
articulatory modeling of tongue, lips and face
based
on MRI and video images
(completed
with video illustrations)
Part of a presentation at the
Fifth Speech Production Seminar - Seeon - 1-4 May 2000
(P. Badin, P. Borel, G. Bailly, L.
Revéret, M. Baciu, and C. Segebarth. Towards an audiovisual
virtual talking head: 3D articulatory modeling of tongue, lips
and face based on MRI and video images. In Proceedings of the 5th Seminar
on Speech Production: Models and Data & CREST Workshop on Models of
Speech Production: Motor Planning and Articulatory Modelling, pages
261-264, Kloster Seeon, Germany, May 2000)
|
|
|
Animation d'une séquence de voyelles (2MBytes)
|
Animation d'une séquence de consonnes (9MBytes)
|
Introduction
-
Traditional
articulatory modeling limited to the mere midsagittal plane
-
midsagittal contours to
area functions ?
-
lateral consonants ?
-
acoustical transverse
modes ?
-
fluid mechanics ?
-
visual aspects of speech
communication ?
-
MRI
& image processing
-
acquisition of 3D vocal
tract and speech organs articulatory data
-
acquisition of 3D lips
and face articulatory data
Objectives
-
exploration of the independent
linear degrees of freedom of the speech articulators
-
developing 3D linear articulatory
models
-
developing virtual talking
heads
The linear articulatory modeling approach
-
Speaker / Language / Corpus
-
Dimensionality of speech
articulators shapes / positions
-
independent
linear degrees of freedom
-
multilinear analysis
-
Compromise between
-
explanation of the data
variance
-
number of control parameters
-
biomechanical likelihood
The articulatory data
-
MRI
=> tongue, jaw (vocal tract, lips)
-
Video => lips, face, jaw
-
One male French speaker
-
Corpus of artificially
sustained articulations
-
French vowels [a
E e i y u o O
? ¿ ]
-
French consonants in three
symmetrical vocalic contexts [aCa], [iCi], [uCu] with C = [p
t k f s S { l]
Raw MRI data processing
Axial stack |
Oblique stack |
Coronal stack |
|
Midsagittal section recontructed from the three stacks |
|
MRI contours processing
| |
 |
|
 |
-
limitation by hard palate and jaw
-
re-slicing by the semipolar grid
|
 |
|
Raw data
|
 |
Final contours in the semipolar grid system
|
Video data for lips and face measurements
Measurement of 3D coordinates of the
face fleshpoints
Measurement of 3D coordinates of the
lip shape
Video data for jaw position measurement
Guided Multilinear Analysis (GMA)
-
Multilinear Analysis
-
Predictors chosen as:
-
Factors extracted by Principal Component Analysis on whole
/ part of the articulators
-
Direct articulatory measurements
-
jaw height
-
jaw advance
-
upper lip protrusion
-
etc.
Degrees of freedom of the jaw
-
Original data: 3D coordinates of lower incisor
-
PCA:
-
first factor jaw1 (~ Jaw Height)
-
second factor jaw2 (~ Jaw Advance)
Degrees of freedom of the tongue
Effect of jaw height parameter
jaw1[6
Mb "avi" video file]
Effect of tongue body parameter
TB[6
Mb "avi" video file]
Effect of tongue dorsum parameter
TD[6
Mb "avi" video file]
Effect of tongue tip parameter
TT[6
Mb "avi" video file]
Effect of tongue advance
parameter TA
[6
Mb "avi" video file]
Effect of tongue extremity
parameter T1[6
Mb "avi" video file]
Examples of 3D tongue sequences tracked
by inversion from Xray films
This tongue animation of vowel sequence[3.5
Mb "avi" video file] has been built from this Xray
film of vowel sequence [2
Mb "mov" video file].
Here is animation of plosive sequence
[9
Mb "avi" video file].
Degrees of freedom of lips / face
Effect of jaw height
parameter
jaw1 [6
Mb "avi" video file]
Effect of lip protrusion
parameter lips1 [6
Mb "avi" video file]
Effect of lip height parameter
lips2[6
Mb "avi" video file]
Effect of lip vertical elevation
parameter lips3 [6
Mb "avi" video file]
Effect of jaw advance parameter
jaw2[6
Mb "avi" video file]
Examples of sequences tracked from
video
This face animation of plosive sequences
[9
Mb "avi" video file] has been tracked
from a video film.
This face animation of labiodental
sequences [9
Mb "avi" video file] has been also
tracked from a video film.
Conclusions
-
Articulatory data on phonemes
-
Estimation of the independent linear degrees of freedom for
articulators
-
Degrees of freedom of 3D tongue not markedly higher than
that of the midsagittal plane
-
Linear articulatory models of tongue, lips and face
Future work
-
less noisy data from MRI
-
more subjects
-
complete vocal tract (inc. velum)
-
acoustics
-
emotion (smile, etc.)
Perspectives
-
Inversion from audiovisual signals
-
inversion of speech signal to determine internal articulators
-
tracking of external articulators (inc. jaw) from video
-
Animation of clones for telecommunications
-
AudioVisual text-to-speech synthesis
-
Aids to language learning
Example of sequence inversed from formants