Generation of Audiovisual Prosody for Expressive Virtual Actors


Encadrant :     Gérard BAILLY

École doctorale : Mathématiques, sciences et technologies de l'information, informatique (MSTII)

Spécialité : Informatique

Structure de rattachement : UJF

Établissement d'origine : Universitad Autonoma de Barcelona (Espagne)

Financement(s) : Contrat doctoral


Date d'entrée en thèse : 01/12/2012

Date de soutenance : 23/11/2015


Composition du jury :
M., Ronan BOULIC, PRF, EPFL Lausanne, Rapporteur
Mme., Catherine PELACHAUD, DR CNRS, Telecom Paris Tech, Rapporteur
Mme. Marie-Christine ROUSSET, PRF, Université Grenoble Alpes , Examinateur
M. Marc SWERTS, PRF, Tilburg University, Examinateur
M. Slim OUNI, MCF/HDR, Université de Lorraine, Nancy , Examinateur
M., Rémi RONFARD, CR/HDR INRIA, Université Grenoble Alpes, Directeur de thèse
M., Gérard BAILLY, DR CNRS, Université Grenoble Alpes, Co-Directeur de thèse


Résumé : The work presented in this thesis addresses the problem of generating audiovisual prosody for expressive virtual actors. A virtual actor is represented by a 3D talking head and an audiovisual performance refers to facial expressions, head movements, gaze direction and the speech signal. While an important amount of work has been dedicated to emotions, we explore here expressive verbal behaviors that signal mental states, i.e how speakers feel about what they say. We explore the characteristics of these so-called dramatic attitudes and the way they are encoded with speaker-speci c prosodic signatures i.e. patterns of trajectories of audio-visual prosodic parameters. We analyze and model a set of 16 attitudes which encode interactive dimensions of face-to-face communication in dramatic dialogues. We ask two semi-professional actors to perform these 16 attitudes rst in isolation (exercises in style) in a series of 35 carrier sentences and secondly in a short interactive dialog extracted from the theater play Hands around by Arthur Schnitzler, under the guidance of a professional theater director. The audiovisual trajectories are analyzed both at frame-level and at utterance-level. In order to synthesize expressive performances, we used both a frame-based conversion system for generating segmental features and a prosodic model for suprasegmental features. The prosodic model considers both the spatial and temporal dimension in the analysis and generation of prosody by introducing dynamic audiovisual units. Along with the implementation of the presented system, the following topics are discussed in detail: state of the art (virtual actors, visual prosody, speech-driven animation, text-tovisual speech, expressive audiovisual conversion), the recording of an expressive corpus of dramatic attitudes, the data analysis and characterization, the generation of audiovisual prosody and evaluation of the synthesized audiovisual performances.

