Toward a real-time silent speech interface driven by ultrasound and video imaging

Welcome !



Welcome to the Ultraspeech II website ! Ultraspeech II is a research project funded by the 6th Christian Benoît Award which I received in in 2011. For further information about the project, don't hesitate to contact me ;-) - Thomas Hueber



The goal of the Ultraspeech II project is to build a real-time prototype of a silent speech interface (SSI), i.e. a device allowing speech communication without the need to vocalize. SSI could be used in situations where silence is required (as a silent cell phone), or for communication in very noisy environments. Further applications are possible in the medical field. For example, SSI could be used by laryngectomized patients as an alternative to electrolarynx which provides a very robotic voice; to oesophageal speech, which is difficult to master; or to tracheo-oesoephageal speech, which requires additional surgery. This system is based on the analysis of vocal tract configuration during silent articulation using ultrasound and video imaging. Articulatory movements are captured by a non-invasive multimodal imaging system composed of an ultrasound transducer placed beneath the chin and a video camera in front of the lips. Articulatory features extracted from the visual data are converted into audible speech using statistical mapping techniques (ANN/GMM/HMM).



Ultraspeech II is based on my previous work on silent speech interfaces, which started in 2006 during my PhD and then as a CNRS researcher at GIPSA-lab (see my homepage). The extraction of the visual (articulatory) features from ultrasound and video images, their conversion into acoustic parameters (articulatory-to-acoustic mapping), and the synthesis of the audio speech signal, were done offline. However, in order to be used in a realistic communication device, this entire data processing should run in real-time. This real-time implementation is the goal of this project and the main specification of Ultraspeech II.


Overview of Ultraspeech 2

Being involved in a “perception-action” loop have a strong impact on speech production. A speaker communicating with a silent speech interface could potentially take advantage of the different feedbacks provided by the communication partner or by the device itself, in order to adapt his own production and maximize the quality of the communication. This scenario, illustrated in figure 7, can be envisioned only if the silent speech interface runs in “real time”.


With the term “real time”, I mean the necessity to respect constraints on response time, i.e. operational deadlines from event to system response. In the context of articulatory-to-acoustic mapping, this means that the delay between an articulatory event and the corresponding modification in the audio signal has to be constant (and of course as short as possible). In this research project, we are adapting the algorithms developed in my previous work to real-time constraints. These algorithms are implemented in a demonstrator which enable us to investigate the use of a silent speech interface in a realistic communication situation.


The architecture of the Ultraspeech II system is illustrated in figure below.

Ultraspeech II system is composed of two distinct « modules » : (1) an extended version of my acquisition system Ultraspeech I (more info at www.ultraspeech.com) which is dedicated to the synchronous recording of ultrasound, video and audio speech data - (2) UltraspeechMax, a new module implemented in the real-time graphical programming environment Max/MSP which is dedicated to the articulatory-to-acoustic mapping and to the generation of the multimodal output speech signal. The two modules communicate together via the OSC communication network protocol (Open Sound Control) and can run on distinct computers in order to reduce the computation load.