Bayesian Cognition and Machine Learning for Speech Communication

MIAI - Multidisciplinary Institute in Artificial Intelligence


The "Bayesian Cognition and Machine Learning for Speech communication" chair, a part of the Grenoble MIAI institute, brings together researchers whose areas of expertise are in the fields of Speech Communication, Cognition, Machine Learning, and Probabilistic Modeling of sensorimotor systems. The team members come from Gipsa-lab (UMR CNRS 5216) and LPNC (UMR CNRS 5105), two labs at UGA and Grenoble INP. The aim of the chair is to build a global computational model of speech production and speech perception, that is, a system able to learn how to speak and to perceive speech from examples provided by the environment. To this purpose, an original approach is proposed, which associates the algorithmic and mathematical framework of data-based Deep Learning and hypothesis-driven Probabilistic Modelling. This approach was developed in order to design more interpretable and thus more explainable and transferrable models, with more rapid and economical implementations and more robustness and versatility. Our aim is to build models of speech communication that reach the state-of-the-art performance of current deep-learning based systems while drastically limiting the amount of training data.

Scientific approach

Jointly modeling the speech production and perception processes amounts to designing models of the relationships between the various variables that are involved in those processes, namely motor/control variables, multi-sensory variables and linguistic/phonological variables (Figure 1). Over the years, the members of the group have explored two complementary approaches to do this. Firstly, a hypothesis-driven probabilistic programming approach (Tenenbaum et al., 2011; Bessière et al., 2013) has been used to explicitly design a set of multidimensional probabilistic functions that link the variables. These probabilistic models are defined from theoretical hypotheses about the physical mechanisms, neurocognitive processing and representations of speech production and speech perception in humans. This has led to a number of significant insights concerning speech perception in adverse conditions (Moulin-Frier et al., 2012; Laurent et al., 2015), variability and robustness in speech production (Patri et al., 2015, 2018), speech representations in the brain (Barnaud et al., 2018) and the emergence of sound systems in a society of communicating agents (Moulin-Frier et al., 2015; Schwartz et al., 2016). Secondly, data-driven deep-learning algorithmic and mathematical frameworks such as artificial deep neural networks makes it possible to establish a direct mapping, i.e., a deterministic regression, between subsets of variables. This approach was the basis for the development of efficient systems for speech processing, voice conversion, noise reduction, feature extraction, acoustic-articulatory inversion, speech synthesis and braincomputer interfaces (Hueber et al., 2015; Hueber & Bailly, 2016; Bocquelet et al., 2016; Fabre et al., 2017; Girin et al., 2017; Schultz et al., 2017). The computational challenge addressed in the program of the chair is to take the best from these two complementary approaches to elaborate a computational system which is able to learn how to produce and perceive speech from examples provided by the environment. These two approaches have recently begun to intersect with the emergence of Deep Generative Models, such as Variational AutoEncoders (VAE), or Generative Adversarial Networks (GAN). Basically, parameters characterizing probability distributions of data are encoded/mapped within deep neural networks. These models can be used as supervised probabilistic priors in a more general (probabilistic) model. They provide efficient ways to extract/model/manipulate the low-dimensional latent space that represents the structure of the high-dimensional data.

Figure 1: Processes and variables included in the joint model of speech and speech perception

Research Program

In this context, it is crucial to clarify how complex multidimensional probabilistic relations between speech production and perception variables can be learned, and to make this learning more efficient. A systematic exploration of the numerous multidimensional spaces involved in these processes is unfeasible. This is why a cognition-based approach of probabilistic programming is favored, since it provides a structure for the development of the computational models and the learning processes in relation with existing knowledge about physics (aeroacoustics, biomechanics) and neuroscience/psychology (sensory-motor control of speech, language representations in the brain, developmental schedule of speech and language acquisition in infancy and childhood). The overall program of the project consists in associating the formalization of the speech communication model made possible by the Probabilistic Programming framework and the learning capabilities of advanced machine learning methods to tackle these learning issues, elaborate new learning models and evaluate them (Figure 2).

Figure 2: Components of the research program

More specifically during the next 4 years, research will be organized around several main topics.

The evaluation of the impact of this research program for the development of speech technologies will be done in terms of rapidity and completeness of learning as well as in terms of the amount of data required to reach a satisfactory level of learning. Moreover, in the context of deep learning, the analysis of the representations learned by deep artificial neural networks from raw articulatory, acoustic and linguistic data could provide important insights into the sensory-motor representations potentially encoded in the human brain.


  • Barnaud, M.L., Bessière, P., Diard, J., & Schwartz, J.L. (2018). Reanalyzing neurocognitive data on the role of the motor system in speech perception within COSMO, a Bayesian perceptuo-motor model of speech communication. Brain and Language, 187, 19- 32.
  • Bessière, P., Mazer, E., Ahuactzin-Larios, J.-M., & Mekhnacha, K. (2013). Bayesian Programming. Boca Raton, FL: CRC Press.
  • Bocquelet, F., Hueber, T., Girin, L., Savariaux, C., & Yvert, B. (2016). Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLoS computational biology, 12(11): e1005119.
  • Fabre, D., Hueber, T., Girin, L., Alameda-Pineda, X., & Badin, P. (2017). Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication, 93, 63-75.
  • Girin, L., Hueber, T., & Alameda-Pineda, X. (2017). Extending the cascaded gaussian mixture regression framework for cross-speaker acoustic-articulatory mapping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 662-673.
  • Hueber, T., & Bailly, G. (2016). Statistical conversion of silent articulation into audible speech using full-covariance HMM. Computer Speech & Language, 36, 274-293.
  • Hueber, T., Girin, L., Alameda-Pineda, X., & Bailly, G. (2015). Speaker-adaptive acousticarticulatory inversion using cascaded Gaussian mixture regression. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2246-2259.
  • Moulin-Frier, C., Laurent, R., Bessière, P., Schwartz, J. L., & Diard, J. (2012). Adverse conditions improve distinguishability of auditory, motor, and perceptuo-motor theories of speech perception: An exploratory Bayesian modelling study. Language and Cognitive Processes, 27(7-8), 1240-1263.
  • Moulin-Frier, C., Diard, J., Schwartz, J. L., & Bessière, P. (2015). COSMO (“Communicating about Objects using Sensory–Motor Operations”): A Bayesian modeling framework for studying speech communication and the emergence of phonological systems. Journal of Phonetics, 53, 5-41.
  • Patri, J. F., Perrier, P., Schwartz, J. L., & Diard, J. (2018). What drives the perceptual change resulting from speech motor adaptation? Evaluation of hypotheses in a Bayesian modeling framework. PLoS computational biology, 14(1), e1005942.
  • Patri, J. F., Diard, J., & Perrier, P. (2015). Optimal speech motor control and token-to-token variability: a Bayesian modeling approach. Biological cybernetics, 109(6), 611-626.
  • Schultz, T., Wand, M., Hueber, T., Krusienski, D. J., Herff, C., & Brumberg, J. S. (2017). Biosignal-based spoken communication: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2257-2271.
  • Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, N. D. (2011). How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022), 1279-1285.