Inferring Prosody from Facial Cues for EMG-based Synthesis of Silent Speech
by , , ,
Abstract:
In this paper we introduce a system which is able to detect prosodic elements in a spoken utterance based on signals from the facial muscles. The proposed system can augment our surface electromyography (EMG) based Silent Speech Interface in order to make synthesized speech more natural. Having shown in (Nakamura, Janke, Wand, & Schultz, 2011) that it is possible to produce understandable synthesized speech from EMG signals, our current interest is to improve the quality and expressivity of the synthesis. We show that a standard phonetically balanced German speech corpus with only a few additional utterances is sufficient to train a system that can discriminate yes/no questions from normal speech and also distinguish between normal and emphasized words in an utterance. For the detection of prosodic information in facial muscle movement we extend our EMG based speech synthesis system with two additional EMG channels, recording the movements of the facial muscles musculus corrugator and musculus frontalis. Our classification method uses a frame-based SVM classification, followed by a majority vote to classify a whole word. Our system achieves F-scores of up to 0.68 for the recognition of emphasized words and 1.0 for the classification between questions and normal utterances although the results show large variations depending on the feature combination used for training.
Reference:
Inferring Prosody from Facial Cues for EMG-based Synthesis of Silent Speech (Christian Johner, Matthias Janke, Michael Wand, Tanja Schultz), In 4th International Conference on Applied Human Factors and Ergonomics, 2012.
Bibtex Entry:
@inproceedings{johner2012inferring,
  title={Inferring Prosody from Facial Cues for EMG-based Synthesis of Silent Speech},
  year={2012},
  booktitle={4th International Conference on Applied Human Factors and Ergonomics},
  url={https://www.csl.uni-bremen.de/cms/images/documents/publications/johner_christian_ahfe_paper.pdf},
  abstract={In this paper we introduce a system which is able to detect prosodic elements in a spoken utterance based on signals from the facial muscles. The proposed system can augment our surface electromyography (EMG) based Silent Speech Interface in order to make synthesized speech more natural. Having shown in (Nakamura, Janke, Wand, & Schultz, 2011) that it is possible to produce understandable synthesized speech from EMG signals, our current interest is to improve the quality and expressivity of the synthesis. We show that a standard phonetically balanced German speech corpus with only a few additional utterances is sufficient to train a system that can discriminate yes/no questions from normal speech and also distinguish between normal and emphasized words in an utterance. For the detection of prosodic information in facial muscle movement we extend our EMG based speech synthesis system with two additional EMG channels, recording the movements of the facial muscles musculus corrugator and musculus frontalis. Our classification method uses a frame-based SVM classification, followed by a majority vote to classify a whole word. Our system achieves F-scores of up to 0.68 for the recognition of emphasized words and 1.0 for the classification between questions and normal utterances although the results show large variations depending on the feature combination used for training.},
  keywords={EMG, synthesis, prosody, speech recognition},
  author={Johner, Christian and Janke, Matthias and Wand, Michael and Schultz, Tanja}
}