Indonesian Text to Audio Visual Speech with Animated Talking Head

Muljono Muljono; Surya Sumpeno; Arifin Arifin; Dhany Arifianto; Mauridhi Hery Purnomo

doi:10.15866/irecos.v11i3.8706

Indonesian Text to Audio Visual Speech with Animated Talking Head

Muljono Muljono^(1*), Surya Sumpeno⁽²⁾, Arifin Arifin⁽³⁾, Dhany Arifianto⁽⁴⁾, Mauridhi Hery Purnomo⁽⁵⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecos.v11i3.8706

Abstract

Most people have to connect to the computer system for accommodating their daily routines. For this reason, a supporting system must be available so that a natural communication between users and the computer system can run smoothly and naturally. Speech is a means of communication in which a visual speech information must be integrated in order to improve and to sharpen up the quality of its naturalness. The most ideal way to enhance the quality of human and computer interaction is by applying the audio visual speech mode. Text To Audio Visual Speech (TTAVS) is a computer system able to transform text input into speech message and to enrich synthesized visual speech for being delivered as a unit of message to the user simultaneously. This paper is addressed to the development of TTAVS in Indonesian language. Mean Opinon Score (MOS) test is used to evaluate the performance of the system. Indonesian natives, among them some lectures of Indonesian, have been selected to participate in this test. The average outcome based on the used MOS test indicates a promising result.
Copyright © 2016 Praise Worthy Prize - All rights reserved.

Keywords

Animated Talking Head; Audio Visual Speech; Text to Speech; Text to Audio Visual Speech; Visual Speech

Full Text:

PDF

References

W. Mattheyses and W. Verhelst, “Audiovisual speech synthesis: An overview of the state-of-the-art,” Speech Commun., vol. 66, pp. 182–217, Feb. 2015.
http://dx.doi.org/10.1016/j.specom.2014.11.001

G. Plenge and U. Tilse, “The Cocktail Party Effect With and Without Conflicting Visual Clues,” presented at the Audio Engineering Society Convention 50, 1975.

D. Gibbon, I. Mertins, and R. K. Moore, “Audio-visual and multimodal speech-based systems,” in Handbook of Multimodal and Spoken Dialogue Systems, D. Gibbon, I. Mertins, and R. K. Moore, Eds. Springer US, 2000, pp. 102–203.
http://dx.doi.org/10.1007/978-1-4615-4501-9_2

D. W. Massaro, M. M. Cohen, D. W. Massaro, and M. M. Cohen, “Perception of synthesized audible and visible speech,” Psychol. Sci, pp. 55–63, 1990.
http://dx.doi.org/10.1111/j.1467-9280.1990.tb00068.x

W. J. Ma, X. Zhou, L. A. Ross, J. J. Foxe, and L. C. Parra, “Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space,” PLoS ONE, vol. 4, no. 3, p. e4638, Mar. 2009.
http://dx.doi.org/10.1371/journal.pone.0004638

J. Jeffers and M. Barley, Speechreading, 1st edition. Springfield: Charles C Thomas, 1971.

J. C. Woods, Lipreading: A Guide for Beginners. London: The Royal National Institute for Deaf People, 1994.

Q. Summerfield, “Lipreading and Audio-Visual Speech Perception,” Philos. Trans. Biol. Sci., vol. 335, no. 1273, pp. 71–78, 1992.
http://dx.doi.org/10.1098/rstb.1992.0009

L. E. Bernstein, P. E. Tucker, and M. E. Demorest, “Speech perception without hearing,” Percept. Psychophys., vol. 62, no. 2, pp. 233–252, Jan. 2000.
http://dx.doi.org/10.3758/bf03205546

T. Dutoit, An Introduction to Text-to-Speech Synthesis, vol. 3. Dordrecht: Springer Netherlands, 1997.
http://dx.doi.org/10.1007/978-94-011-5730-8

I. S. Pandzic, J. Ostermann, and D. Millen, “User evaluation: Synthetic talking faces for interactive services,” Vis. Comput., vol. 15, no. 7–8, pp. 330–340, Mar. 2014.
http://dx.doi.org/10.1007/s003710050182

T. Masuko, T. Kobayashi, M. Tamura, J. Masubuchi, and K. Tokuda, “Text-to-visual speech synthesis based on parameter generation from HMM,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, 1998, vol. 6, pp. 3745–3748 vol.6.
http://dx.doi.org/10.1109/icassp.1998.679698

Y. Fu, R. Li, T. S. Huang, and M. Danielsen, “Real-Time Multimodal Human-Avatar Interaction,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 467–477, Apr. 2008.
http://dx.doi.org/10.1109/tcsvt.2008.918441

Y. Fu, R. Li, T. S. Huang, and M. Danielsen, “Real-Time Humanoid Avatar for Multimodal Human-Machine Interaction,” in 2007 IEEE International Conference on Multimedia and Expo, 2007, pp. 991–994.
http://dx.doi.org/10.1109/icme.2007.4284819

H. Dudley, R. R. Riesz, and S. S. A. Watkins, “A synthetic speaker,” J. Frankl. Inst., vol. 227, no. 6, pp. 739–764, Jun. 1939.
http://dx.doi.org/10.1016/s0016-0032(39)90816-1

H. K. Dunn, “The Calculation of Vowel Resonances, and an Electrical Vocal Tract,” J. Acoust. Soc. Am., vol. 22, no. 6, pp. 740–753, Nov. 1950.
http://dx.doi.org/10.1121/1.1906755

G. Rosen, “Dynamic Analog Speech Synthesizer,” J. Acoust. Soc. Am., vol. 30, no. 3, pp. 201–209, Mar. 1958.
http://dx.doi.org/10.1121/1.1909541

J. L. Kelly and C. C. Lochbaum, “Speech synthesis,” in Proc. Fourth International Conference on Acoustics, 1962, vol. 1–4.

W. Lawrence, “The synthesis of speech from signals which have a low information rate,” Commun. Theory, pp. 460–469, 1953.

J. L. K. Jr and L. J. Gerstman, “An Artificial Talker Driven from a Phonetic Input,” J. Acoust. Soc. Am., vol. 33, no. 6, pp. 835–835, Jun. 1961.
http://dx.doi.org/10.1121/1.1936801

A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in , 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings, 1996, vol. 1, pp. 373–376 vol. 1.
http://dx.doi.org/10.1109/icassp.1996.541110

R. A. J. Clark, K. Richmond, and S. King, “Multisyn: Open-domain unit selection for the Festival speech synthesis system,” Speech Commun., vol. 49, no. 4, pp. 317–330, Apr. 2007.
http://dx.doi.org/10.1016/j.specom.2007.01.014

A. W. Black and N. Campbell, “Optimising selection of units from speech databases for concatenative synthesis.,” 1995.
http://dx.doi.org/10.1007/978-1-4612-1894-4_22

J. Olive, “Rule synthesis of speech from dyadic units,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’77., 1977, vol. 2, pp. 568–570.
http://dx.doi.org/10.1109/icassp.1977.1170350

S. King, “An introduction to statistical parametric speech synthesis,” Sadhana, vol. 36, no. 5, pp. 837–852, Oct. 2011.
http://dx.doi.org/10.1007/s12046-011-0048-y

H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Commun., vol. 51, no. 11, pp. 1039–1064, Nov. 2009.
http://dx.doi.org/10.1016/j.specom.2009.04.004

J. P. Cabral, S. Renals, J. Yamagishi, and K. Richmond, “HMM-based speech synthesiser using the LF-model of the glottal source,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 4704–4707.
http://dx.doi.org/10.1109/icassp.2011.5947405

Z.-H. Ling, K. Richmond, and J. Yamagishi, “Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-based Speech Synthesis,” Proc INTERSPEECH, Aug. 2011.

W. Mattheyses, L. Latacz, and W. Verhelst, “Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis,” Speech Commun., vol. 55, no. 7–8, pp. 857–876, Sep. 2013.
http://dx.doi.org/10.1016/j.specom.2013.02.005

G. Bailly, M. Bérar, F. Elisei, and M. Odisio, “Audiovisual Speech Synthesis,” Int. J. Speech Technol., vol. 6, no. 4, pp. 331–346, Oct. 2003.
http://dx.doi.org/10.1023/a:1025700715107

T. Ezzat and T. Poggio, “Visual Speech Synthesis by Morphing Visemes,” Int. J. Comput. Vis., vol. 38, no. 1, pp. 45–57, Jun. 2000.

J. Melenchon, E. Martinez, F. De la Torre, and J. A. Montero, “Emphatic Visual Speech Synthesis,” IEEE Trans. Audio Speech Lang. Process., vol. 17, no. 3, pp. 459–468, Mar. 2009.
http://dx.doi.org/10.1109/tasl.2008.2010213

S. Fagel and C. Clemens, “An articulation model for audiovisual speech synthesis—Determination, adjustment, evaluation,” Speech Commun., vol. 44, no. 1–4, pp. 141–154, Oct. 2004.
http://dx.doi.org/10.1016/j.specom.2004.10.006

M. M. Cohen and D. W. Massaro, “Synthesis of visible speech,” Behav. Res. Methods Instrum. Comput., vol. 22, no. 2, pp. 260–263, Mar. 1990.
http://dx.doi.org/10.3758/bf03203157

J. M. De Martino, L. Pini Magalhães, and F. Violaro, “Facial animation based on context-dependent visemes,” Comput. Graph., vol. 30, no. 6, pp. 971–980, Dec. 2006.
http://dx.doi.org/10.1016/j.cag.2006.08.017

P. D. P. Costa and J. M. De Martino, “Compact 2D Facial Animation Based on Context-dependent Visemes,” in Proceedings of the SSPNET 2Nd International Symposium on Facial Analysis and Animation, New York, NY, USA, 2010, pp. 20–20.
http://dx.doi.org/10.1145/1924035.1924047

J. Beskow and M. Nordenberg, “Data-driven synthesis of expressive visual speech using an MPEG-4 talking head.,” in INTERSPEECH, 2005, pp. 793–796.

F. Elisei, M. Odisio, G. Bailly, and P. Badin, “Creating and controlling video-realistic talking heads.,” in AVSP, 2001, pp. 90–97.

C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving Visual Speech with Audio,” in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY, USA, 1997, pp. 353–360.
http://dx.doi.org/10.1145/258734.258880

Z. Deng and U. Neumann, “Expressive Speech Animation Synthesis with Phoneme-Level Controls,” Comput. Graph. Forum, vol. 27, no. 8, pp. 2096–2113, Dec. 2008.
http://dx.doi.org/10.1111/j.1467-8659.2008.01192.x

B. J. Theobald, J. A. Bangham, I. A. Matthews, and G. C. Cawley, “Near-videorealistic synthetic talking faces: implementation and evaluation,” Speech Commun., vol. 44, no. 1–4, pp. 127–140, Oct. 2004.
http://dx.doi.org/10.1016/j.specom.2004.07.002

K. Liu and J. Ostermann, “Realistic facial expression synthesis for an image-based talking head,” in 2011 IEEE International Conference on Multimedia and Expo (ICME), 2011, pp. 1–6.
http://dx.doi.org/10.1109/icme.2011.6011835

L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Ann. Math. Stat., vol. 41, no. 1, pp. 164–171, 1970.
http://dx.doi.org/10.1214/aoms/1177697196

T. Ezzat, G. Geiger, and T. Poggio, “Trainable Videorealistic Speech Animation,” in Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY, USA, 2002, pp. 388–398.
http://dx.doi.org/10.1145/566570.566594

M. Železný, Z. Krňoul, P. Císař, and J. Matoušek, “Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis,” Signal Process., vol. 86, no. 12, pp. 3657–3673, Dec. 2006.
http://dx.doi.org/10.1016/j.sigpro.2006.02.039

M. Tamura, T. Masuko, T. Kobayashi, and K. Tokuda, “Visual speech synthesis based on parameter generation from HMM: speech-driven and text-and-speech-driven approaches,” in AVSP’98 International Conference on Auditory-Visual Speech Processing, 1998.
http://dx.doi.org/10.1109/icassp.1998.679698

O. Govokhina, G. Bailly, G. Breton, and P. Bagshaw, “TDA: A new trainable trajectory formation system for facial animation,” in Interspeech, 2006, pp. 2474–2477.

“The MBROLA Project [Online].” Available: http://tcts.fpms.ac.be/synthesis/mbrola.html.

A. A. Arman, “Konversi dari Teks ke Ucapan.” Sep-2004.

Arifin, Muljono, S. Sumpeno, and M. Hariadi, “Towards building Indonesian viseme: A clustering-based approach,” in 2013 IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM), 2013, pp. 57–61.
http://dx.doi.org/10.1109/cyberneticscom.2013.6865781

Muljono, S. Sumpeno, D. Arifianto, K. Aikawa, and M. H. Purnomo, “Developing an Online Self-learning System of Indonesian Pronunciation for Foreign Learners,” Int. J. Emerg. Technol. Learn. IJET, vol. 11, no. 04, pp. 83–89, Apr. 2016.
http://dx.doi.org/10.3991/ijet.v11i04.5440

A. W. Black and K. Tokuda, “The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets,” in Proc. of Interspeech, Lisbon, Portugal, 2005, pp. 77–80.

P. Taylor, Text to speech synthesis. Cambridge: Cambridge University Press., 2009.

Refbacks

There are currently no refbacks.

Username
Password
Remember me