Spectral Correlative Mapping Approach for Transformation of Expressivity in Marathi Speech

P. S. Deshpande; J. S. Chitode

doi:10.15866/irecap.v8i1.13895

Spectral Correlative Mapping Approach for Transformation of Expressivity in Marathi Speech

P. S. Deshpande^(1*), J. S. Chitode⁽²⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecap.v8i1.13895

Abstract

The efforts of Imposing appropriate expressivity in a plain speech segment are referred to as emotion transformation. The expressivity enhances the meaning of the utterance. In speech signal, the prosody refers to emotions. This virtue is an evidence of intelligibility which humans can only add up. An appropriate prosody modification in synthetic plain Speech generated by a text to speech synthesizer plays a vital role in developing an effective speech interface platform. This research study focuses on Marathi regional language. It presents a Spectral correlative prosodic mapping approach using spectral mode decomposition (SMD) to achieve emotion transformation. The calculation of the prosodic features of the expressive speech is done by considering the segments in the signal which are high in spectral correlation. This feature vector representing the target speaking style, which is used to transplant on the mean feature vector, derived from neutral speech. Before replacing the target feature vector on the source segment its value is adaptively recomputed to reduce spectral errors between source and target emotion segments. The re-synthesized speech using prosody modified features with minimized spectral errors sounds like the target expression speech. This approach gives better quality as the computed features are not used for transformation as it is, but they are fine tuned to represent the least spectral error between source and target segments. This can also be observed by the waveform, spectrogram and objective measurements.
Copyright © 2018 Praise Worthy Prize - All rights reserved.

Keywords

Spectral Mode Decomposition (SMD); Spectral Co-Relation; Prosody; Emotion Transformation

Full Text:

PDF

References

D. H. Klatt, “Review of text to speech conversion for English”. The Journal of the Acoustical Society of America“, Vol. 82, n.6, pp737–793. 1987

M. Schroder, “Expressive speech synthesis: past, present and possible futures”. Proc Affective Information Processing, Germany, 2009, Vol 2, pp111–126, 2009.
http://dx.doi.org/10.1007/978-1-84800-306-4_7

J. E. Cahn, “Generation of affect in synthesized speech”. Proc. American voice I/O society. Vol 8, pp1-19, New York 1989.
http://dx.doi.org/10.1121/1.398253

I. R. Murray, J. L. Arnott, “Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion”. The Journal of the Acoustical Society of America, Vol 93, n. 2, pp1097–1108, 1993.
http://dx.doi.org/10.1121/1.405558

F. Burkhardt, W. F. Sendilmeier, “Verification of acoustical correlates of emotional speech using formant synthesis” Proc. ISCA workshop on speech & emotion, Northon Irland, pp. 151–156, 2000.

J. Vroomen, R. Collier, S. J. L. Mozziconacci, “Duration and intonation in emotional speech”, Proc. Eurospeech, pp. 577–580, 1993.
http://dx.doi.org/10.1016/0885-2308(89)90029-6

J. M. Montero, J. Gutierrez-Arriola, Colas, J., Enriquez, E., Pardo, J. M. “Analysis and modeling of emotional speech in Spanish”, 1999, Proc. ICPhS, pp. 671–674.

J. P. Cabral and L. C. Oliveira, “Pitch-synchronous time-scaling for prosodic and voice quality transformations,” Proc. of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 2005, pp. 1137–1140.
http://dx.doi.org/10.1007/3-540-28860-0

M. Theune, K. Meijs, D. Heylen, R. Ordelman, “Generating expressive speech for story telling applications”. IEEE transactions on Audio, Speech, and Language Processing, Vol 14, n. 4, pp 1099–1108, 2006.
http://dx.doi.org/10.1109/tasl.2006.876129

Jainath Yadav, K. Sreenivasa Rao “Generation of emotional speech by prosody imposition on Sentence, Word and Syllable level fragments of neutral speech”, Cognitive computing and information processing (CCIP), international conference, IEEE, Noida. 2015, PP1-5.
http://dx.doi.org/10.1109/ccip.2015.7100694

D. H. Kaltt, L. C. Klatt, “Analysis, synthesis, and perception of voice quality variations among female and male talkers”, Journal of Acoust. Soc. America., Vol 87, n. 2, pp820–857, 1990.
http://dx.doi.org/10.1121/1.398894

H. Kasuya, M. Kawamata, H. Kido,”Multiregressional analysis of autoregressive with exogenous input speech synthesis parameters and voice qualities”, Journal of Acoust. Soc. America, Vol 116, n.2, pp25-45, 2004.
http://dx.doi.org/10.1121/1.4785154

C. Gob and A. N Chasaide, “The role of voice quality in communicating emotion, mood, and attitude, Speech Communication Vol 40, n, 1. pp. 189–212, Elsevier,2003.
http://dx.doi.org/10.1016/s0167-6393(02)00082-1

A. Iida, N. Campbell, F. Higuchi and M. Yasumura, “A corpus-based speech synthesis system with emotion”, Speech Communication, Vol 40, n 1. pp189–212, Elsevier,2003
http://dx.doi.org/10.1016/s0167-6393(02)00081-x

Y. Sogabej, K. Kakehi and H. Kawahara, “Psychological evaluation of emotional speech using a new morphing method”, 4th ICCS Int. Conf. Cognitive Science, Sydney, Australia 2003, pp 628-633.
http://dx.doi.org/10.1109/cogsci.2015.7426666

N. Campbell,”Developments in corpus-based speech synthesis: Approaching natural conversational speech, ”IEICE Transactions on information and analysis, Vol E88, n.3, pp497–500, 2005.

D. Erickson, ”Expressive speech: production perception and application to speech synthesis,” Acoustical Science and Technology Vol 26, n. 4, pp 317–325, 2005.
http://dx.doi.org/10.1250/ast.26.317

J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny, “The IBM expressive text to speech synthesis system for American English,” IEEE Trans. Audio, Speech and Language Processing, Vol. 14, n. 4, pp. 1099–1109, July 2006.
http://dx.doi.org/10.1109/tasl.2006.876123

Tao, J., Kang, Y., Li, A. “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, Vol.14, n.4, pp1145–1154, 2006.
http://dx.doi.org/10.1109/tasl.2006.876113

A. Jaywant, M. D. Pell, “Categorical processing of negative emotions from speech prosody”, Speech Communication, Vol. 54, n.1, pp1-10, Elsevier, 2012.
http://dx.doi.org/10.1016/j.specom.2011.05.011

L. Thakuria, “Integrating Rule and Template- based Approaches to Prosody Generation for Emotional BODO Speech Synthesis,” Fourth International Conference on Communication Systems and Network Technologies, India, IEEE, 2014 , pp. 939-943.
http://dx.doi.org/10.1109/csnt.2014.193

A. Odetunji, S.H. Sylvia, A. Wong, J. Beaumont, “A modular holistic approach to prosody modeling for Standard Yoruba speech synthesis”, Computer Speech and Language Vol. 22, pp 39–68, Elsevier, 2008.
http://dx.doi.org/10.1016/j.csl.2007.05.002

K. Rao, “Role of neural network models for developing speech systems,” Sadhana, Vol. 36, Part 5, Indian Academy of Sciences, pp.783-836, 2011.
http://dx.doi.org/10.1007/s12046-011-0047-z

Y. Huang, C. Hsien, Y. Y. Chen, M. Shie, and J. F. Wang, “Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1048–1060, 2017.
http://dx.doi.org/10.1109/taslp.2017.2679603

Hadhami, I., Aicha, B., Speech Signal Enhancement Using Empirical Mode Decomposition and Adaptive Method Based on the Signal for Noise Ratio Objective Evaluation, (2014) International Review on Computers and Software (IRECOS), 9 (8), pp. 1461-1467.
http://dx.doi.org/10.15866/irecos.v9i8.1588

P. S. Deshpande ”Novel feature coding in prosody generation for emotional speech synthesis, ”IETE National Journal of innovation and research, Vol. 3, n. 1, pp. 38-44, 2015.
http://dx.doi.org/10.1007/978-3-662-45258-5_11

C. E. Williams, K. N. Stevens,”Emotions and speech: Some acoustical correlates,” Journal of Acoust. Soc. America, Vol. 52, no. 4, pp 1238–1250, 1972.
http://dx.doi.org/10.1121/1.1913238

A. Paeschke, “Global trend of fundamental frequency in emotional speech,” Proc. Speech Prosody, Nara, 2004, pp. 671–674.
http://dx.doi.org/10.1007/978-3-662-45258-5_9

K. R. Scherer, R. Banse, H. G. Wallbott and T. Goldbeck,”Vocal cues in emotion encoding and decoding”, Motiv. Emotion, Vol.15, n. 2, pp. 123–148, Springer, 1991.
http://dx.doi.org/10.1007/bf00995674

Y. Hashizawa, S. Takeda, M. D. Hamzah and G. Ohyama, ‘‘On the differences in prosodic features of emotional expressions in Japanese speech according to the degree of emotion,’’ Proc. Speech Prosody, Nara, 2004, pp.671–674.
http://dx.doi.org/10.21437/speechprosody.2016-132

C.-F. Huang and M. Akagi, ‘‘A perceptual model of emotional speech built by fuzzy logic,’’ Proc. Autumn Meet. Acoustic Society Japan, 2004, pp. 287–288.
http://dx.doi.org/10.1016/b978-012415146-8/50011-1

M. Yanagida, ‘‘Discriminating ironies from praising—Acoustic parameters vs. prosodic parameters,’’ Proc. Symp. Prosody and Speech Processing, University of Tokyo,2002, pp. 143–146.
http://dx.doi.org/10.21437/speechprosody.2016-132

L. Leinonen, T. Hiltunen, I. Linnanakoski and M.-L. Laakso, ‘‘Expression of emotional-motivational connotations with a one-word utterance,” Journal of Acoustic Society of America, Vol. 102, n. 3, pp 1853–1863, 1997.
http://dx.doi.org/10.1121/1.420109

Y. Hayashi, ‘‘Recognition of vocal expression of mental attitudes in Japanese: Using the interjection ‘‘eh’’,’’ Proc. Int. Congr. Phonetic Sciences, San Francisco, 1999, pp. 2355–2358.
http://dx.doi.org/10.7567/jjap.13s1.866

S. Takeda, G. Ohyama, A. Tochitani and Y. Nishizawa, ‘‘Analysis of prosodic features of ‘‘anger’’ expressions in Japanese speech,’’ Journal of Acoust. Soc. Japan. Vol. 58, n. 9, pp. 561–568, 2002.
http://dx.doi.org/10.1121/1.1514536

Refbacks

There are currently no refbacks.

Username
Password
Remember me