In the last couple of years, tremendous progress has been made in the field of voice and speaker recognition and corresponding technologies have since made their way into our everyday life, e.g. when we control smartphones using our voice or talk to robots on the phone guiding us through service hotlines. In the coming years and decades, speech processing technologies are expected to reach the next stage: they will gain social skills in order to make dialogs more intuitive and searches more efficient, as well as making speech analysis systems in the commercial, healthcare and security sectors more robust and universal. One essential requirement is that speech analysis systems not only recognize the textual content of verbal expressions ("what is said") but also understand any paralinguistic content ("who says it, how, where, and in what context"). In order to reach this goal, it is necessary to reliably detect speaker attributes and states such as age, gender, height, accent/dialect, personality, state of health, tiredness, soberness and affective state.
The research project iHEARu, short for "Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics", tries to push new limits in intelligent speaker analysis and aims at developing brand-new speech analysis methods. This includes machine learning techniques that recognize multiple attributes at the same time and leverage any correlations between them (e.g. BLSTM-RNN or multi-task linear regression), as well as semi-supervised learning methods that allow the use of large training datasets without the need of spending much effort on their annotation.
New realistic speech data is collected from public sources such as video streaming sites and media reports and partially annotated through crowdsourcing. Besides, existing datasets that are already available for research purposes are used and new speech data is recorded from voluntary subjects at the TUM or gathered via online platforms provided by the institute. Furthermore, perception studies are planned where participants are asked to classify samples of real or synthetic speech in regard to certain attributes and speaker states. The performance of automatic systems will then be compared to the recognition accuracy of human subjects.
Any project results will be made available to the research community in the form of publications and open-source software toolkits in order to foster collaboration between teams.
S. Hantke, Z. Zhang, and B. Schuller, “Towards Intelligent Crowdsourcing for Audio Data Annotation: Integrating Active Learning in the Real World,” in Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, ISCA, Stockholm, Sweden, August 2017, pp.3951– 3955. [pdf] [bib]
S. Amiriparian, S. Pugachevskiy, N. Cummins, S. Hantke, J. Pohjalainen, G. Keren, and B. Schuller, “CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms,” in Proceedings 7th biannual Conference on Affective Computing and Intelligent Interaction (ACII 2017), IEEE, San Antionio, TX, October 2017, pp. 340–345. [pdf] [bib]
S. Hantke, A. Batliner, and B. Schuller, “Ethics for Crowdsourced Corpus Collection, Data Annotation and its Application in the Web-based Game iHEARu-PLAY,” in Proceedings 1st International Workshop on ETHics In Corpus Collection, Annotation and Application (ETHI-CA 2016), satellite of the 10th Language Resources and Evaluation Conference (LREC 2016), ELRA, Portoroz, Slovenia, May 2016, pp. 54-59. [pdf] [bib]
S. Hantke, F. Eyben, T. Appel, and B. Schuller, "iHEARu-PLAY: "Introducing a game for crowdsourced data collection for affective computing." in Proceedings 1st International Workshop on Automatic Sentiment Anaysis in the Wild (WASA 2015) held in conjunction with the 6th biannual Conference on Affective Computing and Intelligent Interaction (ACII 2015), IEEE, Xi’an, P. R. China, 2015. [pdf] [bib]
We would be very happy if researchers working with iHEARu-PLAY would inform us on their scientific publications: please send an email to with the full citation of your paper in the core of the message and, if possible, with the PDF file attached. After checking, publications will be listed here to ease access to other researchers interested in iHEARu-PLAY.
S. Hantke, H. Sagha, N. Cummins, and B. Schuller, “Emotional Speech of Mentally and Physically Disabled Individuals: Introducing The EmotAsS Database and First Findings,” in Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, ISCA, Stockholm, Sweden, August 2017, pp. 3137–3141. [pdf] [bib]
A. Baird, S. H. Jorgensen, E. Parada-Cabaleiro, S. Hantke, N. Cummins, and B. Schuller, “Perception of Paralinguistic Traits in Synthesized Voices,” in Proceedings of the 11th Audio Mostly Conference on Interaction with Sound (Audio Mostly), ACM, London, UK, August 2017. 5 pages, to appear [pdf] [bib]
E. Parada-Cabaleiro, A. Baird, A. Batliner, N. Cummins, S. Hantke, and B. Schuller, “The Perception of Emotions in Noisified Non-Sense Speech,” in Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, ISCA, Stockholm, Sweden, August 2017, pp. 3246– 3250. [pdf] [bib]
B. Schuller, J.-G. Ganascia, and L. Devillers, “Multimodal Sentiment Analysis in the Wild: Ethical considerations on Data Collection, Annotation, and Exploitation,” in Proceedings of the 1st International Workshop on ETHics In Corpus Collection, Annotation and Application (ETHI-CA2 2016), satellite of the 10th Language Resources and Evaluation Conference (LREC 2016), ELRA, Portoroz, Slovenia, May 2016, pp. 29–34. [pdf] [bib]
We kindly thank the authors of the open-source tools WEKA  and openSMILE  to give us the possibility to make use of their applications. We further thank audEERING GmbH for sponsoring the access to sensAI, their world-leading emotion and affective state identification tool from the voice. We also like to thank Dr. Zixing Zhang for providing the baseline code on the integrated Active Learning Algorithms.
 Eibe Frank, Mark A. Hall, and Ian H. Witten (2016): “The WEKA Workbench”, Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.
 Florian Eyben, Felix Weninger, Florian Gross, Björn Schuller: “Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor”, In Proc. ACM Multimedia (MM), Barcelona, Spain, ACM, ISBN 978-1-4503-2404-5, pp. 835-838, October 2013. doi:10.1145/2502081.2502224