What we do

In the last couple of years, tremendous progress has been made in the field of voice and speaker recognition and corresponding technologies have since made their way into our everyday life, e.g. when we control smartphones using our voice or talk to robots on the phone guiding us through service hotlines. In the coming years and decades, speech processing technologies are expected to reach the next stage: they will gain social skills in order to make dialogs more intuitive and searches more efficient, as well as making speech analysis systems in the commercial, healthcare and security sectors more robust and universal. One essential requirement is that speech analysis systems not only recognize the textual content of verbal expressions ("what is said") but also understand any paralinguistic content ("who says it, how, where, and in what context"). In order to reach this goal, it is necessary to reliably detect speaker attributes and states such as age, gender, height, accent/dialect, personality, state of health, tiredness, soberness and affective state.

The research project iHEARu, short for "Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics", tries to push new limits in intelligent speaker analysis and aims at developing brand-new speech analysis methods. This includes machine learning techniques that recognize multiple attributes at the same time and leverage any correlations between them (e.g. BLSTM-RNN or multi-task linear regression), as well as semi-supervised learning methods that allow the use of large training datasets without the need of spending much effort on their annotation.

New realistic speech data is collected from public sources such as video streaming sites and media reports and partially annotated through crowdsourcing. Besides, existing datasets that are already available for research purposes are used and new speech data is recorded from voluntary subjects at the TUM or gathered via online platforms provided by the institute. Furthermore, perception studies are planned where participants are asked to classify samples of real or synthetic speech in regard to certain attributes and speaker states. The performance of automatic systems will then be compared to the recognition accuracy of human subjects.

Any project results will be made available to the research community in the form of publications and open-source software toolkits in order to foster collaboration between teams.

Publications on iHEARu-PLAY

Publications using iHEARu-PLAY

We would be very happy if researchers working with iHEARu-PLAY would inform us on their scientific publications: please send an email to with the full citation of your paper in the core of the message and, if possible, with the PDF file attached. After checking, publications will be listed here to ease access to other researchers interested in iHEARu-PLAY.

Integrated open-source Tools

We kindly thank the authors of the open-source tools WEKA [1] and openSMILE [2] to give us the possibility to make use of their applications. We further thank audEERING GmbH for sponsoring the access to sensAI, their world-leading emotion and affective state identification tool from the voice. We also like to thank Dr. Zixing Zhang for providing the baseline code on the integrated Active Learning Algorithms.