Keynote Speakers

Towards Robust Audio Deepfake Detection and Attribution

Prof. Jianhua Tao – Monday (Dec 2)

Abstract: Audio deepfake detection has attracted more and more attention. Although previous studies have made some attempts on audio deepfake detection and attribution, the generalization and robustness of the models are still poor when evaluated on the mismatching dataset containing multiple unseen attacks like VALL-E, GPT-4o etc. This talk will provide an overview of recent progress in audio deepfake detection and attribution, with a particular emphasis on how to improve robustness of the models making them more reliable in real-world applications. This talk will also provide a more comprehensive understanding the reasons of discrimination, helping users understand the detection process and building trust in anti-deepfake technologies.

Biography: Prof. Tao is a Professor of Tsinghua University. He was the Deputy Director of the National Laboratory of Pattern Recognition from 2014 to 2022, the Director of Sino-European Laboratory of Informatics, Automation and Applied Mathematics (LIAMA) from 2015 to 2022. Prof. Tao is a recognized scholar in the field of speech and language processing, multimodal human-computer interaction and affective computing. He was elected Chairperson of the ISCA SIG-CSLP (2019-2020) and was Technical Program Chair of INTERSPEECH2020. He is a Fellow of the China Computer Federation (CCF). He has published more than 300 papers in IEEE TPAMI, TASLP, TAC, PR, NIPS, ICML, AAAI, ICASSP, etc. His recent awards include the Award of Distinguished Young Scholars of NSFC (2014), Award of National special support program for high-level person (2018), Best Paper Awards of NCMMSC (2001, 2015, 2017), Best Paper Awards of CHCI (2011, 2013, 2015, 2016). He has delivered numerous invited and keynote talks, such as Speech Prosody (2012, 2018), NCMMSC (2017), etc. He was also an elected member of the Executive Committee of AAAC association (2007-2017) and served on the Steering Committee of IEEE Transactions on Affective Computing (2009-2017). He currently serves as the ISCA Board member, the Subject Editor of Speech Communication, the Editorial Board Member of Journal on Multimodal User Interfaces.

End-to-End Audio Processing: From On-Device Models to LLMs

Tara N. Sainath – Tuesday (Dec 3)

Abstract: End-to-end (E2E) speech recognition has become a popular research paradigm in recent years, allowing the modular components of a conventional speech recognition system (acoustic model, pronunciation model, language model), to be replaced by one neural network. In this talk, we will discuss a multi-year research journey of E2E modeling for speech recognition at Google. This journey started with building E2E models that can surpass the performance of conventional models across many different quality and latency metrics, as well as the productionization of E2E models for Pixel 4, 5 and 6 phones. We then looked at expanding these models, both in terms of size and language coverage. Towards this, we will touch on the Universal Speech Model, as well as more open-ended audio tasks achievable with large language models (LLMs). 

Biography: Dr. Tara Sainath is a leading expert in speech recognition and deep neural networks, holding an S.B., M.Eng, and PhD in Electrical Engineering and Computer Science from MIT. After stints at IBM T.J. Watson Research Center, she now serves as a Tech Lead in Google DeepMind’s Audio Pillar, integrating audio capabilities with  large language models (LLMs). Dr. Sainath’s leadership is exemplified by her roles as Program Chair for ICLR (2017, 2018) and her extensive work co-organizing influential conferences and workshops. Her contributions to the field have been recognized with an IEEE Fellowship, the 2021 IEEE SPS Industrial Innovation Award, and the 2022 IEEE SPS Signal Processing Magazine Best Paper Award.

Large Language-Audio Models and Applications

Prof. Wenwu Wang – Wednesday (Dec 4)

Abstract: Large Language Models (LLMs) are being explored in audio processing to interpret and generate meaningful patterns from complex sound data, such as speech, music, environmental noise, sound effects, and other non-verbal audio. Combined with acoustic models, LLMs offer great potential for addressing a variety of problems in audio processing, such as audio captioning, audio generation, source separation, and audio coding. This talk will cover recent advancements in using LLMs to address audio-related challenges. Topics will include the language-audio models for mapping and aligning audio with textual data, their applications across various audio tasks, the creation of language-audio datasets, and potential future directions in language-audio learning. We will demonstrate our recent works in this area, for example, AudioLDM, AudioLDM2 and WavJourney for audio generation and storytelling, AudioSep for audio source separation, ACTUAL for audio captioning, SemantiCodec for audio coding, WavCraft for content creation and editing, and APT-LLMs for audio reasoning, and the datasets WavCaps, Sound-VECaps, and AudioSetCaps for training and evaluating large language-audio models.

Biography: Wenwu Wang is a Professor in Signal Processing and Machine Learning, University of Surrey, UK. He is also an AI Fellow at the Surrey Institute for People Centred Artificial Intelligence. His current research interests include signal processing, machine learning and perception, artificial intelligence, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 300 papers in these areas. He has been recognized as a (co-)author or (co)-recipient of more than 15 accolades, including the 2022 IEEE Signal Processing Society Young Author Best Paper Award, ICAUS 2021 Best Paper Award, DCASE 2020 and 2023 Judge’s Award, DCASE 2019 and 2020 Reproducible System Award, and LVA/ICA 2018 Best Student Paper Award. He is an Associate Editor (2020-2025) for IEEE/ACM Transactions on Audio Speech and Language Processing, and an Associate Editor (2024-2026) for IEEE Transactions on Multimedia. He was a Senior Area Editor (2019-2023) and Associate Editor (2014-2018) for IEEE Transactions on Signal Processing. He is the elected Chair (2023-2024) of IEEE Signal Processing Society (SPS) Machine Learning for Signal Processing Technical Committee, a Board Member (2023-2024) of IEEE SPS Technical Directions Board, the elected Chair (2025-2027) and Vice Chair (2022-2024) of the EURASIP Technical Area Committee on Acoustic Speech and Music Signal Processing, an elected Member (2021-2026) of the IEEE SPS Signal Processing Theory and Methods Technical Committee. He has been on the organising committee of INTERSPEECH 2022, IEEE ICASSP 2019 & 2024, IEEE MLSP 2013 & 2024, and SSP 2009. He is Technical Program Co-Chair of IEEE MLSP 2025. He has been an invited Keynote or Plenary Speaker on more than 20 international conferences and workshops.

A Theory of Unsupervised Speech Recognition

Prof. Mark Hasegawa-Johnson – Thursday (Dec 5)

Abstract: An unsupervised automatic speech recognizer (UASR) is a learner that observes a corpus of text, and a corpus of untranscribed speech in the same language, and infers a mapping from speech to text based on similarities in the sequence statistics between the two.  Wang demonstrated in 2023 that if the transition probabilities in natural language are chosen randomly from a subgaussian prior, then UASR is possible with probability one.  Unfortunately, while the subgaussian assumption is plausible for phonemes, it is not plausible for words, whose frequencies follow a Zipf distribution.  In 2008, Chan proposed that the Zipf distribution “can be exploited effectively in language acquisition” by a rule-based model that acquires new rules only when they are necessary to explain previously unexplained features of the data.  When applied to speech, I claim that Chan’s model implies a closed-loop speech chain in which the UASR learns the most frequent units, which are then used to resynthesize speech, permitting further refinement of the ASR.  We have demonstrated that unsupervised text-to-speech is intelligible, that it can be used to generate additional training data for self-supervised learning, and that unsupervised units can be used to learn a better grapheme-to-phoneme transduction of text.

Biography: Mark A. Hasegawa-Johnson is the M.E. Van Valkenburg Professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign.  His research converts facts about speech production into low-resource transfer learning algorithms that can be used to make speech technology more fair, more inclusive, and more accessible.  His research has been featured in online stories by the Wall Street Journal, CNN, CNET, and The Atlantic.  Dr. Hasegawa-Johnson is a Fellow of the IEEE, of the Acoustical Society of America, and of the International Speech Communication Association, and he is currently Deputy Editor of the IEEE Transactions on Audio, Speech, and Language Processing.