SLT 2024 Tentative Program

(This is a tentative program, subject to minor adjustments.)

 Program overview: https://2024.ieeeslt.org/program/

Day1, Dec 2, Monday

S1: Opening Session (Day 1, Dec 2, Monday, 08:30 – 09:00)

K1: Keynote 1 (Day 1, Dec 2, Monday, 09:00 – 10:00)

S2: Coffee Break (Day 1, Dec 2, Monday, 10:00 – 10:30)

P1: Poster Session 1: Speech Recognition (Day 1, Dec 2, Monday, 10:30 – 12:30)

28: PromptKWS: A Novel Prompt-Guided Open-Vocabulary Keyword Spotting Framework 

Xu, Gaopeng; Li, Chengfei; Wang, Xianliang; Zhu, Li; Wei, Juan; Li, Wenpeng; Niu, Jianwei; Gao, Jie

43: Personalizing Large Sequence-to-Sequence Speech Foundation Models with Speaker Representations 

Wagner, Dominik; Baumann, Ilja; Ranzenberger, Thomas; Riedhammer, Korbinian; Bocklet, Tobias

97: Label-Looping: Highly Efficient Decoding for Transducers 

Bataev, Vladimir; Xu, Hainan; Galvez, Daniel; Lavrukhin, Vitaly; Ginsburg, Boris

102: Advancing Multi-Talker ASR Performance with Large Language Models 

Shi, Mohan; Jin, Zengrui; Xu, Yaoxun; Xu, Yong; Zhang, Shi-Xiong; Wei, Kun; Shao, Yiwen; Zhang, Chunlei; Yu, Dong

114: Token-Weighted RNN-T for Learning from Flawed Data 

Keren, Gil; Zhou, Wei; Kalinli, Ozlem

137: Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model 

Huang, Hukai; Lin, Jiayan; Wang, Kaidi; Li, Yishuang; Guan, Wenhao; Li, Lin; Hong, Qingyang

149: Language Bias in Self-Supervised Learning for Automatic Speech Recognition 

Storey, Ed; Harte, Naomi; Bell, Peter

150: Robust Audiovisual Speech Recognition Models with Mixture-of-Experts 

Wu, Yihan; Peng, Yifan; Lu, Yichen; Chang, Xuankai; Song, Ruihua; Watanabe, Shinji

162: Hybrid Attention-Based Encoder-Decoder Model for Efficient Language Model Adaptation 

Ling, Shaoshi; Ye, Guoli; Zhao, Rui; Gong, Yifan

165: SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays 

Shao, Yiwen; Xu, Yong; Khudanpur, Sanjeev; Yu, Dong

171: Effective Text Adaptation for LLM-Based ASR Through Soft Prompt Fine-Tuning 

Ma, Yingyi; Liu, Zhe; Kalinli, Ozlem

173: Temporal Order Preserved Optimal Transport-Based Cross-Modal Knowledge Transfer Learning for ASR 

Lu, Xugang; Shen, Peng; Tsao, Yu; Kawai, Hisashi

187: Contextualized Automatic Speech Recognition with Dynamic Vocabulary 

Sudo, Yui; Fukumoto, Yosuke; Shakeel, Muhammad; Peng, Yifan; Watanabe, Shinji

209: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper 

Yang, Chih-Kai; Huang, Kuan-Po; Lee, Hung-yi

221: An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition 

Wang, Yi-Cheng; Pai, Li-Ting; Yan, Bi-Cheng; Wang, Hsin-Wei; Lin, Chi-Han; Chen, Berlin

300: Training Large ASR Encoders with Differential Privacy 

Chauhan, Geeticka; Chien, Steve; Thakkar, Om; Thakurta, Abhradeep; Narayanan, Arun

309: Transducer Consistency Regularization for Speech-to-Text Applications 

Tseng, Cindy S.; Tang, Yun; Apsingekar, Vijendra Raj

324: Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data 

Tseng, Liang-Hsuan; Chen, Zih-Ching; Chang, Weishun; Lee, Cheng-Kuang; Huang, Tsung-Ren; Lee, Hung-yi

389: CTC-Assisted LLM-Based Contextual ASR 

Yang, Guanrou; Ma, Ziyang; Gao, Zhifu; Zhang, Shiliang; Chen, Xie

53: Automatic Time Alignment Generation for End-to-End ASR Using Acoustic Probability Modeling 

Jiang, Dongcheng; Zhang, Chao; Woodland, Phil

73: Continual Learning with Embedding Layer Surgery and Task-Wise Beam 

Kwok, Chin Yuen; Yip, Jia Qi; Chng, Eng Siong

93: BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 

Huang, He; Chen, Zhehuai; Puvvada, Krishna C.;Żelasko, Piotr; Balam, Jagadeesh; Ginsburg, Boris; Koluguri, Nithin Rao; Hrinchuk, Oleksii

116: Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription 

Vieting, Peter; Berger, Simon; von Neumann, Thilo; Boeddeker, Christoph; Schlüter, Ralf; Haeb-Umbach, Reinhold

122: Mamba-Based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition 

Masuyama, Yoshiki; Miyazaki, Koichi; Murata, Masato

201: An Analysis of Linear Complexity Attention Substitutes with BEST-RQ 

Whetten, Ryan; Parcollet, Titouan; Moumen, Adel; Dinarelli, Marco; Estève, Yannick

214: Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models 

Gao, Xiaoxue; Chen, Nancy

357: Lite ASR Transformer: A Lightweight Transformer Architecture for Automatic Speech Recognition 

Sagaya Mary N J, Metilda; Umesh, S.

S3: Lunch (Day 1, Dec 2, Monday, 12:30 – 14:00)

T1: Invited Talk 1 (Day 1, Dec 2, Monday, 14:00 – 15:00)

P2: Poster Session 2: Speech Recognition and Enhancement (Day 1, Dec 2, Monday, 15:00 – 17:00)

22: Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition 

Shi, Hao; Gao, Yuan; Ni, Zhaoheng; Kawahara, Tatsuya

161: Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models 

Poncelet, Jakob; Wang, Yujun; Van hamme, Hugo

170: Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Multi-Task Automatic Speech Recognition Models 

Raina, Vyas; Gales, Mark

283: Improving Rare-Word Recognition of Whisper in Zero-Shot Settings 

Jogi, Yash; Aggarwal, Vaibhav; Nair, Shabari S.; Verma, Yash; Kubba, Aayush

343: Augmenting Automatic Speech Recognition Models with Disfluency Detection 

Amann, Robin; Li, Zhaolin; Bruno, Barbara; Niehues, Jan

246: Enhancing Unified Streaming and Non-Streaming ASR Through Curriculum Learning with Easy-to-Hard Tasks 

Yang, Yuting; Li, Yuke; Zhou, Lifeng; Du, Binbin; Zhu, Haoqi

78: DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition 

Shao, Hang; Liu, Bei; Wang, Wei; Gong, Xun; Qian, Yanmin

129: Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition 

Wang, Shih-Heng; Shi, Jiatong; Huang, Chien-Yu; Watanabe, Shinji; Lee, Hung-Yi

157: Longer Is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation 

Koluguri, Nithin Rao; Bartley, Travis M.; Xu, Hainan; Hrinchuk, Oleksii; Balam, Jagadeesh; Ginsburg, Boris; Kucsko, Georg

178: Semi-Supervised Learning for Code-Switching ASR with Large Language Model Filter 

Xi, Yu; Ding, Wen; Yu, Kai; Lai, Junjie

301: Parameter Averaging Is All You Need to Prevent Forgetting 

Plantinga, Peter W.; Yoo, Jaekwon; Girma, Abenezer G.; Dhir, Chandra

304: Advancing CTC Models for Better Speech Alignment: A Topological Approach 

Zhao, Zeyu; Bell, Peter

37: DualSep: A Lightweight Dual-Encoder Convolutional Recurrent Network for Real-Time In-Car Speech Separation 

Wang, Ziqian; Sun, Jiayao; Zhang, Zihan; Li, Xingchen; Liu, Jie; Xie, Lei

39: DDTSE: Discriminative Diffusion Model for Target Speech Extraction 

Zhang, Leying; Qian, Yao; Yu, Linfeng; Wang, Heming; Yang, Hemin; Liu, Shujie; Zhou, Long; Qian, Yanmin

89: An Investigation of Incorporating Mamba for Speech Enhancement 

Chao, Rong; Cheng, Wen-Huang; La Quatra, Moreno; Siniscalchi, Sabato M.; Yang, Chao-Han Huck; Fu, Szu-Wei; Tsao, Yu

117: Effective Noise-Aware Data Simulation for Domain-Adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation 

Wang, Chien-Chun; Chen, Li-Wei; Lee, Hung-Shin; Chen, Berlin; Wang, Hsin-Min

120: SMRU: Split-and-Merge Recurrent-Based UNet for Acoustic Echo Cancellation and Noise Suppression 

Sun, Zhihang; Li, Andong; Chen, Rilin; Zhang, Hao; Yu, Meng; Zhou, Yi; Yu, Dong

139: On the Effectiveness of Enrollment Speech Augmentation for Target Speaker Extraction 

Li, Junjie; Zhang, Ke; Wang, Shuai; Li, Haizhou; Mak, M. W.; Lee, Kong Aik

142: Diffusion-Based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement 

Li, Chenda; Cornell, Samuele; Watanabe, Shinji; Qian, Yanmin

216: NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Fusion 

De Silva, Dashanka D. N.; Cai, Siqi; Pahuja, Saurav; Schultz, Tanja; Li, Haizhou

325: Enhancing Speaker Extraction Through Rectifying Target Confusion 

Wang, Jiahe; Wang, Shuai; Li, Junjie; Zhang, Ke; Qian, Yanmin; Li, Haizhou

368: Diff-PLC: A Diffusion-Based Approach for Effective Packet Loss Concealment 

Yang, Da-Hee; Chang, Joon-Hyuk

369: Improving Curriculum Learning for Target Speaker Extraction with Synthetic Speakers 

Liu, Yun; Liu, Xuechen; Yamagishi, Junichi

212: FLANEC: Exploring Flan-T5 for Post-ASR Error Correction 

La Quatra, Moreno; Salerno, Valerio Mario; Tsao, Yu; Siniscalchi, Sabato Marco

415: Language Model-Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition 

Yang, Chao-Han Huck; Park, Tae Jin; Gong, Yuan; Li, Yuanchao; Lin, Yen-Ting; Chen, Zhehuai; Hu, Yuchen; Chen, Chen; Dhawan, Kunal;Żelasko, Piotr; Zhang, Chao; Chen, Yun-Nung; Tsao, Yu; Balam, Jagadeesh; Ginsburg, Boris; Siniscalchi, Sabato M.; Chng, Eng Siong; Bell, Peter; Lai, Catherine; Watanabe, Shinji; Stolcke, Andreas

400: FGCL: Fine-Grained Contrastive Learning for Mandarin Stuttering Event Detection 

Jiang, Han; Wang, Wenyu; Zhou, Yiquan; Ding, Hongwu; Jiacheng, Xu; Zhu, Jihua

402: Data Augmentation Techniques for Improved Performance in the SLT 2024 Mandarin Stuttering Event Detection and ASR Challenge 

Wang, Weiwei; Feng, Zhijin; Song, Qingyuan; Wei, Wenyang; Wang, Yansong

403: Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge 

Xue, Hongfei; Gong, Rong; Shao, Mingchen; Xu, Xin; Wang, Lezhi; Xie, Lei; Bu, Hui; Zhou, Jiaming; Qin, Yong; Du, Jun; Li, Ming; Zhang, Binbin; Jia, Bin

410: Enhanced ASR for Stuttering Speech: Combining Adversarial and Signal-Based Data Augmentation 

Huang, Shangkun; Zhang, Dejun; Deng, Jing; Zheng, Rong

S4: Coffee Break (Day 1, Dec 2, Monday, 17:00 – 17:30)

S5: TBD (Day 1, Dec 2, Monday, 17:30 – 18:30)

S6: Welcome Reception (Day 1, Dec 2, Monday, 19:00 – 20:30)

Day2, Dec 3, Tuesday

K2: Keynote Speech 2 (Day 2, Dec 3, Tuesday, 09:00 – 10:00)

S7: Coffee Break (Day 2, Dec 3, Tuesday, 10:00 – 10:30)

P3: Poster Session 3: Speech Processing (Day 2, Dec 3, Tuesday, 10:30 – 12:30)

66: Property Neurons in Self-Supervised Speech Transformers 

Lin, Tzu-Quan; Lin, Guan-Ting; Lee, Hung-Yi; Tang, Hao

145: Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization 

Cai, Zexin; Xinyuan, Henry Li; Garg, Ashi; Andrews, Nicholas O.; Garcia, Paola; Wiesner, Matthew S.; Duh, Kevin; Khudanpur, Sanjeev

217: Estimating the Completeness of Discrete Speech Units 

Yeh, Sung-Lin; Tang, Hao

314: Investigation of Speaker Representation for Target-Speaker Speech Processing 

Ashihara, Takanori; Moriya, Takafumi; Horiguchi, Shota; Peng, Junyi; Ochiai, Tsubasa; Delcroix, Marc; Matsuura, Kohei; Sato, Hiroshi

226: Crossmodal ASR Error Correction with Discrete Speech Units 

Li, Yuanchao; Chen, Pinzhen; Bell, Peter; Lai, Catherine

241: Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech-Integrated Large Language Models 

Lin, Yi-Cheng; Lin, Tzu-Quan; Yang, Chih-Kai; Lu, Ke-Han; Chen, Wei-Chih; Kuan, Chun-Yi; Lee, Hung-Yi

265: Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition 

Kim, Sungnyun; Jang, Kangwook; Bae, Sangmin; Kim, Hoirin; Yun, Se-Young

269: Data-Efficient Reflow for Few-Step Audio Generation 

Wu, Lemeng; Ni, Zhaoheng; Shi, Bowen; Hsu, Wei-Ning; Le Lan, Gael; Nagaraja, Varun; Kumar, Anurag; Mei, Xinhao; Xiong, Yunyang; Soran, Bilge; Krishnamoorthi, Raghuraman; Shi, Yangyang; Chandra, Vikas

99: Optimizing Byte-Level Representation for End-to-End ASR 

Hsiao, Roger; Deng, Liuhui; McDermott, Erik; Travadi, Ruchir; Zhuang, Xiaodan

179: Romanization Encoding for Multilingual ASR 

Ding, Wen; Jia, Fei; Xu, Hainan; Xi, Yu; Lai, Junjie; Ginsburg, Boris

254: Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection 

Yang, Tzu-Ting; Wang, Hsin-Wei; Wang, Yi-Cheng; Chen, Berlin

263: Language-Independent Prosody-Enhanced Speech Representations for Multilingual Speech Synthesis 

Liu, Chang; Ling, Zhen-Hua; Hu, Ya-Jun

359: Classification of Spontaneous and Scripted Speech for Multilingual Audio 

Elisha, Shahar; McDowell, Andrew J.; Beguerisse-Díaz, Mariano; Benetos, Emmanouil

40: GMP-TL: Gender-Augmented Multi-Scale Pseudo-Label Enhanced Transfer Learning for Speech Emotion Recognition 

Pan, Yu; Yang, Yuguang; Huang, Yuheng; Jin, Tiancheng; Yin, Jingjing; Hu, Yanni; Lu, Heng; Ma, Lei; Zhao, Jianjun

81: Embracing Ambiguity and Subjectivity Using the All-Inclusive Aggregation Rule for Evaluating Multi-Label Speech Emotion Recognition Systems 

Chou, Huang-Cheng; Wu, Haibin; Goncalves, Lucas; Leem, Seong-Gyun; Salman, Ali N.; Busso, Carlos; Lee, Hung-Yi; Lee, Chi-Chun

83: Open-Emotion: A Reproducible Emo-SUPERB for Speech Emotion Recognition Systems 

Wu, Haibin; Chou, Huang-Cheng; Chang, Kai-Wei; Goncalves, Lucas; Du, Jiawei; Jang, Jyh-Shing Roger; Lee, Chi-Chun; Lee, Hung-Yi

225: Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques 

Li, Yuanchao; Bell, Peter; Lai, Catherine

282: Beyond the Binary: Limitations and Possibilities of Gender-Related Speech Technology Research 

Sanchez, Ariadna; Ross, Alice; Markl, Nina

352: Enhancing Domain Generalization in Speech Emotion Recognition by Combining Domain-Variant Representations and Domain-Invariant Classifiers 

Lee, Shi-Wook

47: MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towards High Sampling Rate and Low Bitrate Scenarios 

Jiang, Xiao-Hang; Ai, Yang; Zheng, Rui-Chen; Du, Hui-Peng; Lu, Ye-Xin; Ling, Zhen-Hua

51: Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder 

Guo, Haohan; Xie, Fenglong; Yang, Dongchao; Lu, Hui; Wu, Xixi; Meng, Helen

267: Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation 

Li, Jiaqi; Wang, Dongmei; Wang, Xiaofei; Qian, Yao; Zhou, Long; Liu, Shujie; Yousefi, Midia; Li, Canrun; Tsai, Chung-Hsien; Xiao, Zhen; Liu, Yanqing; Chen, Junkun; Zhao, Sheng; Li, Jinyu; Wu, Zhizheng; Zeng, Michael

280: ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech 

Shi, Jiatong; Tian, Jinchuan; Wu, Yihan; Jung, Jee-Weon; Yip, Jia Qi; Masuyama, Yoshiki; Chen, William; Wu, Yuning; Tang, Yuxun; Baali, Massa; Alharthi, Dareen; Zhang, Dong; Deng, Ruifan; Srivastava, Tejes; Wu, Haibin; Liu, Alexander H.; Raj, Bhiksha; Jin, Qin; Song, Ruihua; Watanabe, Shinji

336: Codec-SUPERB @ SLT 2024: A Lightweight Benchmark for Neural Codec Models 

Wu, Haibin; Chen, Xuanjun; Lin, Yi-Cheng; Du, Jiawei; Chang, Kai-Wei; Lu, Ke-Han; Liu, Alexander H.; Chung, Ho Lam; Wu, Yuan-Kuei; Yang, Dongchao; Liu, Songxiang; Wu, Yi-Chiao; Tan, Xu; Glass, James; Watanabe, Shinji; Lee, Hung-Yi

61: Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge 

Liu, Shuiyun; Kong, Yuxiang; Guo, Pengcheng; Zhuang, Weiji; Gao, Peng; Wang, Yujun; Xie, Lei

234: PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge 

Wang, Shiyao; Zhou, Jiaming; Zhao, Shiwan; Qin, Yong

404: Summary of Low-Resource Dysarthria Wake-Up Word Spotting Challenge 

Gao, Ming; Chen, Hang; Du, Jun; Xu, Xin; Guo, Hongxiao; Bu, Hui; Li, Ming; Lee, Chin-Hui

417: Domain Adaptation and Unified Knowledge Base Motivate Better Retrieval Models in Dialog Systems with RAG 

Lin, Huadong; Chen, Yirong; Tao, Wenyu; Chen, Mingyu; Xu, Xiangmin; Xing, Xiaofen

S8: Lunch (Day 2, Dec 3, Tuesday, 12:30 – 14:00)

T2: Invited Talk 2 (Day 2, Dec 3, Tuesday, 14:00 – 15:00)

P4: Poster Session 4:  Speech Synthesis (Day 2, Dec 3, Tuesday, 15:00 – 17:00)

21: AS-Speech: Adaptive Style for Speech Synthesis 

Li, Zhipeng; Xing, Xiaofen; Wang, Jun; Chen, Shuaiqi; Yu, Guoqiao; Wan, Guanglu; Xu, Xiangmin

31: Room Impulse Responses Help Attackers Evade Deep Fake Detection 

Luong, Hieu-Thi; Truong, Duc-Tuan; Lee, Kong Aik; Chng, Eng Siong

35: Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech 

Wang, Hankun; Du, Chenpeng; Guo, Yiwei; Wang, Shuai; Chen, Xie; Yu, Kai

46: Stage-Wise and Prior-Aware Neural Speech Phase Prediction 

Liu, Fei; Ai, Yang; Du, Hui-Peng; Lu, Ye-Xin; Zheng, Rui-Chen; Ling, Zhen-Hua

52: SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model-Based Text-to-Speech Synthesis 

Guo, Haohan; Xie, Fenglong; Yang, Dongchao; Wu, Xixi; Meng, Helen; Xie, Kun; Guo, Dake

56: Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits 

Huang, Sung-Feng; Kuo, Heng-Cheng; Chen, Zhehuai; Yang, Xuesong; Yang, Chao-Han Huck; Tsao, Yu; Wang, Yu-Chiang Frank; Lee, Hung-Yi; Fu, Szu-Wei

62: DNN-Based Ensemble Singing Voice Synthesis with Interactions Between Singers 

Hyodo, Hiroaki; Takamichi, Shinnosuke; Nakamura, Tomohiko; Koguchi, Junya; Saruwatari, Hiroshi

87: Investigating Disentanglement in a Phoneme-Level Speech Codec for Prosody Modeling 

Karapiperis, Sotirios; Ellinas, Nikolaos; Vioni, Alexandra; Oh, Junkwang; Jho, Gunu; Hwang, Inchul; Raptis, Spyros

128: InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself 

Zeng, Chang; Wang, Chunhui; Miao, Xiaoxiao; Zhao, Jian; Jiang, Zhonglin; Chen, Yong

166: E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS 

Eskimez, Sefik Emre; Wang, Xiaofei; Thakker, Manthan; Li, Canrun; Tsai, Chung-Hsien; Xiao, Zhen; Yang, Hemin; Zhu, Zirun; Tang, Min; Tan, Xu; Liu, Yanqing; Zhao, Sheng; Kanda, Naoyuki

167: Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech 

Wu, Haibin; Wang, Xiaofei; Eskimez, Sefik Emre; Thakker, Manthan; Tompkins, Daniel; Tsai, Chung-Hsien; Li, Canrun; Xiao, Zhen; Zhao, Sheng; Li, Jinyu; Kanda, Naoyuki

184: Disentangling the Prosody and Semantic Information with Pre-Trained Model for In-Context Learning-Based Zero-Shot Voice Conversion 

Chen, Zhengyang; Wang, Shuai; Zhang, Mingyang; Liu, Xuechen; Yamagishi, Junichi; Qian, Yanmin

195: NDVQ: Robust Neural Audio Codec with Distribution-Based Vector Quantization 

Niu, Zhikang; Chen, Sanyuan; Zhou, Long; Ma, Ziyang; Chen, Xie; Liu, Shujie

228: Fast, High-Quality, and Parameter-Efficient Articulatory Synthesis Using Differentiable DSP 

Liu, Yisi; Yu, Bohan; Lin, Drake; Wu, Peter; Cho, Cheol Jun; Anumanchipalli, Gopala Krishna

299: VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation 

Yu, Yifeng; Shi, Jiatong; Wu, Yuning; Tang, Yuxun; Watanabe, Shinji

316: End-to-End Streaming Model for Low-Latency Speech Anonymization 

Quamer, Waris; Gutierrez-Osuna, Ricardo

326: Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids’ Story Speech Synthesis 

Chung, Raymond

332: Discrete Unit-Based Masking for Improving Disentanglement in Voice Conversion 

Lee, Philip;Ülgen,İsmail Rasim; Sisman, Berrak

345: Cross-Dialect Text-to-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT 

Yamauchi, Kazuki; Saito, Yuki; Saruwatari, Hiroshi

353: Leveraging Diverse Semantic-Based Audio Pretrained Models for Singing Voice Conversion 

Zhang, Xueyao; Fang, Zihao; Gu, Yicheng; Chen, Haopeng; Zou, Lexiao; Zhang, Junan; Xue, Liumeng; Wu, Zhizheng

394: TTSDS: Text-to-Speech Distribution Score 

Minixhofer, Christoph D.; Klejch, Ondřej; Bell, Peter

396: The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction 

Huang, Wen-Chin; Fu, Szu-Wei; Cooper, Erica; Zezario, Ryandhimas E.; Toda, Tomoki; Wang, Hsin-Min; Yamagishi, Junichi; Tsao, Yu

406: Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion 

Shi, Yu-Fei; Ai, Yang; Lu, Ye-Xin; Du, Hui-Peng; Ling, Zhen-Hua

407: The T05 System for the VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech 

Baba, Kaito; Nakata, Wataru; Saito, Yuki; Saruwatari, Hiroshi

261: Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 

Guragain, Anmol; Liu, Tianchi; Pan, Zihan; Sailor, Hardik B.; Wang, Qiongqiong

323: SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge 

Zhang, You; Zang, Yongyi; Shi, Jiatong; Yamamoto, Ryuichi; Toda, Tomoki; Duan, Zhiyao

348: XWSB: A Blend System Utilizing XLS-R and WavLM with SLS Classifier Detection System for SVDD 2024 Challenge 

Qi, Shan Zhang; Wen, Shuangbing; Yan, Fangke; Hu, Tao; Li, Jun

416: Integrating Self-Supervised Pre-Training with Adversarial Learning for Synthesized Song Detection 

Wang, Yankai; Du, Yuxuan; Zhang, Dejun; Zheng, Rong; Deng, Jing

S9: Coffee Break (Day 2, Dec 3, Tuesday, 17:00 – 17:30)

S10: TBD (Day 2, Dec 3, Tuesday, 17:30 – 18:30)

Day 3, Dec 4, Wednesday

K3: Keynote 3 (Day 3, Dec 4, Wednesday, 09:00 – 10:00)

S11: Coffee Break (Day 3, Dec 4, Wednesday, 10:00 – 10:30)

P5: Poster Session 5: Machine Learning & Resources (Day 3, Dec 4, Wednesday, 10:30 – 12:30)

15: HeightCeleb: An Enrichment of VoxCeleb Dataset with Speaker Height Information 

Kacprzak, Stanisław; Kowalczyk, Konrad

57: ESPnet-EZ: Python-Only ESPnet for Easy Fine-Tuning and Integration 

Someki, Masao; Choi, Kwanghee; Arora, Siddhant; Chen, William; Cornell, Samuele; Han, Jionghao; Peng, Yifan; Shi, Jiatong; Srivastav, Vaibhav; Watanabe, Shinji

106: Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models 

Lin, Yi-Cheng; Chen, Wei-Chih; Lee, Hung-Yi

124: Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit 

Zhang, Xueyao; Xue, Liumeng; Gu, Yicheng; Wang, Yuancheng; Li, Jiaqi; He, Haorui; Wang, Chaoren; Liu, Songting; Chen, Xi; Zhang, Junan; Fang, Zihao; Chen, Haopeng; Tang, Tze Ying; Zou, Lexiao; Wang, Mingxuan; Han, Jun; Chen, Kai; Li, Haizhou; Wu, Zhizheng

126: Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation 

He, Haorui; Shang, Zengqiang; Wang, Chaoren; Li, Xuyuan; Gu, Yicheng; Hua, Hua; Liu, Liwei; Yang, Chen; Li, Jiaqi; Shi, Peiyang; Wang, Yuancheng; Chen, Kai; Zhang, Pengyuan; Wu, Zhizheng

191: FLORAS 50: A Massively Multilingual Multitask Benchmark for Long-Form Conversational Speech 

Chen, William; Yan, Brian; Chen, Chih-Chen; Watanabe, Shinji

295: Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units 

Inaguma, Hirofumi; Kulikov, Ilia; Ni, Zhaoheng; Popuri, Sravya; Tomasello, Paden P.

298: Speech Recognition for Analysis of Police Radio Communication 

Srivastava, Tejes; Chou, Ju-Chieh; Shroff, Priyank; Livescu, Karen; Graziul, Christopher

307: Large Language Models as User Agents for Evaluating Task-Oriented Dialogue Systems 

Kazi, Taaha; Lyu, Ruiliang; Zhou, Sizhe; Hakkani-Tur, Dilek; Tur, Gokhan

334: DFADD: The Diffusion and Flow-Matching-Based Audio Deepfake Dataset 

Du, Jiawei; Lin, I-Ming; Chiu, I-Hsiang; Chen, Xuanjun; Wu, Haibin; Ren, Wenze; Tsao, Yu; Lee, Hung-Yi; Jang, Roger

384: SpMis: An Investigation of Synthetic Spoken Misinformation Detection 

Liu, Peizhuo; Wang, Li; Renqiang, He; He, Haorui; Wang, Lei; Zheng, Huadi; Shi, Jie; Xiao, Tong; Wu, Zhizheng

7: Self-Supervised Speech Models for Word-Level Stuttered Speech Detection 

Shih, Yi-Jen; Gkalitsiou, Zoi; Dimakis, Alex; Harwath, David

30: Enhancing Automatic Speech Assessment Leveraging Heterogeneous Features and Soft Labels for Ordinal Classification 

Peng, Wen Hsuan; Chen, Sally; Chen, Berlin

92: Speech Recognition-Based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech 

Lee, Jeehyun; Choi, Yerin; Koo, Myoung-Wan

144: Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget 

Liu, Andy T.; Lin, Yi-Cheng; Wu, Haibin; Winkler, Stefan; Lee, Hung-Yi

175: Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models 

Zheng, Xinhu; Jiang, Anbai; Han, Bing; Qian, Yanmin; Fan, Pingyi; Liu, Jia; Zhang, Wei-Qiang

233: Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis 

Nguyen, Tuan Manh; Fredouille, Corinne; Ghio, Alain; Balaguer, Mathieu; Woisard, Virginie

249: Hierarchical Multi-Path and Multi-Model Selection for Fake Speech Detection 

Feng, Chang; Zhao, Yiyang; Sun, Guangzhi; Chen, Zehua; Wang, Shuai; Zhang, Chao; Xu, Mingxing; Zheng, Thomas Fang

251: Semi-Supervised Learning for Robust Speech Evaluation 

Zhang, Huayun; Wong, Jeremy H. M.; Lin, Geyu; Chen, Nancy

264: GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-Shot Keyword Spotting 

Zhu, Pai; Bartel, Jacob W.; Agarwal, Dhruuv; Partridge, Kurt; Park, Hyun Jin; Wang, Quan

312: A Simple HMM with Self-Supervised Representations for Phone Segmentation 

Yang, Gene-Ping; Tang, Hao

331: DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners 

Bhati, Saurabhchand; Gong, Yuan; Karlinsky, Leonid; Kuehne, Hilde; Feris, Rogerio; Glass, James

337: RAND: Robustness-Aware Norm Decay for Quantized Neural Networks 

Qiu, David; Rim, David; Ding, Shaojin; Rybakov, Oleg; He, Yanzhang

377: SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding 

Zhang, Ziyang; Thwaites, Andrew; Woolgar, Alexandra; Moore, Brian C. J.; Zhang, Chao

12: Automated Speaking Assessment of Conversation Tests with a Novel Graph-Based Modeling Method on Spoken Response Coherence 

Li, Jiun-Ting; Yan, Bi-Cheng; Lo, Tien-Hong; Wang, Yi-Cheng; Hsu, Yung-Chang; Chen, Berlin

115: Conditional Label Smoothing for LLM-Based Data Augmentation in Medical Text Classification 

Becker, Luca Manuel; Pracht, Philip; Sertdal, Peter; Uboreck, Jil; Bendel, Alexander; Martin, Rainer

180: Plan, Generate and Optimize: Extending Large Language Models for Dialogue Systems via Prompt-Based Collaborative Method 

Guo, Mengfei; Chen, Si; Huang, Yi; Feng, Junlan

202: Taming NLU Noise: Student-Teacher Learning for Robust Dialogue Policy 

Rohmatillah, Mahdin; Chien, Jen-Tzung

S12: Lunch (Day 3, Dec 4, Wednesday, 12:30 – 14:00)

T3: Invited Talk 3 (Day 3, Dec 4, Wednesday, 14:00 – 15:00)

PD1: Panel Discussion (Day 3, Dec 4, Wednesday, 15:00 – 17:00)

S13: Coffee Break (Day 3, Dec 4, Wednesday, 17:00 – 17:30)

D2: Demos (Day 3, Dec 4, Wednesday, 17:30 – 18:30)

S14: Gala Dinner (Day 3, Dec 4, Wednesday, 19:00 – 20:30)

Day 4, Dec 5, Thursday

K4: Keynote 4 (Day 4, Dec 5, Thursday, 09:00 – 10:00)

S15: Coffee Break (Day 4, Dec 5, Thursday, 10:00 – 10:30)

P6: Poster Session 6: Spoken Language Processing (Day 4, Dec 5, Thursday, 10:30 – 12:30)

65: Stutter-Solver: End-to-End Multi-Lingual Dysfluency Detection 

Zhou, Xuanru; Cho, Cheol Jun; Sharma, Ayati; Morin, Brittany; Baquirin, David; Vonk, Jet; Ezzes, Zoe; Miller, Zachary; Tee, Boon Lead; Gorno Tempini, Maria Luisa; Lian, Jiachen; Anumanchipalli, Gopala Krishna

105: ProGRes: Prompted Generative Rescoring on ASR N-Best 

Tur, Ada D.; Ravanelli, Mirco; Moumen, Adel

153: SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model 

Shams, Siavash; Dindar, Sukru Samet; Jiang, Xilin; Mesgarani, Nima

215: Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation 

Kuan, Chun-Yi; Yang, Chih-Kai; Huang, Wei-Ping; Lu, Ke-Han; Lee, Hung-Yi

174: CTC-GMM: CTC-Guided Modality Matching for Fast and Accurate Streaming Speech Translation 

Zhao, Rui; Li, Jinyu; Fan, Ruchao; Post, Matt

223: Long-Form End-to-End Speech Translation via Latent Alignment Segmentation 

Polák, Peter; Bojar, Ondrej

272: Confidence Estimation for LLM-Based Dialogue State Tracking 

Sun, Yijyun; Dey, Suvodip; Hakkani-Tur, Dilek; Tur, Gokhan

278: The 2nd FutureDial Challenge: Dialog Systems with Retrieval-Augmented Generation (FutureDial-RAG) 

Cai, Yucheng; Chen, Si; Wu, Yuxuan; Huang, Yi; Feng, Junlan; Ou, Zhijian

130: Zero-Shot Audio Topic Reranking Using Large Language Models 

Qian, Mengjie; Ma, Rao; Liusie, Adian; Loweimi, Erfan; Knill, Katherine M.; Gales, Mark

59: Clean Label Attacks Against SLU Systems 

Xinyuan, Henry Li; Thebaud, Thomas; Joshi, Sonal; Villalba, Jesus Antonio; Dehak, Najim; Khudanpur, Sanjeev

168: WHISMA: A Speech-LLM to Perform Zero-Shot Spoken Language Understanding 

Li, Mohan; Do, Cong-Thanh; Keizer, Simon; Farag, Youmna; Stoyanchev, Svetlana; Doddipatla, Rama S.

169: Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer 

Sunder, Vishal; Fosler-Lussier, Eric

208: Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT 

Komatsu, Ryota; Shinozaki, Takahiro

355: Just ASR + LLM? A Study on Speech Large Language Models’ Ability to Identify and Understand Speaker in Spoken Dialogue 

Wu, Junkai; Fan, Xulin; Lu, Bo-Ru; Jiang, Xilin; Mesgarani, Nima; Hasegawa-Johnson, Mark A.; Ostendorf, Mari

5: Enhancing Open-Set Speaker Identification Through Rapid Tuning with Speaker Reciprocal Points and Negative Samples 

Chen, Zhiyong; Ai, Zhiqi; Li, Xinnuo; Xu, Shugong

10: Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches 

Zeng, Chang; Miao, Xiaoxiao; Wang, Xin; Cooper, Erica; Yamagishi, Junichi

49: Adversarial Purification for Speaker Verification by Two-Stage Diffusion Models 

Bai, Yibo; XiaoLei, Zhang; Li, Xuelong

103: Measuring Sound Symbolism in Audio-Visual Models 

Tseng, Wei-Cheng; Shih, Yi-Jen; Harwath, David; Mooney, Raymond

111: Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes 

Kukanov, Ivan; Laakkonen, Janne; Kinnunen, Tomi H.; Hautamäki, Ville

135: On the Generation and Removal of Speaker Adversarial Perturbation for Voice Privacy Protection 

Guo, Chenyang; Chen, Liping; Li, Zhuhai; Lee, Kong Aik; Ling, Zhen-Hua; Guo, Wu

138: Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing 

Liu, Tianchi; Kukanov, Ivan; Pan, Zihan; Wang, Qiongqiong; Sailor, Hardik B.; Lee, Kong Aik

141: Enhance Low-Resource Spoken Language Identification via Cross-Modality Retrieval and Cross-Lingual Text-to-Speech Synthesis 

Ma, Min; Wang, Yuan; Kastner, Kyle; Caswell, Isaac; Yoon, Charles; Rosenberg, Andrew

148: Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings 

Horiguchi, Shota; Ando, Atsushi; Moriya, Takafumi; Ashihara, Takanori; Sato, Hiroshi; Tawara, Naohiro; Delcroix, Marc

259: PDAF: A Phonetic Debiasing Attention Framework for Speaker Verification 

Baali, Massa; Aldoobi, Abdulhamid; Dhamyal, Hira; Singh, Rita; Raj, Bhiksha

356: INX-SpeakerHub: A 2000-Hour Indian Multilingual Speaker Identification Corpus 

Sagaya Mary N J, Metilda; Umesh, S.

100: Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR 

Wang, Weiqing; Dhawan, Kunal; Park, Tae Jin; Puvvada, Krishna C.; Medennikov, Ivan; Majumdar, Somshubra; Huang, He; Balam, Jagadeesh; Ginsburg, Boris

405: Exploring Self-Supervised Representations for Text-Dependent Speaker Verification 

Sreekanth, Sankala

147: Distillation-Based Feature Extraction Algorithm for Source Speaker Verification 

Ma, Xinlei; Lu, Wenhuan; Zhang, Ruiteng; Xu, Junhai; Lu, Xugang; Wei, Jianguo

224: Speaker Contrastive Learning for Source Speaker Tracing 

Wang, Qing; Guo, Hongmei; Kang, Jian; Du, Mengjie; Li, Jie; XiaoLei, Zhang; Xie, Lei

411: The Database and Benchmark for the Source Speaker Tracing Challenge 2024 

Li, Ze; Lin, Yuke; Tian, Yao; Suo, Hongbin; Zhang, Pengyuan; Ren, Yanzhen; Cai, Zexin; Nishizaki, Hiromitsu; Li, Ming

S16: Closing Ceremony (Day 4, Dec 5, Thursday, 12:30 – 13:00)