SLT 2024 Tentative Program
(This is a tentative program, subject to minor adjustments.)
Program overview: https://2024.ieeeslt.org/program/
Day1, Dec 2, Monday
S1: Opening Session (Day 1, Dec 2, Monday, 08:30 – 09:00)
K1: Keynote 1 (Day 1, Dec 2, Monday, 09:00 – 10:00)
S2: Coffee Break (Day 1, Dec 2, Monday, 10:00 – 10:30)
P1: Poster Session 1: Speech Recognition (Day 1, Dec 2, Monday, 10:30 – 12:30)
28: PromptKWS: A Novel Prompt-Guided Open-Vocabulary Keyword Spotting Framework
Xu, Gaopeng; Li, Chengfei; Wang, Xianliang; Zhu, Li; Wei, Juan; Li, Wenpeng; Niu, Jianwei; Gao, Jie
43: Personalizing Large Sequence-to-Sequence Speech Foundation Models with Speaker Representations
Wagner, Dominik; Baumann, Ilja; Ranzenberger, Thomas; Riedhammer, Korbinian; Bocklet, Tobias
97: Label-Looping: Highly Efficient Decoding for Transducers
Bataev, Vladimir; Xu, Hainan; Galvez, Daniel; Lavrukhin, Vitaly; Ginsburg, Boris
102: Advancing Multi-Talker ASR Performance with Large Language Models
Shi, Mohan; Jin, Zengrui; Xu, Yaoxun; Xu, Yong; Zhang, Shi-Xiong; Wei, Kun; Shao, Yiwen; Zhang, Chunlei; Yu, Dong
114: Token-Weighted RNN-T for Learning from Flawed Data
Keren, Gil; Zhou, Wei; Kalinli, Ozlem
137: Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
Huang, Hukai; Lin, Jiayan; Wang, Kaidi; Li, Yishuang; Guan, Wenhao; Li, Lin; Hong, Qingyang
149: Language Bias in Self-Supervised Learning for Automatic Speech Recognition
Storey, Ed; Harte, Naomi; Bell, Peter
150: Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
Wu, Yihan; Peng, Yifan; Lu, Yichen; Chang, Xuankai; Song, Ruihua; Watanabe, Shinji
162: Hybrid Attention-Based Encoder-Decoder Model for Efficient Language Model Adaptation
Ling, Shaoshi; Ye, Guoli; Zhao, Rui; Gong, Yifan
165: SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays
Shao, Yiwen; Xu, Yong; Khudanpur, Sanjeev; Yu, Dong
171: Effective Text Adaptation for LLM-Based ASR Through Soft Prompt Fine-Tuning
Ma, Yingyi; Liu, Zhe; Kalinli, Ozlem
173: Temporal Order Preserved Optimal Transport-Based Cross-Modal Knowledge Transfer Learning for ASR
Lu, Xugang; Shen, Peng; Tsao, Yu; Kawai, Hisashi
187: Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Sudo, Yui; Fukumoto, Yosuke; Shakeel, Muhammad; Peng, Yifan; Watanabe, Shinji
209: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Yang, Chih-Kai; Huang, Kuan-Po; Lee, Hung-yi
221: An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition
Wang, Yi-Cheng; Pai, Li-Ting; Yan, Bi-Cheng; Wang, Hsin-Wei; Lin, Chi-Han; Chen, Berlin
300: Training Large ASR Encoders with Differential Privacy
Chauhan, Geeticka; Chien, Steve; Thakkar, Om; Thakurta, Abhradeep; Narayanan, Arun
309: Transducer Consistency Regularization for Speech-to-Text Applications
Tseng, Cindy S.; Tang, Yun; Apsingekar, Vijendra Raj
324: Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data
Tseng, Liang-Hsuan; Chen, Zih-Ching; Chang, Weishun; Lee, Cheng-Kuang; Huang, Tsung-Ren; Lee, Hung-yi
389: CTC-Assisted LLM-Based Contextual ASR
Yang, Guanrou; Ma, Ziyang; Gao, Zhifu; Zhang, Shiliang; Chen, Xie
53: Automatic Time Alignment Generation for End-to-End ASR Using Acoustic Probability Modeling
Jiang, Dongcheng; Zhang, Chao; Woodland, Phil
73: Continual Learning with Embedding Layer Surgery and Task-Wise Beam
Kwok, Chin Yuen; Yip, Jia Qi; Chng, Eng Siong
93: BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Huang, He; Chen, Zhehuai; Puvvada, Krishna C.;Żelasko, Piotr; Balam, Jagadeesh; Ginsburg, Boris; Koluguri, Nithin Rao; Hrinchuk, Oleksii
116: Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
Vieting, Peter; Berger, Simon; von Neumann, Thilo; Boeddeker, Christoph; Schlüter, Ralf; Haeb-Umbach, Reinhold
122: Mamba-Based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition
Masuyama, Yoshiki; Miyazaki, Koichi; Murata, Masato
201: An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Whetten, Ryan; Parcollet, Titouan; Moumen, Adel; Dinarelli, Marco; Estève, Yannick
214: Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models
Gao, Xiaoxue; Chen, Nancy
357: Lite ASR Transformer: A Lightweight Transformer Architecture for Automatic Speech Recognition
Sagaya Mary N J, Metilda; Umesh, S.
S3: Lunch (Day 1, Dec 2, Monday, 12:30 – 14:00)
T1: Invited Talk 1 (Day 1, Dec 2, Monday, 14:00 – 15:00)
P2: Poster Session 2: Speech Recognition and Enhancement (Day 1, Dec 2, Monday, 15:00 – 17:00)
22: Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition
Shi, Hao; Gao, Yuan; Ni, Zhaoheng; Kawahara, Tatsuya
161: Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models
Poncelet, Jakob; Wang, Yujun; Van hamme, Hugo
170: Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Multi-Task Automatic Speech Recognition Models
Raina, Vyas; Gales, Mark
283: Improving Rare-Word Recognition of Whisper in Zero-Shot Settings
Jogi, Yash; Aggarwal, Vaibhav; Nair, Shabari S.; Verma, Yash; Kubba, Aayush
343: Augmenting Automatic Speech Recognition Models with Disfluency Detection
Amann, Robin; Li, Zhaolin; Bruno, Barbara; Niehues, Jan
246: Enhancing Unified Streaming and Non-Streaming ASR Through Curriculum Learning with Easy-to-Hard Tasks
Yang, Yuting; Li, Yuke; Zhou, Lifeng; Du, Binbin; Zhu, Haoqi
78: DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
Shao, Hang; Liu, Bei; Wang, Wei; Gong, Xun; Qian, Yanmin
129: Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition
Wang, Shih-Heng; Shi, Jiatong; Huang, Chien-Yu; Watanabe, Shinji; Lee, Hung-Yi
157: Longer Is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation
Koluguri, Nithin Rao; Bartley, Travis M.; Xu, Hainan; Hrinchuk, Oleksii; Balam, Jagadeesh; Ginsburg, Boris; Kucsko, Georg
178: Semi-Supervised Learning for Code-Switching ASR with Large Language Model Filter
Xi, Yu; Ding, Wen; Yu, Kai; Lai, Junjie
301: Parameter Averaging Is All You Need to Prevent Forgetting
Plantinga, Peter W.; Yoo, Jaekwon; Girma, Abenezer G.; Dhir, Chandra
304: Advancing CTC Models for Better Speech Alignment: A Topological Approach
Zhao, Zeyu; Bell, Peter
37: DualSep: A Lightweight Dual-Encoder Convolutional Recurrent Network for Real-Time In-Car Speech Separation
Wang, Ziqian; Sun, Jiayao; Zhang, Zihan; Li, Xingchen; Liu, Jie; Xie, Lei
39: DDTSE: Discriminative Diffusion Model for Target Speech Extraction
Zhang, Leying; Qian, Yao; Yu, Linfeng; Wang, Heming; Yang, Hemin; Liu, Shujie; Zhou, Long; Qian, Yanmin
89: An Investigation of Incorporating Mamba for Speech Enhancement
Chao, Rong; Cheng, Wen-Huang; La Quatra, Moreno; Siniscalchi, Sabato M.; Yang, Chao-Han Huck; Fu, Szu-Wei; Tsao, Yu
117: Effective Noise-Aware Data Simulation for Domain-Adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
Wang, Chien-Chun; Chen, Li-Wei; Lee, Hung-Shin; Chen, Berlin; Wang, Hsin-Min
120: SMRU: Split-and-Merge Recurrent-Based UNet for Acoustic Echo Cancellation and Noise Suppression
Sun, Zhihang; Li, Andong; Chen, Rilin; Zhang, Hao; Yu, Meng; Zhou, Yi; Yu, Dong
139: On the Effectiveness of Enrollment Speech Augmentation for Target Speaker Extraction
Li, Junjie; Zhang, Ke; Wang, Shuai; Li, Haizhou; Mak, M. W.; Lee, Kong Aik
142: Diffusion-Based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement
Li, Chenda; Cornell, Samuele; Watanabe, Shinji; Qian, Yanmin
216: NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Fusion
De Silva, Dashanka D. N.; Cai, Siqi; Pahuja, Saurav; Schultz, Tanja; Li, Haizhou
325: Enhancing Speaker Extraction Through Rectifying Target Confusion
Wang, Jiahe; Wang, Shuai; Li, Junjie; Zhang, Ke; Qian, Yanmin; Li, Haizhou
368: Diff-PLC: A Diffusion-Based Approach for Effective Packet Loss Concealment
Yang, Da-Hee; Chang, Joon-Hyuk
369: Improving Curriculum Learning for Target Speaker Extraction with Synthetic Speakers
Liu, Yun; Liu, Xuechen; Yamagishi, Junichi
212: FLANEC: Exploring Flan-T5 for Post-ASR Error Correction
La Quatra, Moreno; Salerno, Valerio Mario; Tsao, Yu; Siniscalchi, Sabato Marco
415: Language Model-Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Yang, Chao-Han Huck; Park, Tae Jin; Gong, Yuan; Li, Yuanchao; Lin, Yen-Ting; Chen, Zhehuai; Hu, Yuchen; Chen, Chen; Dhawan, Kunal;Żelasko, Piotr; Zhang, Chao; Chen, Yun-Nung; Tsao, Yu; Balam, Jagadeesh; Ginsburg, Boris; Siniscalchi, Sabato M.; Chng, Eng Siong; Bell, Peter; Lai, Catherine; Watanabe, Shinji; Stolcke, Andreas
400: FGCL: Fine-Grained Contrastive Learning for Mandarin Stuttering Event Detection
Jiang, Han; Wang, Wenyu; Zhou, Yiquan; Ding, Hongwu; Jiacheng, Xu; Zhu, Jihua
402: Data Augmentation Techniques for Improved Performance in the SLT 2024 Mandarin Stuttering Event Detection and ASR Challenge
Wang, Weiwei; Feng, Zhijin; Song, Qingyuan; Wei, Wenyang; Wang, Yansong
403: Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge
Xue, Hongfei; Gong, Rong; Shao, Mingchen; Xu, Xin; Wang, Lezhi; Xie, Lei; Bu, Hui; Zhou, Jiaming; Qin, Yong; Du, Jun; Li, Ming; Zhang, Binbin; Jia, Bin
410: Enhanced ASR for Stuttering Speech: Combining Adversarial and Signal-Based Data Augmentation
Huang, Shangkun; Zhang, Dejun; Deng, Jing; Zheng, Rong
S4: Coffee Break (Day 1, Dec 2, Monday, 17:00 – 17:30)
S5: TBD (Day 1, Dec 2, Monday, 17:30 – 18:30)
S6: Welcome Reception (Day 1, Dec 2, Monday, 19:00 – 20:30)
Day2, Dec 3, Tuesday
K2: Keynote Speech 2 (Day 2, Dec 3, Tuesday, 09:00 – 10:00)
S7: Coffee Break (Day 2, Dec 3, Tuesday, 10:00 – 10:30)
P3: Poster Session 3: Speech Processing (Day 2, Dec 3, Tuesday, 10:30 – 12:30)
66: Property Neurons in Self-Supervised Speech Transformers
Lin, Tzu-Quan; Lin, Guan-Ting; Lee, Hung-Yi; Tang, Hao
145: Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization
Cai, Zexin; Xinyuan, Henry Li; Garg, Ashi; Andrews, Nicholas O.; Garcia, Paola; Wiesner, Matthew S.; Duh, Kevin; Khudanpur, Sanjeev
217: Estimating the Completeness of Discrete Speech Units
Yeh, Sung-Lin; Tang, Hao
314: Investigation of Speaker Representation for Target-Speaker Speech Processing
Ashihara, Takanori; Moriya, Takafumi; Horiguchi, Shota; Peng, Junyi; Ochiai, Tsubasa; Delcroix, Marc; Matsuura, Kohei; Sato, Hiroshi
226: Crossmodal ASR Error Correction with Discrete Speech Units
Li, Yuanchao; Chen, Pinzhen; Bell, Peter; Lai, Catherine
241: Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech-Integrated Large Language Models
Lin, Yi-Cheng; Lin, Tzu-Quan; Yang, Chih-Kai; Lu, Ke-Han; Chen, Wei-Chih; Kuan, Chun-Yi; Lee, Hung-Yi
265: Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
Kim, Sungnyun; Jang, Kangwook; Bae, Sangmin; Kim, Hoirin; Yun, Se-Young
269: Data-Efficient Reflow for Few-Step Audio Generation
Wu, Lemeng; Ni, Zhaoheng; Shi, Bowen; Hsu, Wei-Ning; Le Lan, Gael; Nagaraja, Varun; Kumar, Anurag; Mei, Xinhao; Xiong, Yunyang; Soran, Bilge; Krishnamoorthi, Raghuraman; Shi, Yangyang; Chandra, Vikas
99: Optimizing Byte-Level Representation for End-to-End ASR
Hsiao, Roger; Deng, Liuhui; McDermott, Erik; Travadi, Ruchir; Zhuang, Xiaodan
179: Romanization Encoding for Multilingual ASR
Ding, Wen; Jia, Fei; Xu, Hainan; Xi, Yu; Lai, Junjie; Ginsburg, Boris
254: Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection
Yang, Tzu-Ting; Wang, Hsin-Wei; Wang, Yi-Cheng; Chen, Berlin
263: Language-Independent Prosody-Enhanced Speech Representations for Multilingual Speech Synthesis
Liu, Chang; Ling, Zhen-Hua; Hu, Ya-Jun
359: Classification of Spontaneous and Scripted Speech for Multilingual Audio
Elisha, Shahar; McDowell, Andrew J.; Beguerisse-Díaz, Mariano; Benetos, Emmanouil
40: GMP-TL: Gender-Augmented Multi-Scale Pseudo-Label Enhanced Transfer Learning for Speech Emotion Recognition
Pan, Yu; Yang, Yuguang; Huang, Yuheng; Jin, Tiancheng; Yin, Jingjing; Hu, Yanni; Lu, Heng; Ma, Lei; Zhao, Jianjun
81: Embracing Ambiguity and Subjectivity Using the All-Inclusive Aggregation Rule for Evaluating Multi-Label Speech Emotion Recognition Systems
Chou, Huang-Cheng; Wu, Haibin; Goncalves, Lucas; Leem, Seong-Gyun; Salman, Ali N.; Busso, Carlos; Lee, Hung-Yi; Lee, Chi-Chun
83: Open-Emotion: A Reproducible Emo-SUPERB for Speech Emotion Recognition Systems
Wu, Haibin; Chou, Huang-Cheng; Chang, Kai-Wei; Goncalves, Lucas; Du, Jiawei; Jang, Jyh-Shing Roger; Lee, Chi-Chun; Lee, Hung-Yi
225: Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Li, Yuanchao; Bell, Peter; Lai, Catherine
282: Beyond the Binary: Limitations and Possibilities of Gender-Related Speech Technology Research
Sanchez, Ariadna; Ross, Alice; Markl, Nina
352: Enhancing Domain Generalization in Speech Emotion Recognition by Combining Domain-Variant Representations and Domain-Invariant Classifiers
Lee, Shi-Wook
47: MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towards High Sampling Rate and Low Bitrate Scenarios
Jiang, Xiao-Hang; Ai, Yang; Zheng, Rui-Chen; Du, Hui-Peng; Lu, Ye-Xin; Ling, Zhen-Hua
51: Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
Guo, Haohan; Xie, Fenglong; Yang, Dongchao; Lu, Hui; Wu, Xixi; Meng, Helen
267: Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Li, Jiaqi; Wang, Dongmei; Wang, Xiaofei; Qian, Yao; Zhou, Long; Liu, Shujie; Yousefi, Midia; Li, Canrun; Tsai, Chung-Hsien; Xiao, Zhen; Liu, Yanqing; Chen, Junkun; Zhao, Sheng; Li, Jinyu; Wu, Zhizheng; Zeng, Michael
280: ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech
Shi, Jiatong; Tian, Jinchuan; Wu, Yihan; Jung, Jee-Weon; Yip, Jia Qi; Masuyama, Yoshiki; Chen, William; Wu, Yuning; Tang, Yuxun; Baali, Massa; Alharthi, Dareen; Zhang, Dong; Deng, Ruifan; Srivastava, Tejes; Wu, Haibin; Liu, Alexander H.; Raj, Bhiksha; Jin, Qin; Song, Ruihua; Watanabe, Shinji
336: Codec-SUPERB @ SLT 2024: A Lightweight Benchmark for Neural Codec Models
Wu, Haibin; Chen, Xuanjun; Lin, Yi-Cheng; Du, Jiawei; Chang, Kai-Wei; Lu, Ke-Han; Liu, Alexander H.; Chung, Ho Lam; Wu, Yuan-Kuei; Yang, Dongchao; Liu, Songxiang; Wu, Yi-Chiao; Tan, Xu; Glass, James; Watanabe, Shinji; Lee, Hung-Yi
61: Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge
Liu, Shuiyun; Kong, Yuxiang; Guo, Pengcheng; Zhuang, Weiji; Gao, Peng; Wang, Yujun; Xie, Lei
234: PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge
Wang, Shiyao; Zhou, Jiaming; Zhao, Shiwan; Qin, Yong
404: Summary of Low-Resource Dysarthria Wake-Up Word Spotting Challenge
Gao, Ming; Chen, Hang; Du, Jun; Xu, Xin; Guo, Hongxiao; Bu, Hui; Li, Ming; Lee, Chin-Hui
417: Domain Adaptation and Unified Knowledge Base Motivate Better Retrieval Models in Dialog Systems with RAG
Lin, Huadong; Chen, Yirong; Tao, Wenyu; Chen, Mingyu; Xu, Xiangmin; Xing, Xiaofen
S8: Lunch (Day 2, Dec 3, Tuesday, 12:30 – 14:00)
T2: Invited Talk 2 (Day 2, Dec 3, Tuesday, 14:00 – 15:00)
P4: Poster Session 4: Speech Synthesis (Day 2, Dec 3, Tuesday, 15:00 – 17:00)
21: AS-Speech: Adaptive Style for Speech Synthesis
Li, Zhipeng; Xing, Xiaofen; Wang, Jun; Chen, Shuaiqi; Yu, Guoqiao; Wan, Guanglu; Xu, Xiangmin
31: Room Impulse Responses Help Attackers Evade Deep Fake Detection
Luong, Hieu-Thi; Truong, Duc-Tuan; Lee, Kong Aik; Chng, Eng Siong
35: Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech
Wang, Hankun; Du, Chenpeng; Guo, Yiwei; Wang, Shuai; Chen, Xie; Yu, Kai
46: Stage-Wise and Prior-Aware Neural Speech Phase Prediction
Liu, Fei; Ai, Yang; Du, Hui-Peng; Lu, Ye-Xin; Zheng, Rui-Chen; Ling, Zhen-Hua
52: SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model-Based Text-to-Speech Synthesis
Guo, Haohan; Xie, Fenglong; Yang, Dongchao; Wu, Xixi; Meng, Helen; Xie, Kun; Guo, Dake
56: Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits
Huang, Sung-Feng; Kuo, Heng-Cheng; Chen, Zhehuai; Yang, Xuesong; Yang, Chao-Han Huck; Tsao, Yu; Wang, Yu-Chiang Frank; Lee, Hung-Yi; Fu, Szu-Wei
62: DNN-Based Ensemble Singing Voice Synthesis with Interactions Between Singers
Hyodo, Hiroaki; Takamichi, Shinnosuke; Nakamura, Tomohiko; Koguchi, Junya; Saruwatari, Hiroshi
87: Investigating Disentanglement in a Phoneme-Level Speech Codec for Prosody Modeling
Karapiperis, Sotirios; Ellinas, Nikolaos; Vioni, Alexandra; Oh, Junkwang; Jho, Gunu; Hwang, Inchul; Raptis, Spyros
128: InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself
Zeng, Chang; Wang, Chunhui; Miao, Xiaoxiao; Zhao, Jian; Jiang, Zhonglin; Chen, Yong
166: E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Eskimez, Sefik Emre; Wang, Xiaofei; Thakker, Manthan; Li, Canrun; Tsai, Chung-Hsien; Xiao, Zhen; Yang, Hemin; Zhu, Zirun; Tang, Min; Tan, Xu; Liu, Yanqing; Zhao, Sheng; Kanda, Naoyuki
167: Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Wu, Haibin; Wang, Xiaofei; Eskimez, Sefik Emre; Thakker, Manthan; Tompkins, Daniel; Tsai, Chung-Hsien; Li, Canrun; Xiao, Zhen; Zhao, Sheng; Li, Jinyu; Kanda, Naoyuki
184: Disentangling the Prosody and Semantic Information with Pre-Trained Model for In-Context Learning-Based Zero-Shot Voice Conversion
Chen, Zhengyang; Wang, Shuai; Zhang, Mingyang; Liu, Xuechen; Yamagishi, Junichi; Qian, Yanmin
195: NDVQ: Robust Neural Audio Codec with Distribution-Based Vector Quantization
Niu, Zhikang; Chen, Sanyuan; Zhou, Long; Ma, Ziyang; Chen, Xie; Liu, Shujie
228: Fast, High-Quality, and Parameter-Efficient Articulatory Synthesis Using Differentiable DSP
Liu, Yisi; Yu, Bohan; Lin, Drake; Wu, Peter; Cho, Cheol Jun; Anumanchipalli, Gopala Krishna
299: VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation
Yu, Yifeng; Shi, Jiatong; Wu, Yuning; Tang, Yuxun; Watanabe, Shinji
316: End-to-End Streaming Model for Low-Latency Speech Anonymization
Quamer, Waris; Gutierrez-Osuna, Ricardo
326: Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids’ Story Speech Synthesis
Chung, Raymond
332: Discrete Unit-Based Masking for Improving Disentanglement in Voice Conversion
Lee, Philip;Ülgen,İsmail Rasim; Sisman, Berrak
345: Cross-Dialect Text-to-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
Yamauchi, Kazuki; Saito, Yuki; Saruwatari, Hiroshi
353: Leveraging Diverse Semantic-Based Audio Pretrained Models for Singing Voice Conversion
Zhang, Xueyao; Fang, Zihao; Gu, Yicheng; Chen, Haopeng; Zou, Lexiao; Zhang, Junan; Xue, Liumeng; Wu, Zhizheng
394: TTSDS: Text-to-Speech Distribution Score
Minixhofer, Christoph D.; Klejch, Ondřej; Bell, Peter
396: The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction
Huang, Wen-Chin; Fu, Szu-Wei; Cooper, Erica; Zezario, Ryandhimas E.; Toda, Tomoki; Wang, Hsin-Min; Yamagishi, Junichi; Tsao, Yu
406: Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion
Shi, Yu-Fei; Ai, Yang; Lu, Ye-Xin; Du, Hui-Peng; Ling, Zhen-Hua
407: The T05 System for the VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
Baba, Kaito; Nakata, Wataru; Saito, Yuki; Saruwatari, Hiroshi
261: Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
Guragain, Anmol; Liu, Tianchi; Pan, Zihan; Sailor, Hardik B.; Wang, Qiongqiong
323: SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge
Zhang, You; Zang, Yongyi; Shi, Jiatong; Yamamoto, Ryuichi; Toda, Tomoki; Duan, Zhiyao
348: XWSB: A Blend System Utilizing XLS-R and WavLM with SLS Classifier Detection System for SVDD 2024 Challenge
Qi, Shan Zhang; Wen, Shuangbing; Yan, Fangke; Hu, Tao; Li, Jun
416: Integrating Self-Supervised Pre-Training with Adversarial Learning for Synthesized Song Detection
Wang, Yankai; Du, Yuxuan; Zhang, Dejun; Zheng, Rong; Deng, Jing
S9: Coffee Break (Day 2, Dec 3, Tuesday, 17:00 – 17:30)
S10: TBD (Day 2, Dec 3, Tuesday, 17:30 – 18:30)
Day 3, Dec 4, Wednesday
K3: Keynote 3 (Day 3, Dec 4, Wednesday, 09:00 – 10:00)
S11: Coffee Break (Day 3, Dec 4, Wednesday, 10:00 – 10:30)
P5: Poster Session 5: Machine Learning & Resources (Day 3, Dec 4, Wednesday, 10:30 – 12:30)
15: HeightCeleb: An Enrichment of VoxCeleb Dataset with Speaker Height Information
Kacprzak, Stanisław; Kowalczyk, Konrad
57: ESPnet-EZ: Python-Only ESPnet for Easy Fine-Tuning and Integration
Someki, Masao; Choi, Kwanghee; Arora, Siddhant; Chen, William; Cornell, Samuele; Han, Jionghao; Peng, Yifan; Shi, Jiatong; Srivastav, Vaibhav; Watanabe, Shinji
106: Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models
Lin, Yi-Cheng; Chen, Wei-Chih; Lee, Hung-Yi
124: Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit
Zhang, Xueyao; Xue, Liumeng; Gu, Yicheng; Wang, Yuancheng; Li, Jiaqi; He, Haorui; Wang, Chaoren; Liu, Songting; Chen, Xi; Zhang, Junan; Fang, Zihao; Chen, Haopeng; Tang, Tze Ying; Zou, Lexiao; Wang, Mingxuan; Han, Jun; Chen, Kai; Li, Haizhou; Wu, Zhizheng
126: Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
He, Haorui; Shang, Zengqiang; Wang, Chaoren; Li, Xuyuan; Gu, Yicheng; Hua, Hua; Liu, Liwei; Yang, Chen; Li, Jiaqi; Shi, Peiyang; Wang, Yuancheng; Chen, Kai; Zhang, Pengyuan; Wu, Zhizheng
191: FLORAS 50: A Massively Multilingual Multitask Benchmark for Long-Form Conversational Speech
Chen, William; Yan, Brian; Chen, Chih-Chen; Watanabe, Shinji
295: Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units
Inaguma, Hirofumi; Kulikov, Ilia; Ni, Zhaoheng; Popuri, Sravya; Tomasello, Paden P.
298: Speech Recognition for Analysis of Police Radio Communication
Srivastava, Tejes; Chou, Ju-Chieh; Shroff, Priyank; Livescu, Karen; Graziul, Christopher
307: Large Language Models as User Agents for Evaluating Task-Oriented Dialogue Systems
Kazi, Taaha; Lyu, Ruiliang; Zhou, Sizhe; Hakkani-Tur, Dilek; Tur, Gokhan
334: DFADD: The Diffusion and Flow-Matching-Based Audio Deepfake Dataset
Du, Jiawei; Lin, I-Ming; Chiu, I-Hsiang; Chen, Xuanjun; Wu, Haibin; Ren, Wenze; Tsao, Yu; Lee, Hung-Yi; Jang, Roger
384: SpMis: An Investigation of Synthetic Spoken Misinformation Detection
Liu, Peizhuo; Wang, Li; Renqiang, He; He, Haorui; Wang, Lei; Zheng, Huadi; Shi, Jie; Xiao, Tong; Wu, Zhizheng
7: Self-Supervised Speech Models for Word-Level Stuttered Speech Detection
Shih, Yi-Jen; Gkalitsiou, Zoi; Dimakis, Alex; Harwath, David
30: Enhancing Automatic Speech Assessment Leveraging Heterogeneous Features and Soft Labels for Ordinal Classification
Peng, Wen Hsuan; Chen, Sally; Chen, Berlin
92: Speech Recognition-Based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech
Lee, Jeehyun; Choi, Yerin; Koo, Myoung-Wan
144: Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget
Liu, Andy T.; Lin, Yi-Cheng; Wu, Haibin; Winkler, Stefan; Lee, Hung-Yi
175: Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models
Zheng, Xinhu; Jiang, Anbai; Han, Bing; Qian, Yanmin; Fan, Pingyi; Liu, Jia; Zhang, Wei-Qiang
233: Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis
Nguyen, Tuan Manh; Fredouille, Corinne; Ghio, Alain; Balaguer, Mathieu; Woisard, Virginie
249: Hierarchical Multi-Path and Multi-Model Selection for Fake Speech Detection
Feng, Chang; Zhao, Yiyang; Sun, Guangzhi; Chen, Zehua; Wang, Shuai; Zhang, Chao; Xu, Mingxing; Zheng, Thomas Fang
251: Semi-Supervised Learning for Robust Speech Evaluation
Zhang, Huayun; Wong, Jeremy H. M.; Lin, Geyu; Chen, Nancy
264: GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-Shot Keyword Spotting
Zhu, Pai; Bartel, Jacob W.; Agarwal, Dhruuv; Partridge, Kurt; Park, Hyun Jin; Wang, Quan
312: A Simple HMM with Self-Supervised Representations for Phone Segmentation
Yang, Gene-Ping; Tang, Hao
331: DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
Bhati, Saurabhchand; Gong, Yuan; Karlinsky, Leonid; Kuehne, Hilde; Feris, Rogerio; Glass, James
337: RAND: Robustness-Aware Norm Decay for Quantized Neural Networks
Qiu, David; Rim, David; Ding, Shaojin; Rybakov, Oleg; He, Yanzhang
377: SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding
Zhang, Ziyang; Thwaites, Andrew; Woolgar, Alexandra; Moore, Brian C. J.; Zhang, Chao
12: Automated Speaking Assessment of Conversation Tests with a Novel Graph-Based Modeling Method on Spoken Response Coherence
Li, Jiun-Ting; Yan, Bi-Cheng; Lo, Tien-Hong; Wang, Yi-Cheng; Hsu, Yung-Chang; Chen, Berlin
115: Conditional Label Smoothing for LLM-Based Data Augmentation in Medical Text Classification
Becker, Luca Manuel; Pracht, Philip; Sertdal, Peter; Uboreck, Jil; Bendel, Alexander; Martin, Rainer
180: Plan, Generate and Optimize: Extending Large Language Models for Dialogue Systems via Prompt-Based Collaborative Method
Guo, Mengfei; Chen, Si; Huang, Yi; Feng, Junlan
202: Taming NLU Noise: Student-Teacher Learning for Robust Dialogue Policy
Rohmatillah, Mahdin; Chien, Jen-Tzung
S12: Lunch (Day 3, Dec 4, Wednesday, 12:30 – 14:00)
T3: Invited Talk 3 (Day 3, Dec 4, Wednesday, 14:00 – 15:00)
PD1: Panel Discussion (Day 3, Dec 4, Wednesday, 15:00 – 17:00)
S13: Coffee Break (Day 3, Dec 4, Wednesday, 17:00 – 17:30)
D2: Demos (Day 3, Dec 4, Wednesday, 17:30 – 18:30)
S14: Gala Dinner (Day 3, Dec 4, Wednesday, 19:00 – 20:30)
Day 4, Dec 5, Thursday
K4: Keynote 4 (Day 4, Dec 5, Thursday, 09:00 – 10:00)
S15: Coffee Break (Day 4, Dec 5, Thursday, 10:00 – 10:30)
P6: Poster Session 6: Spoken Language Processing (Day 4, Dec 5, Thursday, 10:30 – 12:30)
65: Stutter-Solver: End-to-End Multi-Lingual Dysfluency Detection
Zhou, Xuanru; Cho, Cheol Jun; Sharma, Ayati; Morin, Brittany; Baquirin, David; Vonk, Jet; Ezzes, Zoe; Miller, Zachary; Tee, Boon Lead; Gorno Tempini, Maria Luisa; Lian, Jiachen; Anumanchipalli, Gopala Krishna
105: ProGRes: Prompted Generative Rescoring on ASR N-Best
Tur, Ada D.; Ravanelli, Mirco; Moumen, Adel
153: SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Shams, Siavash; Dindar, Sukru Samet; Jiang, Xilin; Mesgarani, Nima
215: Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
Kuan, Chun-Yi; Yang, Chih-Kai; Huang, Wei-Ping; Lu, Ke-Han; Lee, Hung-Yi
174: CTC-GMM: CTC-Guided Modality Matching for Fast and Accurate Streaming Speech Translation
Zhao, Rui; Li, Jinyu; Fan, Ruchao; Post, Matt
223: Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Polák, Peter; Bojar, Ondrej
272: Confidence Estimation for LLM-Based Dialogue State Tracking
Sun, Yijyun; Dey, Suvodip; Hakkani-Tur, Dilek; Tur, Gokhan
278: The 2nd FutureDial Challenge: Dialog Systems with Retrieval-Augmented Generation (FutureDial-RAG)
Cai, Yucheng; Chen, Si; Wu, Yuxuan; Huang, Yi; Feng, Junlan; Ou, Zhijian
130: Zero-Shot Audio Topic Reranking Using Large Language Models
Qian, Mengjie; Ma, Rao; Liusie, Adian; Loweimi, Erfan; Knill, Katherine M.; Gales, Mark
59: Clean Label Attacks Against SLU Systems
Xinyuan, Henry Li; Thebaud, Thomas; Joshi, Sonal; Villalba, Jesus Antonio; Dehak, Najim; Khudanpur, Sanjeev
168: WHISMA: A Speech-LLM to Perform Zero-Shot Spoken Language Understanding
Li, Mohan; Do, Cong-Thanh; Keizer, Simon; Farag, Youmna; Stoyanchev, Svetlana; Doddipatla, Rama S.
169: Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer
Sunder, Vishal; Fosler-Lussier, Eric
208: Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT
Komatsu, Ryota; Shinozaki, Takahiro
355: Just ASR + LLM? A Study on Speech Large Language Models’ Ability to Identify and Understand Speaker in Spoken Dialogue
Wu, Junkai; Fan, Xulin; Lu, Bo-Ru; Jiang, Xilin; Mesgarani, Nima; Hasegawa-Johnson, Mark A.; Ostendorf, Mari
5: Enhancing Open-Set Speaker Identification Through Rapid Tuning with Speaker Reciprocal Points and Negative Samples
Chen, Zhiyong; Ai, Zhiqi; Li, Xinnuo; Xu, Shugong
10: Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches
Zeng, Chang; Miao, Xiaoxiao; Wang, Xin; Cooper, Erica; Yamagishi, Junichi
49: Adversarial Purification for Speaker Verification by Two-Stage Diffusion Models
Bai, Yibo; XiaoLei, Zhang; Li, Xuelong
103: Measuring Sound Symbolism in Audio-Visual Models
Tseng, Wei-Cheng; Shih, Yi-Jen; Harwath, David; Mooney, Raymond
111: Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes
Kukanov, Ivan; Laakkonen, Janne; Kinnunen, Tomi H.; Hautamäki, Ville
135: On the Generation and Removal of Speaker Adversarial Perturbation for Voice Privacy Protection
Guo, Chenyang; Chen, Liping; Li, Zhuhai; Lee, Kong Aik; Ling, Zhen-Hua; Guo, Wu
138: Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing
Liu, Tianchi; Kukanov, Ivan; Pan, Zihan; Wang, Qiongqiong; Sailor, Hardik B.; Lee, Kong Aik
141: Enhance Low-Resource Spoken Language Identification via Cross-Modality Retrieval and Cross-Lingual Text-to-Speech Synthesis
Ma, Min; Wang, Yuan; Kastner, Kyle; Caswell, Isaac; Yoon, Charles; Rosenberg, Andrew
148: Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings
Horiguchi, Shota; Ando, Atsushi; Moriya, Takafumi; Ashihara, Takanori; Sato, Hiroshi; Tawara, Naohiro; Delcroix, Marc
259: PDAF: A Phonetic Debiasing Attention Framework for Speaker Verification
Baali, Massa; Aldoobi, Abdulhamid; Dhamyal, Hira; Singh, Rita; Raj, Bhiksha
356: INX-SpeakerHub: A 2000-Hour Indian Multilingual Speaker Identification Corpus
Sagaya Mary N J, Metilda; Umesh, S.
100: Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR
Wang, Weiqing; Dhawan, Kunal; Park, Tae Jin; Puvvada, Krishna C.; Medennikov, Ivan; Majumdar, Somshubra; Huang, He; Balam, Jagadeesh; Ginsburg, Boris
405: Exploring Self-Supervised Representations for Text-Dependent Speaker Verification
Sreekanth, Sankala
147: Distillation-Based Feature Extraction Algorithm for Source Speaker Verification
Ma, Xinlei; Lu, Wenhuan; Zhang, Ruiteng; Xu, Junhai; Lu, Xugang; Wei, Jianguo
224: Speaker Contrastive Learning for Source Speaker Tracing
Wang, Qing; Guo, Hongmei; Kang, Jian; Du, Mengjie; Li, Jie; XiaoLei, Zhang; Xie, Lei
411: The Database and Benchmark for the Source Speaker Tracing Challenge 2024
Li, Ze; Lin, Yuke; Tian, Yao; Suo, Hongbin; Zhang, Pengyuan; Ren, Yanzhen; Cai, Zexin; Nishizaki, Hiromitsu; Li, Ming