Challenge Description

Following the success of two previous Short-duration Speaker Verification Challenges, the TdSV Challenge 2024 aims to focus on the relevance of recent training strategies, such as self-supervised learning. The challenge evaluates the TdSV task in two practical scenarios, namely, conventional TdSV using predefined Passphrases (Task 1) and TdSV using user-defined passphrases (Task 2). In Task 1, participants are tasked with training a speaker encoder model using a significantly large training dataset sourced from a predefined phrase pool consisting of 10 phrases. Speaker models are created using three repetitions of a specific passphrase from the phrase pool. In Task 2, participants are required to train a speaker encoder model using a large text-independent training dataset. Additionally, there are utterances from a predefined pool of 6 phrases for each in-domain training speaker. Three cash prizes will be given away for each task based on the results of the evaluation dataset and other qualitative factors.

Challenge Website

https://tdsvc.github.io

Organizers

Hossein ZEINALI

Amirkabir University of Technology, Iran

hsn.zeinali@gmail.com

Kong Aik LEE

The Hong Kong Polytechnic University, Hong Kong

kong-aik.lee@polyu.edu.hk

Jahangir ALAM

Computer Research Institute of Montreal, Canada

jahangir.alam@crim.ca

Lukáš BURGET

Brno University of Technology, Czech Republic

burget@fit.vutbr.cz

Challenge Description

The SLT-24 GenASR challenge aims to explore a new field of post-ASR based text modeling focusing on three distinct tasks: (1) ASR-Quality aware: N-best ASR-LM Correction, (2) Speaker-aware: Speaker-Tagging Assignment, and (3) Non-Semantic aware: Post-ASR Text-to-Emotion recognition. We provide open benchmarks and systems to encourage researchers worldwide to open speech technology within economic computing budgets and to push the limits of ASR-related tasks performance. We include two tracks of limited LM size <= 7B and unlimited size to welcome all speech communities.

Challenge Website

https://sites.google.com/view/gensec-challenge/home

Honorary Challenge Chair

Andreas Stolcke

Uniphore and ICSI

Organizers

Huck Yang

NVIDIA Research

hucky@nvidia.com

Taejin Park

NVIDIA NeMo

taejinp@nvidia.com

Yuan Gong

Massachusetts Institute of Technology CSAIL, United States

yuangong@mit.edu

Yuanchao Li

University of Edinburgh, United Kingdom

yuanchao.li@ed.ac.uk

Challenge Description

The Singing Voice Deepfake Detection (SVDD) Challenge addresses the surge in AI-generated singing voices, which now closely resemble natural human singing and align seamlessly with musical scores, leading to heightened concerns for artists and the music industry. Unlike spoken voice,  the singing voice introduces unique detection challenges due to its intricate musical nature and prominent background music. This pioneering challenge is the first to focus on detecting both authentic and synthetic singing voices in lab-controlled and real-world scenarios. It features two distinct tracks: CtrSVDD, for controlled settings, utilizing top-performing singing voice synthesis and singing voice conversion systems, and WildSVDD, which expands our previous SingFake dataset to offer a wide array of real-world examples. This pioneering effort aims not only to advance the field of synthetic singing voice detection but also to address the music industry’s concerns about singing voice authenticity, paving the way for more secure digital musical experiences.

Challenge Website

https://challenge.singfake.org/

Organizers

You Zhang

University of Rochester, United States

you.zhang@rochester.edu

Yongyi Zang

University of Rochester, United States

yongyi.zang@rochester.edu

Jiatong Shi

Carnegie Mellon University, United States

jiatongs@andrew.cmu.edu

Ryuichi Yamamoto

Nagoya University, Japan

yamamoto.ryuichi@g.sp.m.is.nagoya-u.ac.jp

Tomoki Toda

Nagoya University, Japan

tomoki@icts.nagoya-u.ac.jp

Zhiyao Duan

University of Rochester, United States

zhiyao.duan@rochester.edu

Challenge Description

AIGC has already become a very hot topic and has been widely used in our daily lives. In the speech field, there are several challenges specifically targeting the audio anti spoofing countermeasure problem, e.g. ASVspoof, ADD, etc. However, there are limited efforts to address the source speaker tracing problem. Source speaker tracing is to identify the information of the source speaker or eventually reconstruct the speech of the source speaker from the manipulated speech signals in different scenarios, e.g. voice conversion, speaker anonymization, speech editing, etc. This year’s challenge will focus on source speaker verification against voice conversion. Participants will be asked to decide whether two converted utterances are from the same source speaker. Besides the challenge track, there is also a research track to attract submissions related to source speaker tracing and other topics related to speech anti-spoofing countermeasure.

Challenge Website

https://sstc-challenge.github.io/

Organizers

Ming Li

Duke Kunshan University, China

ming.li369@dukekunshan.edu.cn

Pengyuan Zhang

Chinese Academic of Sciences, China

zhangpengyuan@hccl.ioa.ac.cn

Yanzhen Ren

Wuhan University, China

renyz@whu.edu.cn

Zexin Cai

Johns Hopkins University, United States

zcai21@jh.edu

Hiromitsu Nishizaki

University of Yamanashi, Japan

hnishi@yamanashi.ac.jp

Challenge Description

The VoiceMOS Challenge 2024 (VMC 2024) is the third edition of the VMC series. The purpose of the challenge is to compare different systems and approaches to the task of predicting human ratings of speech (usually in terms of mean opinion score, MOS). This year, there will be three tracks. The first track aims to predict the MOS ratings of a “zoomed-in” subset comprising the top systems from the VMC 2022 dataset collected through a separate listening test. The second track is based on a new dataset containing samples and their ratings from singing voice synthesis and conversion systems. The third track is semi-supervised MOS prediction for noisy, clean, and enhanced speech. Participants will only be allowed to use a very small amount of MOS-labeled data provided by organizers.

Challenge Website

https://sites.google.com/view/voicemos-challenge

Organizers

Wen-Chin Huang

Nagoya University, Japan

wen.chinhuang@g.sp.m.is.nagoya-u.ac.jp

Szu-Wei Fu

NVIDIA Research, Taiwan

szuweif@nvidia.com

Erica Cooper

National Institute Of Information And Communications Technology, Japan

ecooper@nii.ac.jp

Ryandhimas Zezario

Acadamia Sinica, Taiwan

ryandhimas@citi.sinica.edu.tw

Tomoki Toda

Nagoya University, Japan

tomoki@icts.nagoya-u.ac.jp

Hsin-Min Wang

Academia Sinica, Taiwan

whm@iis.sinica.edu.tw

Junichi Yamagishi

National Institute of Informatics, Japan

jyamagis@nii.ac.jp

Yu Tsao

Academia Sinica, Taiwan

yu.tsao@citi.sinica.edu.tw

Challenge Description

Stuttering affects 1% of the global population, influencing social and mental wellbeing, and posing communication and self-esteem challenges. Despite unknown causality, early intervention is crucial. However, regions like Mainland China have a dearth of speech-language professionals, which hinders timely support. The StutteringSpeech Challenge addresses this gap through innovation in stuttering event detection (SED) and automatic speech recognition (ASR) technology. This initiative focuses on encouraging user interface designs that are inclusive for people who stutter. It comprises Task I, detecting stuttering events in audio; Task II, crafting efficient ASR systems to transcribe stuttered speech; and Task III, a research paper track enriching our understanding of stuttering speech technology. This challenge underlines the commitment to fostering awareness and creating inclusive technology, thereby augmenting communication accessibility in today’s smart device and chatbot ecosystem.

Challenge Website

http://stutteringspeech.org/

Organizers

Rong Gong

StammerTalk

rong.gong@stammertalk.net

Lei Xie

Northwestern Polytechnical University, China

lxie@nwpu.edu.cn

Hui Bu

AIShell Inc., China

buhui@aishelldata.com

Eng Siong Chng

Nanyang Technological University, Singapore

ASESChng@ntu.edu.sg

Binbin Zhang

WeNet Open Source Community

binbzha@qq.com

Ming Li

Duke Kunshan University, China

ming.li369@dukekunshan.edu.cn

Yong Qin

Nankai University, China

qinyong@nankai.edu.cn

Jun Du

University of Science and Technology of China, China

jundu@ustc.edu.cn

Hongfei Xue

Northwestern Polytechnical University, China

hfxue@mail.nwpu.edu.cn

Jiaming Zhou

Nankai University, China

zhoujiaming@mail.nankai.edu.cn

Xin Xu

AIShell Inc., China

xuxin@aishelldata.com

Challenge Description

The goal of this challenge is to encourage innovative methods and a comprehensive understanding of the capability of codec models. This challenge will conduct a comprehensive analysis to provide insights into codec models from both application and signal perspectives, diverging from previous codec papers that predominantly focus on signal-level comparisons following. The diverse set of signal-level metrics, including Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), STFT distance (STFTDistance), Mel Cepstral Distance, and F0 Pearson Correlation Coefficient (F0CORR), SpeechBERTScore, enable us to conduct a thorough evaluation of sound quality across various dimensions, encompassing spectral fidelity, temporal dynamics, perceptual clarity, and intelligibility. The application angle evaluation will comprehensively analyze each codec’s ability to preserve crucial audio information, encompassing content (WER for ASR), speaker timbre (EER for ASV), emotion, and general audio characteristics. We hope this challenge can inspire innovative research in neural codec development. With this proposal, we aim to promote innovation in neural audio codec fields and advance the research frontier.

Challenge Website

https://codecsuperb.github.io/

Organizers

Hung-yi Lee

National Taiwan University, Taiwan

hungyilee@ntu.edu.tw

Haibin Wu

National Taiwan University, Taiwan

f07921092@ntu.edu.tw

Kai-Wei Chang

National Taiwan University, Taiwan

kaiwei.chang.tw@gmail.com

Alexander H. Liu

Massachusetts Institute of Technology, United States

alexhliu@mit.edu

Songxiang Liu

miHoYo

songxiangliu.cuhk@gmail.com

Dongchao Yang

The Chinese University of Hong Kong, Hong Kong

Yi-Chiao Wu

Meta

yichiaowu@meta.com

Xu Tan

Microsoft

xuta@microsoft.com

James Glass

Massachusetts Institute of Technology, United States

glass@csail.mit.edu

Shinji Watanabe

Carnegie Mellon University, United States

shinjiw@ieee.org

Challenge Description

With the increasing prevalence of speech-enabled applications, smart home technology has become commonplace in numerous households. For the general people, waking up and controlling smart devices is no longer a difficult task. However, individuals with dysarthria face significant challenges in utilizing these technologies due to the inherent variability in their speech. Dysarthria is a motor speech disorder commonly associated with conditions such as cerebral palsy, Parkinson’s disease, amyotrophic lateral sclerosis and stroke. Dysarthric speakers often experience pronounced difficulties in articulation, fluency, speech rate, volume, and clarity. Consequently, their speech is difficult to comprehend for commercially available smart devices. Moreover, dysarthric individuals frequently encounter additional motor impairments, further complicating their ability to manipulate household devices. As a result, voice control has emerged as an ideal solution for enabling dysarthria speakers to execute various commands, assisting them in leading simple and independent lives.

This challenge seeks to solve speaker-dependent wake-up spotting tasks by utilizing a small amount of wake-up word audio of the specific person. This research has the potential to not only enhance the quality of life for individuals with dysarthria, but also facilitate smart devices in better accommodating diverse user requirements, making it a truly universal technology. We hope that the challenge will raise awareness about dysarthria and encourage greater participation in related research endeavors. By doing so, we are committed to promoting awareness and understanding of dysarthria in society and eliminating discrimination and prejudice against people with dysarthria.

Challenge Website

https://www.lrdwws.org/

Organizers

Jun Du

University of Science and Technology of China, China

jundu@ustc.edu.cn

Hui Bu

Beijing AIShell Technology Co. Ltd, China

buhui@aishelldata.com

Ming Li

Duke Kunshan University, China

ming.li369@dukekunshan.edu.cn

Ming Gao

University of Science and Technology of China, China

vivigreeeen@mail.ustc.edu.cn

Hang Chen

University of Science and Technology of China, China

ch199703@mail.ustc.edu.cn

Xin Xu

Beijing AISHELL Technology Co., Ltd., China

xuxin@aishelldata.com

Hongxiao Guo

Beijing AISHELL Technology Co., Ltd., China

guohongxiao@aishelldata.com

Chin-Hui Lee

Georgia Institute of Technology, United States

chl@ece.gatech.edu

Challenge Description

Developing intelligent dialog systems has been one of the longest running goals in AI. In recent years, significant progress has been made in building dialog systems with the breakthrough of deep learning methods and the large amount of conversational data being made available for system development.

There are still full of challenges toward building future dialog systems. The first FutureDial challenge focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDialSereTOD) , which was successfully held at EMNLP 2022 SereTOD workshop. ChatGPT, a newly emerged generative dialog system in the end of 2022, has marked another amazing progress in engaging users in open-domain dialogs. However, problems like hallucination and fabrication still hinder the usage of such systems in real-life applications like customer service systems, which requires pin-point accuracy. Retrieval augmented generation (RAG) has been introduced to enhance dialog systems with retrieved information from external knowledge bases and has attracted increasing interests. RAG has been shown to be able to help the dialog systems to reply with higher accuracy and factuality, providing more informative and grounded responses. However, there remain challenges for RAG-based dialog systems such as designing retrievers that can retrieve knowledge from multiple knowledge sources, building RAG-based dialog systems that can effectively utilize available tools and API-calls for retrieval, and etc.

Following the success of the 1st FutureDial challenge, the 2nd FutureDial challenge aims to benchmark and stimulate research in building dialog systems with RAG, with the newly released dialog dataset, MobileCS2. We aim to create a forum to discuss key challenges in the field and share findings from real-world applications.

Challenge Website

http://futuredial.org/

Organizers

Junlan Feng

China Mobile

fengjunlan@chinamobile.com

Zhijian Ou

Tsinghua University, China

ozj@tsinghua.edu.cn

Yi Huang

China Mobile

huangyi@chinamobile.com

Si Chen

China Mobile

chensiyjy@chinamobile.com

Yucheng Cai

Tsinghua University, China

cyc22@mails.tsinghua.edu.cn