ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.02735
  4. Cited By
A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion
  Recognition, Speaker Verification and Spoken Language Understanding
v1v2v3 (latest)

A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

4 November 2021
Yingzhi Wang
Abdelmoumene Boumadane
A. Heba
ArXiv (abs)PDFHTML

Papers citing "A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding"

50 / 83 papers shown
EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning
EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning
Xingfeng Li
Xiaohan Shi
Junjie Li
Yongwei Li
M. Unoki
Tomoki Toda
Masato Akagi
48
0
0
25 Nov 2025
Enabling Automatic Self-Talk Detection via Earables
Enabling Automatic Self-Talk Detection via Earables
Euihyeok Lee
Seonghyeon Kim
Sanghun Im
Heung-Seon Oh
Seungwoo Kang
89
0
0
10 Nov 2025
MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech
MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech
Junming Yuan
Ying Shi
D. Wang
Lantian Li
A. Hamdulla
SSL
425
0
0
09 Nov 2025
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition
Jing-Tong Tzeng
John H. L. Hansen
Chi-Chun Lee
MoE
153
1
0
10 Sep 2025
EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
Shuai Tan
Bin Ji
186
2
0
19 Aug 2025
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Hugo Thimonier
Antony Perzo
Renaud Seguier
145
2
0
19 Aug 2025
Human Feedback Driven Dynamic Speech Emotion Recognition
Human Feedback Driven Dynamic Speech Emotion Recognition
Ilya Fedorov
Dmitry Korobchenko
57
0
0
18 Aug 2025
Deep Learning Approaches for Multimodal Intent Recognition: A Survey
Deep Learning Approaches for Multimodal Intent Recognition: A Survey
Jingwei Zhao
Yuhua Wen
Qifei Li
Minchi Hu
Yingying Zhou
...
Junyang Wu
Yingming Gao
Zhengqi Wen
Jianhua Tao
Ya Li
ViT
191
1
0
24 Jul 2025
Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information
Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information
Nicholas Sanders
Yuanchao Li
Korin Richmond
Simon King
214
1
0
21 May 2025
Representation of perceived prosodic similarity of conversational feedback
Representation of perceived prosodic similarity of conversational feedback
Livia Qian
Carol Figueroa
Gabriel Skantze
120
0
0
19 May 2025
BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition
BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech RecognitionComputer Speech and Language (CSL), 2025
Paige Tuttosi
Mantaj Dhillon
Luna Sang
Shane Eastwood
Poorvi Bhatia
Quang Minh Dinh
Avni Kapoor
Yewon Jin
Angelica Lim
332
3
0
30 Apr 2025
Can Diffusion Models Disentangle? A Theoretical Perspective
Can Diffusion Models Disentangle? A Theoretical Perspective
Liming Wang
Muhammad Jehanzeb Mirza
Yishu Gong
Yuan Gong
Jiaqi Zhang
Brian Tracey
Katerina Placek
Marco Vilela
James Glass
DiffMCoGe
399
0
0
31 Mar 2025
Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of TransformersIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Aneesha Sampath
James Tavernor
E. Provost
311
4
0
17 Feb 2025
Evaluating the Impact of Discriminative and Generative E2E Speech
  Enhancement Models on Syllable Stress Preservation
Evaluating the Impact of Discriminative and Generative E2E Speech Enhancement Models on Syllable Stress Preservation
Rangavajjala Sankara Bharadwaj
Jhansi Mallela
Sai Harshitha Aluru
Chiranjeevi Yarra
191
1
0
11 Dec 2024
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Li-Wei Chen
Takuya Higuchi
He Bai
Ahmed Hussen Abdelaziz
Alexander Rudnicky
Shinji Watanabe
Tatiana Likhomanenko
B. Theobald
Zakaria Aldeneh
311
1
0
16 Sep 2024
Continuous Learning of Transformer-based Audio Deepfake Detection
Continuous Learning of Transformer-based Audio Deepfake Detection
Tuan Duy Nguyen Le
Kah Kuan Teh
Huy Dat Tran
ViT
184
7
0
09 Sep 2024
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech
  Processing Tasks
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing TasksIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
He Huang
Taejin Park
Kunal Dhawan
Ivan Medennikov
Krishna Puvvada
Nithin Rao Koluguri
Weiqing Wang
Jagadeesh Balam
Boris Ginsburg
SSLAI4TS
330
4
0
23 Aug 2024
VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints
VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints
Jinghua Tang
Liyun Zhang
Liyun Zhang
Yu Lu
Lanqing Yang
YiChao Chen
Minjie Bian
Xiaoshan Li
Guangtao Xue
166
2
0
23 Aug 2024
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake
  Detection
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Yi Zhu
Surya Koppisetti
Trang Tran
Gaurav Bharaj
407
22
0
26 Jul 2024
Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification
Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification
Li Zhang
Ning Jiang
Qing Wang
Yuehong Li
Quan Lu
Lei Xie
238
16
0
14 Jul 2024
MSP-Podcast SER Challenge 2024: Lántenne du Ventoux Multimodal
  Self-Supervised Learning for Speech Emotion Recognition
MSP-Podcast SER Challenge 2024: Lántenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition
J. Duret
Mickael Rouvier
Yannick Esteve
118
4
0
08 Jul 2024
A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion
  Recognition
A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition
Shreya G. Upadhyay
John H. L. Hansen
Chi-Chun Lee
269
7
0
06 Jul 2024
Exploring Self-Supervised Multi-view Contrastive Learning for Speech
  Emotion Recognition with Limited Annotations
Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations
Bulat Khaertdinov
Pedro Jeuris
Annanda Sousa
Enrique Hortal
232
2
0
12 Jun 2024
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37
  Emotion Datasets
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets
Shahin Amiriparian
Filip Packañ
Maurice Gerczuk
Björn W. Schuller
104
18
0
11 Jun 2024
SpeechVerse: A Large-scale Generalizable Audio Language Model
SpeechVerse: A Large-scale Generalizable Audio Language Model
Nilaksh Das
Saket Dingliwal
S. Ronanki
Rohit Paturi
David Huang
...
Monica Sunkara
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
Katrin Kirchhoff
485
67
0
14 May 2024
A Large-Scale Evaluation of Speech Foundation Models
A Large-Scale Evaluation of Speech Foundation Models
Shu-Wen Yang
Heng-Jui Chang
Zili Huang
Andy T. Liu
Cheng-I Jeff Lai
...
Kushal Lakhotia
Shang-Wen Li
Abdelrahman Mohamed
Shinji Watanabe
Hung-yi Lee
278
56
0
15 Apr 2024
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisEuropean Conference on Computer Vision (ECCV), 2024
Shuai Tan
Bin Ji
Mengxiao Bi
Ye Pan
259
67
0
02 Apr 2024
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture
  of Adapters
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters
Umberto Cappellazzo
Daniele Falavigna
Alessio Brutti
MoE
189
6
0
01 Feb 2024
Can you Remove the Downstream Model for Speaker Recognition with
  Self-Supervised Speech Features?
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Zakaria Aldeneh
Takuya Higuchi
Jee-weon Jung
Skyler Seto
Tatiana Likhomanenko
Stephen Shum
Ahmed Hussen Abdelaziz
Shinji Watanabe
B. Theobald
SSL
170
4
0
01 Feb 2024
A Multi-Task, Multi-Modal Approach for Predicting Categorical and
  Dimensional Emotions
A Multi-Task, Multi-Modal Approach for Predicting Categorical and Dimensional Emotions
Alex-Răzvan Ispas
Théo Deschamps-Berger
Laurence Devillers
147
4
0
31 Dec 2023
emotion2vec: Self-Supervised Pre-Training for Speech Emotion
  Representation
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Ziyang Ma
Zhisheng Zheng
Jiaxin Ye
Jinchao Li
Zhifu Gao
Shiliang Zhang
Xie Chen
MDESLRSSL
302
243
0
23 Dec 2023
Speech and Text-Based Emotion Recognizer
Speech and Text-Based Emotion Recognizer
Varun Sharma
70
0
0
10 Dec 2023
Generalized zero-shot audio-to-intent classification
Generalized zero-shot audio-to-intent classificationAutomatic Speech Recognition & Understanding (ASRU), 2023
Veera Raghavendra Elluru
Devang Kulshreshtha
Rohit Paturi
S. Bodapati
S. Ronanki
207
4
0
04 Nov 2023
Enhancing expressivity transfer in textless speech-to-speech translation
Enhancing expressivity transfer in textless speech-to-speech translationAutomatic Speech Recognition & Understanding (ASRU), 2023
J. Duret
Benjamin O’Brien
Yannick Esteve
Titouan Parcollet
168
3
0
11 Oct 2023
Improving End-to-End Speech Processing by Efficient Text Data
  Utilization with Latent Synthesis
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent SynthesisConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jianqiao Lu
Wenyong Huang
Nianzu Zheng
Xingshan Zeng
Y. Yeung
Xiao Chen
SyDa
256
1
0
09 Oct 2023
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised
  Learning with Masked Unit Prediction
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit PredictionInternational Conference on Learning Representations (ICLR), 2023
Jiatong Shi
Hirofumi Inaguma
Xutai Ma
Ilia Kulikov
Anna Y. Sun
261
36
0
04 Oct 2023
Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in
  Speaker Recognition
Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Shuai Wang
Qibing Bai
Qi Liu
Jianwei Yu
Zhengyang Chen
Bing Han
Yan-min Qian
Haizhou Li
214
2
0
21 Sep 2023
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion
  Recognition
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Ziyang Ma
Wen Wu
Zhisheng Zheng
Yiwei Guo
Qian Chen
Shiliang Zhang
Xie Chen
244
29
0
19 Sep 2023
Hierarchical Audio-Visual Information Fusion with Multi-label Joint
  Decoding for MER 2023
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023ACM Multimedia (ACM MM), 2023
Haotian Wang
Yuxuan Xi
Hang Chen
Jun Du
Yan Song
...
Pengfei Hu
Ya Jiang
Shi Cheng
Jie Zhang
Yuzhe Weng
213
5
0
11 Sep 2023
Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect
  Representations
Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect RepresentationsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Debaditya Shome
Ali Etemad
183
9
0
09 Sep 2023
Leveraging Label Information for Multimodal Emotion Recognition
Leveraging Label Information for Multimodal Emotion RecognitionInterspeech (Interspeech), 2023
Pei-Hsin Wang
Sunlu Zeng
Junqing Chen
Lu Fan
Meng Chen
Youzheng Wu
Xiaodong He
239
6
0
05 Sep 2023
Speech Self-Supervised Representations Benchmarking: a Case for Larger
  Probing Heads
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing HeadsComputer Speech and Language (CSL), 2023
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
S. Essid
Mirco Ravanelli
SSL
234
19
0
28 Aug 2023
Decoding Emotions: A comprehensive Multilingual Study of Speech Models
  for Speech Emotion Recognition
Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition
Anant Singh
Akshat Gupta
188
5
0
17 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained ModelIEEE transactions on multimedia (IEEE TMM), 2023
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
187
26
0
15 Aug 2023
Leveraging Pretrained ASR Encoders for Effective and Efficient
  End-to-End Speech Intent Classification and Slot Filling
Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot FillingInterspeech (Interspeech), 2023
Hengguan Huang
Jagadeesh Balam
Boris Ginsburg
181
6
0
13 Jul 2023
Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited
  Annotated Data
Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated DataComputer Speech and Language (CSL), 2023
Guangzhi Sun
Chuxu Zhang
Ivan Vulić
Paweł Budzianowski
P. Woodland
189
6
0
04 Jul 2023
Learning Multilingual Expressive Speech Representation for Prosody
  Prediction without Parallel Data
Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel DataSpeech Synthesis Workshop (SSW), 2023
J. Duret
Titouan Parcollet
Yannick Esteve
133
4
0
29 Jun 2023
Speech Emotion Diarization: Which Emotion Appears When?
Speech Emotion Diarization: Which Emotion Appears When?Automatic Speech Recognition & Understanding (ASRU), 2023
Yingzhi Wang
Mirco Ravanelli
Alya Yacoubi
149
21
0
22 Jun 2023
Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic
  Singing Voice Understanding Tasks: Three Case Studies
Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case StudiesAsia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2023
Yuya Yamamoto
168
3
0
22 Jun 2023
Unsupervised speech intelligibility assessment with utterance level
  alignment distance between teacher and learner Wav2Vec-2.0 representations
Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations
Nayan Anand
Meenakshi Sirigiraju
Chiranjeevi Yarra
123
1
0
15 Jun 2023
12
Next
Page 1 of 2