ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.05862
  4. Cited By
wav2vec: Unsupervised Pre-training for Speech Recognition
v1v2v3v4 (latest)

wav2vec: Unsupervised Pre-training for Speech Recognition

11 April 2019
Steffen Schneider
Alexei Baevski
R. Collobert
Michael Auli
    SSL
ArXiv (abs)PDFHTML

Papers citing "wav2vec: Unsupervised Pre-training for Speech Recognition"

50 / 190 papers shown
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image
  Animation
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image AnimationInternational Conference on Learning Representations (ICLR), 2024
Jiahao Cui
Hui Li
Yao Yao
Hao Zhu
Hanlin Shang
Kaihui Cheng
Hang Zhou
Siyu Zhu
Jingdong Wang
DiffMVGen
334
76
0
10 Oct 2024
Toward Robust Real-World Audio Deepfake Detection: Closing the
  Explainability Gap
Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap
Georgia Channing
Juil Sock
Ronald Clark
Juil Sock
Christian Schroeder de Witt
194
5
0
09 Oct 2024
InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries
InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant CriesAsian Conference on Machine Learning (ACML), 2024
Mengze Hong
Chen Jason Zhang
Lingxiao Yang
Wailing Ng
Chen Zhang
219
3
0
29 Sep 2024
DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical
  Diffusion for Audio-driven Talking Head Synthesis
DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis
Fa-Ting Hong
Yunfei Liu
Yu Li
Changyin Zhou
Fei Yu
D. Xu
DiffM
239
3
0
16 Sep 2024
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System PerformanceIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Huang-Cheng Chou
Haibin Wu
Hung-yi Lee
Chi-Chun Lee
410
3
0
16 Sep 2024
Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
Yao-Fei Cheng
Li-Wei Chen
Hung-Shin Lee
Hsin-Min Wang
298
1
0
13 Sep 2024
Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models
Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models
Jin Sob Kim
Hyun Joon Park
Wooseok Shin
Juan Yun
Sung Won Han
SLR
454
2
0
12 Sep 2024
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR
  Model Evaluations
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model EvaluationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Kavya Manohar
Leena G Pillai
236
4
0
04 Sep 2024
CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention
CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention
Gaojie Lin
Jianwen Jiang
Chao Liang
Tianyun Zhong
Jiaqi Yang
Yanbo Zheng
VGenDiffM
553
33
0
03 Sep 2024
GSIFN: A Graph-Structured and Interlaced-Masked Multimodal
  Transformer-based Fusion Network for Multimodal Sentiment Analysis
GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis
Yijie Jin
209
3
0
27 Aug 2024
Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation
Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation
Hemant Yadav
Sunayana Sitaram
R. Shah
SSL
305
0
0
20 Aug 2024
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech
  Processing Tasks
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing TasksIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2024
Nakamasa Inoue
Shinta Otake
Takumi Hirose
Masanari Ohi
Rei Kawakami
247
6
0
28 Jul 2024
Sentiment Reasoning for Healthcare
Sentiment Reasoning for Healthcare
Khai-Nguyen Nguyen
Khai Le-Duc
Bach Phan Tat
Duy Le
Long Vo-Dang
Long Vo-Dang
LRM
385
3
0
24 Jul 2024
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable
  Landmark Conditions
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
Zhiyuan Chen
Jiajiong Cao
Zhiquan Chen
Yuming Li
Chenguang Ma
VGen
274
159
0
11 Jul 2024
STONE: Self-supervised Tonality Estimator
STONE: Self-supervised Tonality Estimator
Yuexuan Kong
Vincent Lostanlen
Gabriel Meseguer-Brocal
Stella Wong
Mathieu Lagrange
Romain Hennequin
329
7
0
10 Jul 2024
Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time
Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time
Salvatore Greco
Bartolomeo Vacchetti
D. Apiletti
Tania Cerquitelli
261
13
0
24 Jun 2024
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image
  Animation
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
Mingwang Xu
Hui Li
Qingkun Su
Hanlin Shang
Liwei Zhang
Ce Liu
Jingdong Wang
Yao Yao
Siyu Zhu
VGen
255
166
0
13 Jun 2024
Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
Eungbeom Kim
Hantae Kim
Kyogu Lee
185
2
0
12 Jun 2024
MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech RepresentationsInterspeech (Interspeech), 2024
Hemant Yadav
Sunayana Sitaram
R. Shah
SSL
305
3
0
09 Jun 2024
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Siavash Shams
Sukru Samet Dindar
Xilin Jiang
N. Mesgarani
Mamba
322
39
0
20 May 2024
Active Learning with Task Adaptation Pre-training for Speech Emotion
  Recognition
Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition
Dongyuan Li
Ying Zhang
Yusong Wang
Funakoshi Kataro
Manabu Okumura
286
3
0
01 May 2024
Mai Hoómāuna i ka Ái: Language Models Improve Automatic Speech
  Recognition in Hawaiian
Mai Hoómāuna i ka Ái: Language Models Improve Automatic Speech Recognition in Hawaiian
Kaavya Chaparala
Guido Zarrella
Bruce Torres Fischer
Larry Kimura
Oiwi Parker Jones
AuLLM
157
0
0
03 Apr 2024
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
Stephanie Fu
Mark Hamilton
Laura E. Brandt
Axel Feldmann
Zhoutong Zhang
William T. Freeman
MDE
317
83
0
15 Mar 2024
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech
  Synthesis
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
Wei-wei Lin
Chenhang He
Man-Wai Mak
Jiachen Lian
Kong Aik Lee
DiffM
241
2
0
01 Mar 2024
Transcription and translation of videos using fine-tuned XLSR Wav2Vec2
  on custom dataset and mBART
Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART
Aniket Tathe
Anand Kamble
Suyash Kumbharkar
Atharva Bhandare
Anirban C. Mitra
140
3
0
01 Mar 2024
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with
  Audio2Video Diffusion Model under Weak Conditions
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Linrui Tian
Qi Wang
Bang Zhang
Liefeng Bo
DiffM
318
218
0
27 Feb 2024
Experimental Study: Enhancing Voice Spoofing Detection Models with
  wav2vec 2.0
Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0
Taein Kang
Soyul Han
Sunmook Choi
Jaejin Seo
Sanghyeok Chung
Seungeun Lee
Seungsang Oh
Il-Youp Kwak
247
10
0
27 Feb 2024
Learning to Generate Context-Sensitive Backchannel Smiles for Embodied
  AI Agents with Applications in Mental Health Dialogues
Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues
Maneesh Bilalpur
Mert Inan
Dorsa Zeinali
Jeffrey F. Cohn
Malihe Alikhani
276
1
0
13 Feb 2024
It's Never Too Late: Fusing Acoustic Information into Large Language
  Models for Automatic Speech Recognition
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
Chen Chen
Ruizhe Li
Yuchen Hu
Sabato Marco Siniscalchi
Pin-Yu Chen
Ensiong Chng
Chao-Han Huck Yang
227
32
0
08 Feb 2024
Streaming Sequence Transduction through Dynamic Compression
Streaming Sequence Transduction through Dynamic Compression
Weiting Tan
Yunmo Chen
Tongfei Chen
Guanghui Qin
Haoran Xu
Heidi C. Zhang
Benjamin Van Durme
Philipp Koehn
500
2
0
02 Feb 2024
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion,
  Asr Error Detection, and Asr Error Correction
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error CorrectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Jiajun He
Xiaohan Shi
Xingfeng Li
Tomoki Toda
215
31
0
24 Jan 2024
Towards Weakly Supervised Text-to-Audio Grounding
Towards Weakly Supervised Text-to-Audio Grounding
Xuenan Xu
Ziyang Ma
Mengyue Wu
Kai Yu
AI4TS
353
17
0
05 Jan 2024
USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech
  Recognition with Universal Speech Models
USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech ModelsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Shaojin Ding
David Qiu
David Rim
Yanzhang He
Oleg Rybakov
...
Tara N. Sainath
Zhonglin Han
Jian Li
Amir Yazdanbakhsh
Shivani Agrawal
MQ
472
13
0
13 Dec 2023
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Zhengcong Fei
Mingyuan Fan
Junshi Huang
377
33
0
27 Nov 2023
Improving Speech Inversion Through Self-Supervised Embeddings and
  Enhanced Tract Variables
Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract VariablesEuropean Signal Processing Conference (EUSIPCO), 2023
Ahmed Adel Attia
Yashish M. Siriwardena
Carol Espy-Wilson
SSL
221
13
0
17 Sep 2023
Indonesian Automatic Speech Recognition with XLSR-53
Indonesian Automatic Speech Recognition with XLSR-53Social Science Research Network (SSRN), 2022
Panji Arisaputra
Amalia Zahra
116
10
0
20 Aug 2023
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised
  Pretraining
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
Haohe Liu
Yiitan Yuan
Xubo Liu
Xinhao Mei
Qiuqiang Kong
Qiao Tian
Yuping Wang
Wenwu Wang
Yuxuan Wang
Mark D. Plumbley
DiffM
326
378
0
10 Aug 2023
Vesper: A Compact and Effective Pretrained Model for Speech Emotion
  Recognition
Vesper: A Compact and Effective Pretrained Model for Speech Emotion RecognitionIEEE Transactions on Affective Computing (IEEE Trans. Affective Comput.), 2023
Weidong Chen
Xiaofen Xing
Peihao Chen
Xiangmin Xu
VLM
298
65
0
20 Jul 2023
Multimodal Audio-textual Architecture for Robust Spoken Language
  Understanding
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding
Anderson R. Avila
Mehdi Rezagholizadeh
Chao Xing
162
1
0
12 Jun 2023
PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches
  For Speech Emotion Recognition Using Pre-trained Speech Models
PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech ModelsAffective Computing and Intelligent Interaction (ACII), 2023
Tiantian Feng
Shrikanth Narayanan
257
41
0
08 Jun 2023
HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders
HD-DEMUCS: General Speech Restoration with Heterogeneous DecodersInterspeech (Interspeech), 2023
Doyeon Kim
Soo-Whan Chung
Hyewon Han
Youna Ji
Hong-Goo Kang
178
12
0
02 Jun 2023
Scaling Speech Technology to 1,000+ Languages
Scaling Speech Technology to 1,000+ LanguagesJournal of machine learning research (JMLR), 2023
Vineel Pratap
Andros Tjandra
Bowen Shi
Paden Tomasello
Arun Babu
...
Yossi Adi
Xiaohui Zhang
Wei-Ning Hsu
Alexis Conneau
Michael Auli
VLM
389
522
0
22 May 2023
Duplex Diffusion Models Improve Speech-to-Speech Translation
Duplex Diffusion Models Improve Speech-to-Speech TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xianchao Wu
DiffM
219
5
0
22 May 2023
TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech
  Embeddings For Speech Emotion Recognition
TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Tiantian Feng
Rajat Hebbar
Shrikanth Narayanan
167
11
0
18 May 2023
A Survey on Time-Series Pre-Trained Models
A Survey on Time-Series Pre-Trained ModelsIEEE Transactions on Knowledge and Data Engineering (TKDE), 2023
Qianli Ma
Ziqiang Liu
Zhenjing Zheng
Ziyang Huang
Siying Zhu
Zhongzhong Yu
James T. Kwok
AI4TS
282
88
0
18 May 2023
A multimodal dynamical variational autoencoder for audiovisual speech
  representation learning
A multimodal dynamical variational autoencoder for audiovisual speech representation learningNeural Networks (NN), 2022
Samir Sadok
Simon Leglaive
Laurent Girin
Xavier Alameda-Pineda
Renaud Séguier
356
21
0
05 May 2023
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised
  Learning
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised LearningACM Multimedia (ACM MM), 2023
Zheng Lian
Haiyang Sun
Guoying Zhao
Kang Chen
Mingyu Xu
...
Meng Wang
Xiaoshi Zhong
Guoying Zhao
Björn W. Schuller
Jianhua Tao
271
81
0
18 Apr 2023
HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
  Recognition
HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition
Soumya Dutta
Sriram Ganapathy
354
23
0
14 Apr 2023
Wav2code: Restore Clean Speech Representations via Codebook Lookup for
  Noise-Robust ASR
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASRIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2023
Yuchen Hu
Cheng Chen
Qiu-shi Zhu
Eng Siong Chng
298
18
0
11 Apr 2023
Transformer-based Self-supervised Multimodal Representation Learning for
  Wearable Emotion Recognition
Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion RecognitionIEEE Transactions on Affective Computing (IEEE Trans. Affective Comput.), 2023
Yujin Wu
Mohamed Daoudi
A. Amad
196
77
0
29 Mar 2023
Previous
1234
Next