Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.05862
Cited By
v1
v2
v3
v4 (latest)
wav2vec: Unsupervised Pre-training for Speech Recognition
11 April 2019
Steffen Schneider
Alexei Baevski
R. Collobert
Michael Auli
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"wav2vec: Unsupervised Pre-training for Speech Recognition"
50 / 190 papers shown
EmoCAST: Emotional Talking Portrait via Emotive Text Description
Yiguo Jiang
Xiaodong Cun
Yong Zhang
Yudian Zheng
Fan Tang
Chi-Man Pun
DiffM
132
0
0
24 Dec 2025
Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Nimrod Berman
O. Joglekar
Eitan Kosman
Dotan Di Castro
Omri Azencot
DiffM
221
2
0
23 Oct 2025
Proprioceptive Image: An Image Representation of Proprioceptive Data from Quadruped Robots for Contact Estimation Learning
G. Abati
J. C. V. Soares
Giulio Turrisi
Victor Barasuol
Claudio Semini
121
0
0
16 Oct 2025
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Achleshwar Luthra
Priyadarsi Mishra
Tomer Galanti
SSL
171
0
0
09 Oct 2025
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Cheng-Han Chiang
Xiaofei Wang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Shujie Liu
Zhendong Wang
Zhengyuan Yang
Hung-yi Lee
Lijuan Wang
LLMAG
ReLM
RALM
LRM
184
3
0
08 Oct 2025
AgentDR Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang
Nurendra Choudhary
Jiangshu Du
Edward W.Huang
Philip S.Yu
Karthik Subbian
Danai Kourta
148
0
0
07 Oct 2025
Audio Driven Real-Time Facial Animation for Social Telepresence
Jiye Lee
Chenghui Li
Linh Tran
S. Wei
Jason M. Saragih
Alexander Richard
Hanbyul Joo
Shaojie Bai
VGen
152
0
0
01 Oct 2025
Reference-free automatic speech severity evaluation using acoustic unit language modelling
B. Halpern
Tomoki Toda
115
2
0
01 Oct 2025
StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
Liyang Chen
Tianze Zhou
Xu He
Boshi Tang
Zhiyong Wu
Yang Huang
Yang Wu
Zhongqian Sun
Wei Yang
Helen M. Meng
DiffM
202
0
0
26 Sep 2025
KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation
Tianle Lyu
Junchuan Zhao
Ye Wang
VGen
122
0
0
24 Sep 2025
Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition
Niclas Pokel
Pehuén Moure
Roman Boehringer
Shih-Chii Liu
Yingqiang Gao
127
0
0
23 Sep 2025
SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation
Xicheng Zhang
Yuan Gao
Wangjin Zhou
Zicheng Yuan
Keisuke Imoto
Tatsuya Kawahara
CLL
113
0
0
19 Sep 2025
Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion
Shanghong Li
Chiam Wen Qi Ruth
Hong Xu
Fang Liu
111
0
0
19 Sep 2025
Speech Language Models for Under-Represented Languages: Insights from Wolof
Yaya Sy
Dioula Doucouré
Christophe Cerisara
Irina Illina
AuLLM
145
0
0
18 Sep 2025
Unified Learnable 2D Convolutional Feature Extraction for ASR
Peter Vieting
Benedikt Hilmes
Ralf Schluter
Hermann Ney
SSL
158
0
0
12 Sep 2025
Contextualized Token Discrimination for Speech Search Query Correction
Junyu Lu
Di Jiang
Mengze Hong
Victor Junqiu Wei
Qintian Guo
Zhiyang Su
113
2
0
04 Sep 2025
Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning
Abdullah Abdelfattah
M. Khalil
Hazem M. Abbas
120
0
0
27 Aug 2025
Wan-S2V: Audio-Driven Cinematic Video Generation
Xin Gao
Li Hu
Siqi Hu
Mingyang Huang
Chaonan Ji
...
Peng Zhang
Xindi Zhang
Zhe Zhang
Jingren Zhou
Lian Zhuo
DiffM
VGen
142
20
0
26 Aug 2025
Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition
Tai Vu
176
0
0
26 Aug 2025
Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English
Nguyen Huu Nhat Minh
Tran Nguyen Anh
Truong Dinh Dung
Vo Van Nam
Le Pham Tuyen
89
1
0
22 Aug 2025
Foundation Models for Cross-Domain EEG Analysis Application: A Survey
Hongqi Li
Yitong Chen
Yujuan Wang
Weihang Ni
Haodong Zhang
196
2
0
21 Aug 2025
CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing
Abdul Rehman
Jian-Jun Zhang
Xiaosong Yang
130
1
0
21 Aug 2025
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Hugo Thimonier
Antony Perzo
Renaud Seguier
145
1
0
19 Aug 2025
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
Shaoshu Yang
Zhe Kong
Feng Gao
Meng Cheng
Xiangyu Liu
...
Zhuoliang Kang
Tong Lu
Xunliang Cai
Ran He
Xiaoming Wei
VGen
127
10
0
19 Aug 2025
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization
Hyebin Ahn
Kangwook Jang
Hoirin Kim
101
1
0
17 Aug 2025
Class Unbiasing for Generalization in Medical Diagnosis
Lishi Zuo
Man-Wai Mak
Lu Yi
Youzhi Tu
187
0
0
09 Aug 2025
Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Jingyuan Xing
Zhipeng Li
Jialong Mai
Xiaofen Xing
Xiangmin Xu
215
0
0
06 Aug 2025
Multimodal Referring Segmentation: A Survey
Henghui Ding
Song Tang
Shuting He
Chang-rui Liu
Zuxuan Wu
Yu-Gang Jiang
384
11
0
01 Aug 2025
Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
Xiaoxu Zhu
Junhua Li
Aaron J. Li
Yiming Ren
Baoxiang Li
189
0
0
19 Jul 2025
MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
Xinyang Li
Gen Li
Zhihui Lin
Yichen Qian
Gongxin Yao
Weinan Jia
Aowen Wang
Weihua Chen
Fan Wang
DiffM
VGen
282
0
0
04 Jul 2025
Audio-3DVG: Unified Audio -- Point Cloud Fusion for 3D Visual Grounding
Duc Cao-Dinh
Khai Le-Duc
Anh Dao
Bach Phan Tat
Chris Ngo
Duy M. H. Nguyen
Nguyen X. Khanh
Thanh Nguyen-Tang
226
0
0
01 Jul 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Gaojie Lin
Jianwen Jiang
Jiaqi Yang
Zerong Zheng
Chao Liang
DiffM
VGen
1.3K
85
0
01 Jul 2025
Manipulated Regions Localization For Partially Deepfake Audio: A Survey
Jiayi He
Jiangyan Yi
Jianhua Tao
Siding Zeng
Hao Gu
193
2
0
17 Jun 2025
AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
Chih-Kai Yang
Neo Ho
Yi-Jyun Lee
Hung-yi Lee
AuLLM
373
4
0
05 Jun 2025
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction
International Conference on Signal Processing and Communications (ICSPC), 2024
Saurabh Agrawal
Raj Gohil
Gopal Kumar Agrawal
Vikram C M
Kushal Verma
150
1
0
02 Jun 2025
Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing
Journal of King Saud University: Computer and Information Sciences (J. King Saud Univ. Comput. Inf. Sci.), 2025
Hanfang Cui
Longfei Song
Li Li
Dongxing Xu
Yanhua Long
346
0
0
17 May 2025
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
J. Choi
Ji-Hoon Kim
Kim Sung-Bin
Tae-Hyun Oh
Joon Son Chung
DiffM
457
1
0
29 Apr 2025
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Yeona Hong
Hyewon Han
Woo-Jin Chung
Hong-Goo Kang
MQ
342
0
0
21 Apr 2025
DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Chengxuan Qian
Shuo Xing
Shawn Li
Yue Zhao
Zhengzhong Tu
328
11
0
14 Mar 2025
Dimitra: Audio-driven Diffusion model for Expressive Talking Head Generation
Baptiste Chopin
Tashvik Dhamija
P. Balaji
Yaohui Wang
A. Dantcheva
DiffM
VGen
285
3
0
24 Feb 2025
Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models
Taj Jones-McCormick
Aukosh Jagannath
S. Sen
405
2
0
24 Feb 2025
On the Robust Approximation of ASR Metrics
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Abdul Waheed
Hanin Atwany
Rita Singh
Bhiksha Raj
315
2
0
18 Feb 2025
Evaluation of Deep Audio Representations for Hearables
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Fabian Gröger
Pascal Baumann
Ludovic Amruthalingam
Laurent Simon
Ruksana Giurda
Simone Lionetti
364
1
0
10 Feb 2025
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Rajath Rao
Adithya Ganesan
Oscar Kjell
Jonah Luby
Akshay Raghavan
...
B. Luft
Camilo Ruggero
Neville Ryant
R. Kotov
H. Andrew Schwartz
460
2
0
15 Jan 2025
FAST: Fast Audio Spectrogram Transformer
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Anugunj Naman
Gaibo Zhang
144
2
0
03 Jan 2025
Memory-Centric Computing: Recent Advances in Processing-in-DRAM
O. Mutlu
Ataberk Olgun
Geraldo F. Oliveira
Ismail Emir Yüksel
321
11
0
26 Dec 2024
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
Computer Vision and Pattern Recognition (CVPR), 2024
Jiahao Cui
Hui Li
Yun Zhan
Hanlin Shang
K. Cheng
Yuqi Ma
Shan Mu
Hang Zhou
Jingdong Wang
Siyu Zhu
ViT
VGen
545
78
0
01 Dec 2024
Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques
Applied Soft Computing (Appl. Soft Comput.), 2024
David Ortiz-Perez
Manuel Benavent-Lledo
José García Rodríguez
David Tomás
M. Flores Vizcaya-Moreno
231
3
0
24 Oct 2024
Detecting Adversarial Examples
Furkan Mumcu
Yasin Yilmaz
AAML
260
4
0
22 Oct 2024
Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads
Federico Nocentini
T. Besnier
Claudio Ferrari
Sylvain Arguillere
Stefano Berretti
Mohamed Daoudi
365
2
0
14 Oct 2024
1
2
3
4
Next