ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.03555
  4. Cited By
data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language
v1v2v3 (latest)

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

International Conference on Machine Learning (ICML), 2022
7 February 2022
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
    SSLVLMViT
ArXiv (abs)PDFHTML

Papers citing "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language"

50 / 605 papers shown
Title
Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach
Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach
Huu Tuong Tu
Ha Viet Khanh
Tran Tien Dat
Vu Huan
Thien Van Luong
Nguyen Tien Cuong
Nguyen Thi Thu Trang
76
0
0
25 Nov 2025
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Wei-Cheng Tseng
Xuanru Zhou
Mingyue Huo
Yiwen Shao
Hao Zhang
Dong Yu
CLIPAI4TSVLM
100
0
0
20 Nov 2025
Unifying Model and Layer Fusion for Speech Foundation Models
Unifying Model and Layer Fusion for Speech Foundation Models
Yi-Jen Shih
David Harwath
MoMe
224
0
0
11 Nov 2025
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Ziliang Chen
Tianang Xiao
Jusheng Zhang
Yongsen Zheng
Xipeng Chen
CLIP
80
0
0
30 Oct 2025
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
Suman Sanyal
SSL
234
0
0
28 Oct 2025
SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling
SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling
Samuel J. Barrett
Docko Sow
64
0
0
21 Oct 2025
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
Wenxi Chen
X. Wang
Ruiqi Yan
Yihao Chen
Zhikang Niu
...
Yuzhe Liang
Hanlin Wen
Shunshun Yin
Ming Tao
Xie Chen
108
1
0
19 Oct 2025
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
Sanghyun Byun
Jung Guack
Mohanad Odema
Baisub Lee
Jacob Song
Woo Seong Chung
VLM
63
0
0
14 Oct 2025
A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
Emilio Estevan
María Sierra-Torralba
Eduardo López-Larraz
Luis Montesano
102
0
0
09 Oct 2025
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Achleshwar Luthra
Priyadarsi Mishra
Tomer Galanti
SSL
135
0
0
09 Oct 2025
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation
Vaibhav Srivastav
Steven Zheng
Eric Bezzam
Eustache Le Bihan
Nithin Rao Koluguri
Piotr .Zelasko
164
0
0
08 Oct 2025
Alternatives To Next Token Prediction In Text Generation - A Survey
Alternatives To Next Token Prediction In Text Generation - A Survey
Charlie Wyatt
Aditya Joshi
Flora D. Salim
80
0
0
29 Sep 2025
Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
Lukas Rauch
René Heinrich
Houtan Ghaffari
Lukas Miklautz
Ilyass Moummad
Bernhard Sick
Christoph Scholz
241
1
0
29 Sep 2025
WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms
WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms
Goksenin Yuksel
Pierre Guetschel
Michael Tangermann
Marcel van Gerven
Kiki van der Heijden
AI4TS
108
0
0
27 Sep 2025
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms
Sarthak Yadav
Sergios Theodoridis
Zheng-Hua Tan
Mamba
147
0
0
23 Sep 2025
HARNESS: Lightweight Distilled Arabic Speech Foundation Models
HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Vrunda N. Sukhadia
Shammur A. Chowdhury
105
0
0
18 Sep 2025
Label-Efficient Grasp Joint Prediction with Point-JEPA
Label-Efficient Grasp Joint Prediction with Point-JEPA
Jed Guzelkabaagac
Boris Petrović
3DPC
127
0
0
13 Sep 2025
DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition
DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition
Yifei Wang
Wenbin Wang
Yong Luo
72
0
0
12 Sep 2025
Deep Learning for Tuberculosis Screening in a High-burden Setting using Cough Analysis and Speech Foundation Models
Deep Learning for Tuberculosis Screening in a High-burden Setting using Cough Analysis and Speech Foundation Models
Ning Ma
Bahman Mirheidari
Guy J. Brown
N. Sanjase
N. Sanjase
Solomon Chifwamba
Seke Muzazu
Monde Muyoyeta
Mary Kagujje
LM&MA
139
0
0
11 Sep 2025
LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Hai Huang
Yann LeCun
Randall Balestriero
151
2
0
11 Sep 2025
Segment Transformer: AI-Generated Music Detection via Music Structural Analysis
Segment Transformer: AI-Generated Music Detection via Music Structural Analysis
Yumin Kim
Seonghyeon Go
68
0
0
10 Sep 2025
Diffusion-Based Action Recognition Generalizes to Untrained Domains
Diffusion-Based Action Recognition Generalizes to Untrained Domains
Rogério Guimarães
Frank Xiao
Pietro Perona
Markus Marks
221
0
0
10 Sep 2025
Mitigating Data Imbalance in Automated Speaking Assessment
Mitigating Data Imbalance in Automated Speaking Assessment
Fong-Chun Tsai
Kuan-Tang Huang
Bi-Cheng Yan
Tien-Hong Lo
Berlin Chen
84
0
0
03 Sep 2025
Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models
Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL ModelsPattern Recognition Letters (Pattern Recogn. Lett.), 2025
Subham Kutum
Abhijit Sinha
H. Kathania
Sudarsana Reddy Kadiri
Mahesh Chandra Govil
64
1
0
28 Aug 2025
Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?
Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?IEEE Signal Processing Letters (IEEE SPL), 2025
Abhijit Sinha
H. Kathania
Sudarsana Reddy Kadiri
Shrikanth Narayanan
64
0
0
28 Aug 2025
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
Anthony Bisulco
Rahul Ramesh
Randall Balestriero
Pratik Chaudhari
94
0
0
21 Aug 2025
MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning
MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning
Aurian Quélennec
Pierre Chouteau
Geoffroy Peeters
S. Essid
128
0
0
18 Aug 2025
Learn Faster and Remember More: Balancing Exploration and Exploitation for Continual Test-time Adaptation
Learn Faster and Remember More: Balancing Exploration and Exploitation for Continual Test-time Adaptation
Pinci Yang
Peisong Wen
Ke Ma
Qianqian Xu
CLLTTA
194
0
0
18 Aug 2025
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization
Hyebin Ahn
Kangwook Jang
Hoirin Kim
64
1
0
17 Aug 2025
RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
Suhang Hu
Wei Hu
Yuhang Su
Fan Zhang
ReLMLRMVLM
216
0
0
17 Aug 2025
VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks
VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks
Daria Diatlova
Nikita Balagansky
Alexander Varlamov
Egor Spirin
DRL
152
0
0
16 Aug 2025
Benchmarking Prosody Encoding in Discrete Speech Tokens
Benchmarking Prosody Encoding in Discrete Speech Tokens
Kentaro Onda
Satoru Fukayama
Daisuke Saito
Nobuaki Minematsu
60
1
0
15 Aug 2025
Emphasis Sensitivity in Speech Representations
Emphasis Sensitivity in Speech Representations
Shaun Cassini
Thomas Hain
Anton Ragni
76
0
0
15 Aug 2025
S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision
S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision
Huihui Xu
Jin Ye
Hongqiu Wang
Changkai Ji
Jiashi Lin
...
Chenglong Ma
Tianbin Li
Lihao Liu
Junjun He
Lei Zhu
134
0
0
09 Aug 2025
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant ObjectiveTransactions of the International Society for Music Information Retrieval (TISMIR), 2025
Alain Riou
Bernardo Torres
Ben Hayes
Stefan Lattner
Gaëtan Hadjeres
Gaël Richard
Geoffroy Peeters
196
2
0
02 Aug 2025
Foundation Models for Bioacoustics -- a Comparative Review
Foundation Models for Bioacoustics -- a Comparative Review
Raphael Schwinger
Paria Vali Zadeh
Lukas Rauch
Mats Kurz
Tom Hauschild
Sam Lapp
Sven Tomforde
VLM
97
1
0
02 Aug 2025
MINR: Implicit Neural Representations with Masked Image Modelling
MINR: Implicit Neural Representations with Masked Image Modelling
Sua Lee
Joonhun Lee
Myungjoo Kang
103
1
0
30 Jul 2025
FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation
FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation
Pingyi Fan
Anbai Jiang
Shuwei Zhang
Zhiqiang Lv
Bing Han
...
Wei Zhang
Yanmin Qian
Xie Chen
Cheng Lu
Jia Liu
97
1
0
22 Jul 2025
Decoding Translation-Related Functional Sequences in 5ÚTRs Using Interpretable Deep Learning Models
Decoding Translation-Related Functional Sequences in 5ÚTRs Using Interpretable Deep Learning Models
Yuxi Lin
Yaxue Fang
Zehong Zhang
Zhouwu Liu
Siyun Zhong
Fulong Yu
80
0
0
22 Jul 2025
Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition
Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition
Mengzhe Geng
Patrick Littell
Aidan Pine
PENÁĆ
Marc Tessier
Roland Kuhn
108
0
0
14 Jul 2025
USAD: Universal Speech and Audio Representation via Distillation
USAD: Universal Speech and Audio Representation via Distillation
Heng-Jui Chang
Saurabhchand Bhati
James R. Glass
Alexander H. Liu
247
2
0
23 Jun 2025
Discrete JEPA: Learning Discrete Token Representations without Reconstruction
Discrete JEPA: Learning Discrete Token Representations without Reconstruction
Junyeob Baek
Hosung Lee
Christopher Hoang
Mengye Ren
Sungjin Ahn
187
0
0
17 Jun 2025
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic SoundscapesInternational Conference on Learning Representations (ICLR), 2025
Tony Alex
S. Ahmed
A. Mustafa
Muhammad Awais
Philip J. B. Jackson
141
7
0
13 Jun 2025
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation
Yanlong Chen
Mattia Orlandi
Pierangelo Maria Rapa
Simone Benatti
Luca Benini
Yawei Li
361
0
0
12 Jun 2025
Vision Generalist Model: A Survey
Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
257
0
0
11 Jun 2025
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
UAD: Unsupervised Affordance Distillation for Generalization in Robotic ManipulationIEEE International Conference on Robotics and Automation (ICRA), 2025
Yihe Tang
Wenlong Huang
Yingke Wang
Chengshu Li
Roy Yuan
Ruohan Zhang
Jiajun Wu
Li Fei-Fei
236
12
0
10 Jun 2025
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech
Jingyu Li
Lingchao Mao
Hairong Wang
Zhendong Wang
Xi Mao
Xuelei Sherry Ni
102
0
0
09 Jun 2025
MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements
MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements
Howon Ryu
Y. Chen
Yacun Wang
Andrea Z. LaCroix
Chongzhi Di
L. Natarajan
Yu Wang
Jingjing Zou
250
0
0
02 Jun 2025
GigaAM: Efficient Self-Supervised Learner for Speech Recognition
GigaAM: Efficient Self-Supervised Learner for Speech Recognition
Aleksandr Kutsakov
Alexandr Maximenko
Georgii Gospodinov
Pavel Bogomolov
Fyodor Minkin
177
0
0
01 Jun 2025
$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
AVROBUSTBENCH\texttt{AVROBUSTBENCH}AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana
Saksham Singh Kushwaha
Baoming Zhang
Adrian Rodriguez
Songtao Wei
Yapeng Tian
Yunhui Guo
TTAVLM
201
0
0
31 May 2025
1234...111213
Next