v1v2v3 (latest)

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

International Conference on Machine Learning (ICML), 2022

7 February 2022

Papers citing "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language"

50 / 605 papers shown

Title
Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach Huu Tuong Tu Ha Viet Khanh Tran Tien Dat Vu Huan Thien Van Luong Nguyen Tien Cuong Nguyen Thi Thu Trang 88 0 0 25 Nov 2025
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation Wei-Cheng Tseng Xuanru Zhou Mingyue Huo Yiwen Shao Hao Zhang Dong Yu CLIP AI4TS VLM 112 0 0 20 Nov 2025
Unifying Model and Layer Fusion for Speech Foundation Models Yi-Jen Shih David Harwath MoMe 232 0 0 11 Nov 2025
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens Ziliang Chen Tianang Xiao Jusheng Zhang Yongsen Zheng Xipeng Chen CLIP 80 0 0 30 Oct 2025
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning Suman Sanyal SSL 254 0 0 28 Oct 2025
SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling Samuel J. Barrett Docko Sow 76 0 0 21 Oct 2025
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization Wenxi Chen X. Wang Ruiqi Yan Yihao Chen Zhikang Niu ... Yuzhe Liang Hanlin Wen Shunshun Yin Ming Tao Xie Chen 112 1 0 19 Oct 2025
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement Sanghyun Byun Jung Guack Mohanad Odema Baisub Lee Jacob Song Woo Seong Chung VLM 67 0 0 14 Oct 2025
A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG Emilio Estevan María Sierra-Torralba Eduardo López-Larraz Luis Montesano 102 0 0 09 Oct 2025
On the Alignment Between Supervised and Self-Supervised Contrastive Learning Achleshwar Luthra Priyadarsi Mishra Tomer Galanti SSL 151 0 0 09 Oct 2025
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation Vaibhav Srivastav Steven Zheng Eric Bezzam Eustache Le Bihan Nithin Rao Koluguri Piotr .Zelasko 168 0 0 08 Oct 2025
Alternatives To Next Token Prediction In Text Generation - A Survey Charlie Wyatt Aditya Joshi Flora D. Salim 84 0 0 29 Sep 2025
Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification Lukas Rauch René Heinrich Houtan Ghaffari Lukas Miklautz Ilyass Moummad Bernhard Sick Christoph Scholz 245 1 0 29 Sep 2025
WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms Goksenin Yuksel Pierre Guetschel Michael Tangermann Marcel van Gerven Kiki van der Heijden AI4TS 108 0 0 27 Sep 2025
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms Sarthak Yadav Sergios Theodoridis Zheng-Hua Tan Mamba 163 0 0 23 Sep 2025
HARNESS: Lightweight Distilled Arabic Speech Foundation Models Vrunda N. Sukhadia Shammur A. Chowdhury 113 0 0 18 Sep 2025
Label-Efficient Grasp Joint Prediction with Point-JEPA Jed Guzelkabaagac Boris Petrović 3DPC 127 0 0 13 Sep 2025
DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition Yifei Wang Wenbin Wang Yong Luo 72 0 0 12 Sep 2025
Deep Learning for Tuberculosis Screening in a High-burden Setting using Cough Analysis and Speech Foundation Models Ning Ma Bahman Mirheidari Guy J. Brown N. Sanjase N. Sanjase Solomon Chifwamba Seke Muzazu Monde Muyoyeta Mary Kagujje LM&MA 139 0 0 11 Sep 2025
LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures Hai Huang Yann LeCun Randall Balestriero 155 3 0 11 Sep 2025
Segment Transformer: AI-Generated Music Detection via Music Structural Analysis Yumin Kim Seonghyeon Go 72 0 0 10 Sep 2025
Diffusion-Based Action Recognition Generalizes to Untrained Domains Rogério Guimarães Frank Xiao Pietro Perona Markus Marks 229 0 0 10 Sep 2025
Mitigating Data Imbalance in Automated Speaking Assessment Fong-Chun Tsai Kuan-Tang Huang Bi-Cheng Yan Tien-Hong Lo Berlin Chen 92 0 0 03 Sep 2025
Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL ModelsPattern Recognition Letters (Pattern Recogn. Lett.), 2025 Subham Kutum Abhijit Sinha H. Kathania Sudarsana Reddy Kadiri Mahesh Chandra Govil 64 1 0 28 Aug 2025
Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?IEEE Signal Processing Letters (IEEE SPL), 2025 Abhijit Sinha H. Kathania Sudarsana Reddy Kadiri Shrikanth Narayanan 68 0 0 28 Aug 2025
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations Anthony Bisulco Rahul Ramesh Randall Balestriero Pratik Chaudhari 94 0 0 21 Aug 2025
MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning Aurian Quélennec Pierre Chouteau Geoffroy Peeters S. Essid 140 0 0 18 Aug 2025
Learn Faster and Remember More: Balancing Exploration and Exploitation for Continual Test-time Adaptation Pinci Yang Peisong Wen Ke Ma Qianqian Xu CLL TTA 214 0 0 18 Aug 2025
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization Hyebin Ahn Kangwook Jang Hoirin Kim 72 1 0 17 Aug 2025
RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning Suhang Hu Wei Hu Yuhang Su Fan Zhang ReLM LRM VLM 224 0 0 17 Aug 2025
VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks Daria Diatlova Nikita Balagansky Alexander Varlamov Egor Spirin DRL 168 0 0 16 Aug 2025
Benchmarking Prosody Encoding in Discrete Speech Tokens Kentaro Onda Satoru Fukayama Daisuke Saito Nobuaki Minematsu 64 1 0 15 Aug 2025
Emphasis Sensitivity in Speech Representations Shaun Cassini Thomas Hain Anton Ragni 76 0 0 15 Aug 2025
S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision Huihui Xu Jin Ye Hongqiu Wang Changkai Ji Jiashi Lin ... Chenglong Ma Tianbin Li Lihao Liu Junjun He Lei Zhu 147 0 0 09 Aug 2025
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant ObjectiveTransactions of the International Society for Music Information Retrieval (TISMIR), 2025 Alain Riou Bernardo Torres Ben Hayes Stefan Lattner Gaëtan Hadjeres Gaël Richard Geoffroy Peeters 200 3 0 02 Aug 2025
Foundation Models for Bioacoustics -- a Comparative Review Raphael Schwinger Paria Vali Zadeh Lukas Rauch Mats Kurz Tom Hauschild Sam Lapp Sven Tomforde VLM 97 1 0 02 Aug 2025
MINR: Implicit Neural Representations with Masked Image Modelling Sua Lee Joonhun Lee Myungjoo Kang 103 1 0 30 Jul 2025
FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation Pingyi Fan Anbai Jiang Shuwei Zhang Zhiqiang Lv Bing Han ... Wei Zhang Yanmin Qian Xie Chen Cheng Lu Jia Liu 105 1 0 22 Jul 2025
Decoding Translation-Related Functional Sequences in 5ÚTRs Using Interpretable Deep Learning Models Yuxi Lin Yaxue Fang Zehong Zhang Zhouwu Liu Siyun Zhong Fulong Yu 88 0 0 22 Jul 2025
Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Mengzhe Geng Patrick Littell Aidan Pine PENÁĆ Marc Tessier Roland Kuhn 108 0 0 14 Jul 2025
USAD: Universal Speech and Audio Representation via Distillation Heng-Jui Chang Saurabhchand Bhati James R. Glass Alexander H. Liu 251 2 0 23 Jun 2025
Discrete JEPA: Learning Discrete Token Representations without Reconstruction Junyeob Baek Hosung Lee Christopher Hoang Mengye Ren Sungjin Ahn 195 0 0 17 Jun 2025
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic SoundscapesInternational Conference on Learning Representations (ICLR), 2025 Tony Alex S. Ahmed A. Mustafa Muhammad Awais Philip J. B. Jackson 141 7 0 13 Jun 2025
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation Yanlong Chen Mattia Orlandi Pierangelo Maria Rapa Simone Benatti Luca Benini Yawei Li 361 1 0 12 Jun 2025
Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025 Ziyi Wang Yongming Rao Shuofeng Sun Xinrun Liu Yi Wei ... Zuyan Liu Yanbo Wang Hongmin Liu Jie Zhou Jiwen Lu 261 0 0 11 Jun 2025
UAD: Unsupervised Affordance Distillation for Generalization in Robotic ManipulationIEEE International Conference on Robotics and Automation (ICRA), 2025 Yihe Tang Wenlong Huang Yingke Wang Chengshu Li Roy Yuan Ruohan Zhang Jiajun Wu Li Fei-Fei 240 12 0 10 Jun 2025
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech Jingyu Li Lingchao Mao Hairong Wang Zhendong Wang Xi Mao Xuelei Sherry Ni 102 0 0 09 Jun 2025
MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements Howon Ryu Y. Chen Yacun Wang Andrea Z. LaCroix Chongzhi Di L. Natarajan Yu Wang Jingjing Zou 250 0 0 02 Jun 2025
GigaAM: Efficient Self-Supervised Learner for Speech Recognition Aleksandr Kutsakov Alexandr Maximenko Georgii Gospodinov Pavel Bogomolov Fyodor Minkin 177 0 0 01 Jun 2025
$$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time$ $\texttt{AVROBUSTBENCH}$ : Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time Sarthak Kumar Maharana Saksham Singh Kushwaha Baoming Zhang Adrian Rodriguez Songtao Wei Yapeng Tian Yunhui Guo TTA VLM 209 0 0 31 May 2025