Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2202.03555
Cited By

data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language

v1v2v3 (latest)

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

International Conference on Machine Learning (ICML), 2022

7 February 2022

ArXiv (abs)PDF HTML

Papers citing "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language"

50 / 609 papers shown

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASRIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024

Shinji Watanabe

283

6

0

07 Dec 2024

Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for
Robust 3D Robotic Manipulation

Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

...

Shanghang Zhang

420

41

0

27 Nov 2024

Image Generation Diversity Issues and How to Tame Them

Image Generation Diversity Issues and How to Tame ThemComputer Vision and Pattern Recognition (CVPR), 2024

Mischa Dombrowski

Sarah Cechnicka

Hadrien Reynaud

322

11

0

25 Nov 2024

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Everything is a Video: Unifying Modalities through Next-Frame Prediction

T. Winterbottom

Noura Al Moubayed

279

2

0

15 Nov 2024

ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics
over Acoustic Foundation Models

ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation ModelsIEEE Transactions on Affective Computing (IEEE Trans. Affective Comput.), 2024

102

8

0

14 Nov 2024

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Speech Separation with Pretrained Frontend to Minimize Domain MismatchIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2024

285

11

0

05 Nov 2024

Music Foundation Model as Generic Booster for Music Downstream Tasks

Music Foundation Model as Generic Booster for Music Downstream Tasks

...

Yuichiro Koyama

530

6

0

02 Nov 2024

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

314

5

0

31 Oct 2024

Sparsh: Self-supervised touch representations for vision-based tactile
sensing

Sparsh: Self-supervised touch representations for vision-based tactile sensingConference on Robot Learning (CoRL), 2024

Carolina Higuera

Chaithanya Krishna Bodduluri

Patrick E. Lancaster

...

Mustafa Mukadam

270

47

0

31 Oct 2024

Enhancing TTS Stability in Hebrew using Discrete Semantic Units

Enhancing TTS Stability in Hebrew using Discrete Semantic UnitsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

172

3

0

28 Oct 2024

Connecting Joint-Embedding Predictive Architecture with Contrastive
Self-supervised Learning

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised LearningNeural Information Processing Systems (NeurIPS), 2024

313

6

0

25 Oct 2024

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech
Recognition using Agnostic Contrastive Mixup

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive MixupIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Carlos Carvalho

215

1

0

18 Oct 2024

Self-supervised contrastive learning performs non-linear system identification

Self-supervised contrastive learning performs non-linear system identificationInternational Conference on Learning Representations (ICLR), 2024

Rodrigo González Laiz

Steffen Schneider

291

4

0

18 Oct 2024

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech
Representation Learning

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Ramaneswaran Selvakumar

248

4

0

17 Oct 2024

Investigation of Speaker Representation for Target-Speaker Speech
Processing

Investigation of Speaker Representation for Target-Speaker Speech ProcessingSpoken Language Technology Workshop (SLT), 2024

Takanori Ashihara

Takafumi Moriya

Shota Horiguchi

231

2

0

15 Oct 2024

JOOCI: a Framework for Learning Comprehensive Speech Representations

JOOCI: a Framework for Learning Comprehensive Speech Representations

Sunayana Sitaram

325

0

0

14 Oct 2024

Adaptive Diffusion Terrain Generator for Autonomous Uneven Terrain
Navigation

Adaptive Diffusion Terrain Generator for Autonomous Uneven Terrain NavigationConference on Robot Learning (CoRL), 2024

160

0

0

14 Oct 2024

Locality Alignment Improves Vision-Language Models

Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Tatsunori Hashimoto

592

11

0

14 Oct 2024

Learning General Representation of 12-Lead Electrocardiogram with a
Joint-Embedding Predictive Architecture

Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture

Sehun Kim

205

7

0

11 Oct 2024

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Surya Koppisetti

170

2

0

09 Oct 2024

Forte : Finding Outliers with Representation Typicality Estimation

Forte : Finding Outliers with Representation Typicality EstimationInternational Conference on Learning Representations (ICLR), 2024

Debargha Ganguly

Warren Morningstar

Vipin Chaudhary

266

4

0

02 Oct 2024

Denoising with a Joint-Embedding Predictive Architecture

Denoising with a Joint-Embedding Predictive ArchitectureInternational Conference on Learning Representations (ICLR), 2024

Dengsheng Chen

Jie Hu

Enhua Wu

482

5

0

02 Oct 2024

You Only Speak Once to See

You Only Speak Once to SeeIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Wenhao Yang

Wenhuan Lu

Lei Li

228

4

0

27 Sep 2024

Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM
Personalization

Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

314

3

0

25 Sep 2024

Point-PNG: Conditional Pseudo-Negatives Generation for Point Cloud Pre-Training

Point-PNG: Conditional Pseudo-Negatives Generation for Point Cloud Pre-Training

Sutharsan Mahendren

Saimunur Rahman

Tharindu Fernando

Sridha Sridharan

Peyman Moghadam

367

0

0

24 Sep 2024

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for
SSL-Based Speaker Verification

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Ladislav Mošner

Themos Stafylakis

Lukáš Burget

Jan Černocký

176

4

0

23 Sep 2024

The ParlaSpeech Collection of Automatically Generated Speech and Text
Datasets from Parliamentary Proceedings

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary ProceedingsInternational Conference on Speech and Computer (SPECOM), 2024

Nikola Ljubesic

Danijel Koržinek

228

5

0

23 Sep 2024

Is Tokenization Needed for Masked Particle Modelling?

Is Tokenization Needed for Masked Particle Modelling?

Matthew Leigh

Samuel Klein

François Charton

Tobias Golling

Michael Kagan

Margarita Osadchy

238

18

0

19 Sep 2024

Self-supervised Speech Models for Word-Level Stuttered Speech Detection

Self-supervised Speech Models for Word-Level Stuttered Speech DetectionSpoken Language Technology Workshop (SLT), 2024

Yi-Jen Shih

David Harwath

243

6

0

16 Sep 2024

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System PerformanceIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Huang-Cheng Chou

422

3

0

16 Sep 2024

NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training

NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training

Youjia Huang

Mingkun Huang

Yuxuan Wang

223

2

0

13 Sep 2024

Connecting Concept Convexity and Human-Machine Alignment in Deep Neural
Networks

Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

Teresa Dorszewski

Lenka Tětková

Lorenz Linhardt

Lars Kai Hansen

235

1

0

10 Sep 2024

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

Tariq Habib Afridi

...

Muhammad Zaigham Zaheer

Tangina Sultana

Ziaurrehman Tanoli

949

12

0

30 Aug 2024

SSDM: Scalable Speech Dysfluency Modeling

SSDM: Scalable Speech Dysfluency ModelingNeural Information Processing Systems (NeurIPS), 2024

Xuanru Zhou

Gopala Anumanchipalli

291

19

0

29 Aug 2024

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language ModelingInternational Conference on Learning Representations (ICLR), 2024

...

Rongjie Huang

Yidi Jiang

Qian Chen

Zhou Zhao

405

122

0

29 Aug 2024

GSIFN: A Graph-Structured and Interlaced-Masked Multimodal
Transformer-based Fusion Network for Multimodal Sentiment Analysis

GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis

Yijie Jin

214

3

0

27 Aug 2024

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech
Processing Tasks

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing TasksIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Taejin Park

Kunal Dhawan

Ivan Medennikov

Krishna Puvvada

Nithin Rao Koluguri

Jagadeesh Balam

Boris Ginsburg

328

4

0

23 Aug 2024

BUT Systems and Analyses for the ASVspoof 5 Challenge

BUT Systems and Analyses for the ASVspoof 5 Challenge

Oldřich Plchot

Vojtěch Staněk

...

Themos Stafylakis

Dmitriy Beveraki

Lukáš Burget

181

8

0

20 Aug 2024

mRNA2vec: mRNA Embedding with Language Model in the 5ÚTR-CDS for mRNA
Design

mRNA2vec: mRNA Embedding with Language Model in the 5ÚTR-CDS for mRNA DesignAAAI Conference on Artificial Intelligence (AAAI), 2024

191

5

0

16 Aug 2024

SpectralEarth: Training Hyperspectral Foundation Models at Scale

SpectralEarth: Training Hyperspectral Foundation Models at ScaleIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE JSTARS), 2024

Nassim Ait Ali Braham

316

33

0

15 Aug 2024

Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem
Compatibility Estimation

Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility EstimationInternational Society for Music Information Retrieval Conference (ISMIR), 2024

Gaëtan Hadjeres

Geoffroy Peeters

262

6

0

05 Aug 2024

Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

302

11

0

31 Jul 2024

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal
Nuances

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances

Julia Hirschberg

390

37

0

31 Jul 2024

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake
Detection

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Surya Koppisetti

405

22

0

26 Jul 2024

Towards Latent Masked Image Modeling for Self-Supervised Visual
Representation Learning

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

206

14

0

22 Jul 2024

Overview of Speaker Modeling and Its Applications: From the Lens of Deep
Speaker Representation Learning

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

Shuai Wang

Zheng-Shou Chen

Kong Aik Lee

344

23

0

21 Jul 2024

Linear-Complexity Self-Supervised Learning for Speech Processing

Linear-Complexity Self-Supervised Learning for Speech Processing

Shucong Zhang

Titouan Parcollet

Rogier van Dalen

Sourav Bhattacharya

278

1

0

18 Jul 2024

ColorMAE: Exploring data-independent masking strategies in Masked
AutoEncoders

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Carlos Hinojosa

248

8

0

17 Jul 2024

A Closer Look at Benchmarking Self-Supervised Pre-training with Image
Classification

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Neehar Kondapaneni

Fernando Pérez-Cruz

402

15

0

16 Jul 2024

Efficient Unsupervised Visual Representation Learning with Explicit
Cluster Balancing

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Ioannis Maniadis Metaxas

Georgios Tzimiropoulos

307

2

0

15 Jul 2024

1 2 3 4 5 6...11 12 13