Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.09058
Cited By
BEATs: Audio Pre-Training with Acoustic Tokenizers
18 December 2022
Sanyuan Chen
Yu-Huan Wu
Chengyi Wang
Shujie Liu
Daniel C. Tompkins
Zhuo Chen
Furu Wei
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BEATs: Audio Pre-Training with Acoustic Tokenizers"
41 / 41 papers shown
Title
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Daisuke Niizumi
Daiki Takeuchi
Masahiro Yasuda
Binh Thien Nguyen
Yasunori Ohishi
N. Harada
27
0
0
25 Apr 2025
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey
R. Swaminathan
K V Vijay Girish
Arunasish Sen
Jian Xie
Grant P. Strimel
Andreas Schwarz
43
0
0
12 Apr 2025
Formula-Supervised Sound Event Detection: Pre-Training Without Real Data
Yuto Shibata
Keitaro Tanaka
Yoshiaki Bando
Keisuke Imoto
Hirokatsu Kataoka
Yoshimitsu Aoki
26
0
0
06 Apr 2025
Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study
Liying Han
Gaofeng Dong
Xiaomin Ouyang
Lance M. Kaplan
Federico Cerutti
Mani B. Srivastava
50
0
0
15 Mar 2025
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Antoni Bigata
Michał Stypułkowski
Rodrigo Mira
Stella Bounareli
Konstantinos Vougioukas
Zoe Landgraf
Nikita Drobyshev
Maciej Ziȩba
Stavros Petridis
M. Pantic
DiffM
VGen
63
2
0
03 Mar 2025
How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations
Hyunji Lee
Danni Liu
Supriti Sinhamahapatra
Jan Niehues
103
0
0
21 Feb 2025
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
J. Zhang
Lu Lu
Y. Wang
Haizhou Li
Z. Wu
AuLLM
75
16
0
17 Jan 2025
HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids
Dyah A. M. G. Wisnu
Stefano Rini
Ryandhimas E. Zezario
Hsin-Min Wang
Yu Tsao
49
0
0
10 Jan 2025
Do Language Models Understand Time?
Xi Ding
Lei Wang
162
0
0
18 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
78
0
0
16 Dec 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGen
VLM
51
2
0
11 Nov 2024
Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
Han Yin
Yang Xiao
Jisheng Bai
Rohan Kumar Das
31
0
0
02 Nov 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
29
2
0
23 Oct 2024
Recent Advances in Speech Language Models: A Survey
Wenqian Cui
Dianzhi Yu
Xiaoqi Jiao
Ziqiao Meng
Guangyan Zhang
Qichao Wang
Yiwen Guo
Irwin King
AuLLM
59
14
0
01 Oct 2024
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Siyin Wang
Wenyi Yu
Yudong Yang
Changli Tang
Yixuan Li
...
Jun Zhang
Guangzhi Sun
Lu Lu
Yuxuan Wang
Chao Zhang
AuLLM
LM&MA
65
5
0
25 Sep 2024
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
Lam Pham
Phat Lam
Dat Tran
Hieu Tang
Tin Nguyen
Alexander Schindler
Canh Vu
Alexander Polonsky
Canh Vu
46
3
0
23 Sep 2024
What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang
Pooneh Mousavi
Artem Ploujnikov
Mirco Ravanelli
AuLLM
44
0
0
22 Sep 2024
Effective Pre-Training of Audio Transformers for Sound Event Detection
Florian Schmid
T. Morocutti
Francesco Foscarin
Jan Schluter
Paul Primus
Gerhard Widmer
ViT
18
1
0
14 Sep 2024
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
W. Zhang
Shuo Sun
Bin Wang
Xunlong Zou
Zhuohan Liu
Yingxu He
Geyu Lin
Nancy F. Chen
A. Aw
AuLLM
65
1
0
10 Sep 2024
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
Qingyu Liu
Longfei Song
Dongxing Xu
Yanhua Long
37
0
0
20 Aug 2024
Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training
Florian Schmid
Paul Primus
T. Morocutti
Jonathan Greif
Gerhard Widmer
19
5
0
17 Jul 2024
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serra
26
2
0
08 Jul 2024
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Anbai Jiang
Bing Han
Zhiqiang Lv
Yufeng Deng
Wei-Qiang Zhang
Xie Chen
Yanmin Qian
Jia Liu
Pingyi Fan
19
3
0
17 Jun 2024
GameVibe: A Multimodal Affective Game Corpus
M. Barthet
Maria Kaselimi
Kosmas Pinitas
Konstantinos Makantasis
Antonios Liapis
Georgios N. Yannakakis
20
3
0
17 Jun 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min-Ling Zhang
MoE
30
27
0
18 May 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
72
4
0
08 Feb 2024
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang
Wenyi Yu
Guangzhi Sun
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
LM&MA
AuLLM
28
195
0
20 Oct 2023
Cross-modal Cognitive Consensus guided Audio-Visual Segmentation
Zhaofeng Shi
Qingbo Wu
Fanman Meng
Linfeng Xu
Hongliang Li
VOS
16
3
0
10 Oct 2023
Efficient Supervised Training of Audio Transformers for Music Representation Learning
Pablo Alonso-Jiménez
Xavier Serra
Dmitry Bogdanov
ViT
13
3
0
28 Sep 2023
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
Youbin Jeon
Yanzhen Ren
VLM
17
0
0
28 Sep 2023
Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation
Xilin Jiang
Cong Han
Yinghao Aaron Li
N. Mesgarani
SSL
8
1
0
27 Sep 2023
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
Zhaohui Li
Haitao Wang
Xinghua Jiang
24
1
0
14 Aug 2023
Noise-aware Speech Enhancement using Diffusion Probabilistic Model
Yuchen Hu
Cheng Chen
Ruizhe Li
Qiu-shi Zhu
E. Chng
DiffM
8
9
0
16 Jul 2023
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
25
155
0
19 May 2023
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
Ke Chen
Xingjian Du
Bilei Zhu
Zejun Ma
Taylor Berg-Kirkpatrick
Shlomo Dubnov
ViT
114
262
0
02 Feb 2022
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
258
7,337
0
11 Nov 2021
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
253
4,735
0
24 Feb 2021
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation
Yuan Gong
Yu-An Chung
James R. Glass
VLM
99
144
0
02 Feb 2021
Interspeech 2021 Deep Noise Suppression Challenge
Chandan K. A. Reddy
Harishchandra Dubey
K. Koishida
A. Nair
Vishak Gopal
Ross Cutler
Sebastian Braun
H. Gamper
R. Aichner
Sriram Srinivasan
AI4CE
72
160
0
06 Jan 2021
CLAR: Contrastive Learning of Auditory Representations
Haider Al-Tahan
Y. Mohsenzadeh
SSL
108
55
0
19 Oct 2020
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
228
29,632
0
16 Jan 2013
1