ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXivPDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 142 papers shown
Title
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
Paul Primus
Florian Schmid
Gerhard Widmer
CLIP
AI4TS
VLM
31
0
0
12 May 2025
Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Xilin Jiang
Junkai Wu
Vishal B. Choudhari
N. Mesgarani
VLM
30
0
0
11 May 2025
Learning Music Audio Representations With Limited Data
Learning Music Audio Representations With Limited Data
Christos Plachouras
Emmanouil Benetos
Johan Pauwels
21
0
0
09 May 2025
MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
Soheil Zibakhsh Shabgahi
Yaman Jandali
F. Koushanfar
MoMe
AAML
52
0
0
06 May 2025
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks
Junhe Zhang
Wanli Ni
Pengwei Wang
Dongyu Wang
24
0
0
06 May 2025
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen
Yifang Yin
Jinming Cao
Shili Xiang
Zhenguang Liu
Roger Zimmermann
VOS
VLM
39
0
0
30 Apr 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
76
0
0
30 Apr 2025
PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies
PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies
Jialiang Zhao
Naveen Kuppuswamy
S. Feng
Benjamin Burchfiel
Edward H. Adelson
37
1
0
27 Apr 2025
M2R2: MulitModal Robotic Representation for Temporal Action Segmentation
M2R2: MulitModal Robotic Representation for Temporal Action Segmentation
Daniel Sliwowski
Dongheui Lee
22
1
0
25 Apr 2025
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Daisuke Niizumi
Daiki Takeuchi
Masahiro Yasuda
Binh Thien Nguyen
Yasunori Ohishi
N. Harada
27
0
0
25 Apr 2025
Formula-Supervised Sound Event Detection: Pre-Training Without Real Data
Formula-Supervised Sound Event Detection: Pre-Training Without Real Data
Yuto Shibata
Keitaro Tanaka
Yoshiaki Bando
Keisuke Imoto
Hirokatsu Kataoka
Yoshimitsu Aoki
26
0
0
06 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
41
0
0
03 Apr 2025
R^RRFLAV: Rolling Flow matching for infinite Audio Video generation
Alex Ergasti
Giuseppe Tarollo
Filippo Botti
Tomaso Fontanini
Claudio Ferrari
Massimo Bertozzi
Andrea Prati
VGen
45
0
0
13 Mar 2025
Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer
Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer
Siqiao Zhao
Zhikang Dong
Zeyu Cao
Raphael Douady
50
6
0
17 Feb 2025
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
David Sasu
Zehui Wu
Ziwei Gong
Run Chen
Pengyuan Shi
Lin Ai
Julia Hirschberg
Natalie Schluter
58
1
0
16 Feb 2025
Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling
Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling
Jakob Poncelet
Hugo Van hamme
69
0
0
05 Feb 2025
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. P. Muñoz
Jinjie Yuan
Nilesh Jain
Mamba
68
1
0
28 Jan 2025
Hybrid Losses for Hierarchical Embedding Learning
Hybrid Losses for Hierarchical Embedding Learning
Haokun Tian
Stefan Lattner
Brian McFee
Charalampos Saitis
43
0
0
22 Jan 2025
Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection
Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection
Myeonghoon Ryu
June-Woo Kim
Minseok Oh
Suji Lee
Han Park
36
0
0
20 Jan 2025
AudioBERT: Audio Knowledge Augmented Language Model
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
42
0
0
17 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
79
2
0
10 Jan 2025
Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
N. Dennler
S. Nikolaidis
Maja J. Matarić
101
0
0
03 Jan 2025
GraFPrint: A GNN-Based Approach for Audio Identification
GraFPrint: A GNN-Based Approach for Audio Identification
Aditya Bhattacharjee
Shubhr Singh
Emmanouil Benetos
21
0
0
14 Oct 2024
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh
Sonal Kumar
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
Dinesh Manocha
DiffM
39
2
0
02 Oct 2024
Recent Advances in Speech Language Models: A Survey
Recent Advances in Speech Language Models: A Survey
Wenqian Cui
Dianzhi Yu
Xiaoqi Jiao
Ziqiao Meng
Guangyan Zhang
Qichao Wang
Yiwen Guo
Irwin King
AuLLM
59
14
0
01 Oct 2024
MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Xiaoyu Yang
Qiujia Li
Chao Zhang
P. Woodland
18
0
0
25 Sep 2024
Generalization in birdsong classification: impact of transfer learning
  methods and dataset characteristics
Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics
Burooj Ghani
Vincent J. Kalkman
Bob Planqué
Willem-Pier Vellinga
L. Gill
Dan Stowell
VLM
24
5
0
21 Sep 2024
LC-Protonets: Multi-Label Few-Shot Learning for World Music Audio Tagging
LC-Protonets: Multi-Label Few-Shot Learning for World Music Audio Tagging
Charilaos Papaioannou
Emmanouil Benetos
Alexandros Potamianos
20
0
0
17 Sep 2024
MusicLIME: Explainable Multimodal Music Understanding
MusicLIME: Explainable Multimodal Music Understanding
Theodoros Sotirou
Vassilis Lyberatos
Orfeas Menis-Mastromichalakis
Giorgos Stamou
26
2
0
16 Sep 2024
Effective Pre-Training of Audio Transformers for Sound Event Detection
Effective Pre-Training of Audio Transformers for Sound Event Detection
Florian Schmid
T. Morocutti
Francesco Foscarin
Jan Schluter
Paul Primus
Gerhard Widmer
ViT
23
2
0
14 Sep 2024
SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
Md Awsafur Rahman
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Bishmoy Paul
S. Fattah
38
7
0
26 Aug 2024
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
Jingyu Liu
Minquan Wang
Ye Ma
Bo Wang
Aozhu Chen
Quan Chen
Peng Jiang
Xirong Li
38
1
0
23 Aug 2024
Sampling Foundational Transformer: A Theoretical Perspective
Sampling Foundational Transformer: A Theoretical Perspective
Viet Anh Nguyen
Minh Lenhat
Khoa Nguyen
Duong Duc Hieu
Dao Huu Hung
Truong Son-Hy
42
0
0
11 Aug 2024
Resource-Efficient Federated Multimodal Learning via Layer-wise and
  Progressive Training
Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training
Ye Lin Tun
Chu Myaet Thwal
Minh N. H. Nguyen
Choong Seon Hong
36
0
0
22 Jul 2024
AudioInsight: Detecting Social Contexts Relevant to Social Anxiety from
  Speech
AudioInsight: Detecting Social Contexts Relevant to Social Anxiety from Speech
Varun Reddy
Zhiyuan Wang
Emma R. Toner
Max Larrazabal
M. Boukhechba
B. Teachman
Laura E. Barnes
31
4
0
19 Jul 2024
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge
  from Large Language Models
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xuenan Xu
Pingyue Zhang
Ming Yan
Ji Zhang
Mengyue Wu
VLM
21
0
0
19 Jul 2024
ASGIR: Audio Spectrogram Transformer Guided Classification And
  Information Retrieval For Birds
ASGIR: Audio Spectrogram Transformer Guided Classification And Information Retrieval For Birds
Yashwardhan Chaudhuri
Paridhi Mundra
Arnesh Batra
Orchid Chetia Phukan
Arun Balaji Buduru
23
1
0
10 Jul 2024
Towards Attention-based Contrastive Learning for Audio Spoof Detection
Towards Attention-based Contrastive Learning for Audio Spoof Detection
C. Goel
Surya Koppisetti
Ben Colman
Ali Shahriyari
Gaurav Bharaj
50
5
0
03 Jul 2024
AnoPatch: Towards Better Consistency in Machine Anomalous Sound
  Detection
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Anbai Jiang
Bing Han
Zhiqiang Lv
Yufeng Deng
Wei-Qiang Zhang
Xie Chen
Yanmin Qian
Jia Liu
Pingyi Fan
32
3
0
17 Jun 2024
MambaLRP: Explaining Selective State Space Sequence Models
MambaLRP: Explaining Selective State Space Sequence Models
F. Jafari
G. Montavon
Klaus-Robert Müller
Oliver Eberle
Mamba
54
9
0
11 Jun 2024
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of
  Progress in Speech Emotion Recognition
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition
Andreas Triantafyllopoulos
A. Batliner
Simon Rampp
M. Milling
Björn Schuller
VLM
20
0
0
10 Jun 2024
Audio-based Step-count Estimation for Running -- Windowing and Neural
  Network Baselines
Audio-based Step-count Estimation for Running -- Windowing and Neural Network Baselines
Philipp Wagner
Andreas Triantafyllopoulos
Alexander Gebhard
Björn Schuller
35
0
0
10 Jun 2024
Multi-Microphone Speech Emotion Recognition using the Hierarchical
  Token-semantic Audio Transformer Architecture
Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
Ohad Cohen
G. Hazan
Sharon Gannot
31
1
0
05 Jun 2024
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Siavash Shams
Sukru Samet Dindar
Xilin Jiang
N. Mesgarani
Mamba
64
18
0
20 May 2024
RepAugment: Input-Agnostic Representation-Level Augmentation for
  Respiratory Sound Classification
RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification
June-Woo Kim
Miika Toikkanen
Sangmin Bae
Minseok Kim
Ho-Young Jung
30
5
0
05 May 2024
AudioRepInceptionNeXt: A lightweight single-stream architecture for
  efficient audio recognition
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
Kin Wai Lau
Yasar Abbas Ur Rehman
L. Po
33
1
0
21 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with
  Unsupervised Audio Mixtures
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
Afrina Tabassum
Dung N. Tran
Trung D. Q. Dang
Ismini Lourentzou
K. Koishida
42
0
0
14 Mar 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
44
1
0
23 Feb 2024
Multimodal Action Quality Assessment
Multimodal Action Quality Assessment
Ling-an Zeng
Wei-Shi Zheng
43
13
0
31 Jan 2024
123
Next