ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.05060
  4. Cited By
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
  Visual Event Localization

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
11 October 2022
Tanvir Mahmud
Diana Marculescu
    CLIP
ArXiv (abs)PDFHTML

Papers citing "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization"

25 / 25 papers shown
Title
Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty
Victor Croisfelt
João Henrique Inacio de Souza
Shashi Raj Pandey
B. Soret
P. Popovski
153
0
0
20 Nov 2025
Energy-Efficient Domain-Specific Artificial Intelligence Models and Agents: Pathways and Paradigms
Energy-Efficient Domain-Specific Artificial Intelligence Models and Agents: Pathways and Paradigms
Abhijit Chatterjee
N. Jha
Jonathan D. Cohen
Thomas Griffiths
Hongjing Lu
Diana Marculescu
Ashiqur Rasul
Keshab K. Parhi
LLMAGAI4CE
328
0
0
24 Oct 2025
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
Jinxing Zhou
Ziheng Zhou
Yanghao Zhou
Yuxin Mao
Zhangling Duan
Dan Guo
100
1
0
06 Aug 2025
GRAM: Spatial general-purpose audio representation models for real-world applications
GRAM: Spatial general-purpose audio representation models for real-world applications
Goksenin Yuksel
Marcel van Gerven
Kiki van der Heijden
194
1
0
01 Jun 2025
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
X. Yu
Yan Fang
Xiaojie Jin
Yao Zhao
Yunchao Wei
231
1
0
29 May 2025
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Sooyoung Park
Arda Senocak
Joon Son Chung
VLM
229
0
0
08 May 2025
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic ThresholdsComputer Vision and Pattern Recognition (CVPR), 2025
E. Shaar
Ariel Shaulov
Gal Chechik
Lior Wolf
VLM
283
1
0
17 Mar 2025
Towards Open-Vocabulary Audio-Visual Event LocalizationComputer Vision and Pattern Recognition (CVPR), 2024
Jinxing Zhou
Dan Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
Ming Wang
VLM
439
19
0
18 Nov 2024
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing
  Audio-Visual Question Answering
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Tianyu Yang
Yiyang Nan
Lisen Dai
Zhenwen Liang
Yapeng Tian
Wei Wei
248
1
0
07 Nov 2024
CACE-Net: Co-guidance Attention and Contrastive Enhancement for
  Effective Audio-Visual Event Localization
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event LocalizationACM Multimedia (MM), 2024
Xiang He
Xiangxi Liu
Yang Li
Dongcheng Zhao
Guobin Shen
Qingqun Kong
Xin Yang
Yi Zeng
212
12
0
04 Aug 2024
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual
  Transformers
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
Tanvir Mahmud
Shentong Mo
Yapeng Tian
Diana Marculescu
142
7
0
07 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise
  Pseudo Labeling
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
215
33
0
03 Jun 2024
OmniBind: Teach to Build Unequal-Scale Modality Interaction for
  Omni-Bind of All
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu
Xueye Zheng
Dahun Kim
Lin Wang
216
20
0
25 May 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
203
10
0
28 Mar 2024
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind
  Them All
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
Yuanhuiyi Lyu
Xueye Zheng
Jiazhou Zhou
Lin Wang
196
39
0
19 Mar 2024
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu
Yikun Liu
Fei Zhang
Chen Ju
Ya Zhang
Yanfeng Wang
251
25
0
17 Mar 2024
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal
  Image Generation
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation
Yuanhuiyi Lyu
Xueye Zheng
Lin Wang
DiffM
185
12
0
31 Jan 2024
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
  Downstream Tasks
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Haoyi Duan
Yan Xia
Mingze Zhou
Li Tang
Jieming Zhu
Zhou Zhao
VLM
258
38
0
09 Nov 2023
Can CLIP Help Sound Source Localization?
Can CLIP Help Sound Source Localization?IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Sooyoung Park
Arda Senocak
Joon Son Chung
145
15
0
07 Nov 2023
EventBind: Learning a Unified Representation to Bind Them All for
  Event-based Open-world Understanding
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world UnderstandingEuropean Conference on Computer Vision (ECCV), 2023
Jiazhou Zhou
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
305
26
0
06 Aug 2023
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
  Perspective
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language PerspectiveNeural Information Processing Systems (NeurIPS), 2023
Yingying Fan
Yu Wu
Bo Du
Yutian Lin
241
17
0
01 Jun 2023
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
195
21
0
04 Mar 2023
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual LearnersComputer Vision and Pattern Recognition (CVPR), 2022
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
268
106
0
15 Dec 2022
Leveraging the Video-level Semantic Consistency of Event for
  Audio-visual Event Localization
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event LocalizationIEEE transactions on multimedia (IEEE TMM), 2022
Yuanyuan Jiang
Jianqin Yin
Yonghao Dang
106
14
0
11 Oct 2022
Xception: Deep Learning with Depthwise Separable Convolutions
Xception: Deep Learning with Depthwise Separable ConvolutionsComputer Vision and Pattern Recognition (CVPR), 2016
François Chollet
MDEBDLPINN
2.5K
16,498
0
07 Oct 2016
1