ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.10558
  4. Cited By
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
  Parsing

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

21 July 2020
Yapeng Tian
Dingzeyu Li
Chenliang Xu
ArXiv (abs)PDFHTML

Papers citing "Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing"

50 / 136 papers shown
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Yaru Chen
Ruohao Guo
Liting Gao
Yang Xiang
Qingyu Luo
Zhenbo Li
Wenwu Wang
249
0
0
17 Sep 2025
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
Yaru Chen
Faegheh Sardari
Peiliang Zhang
Ruohao Guo
Yang Xiang
Zhenbo Li
Wenwu Wang
303
0
0
04 Sep 2025
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
Jinxing Zhou
Ziheng Zhou
Yanghao Zhou
Yuxin Mao
Zhangling Duan
Dan Guo
228
4
0
06 Aug 2025
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
Jinxing Zhou
Yanghao Zhou
Mingfei Han
Tong Wang
Xiaojun Chang
Hisham Cholakkal
Rao Muhammad Anwer
VOSLRM
289
5
0
06 Aug 2025
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Jiong Yin
Liang-Sheng Li
Jiehua Zhang
Yuhan Gao
Chenggang Yan
Xichun Sheng
CLL
294
1
0
29 Jul 2025
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Huilai Li
Yonghao Dang
Ying Xing
Yiming Wang
Jianqin Yin
240
0
0
14 Jul 2025
MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
Langyu Wang
Bingke Zhu
Yingying Chen
Yiyuan Zhang
Ming Tang
Jinqiao Wang
VLM
383
1
0
02 Jul 2025
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
X. Yu
Yan Fang
Xiaojie Jin
Yao Zhao
Yunchao Wei
326
2
0
29 May 2025
Learning to Highlight Audio by Watching Movies
Learning to Highlight Audio by Watching MoviesComputer Vision and Pattern Recognition (CVPR), 2025
Chao Huang
Ruohan Gao
J. M. F. Tsang
Jan Kurcius
Cagdas Bilen
Chenliang Xu
Anurag Kumar
Sanjeel Parekh
VGen
371
5
0
17 May 2025
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video ParsingComputer Vision and Pattern Recognition (CVPR), 2025
Yung-Hsuan Lai
Janek Ebbers
Yu-Chiang Frank Wang
François Germain
Michael Jeffrey Jones
Moitreya Chatterjee
247
1
0
14 May 2025
Audio-visual Event Localization on Portrait Mode Short Videos
Audio-visual Event Localization on Portrait Mode Short Videos
Wuyang Liu
Yi Chai
Yongpeng Yan
Yanzhen Ren
346
2
0
09 Apr 2025
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
Peng Wu
Wanshun Su
Guansong Pang
Yujia Sun
Qingsen Yan
Peng Wang
Yujiao Shi
VLM
460
6
0
06 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLMAuLLM
410
10
0
02 Apr 2025
Continual Cross-Modal Generalization
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
315
1
0
01 Apr 2025
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
Sashuai Zhou
Hai Huang
Yan Xia
MoMeMoE
329
3
0
26 Mar 2025
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic ThresholdsComputer Vision and Pattern Recognition (CVPR), 2025
E. Shaar
Ariel Shaulov
Gal Chechik
Lior Wolf
VLM
461
2
0
17 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit CooperationComputer Vision and Pattern Recognition (CVPR), 2025
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
302
18
0
17 Mar 2025
AudioX: A Unified Framework for Anything-to-Audio Generation
AudioX: A Unified Framework for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Xu Tan
Yike Guo
VGen
585
33
0
13 Mar 2025
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
Hao Zhou
Xiaobao Guo
Yuzhe Zhu
A. Kong
DiffM
540
2
0
13 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
601
3
0
12 Mar 2025
Learning Musical Representations for Music Performance Question Answering
Learning Musical Representations for Music Performance Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Xingjian Diao
Chunhui Zhang
Tingxuan Wu
Ming Cheng
Z. Ouyang
Weiyi Wu
Jiang Gui
366
27
0
10 Feb 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A SurveyIEEE Access (IEEE Access), 2024
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
630
10
0
10 Jan 2025
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Reading to Listen at the Cocktail Party: Multi-Modal Speech SeparationComputer Vision and Pattern Recognition (CVPR), 2022
Akam Rahimi
Triantafyllos Afouras
Andrew Zisserman
424
34
0
02 Jan 2025
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge ComputingACM Symposium on Applied Computing (SAC), 2024
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
390
6
0
26 Dec 2024
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual
  Video Parsing
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video ParsingAAAI Conference on Artificial Intelligence (AAAI), 2024
Pengcheng Zhao
Jinxing Zhou
Yang Zhao
Dan Guo
Yanxiang Chen
376
19
0
15 Dec 2024
Towards Open-Vocabulary Audio-Visual Event LocalizationComputer Vision and Pattern Recognition (CVPR), 2024
Jinxing Zhou
Dan Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
Ming Wang
VLM
622
30
0
18 Nov 2024
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing
  Audio-Visual Question Answering
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Tianyu Yang
Yiyang Nan
Lisen Dai
Zhenwen Liang
Yapeng Tian
Wei Wei
411
2
0
07 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Aligning Audio-Visual Joint Representations with an Agentic WorkflowNeural Information Processing Systems (NeurIPS), 2024
Shentong Mo
Yibing Song
308
4
0
30 Oct 2024
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Shentong Mo
Haofan Wang
316
3
0
31 Aug 2024
VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering
  Visual, Acoustic and Glossary Features
VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary FeaturesIntelligent Data Analysis (IDA), 2024
Ananya Pandey
Dinesh Kumar Vishwakarma
284
4
0
05 Aug 2024
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Guangyao Li
Henghui Du
Di Hu
268
21
0
30 Jul 2024
Audio-visual Generalized Zero-shot Learning the Easy Way
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo
Pedro Morgado
291
9
0
18 Jul 2024
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Jinxing Zhou
Dan Guo
Yuxin Mao
Yiran Zhong
Xiaojun Chang
Meng Wang
286
36
0
11 Jul 2024
Semantic Grouping Network for Audio Source Separation
Semantic Grouping Network for Audio Source Separation
Shentong Mo
Yapeng Tian
357
5
0
04 Jul 2024
SAVE: Segment Audio-Visual Easy way using Segment Anything Model
SAVE: Segment Audio-Visual Easy way using Segment Anything Model
Khanh-Binh Nguyen
Chae Jung Park
VLMVOS
439
5
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLMMLLM
452
29
0
01 Jul 2024
Extending Segment Anything Model into Auditory and Temporal Dimensions
  for Audio-Visual Segmentation
Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual SegmentationInternational Conference on Information Photonics (ICIP), 2024
Juhyeong Seon
Woobin Im
Sebin Lee
Jumin Lee
Sung-eui Yoon
278
7
0
10 Jun 2024
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual
  Transformers
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
Tanvir Mahmud
Shentong Mo
Yapeng Tian
Diana Marculescu
214
14
0
07 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Xu Tan
VGen
708
40
0
06 Jun 2024
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
Rishit Dagli
Shivesh Prakash
Robert Wu
H. Khosravani
417
17
0
06 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise
  Pseudo Labeling
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
280
42
0
03 Jun 2024
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly
  Supervised Audio-Visual Video Parsing
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video ParsingEuropean Conference on Computer Vision (ECCV), 2024
Faegheh Sardari
A. Mustafa
Philip J. B. Jackson
Adrian Hilton
529
11
0
17 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
362
5
0
13 May 2024
TIM: A Time Interval Machine for Audio-Visual Action Recognition
TIM: A Time Interval Machine for Audio-Visual Action Recognition
Jacob Chalk
Jaesung Huh
Evangelos Kazakos
Andrew Zisserman
Dima Damen
338
29
0
08 Apr 2024
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event LocalizationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Tiantian Geng
Teng Wang
Jinming Duan
Yanfu Zhang
Weili Guan
Feng Zheng
Ling Shao
372
5
0
04 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
361
12
0
28 Mar 2024
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2024
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
Hang Hua
Chenliang Xu
VGen
471
17
0
24 Mar 2024
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior RecognitionIEEE transactions on multimedia (IEEE TMM), 2024
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
325
15
0
22 Mar 2024
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu
Yikun Liu
Fei Zhang
Chen Ju
Ya Zhang
Yanfeng Wang
390
29
0
17 Mar 2024
Answering Diverse Questions via Text Attached with Key Audio-Visual
  Clues
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues
Qilang Ye
Zitong Yu
Xin Liu
275
4
0
11 Mar 2024
123
Next
Page 1 of 3