ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.14072
  4. Cited By
Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

26 March 2022
Guangyao Li
Yake Wei
Yapeng Tian
Chenliang Xu
Ji-Rong Wen
Di Hu
ArXivPDFHTML

Papers citing "Learning to Answer Questions in Dynamic Audio-Visual Scenarios"

50 / 92 papers shown
Title
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks
Junhe Zhang
Wanli Ni
Pengwei Wang
Dongyu Wang
11
0
0
06 May 2025
Kimi-Audio Technical Report
Kimi-Audio Technical Report
KimiTeam
Ding Ding
Zeqian Ju
Yichong Leng
S. Liu
...
Z. Yang
Aoxiong Yin
Ruibin Yuan
Y. Zhang
Zaida Zhou
AuLLM
VLM
108
1
0
25 Apr 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li
Mining Tan
Feier Shen
Minyan Luo
Zijiao Yin
Fan Tang
W. Dong
Changsheng Xu
57
0
0
17 Apr 2025
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models
Jaewoo Lee
Keyang Xuan
Chanakya Ekbote
Sandeep Polisetty
Yi Ren Fung
Paul Pu Liang
VLM
37
0
0
14 Apr 2025
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Haoran Hao
Jiaming Han
Yiyuan Zhang
Xiangyu Yue
32
0
0
14 Apr 2025
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Tinh-Anh Nguyen-Nhu
H. Tran
Nguyen-Khang Le
Minh-Nhat Nguyen
T. Nguyen
...
Huu-Phong Phan-Nguyen
Huy-Thach Pham
Quan Nguyen
Hoang M. Le
Quang-Vinh Dinh
44
0
0
12 Apr 2025
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
Peng Wu
Wanshun Su
Guansong Pang
Yujia Sun
Qingsen Yan
Peng Wang
Y. Zhang
VLM
50
0
0
06 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
58
2
0
02 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
J. Liu
P. Wang
Jing Tao
Zhou Su
45
0
0
01 Apr 2025
WikiVideo: Article Generation from Multiple Videos
WikiVideo: Article Generation from Multiple Videos
Alexander Martin
Reno Kriz
William Walden
Kate Sanders
Hannah Recknor
Eugene Yang
Francis Ferraro
Benjamin Van Durme
DiffM
VGen
44
1
0
01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
36
0
0
29 Mar 2025
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
Sashuai Zhou
Hai Huang
Yan Xia
MoMe
MoE
70
0
0
26 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
79
0
0
25 Mar 2025
ACVUBench: Audio-Centric Video Understanding Benchmark
ACVUBench: Audio-Centric Video Understanding Benchmark
Y. Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Y. Li
P. Li
Yifan Jiang
W. Li
Z. Ma
Chao Zhang
AuLLM
CoGe
53
0
0
25 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
54
0
0
17 Mar 2025
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
Chen Liu
Liying Yang
Peike Li
Dadong Wang
Lincheng Li
Xin Yu
VOS
94
0
0
17 Mar 2025
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski
Teodora Popordanoska
Matthew B. Blaschko
Tinne Tuytelaars
53
0
0
12 Mar 2025
Question-Aware Gaussian Experts for Audio-Visual Question Answering
Hongyeob Kim
Inyoung Jung
Dayoon Suh
Youjia Zhang
Sangmin Lee
Sungeun Hong
61
0
0
06 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Wei Ping
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLM
AuLLM
LRM
49
4
0
06 Mar 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou
Ke Mei
Yu Lu
Tianyi Wang
Fengyun Rao
83
2
0
03 Mar 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Guangzhi Sun
Yudong Yang
Jimin Zhuang
Changli Tang
Y. Li
W. Li
Z. Ma
Chao Zhang
LRM
MLLM
VLM
64
2
0
17 Feb 2025
Learning Musical Representations for Music Performance Question Answering
Xingjian Diao
Chunhui Zhang
Tingxuan Wu
Ming Cheng
Z. Ouyang
Weiyi Wu
Jiang Gui
62
5
0
10 Feb 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
104
102
0
10 Jan 2025
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Zhangbin Li
Jinxing Zhou
J. Zhang
Shengeng Tang
Kun Li
D. Guo
75
4
0
14 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
  Audio-Visual Information?
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
B. Li
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Z. Yang
Xiangyu Yue
MLLM
AuLLM
VLM
82
5
0
03 Dec 2024
Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou
D. Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
M. Wang
VLM
46
3
0
18 Nov 2024
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing
  Audio-Visual Question Answering
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Tianyu Yang
Yiyang Nan
Lisen Dai
Zhenwen Liang
Yapeng Tian
X. Zhang
26
0
0
07 Nov 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
29
2
0
23 Oct 2024
Multi-Source Spatial Knowledge Understanding for Immersive Visual
  Text-to-Speech
Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
Shuwei He
Rui Liu
H. Li
19
4
0
18 Oct 2024
OMCAT: Omni Context Aware Transformer
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLM
VLM
16
0
0
15 Oct 2024
On-the-fly Modulation for Balanced Multimodal Learning
On-the-fly Modulation for Balanced Multimodal Learning
Yake Wei
D. Hu
Henghui Du
Ji-Rong Wen
13
7
0
15 Oct 2024
Sample then Identify: A General Framework for Risk Control and
  Assessment in Multimodal Large Language Models
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Qingni Wang
Tiantian Geng
Zhiyuan Wang
Teng Wang
Bo Fu
Feng Zheng
22
4
0
10 Oct 2024
OmniBench: Towards The Future of Universal Omni-Language Models
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
41
11
0
23 Sep 2024
Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic
  Manipulation
Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation
Runze Yuan
Tao Liu
Wenke Ma
Xuelong Li
24
6
0
02 Aug 2024
Learning Trimodal Relation for AVQA with Missing Modality
Learning Trimodal Relation for AVQA with Missing Modality
Kyu Ri Park
Hong Joo Lee
Jung Uk Kim
29
1
0
23 Jul 2024
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang
Peiwen Sun
Dongzhan Zhou
Guangyao Li
Honggang Zhang
Di Hu
VOS
35
5
0
15 Jul 2024
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Jinxing Zhou
Dan Guo
Yuxin Mao
Yiran Zhong
Xiaojun Chang
Meng Wang
31
11
0
11 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
29
9
0
01 Jul 2024
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual
  Question Answering
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
Zhe Yang
Wenrui Li
Guanghui Cheng
Mamba
21
0
0
14 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
38
1
0
13 Jun 2024
Towards Multilingual Audio-Visual Question Answering
Towards Multilingual Audio-Visual Question Answering
Orchid Chetia Phukan
Priyabrata Mallick
Swarup Ranjan Behera
Aalekhya Satya Narayani
Arun Balaji Buduru
Rajesh Sharma
37
0
0
13 Jun 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
47
4
0
10 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise
  Pseudo Labeling
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
53
4
0
03 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
31
0
26 May 2024
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation
  Models
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models
Guangzhi Sun
Potsawee Manakul
Adian Liusie
Kunat Pipatanakul
Chao Zhang
P. Woodland
Mark J. F. Gales
HILM
MLLM
16
7
0
22 May 2024
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly
  Supervised Audio-Visual Video Parsing
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Faegheh Sardari
A. Mustafa
Philip J. B. Jackson
Adrian Hilton
14
2
0
17 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
30
1
0
13 May 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
30
4
0
18 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
30
5
0
28 Mar 2024
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary
  Alignment for Temporal Referential Dialogue
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
VGen
24
17
0
24 Mar 2024
12
Next