ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.14072
  4. Cited By
Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

26 March 2022
Guangyao Li
Yake Wei
Yapeng Tian
Chenliang Xu
Ji-Rong Wen
Di Hu
ArXivPDFHTML

Papers citing "Learning to Answer Questions in Dynamic Audio-Visual Scenarios"

42 / 92 papers shown
Title
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
26
4
0
22 Mar 2024
Answering Diverse Questions via Text Attached with Key Audio-Visual
  Clues
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues
Qilang Ye
Zitong Yu
Xin Liu
33
1
0
11 Mar 2024
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
  Dynamic Audio-Visual Scenarios
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye
Zitong Yu
Rui Shao
Xinyu Xie
Philip H. S. Torr
Xiaochun Cao
MLLM
33
24
0
07 Mar 2024
Model Composition for Multimodal Large Language Models
Model Composition for Multimodal Large Language Models
Chi Chen
Yiyang Du
Zheng Fang
Ziyue Wang
Fuwen Luo
...
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Janet Liu
MoMe
21
3
0
20 Feb 2024
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced
  Video-grounded Dialogue Generation
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation
Hongcheng Liu
Pingjie Wang
Yu Wang
Yanfeng Wang
30
1
0
19 Feb 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative
  Comprehension
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang
Jin Xu
Wenrui Liu
Yunfei Chu
Ziyue Jiang
...
Yichong Leng
Yuanjun Lv
Zhou Zhao
Chang Zhou
Jingren Zhou
LM&MA
AuLLM
ALM
44
56
0
12 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
38
11
0
10 Feb 2024
Quantifying and Enhancing Multi-modal Robustness with Modality
  Preference
Quantifying and Enhancing Multi-modal Robustness with Modality Preference
Zequn Yang
Yake Wei
Ce Liang
Di Hu
AAML
19
9
0
09 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
77
4
0
08 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Wei Ping
Rafael Valle
Bryan Catanzaro
AuLLM
LM&MA
MLLM
59
73
0
02 Feb 2024
AQUALLM: Audio Question Answering Data Generation Using Large Language
  Models
AQUALLM: Audio Question Answering Data Generation Using Large Language Models
Swarup Ranjan Behera
Krishna Mohan Injeti
Jaya Sai Kiran Patibandla
P. Pokala
Pailla Balakrishna Reddy
AuLLM
13
4
0
28 Dec 2023
Object-aware Adaptive-Positivity Learning for Audio-Visual Question
  Answering
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Zhangbin Li
Dan Guo
Jinxing Zhou
Jing Zhang
Meng Wang
19
11
0
20 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
17
36
0
11 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq R. Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
28
45
0
30 Nov 2023
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
  Downstream Tasks
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Haoyi Duan
Yan Xia
Mingze Zhou
Li Tang
Jieming Zhu
Zhou Zhao
VLM
11
17
0
09 Nov 2023
Disentangled Counterfactual Learning for Physical Audiovisual
  Commonsense Reasoning
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Changsheng Lv
Shuai Zhang
Yapeng Tian
Mengshi Qi
Huadong Ma
CML
37
16
0
30 Oct 2023
Audio-Visual Instance Segmentation
Audio-Visual Instance Segmentation
Ruohao Guo
Yaru Chen
Yanyu Qi
Wenzhen Yue
Dantong Niu
...
Wenzhen Yue
Ji Shi
Qixun Wang
Peiliang Zhang
Buwen Liang
VLM
VOS
26
2
0
28 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
13
9
0
25 Oct 2023
MISAR: A Multimodal Instructional System with Augmented Reality
MISAR: A Multimodal Instructional System with Augmented Reality
Jing Bi
Nguyen Nguyen
A. Vosoughi
Chenliang Xu
40
11
0
18 Oct 2023
CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
  video parsing
CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing
Yaru Chen
Ruohao Guo
Xubo Liu
Peipei Wu
Guangyao Li
Zhenbo Li
Wenwu Wang
32
7
0
11 Oct 2023
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for
  Unbiased Question-Answering
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering
Xiulong Liu
Zhikang Dong
Peng Zhang
17
21
0
10 Oct 2023
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Shentong Mo
Weiguo Pian
Yapeng Tian
CLL
VLM
27
21
0
11 Sep 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
  Understanding
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
25
245
0
17 Aug 2023
Progressive Spatio-temporal Perception for Audio-Visual Question
  Answering
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Guangyao Li
Wenxuan Hou
Di Hu
21
26
0
10 Aug 2023
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual
  Navigation in Noisy Environments
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments
Xiulong Liu
Sudipta Paul
Moitreya Chatterjee
A. Cherian
18
8
0
06 Jun 2023
Recent Advances of Local Mechanisms in Computer Vision: A Survey and
  Outlook of Recent Work
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Qiangchang Wang
Yilong Yin
21
0
0
02 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
J. Liu
30
95
0
29 May 2023
Multi-Scale Attention for Audio Question Answering
Multi-Scale Attention for Audio Question Answering
Guangyao Li
Yixin Xu
Di Hu
14
16
0
29 May 2023
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
  Parser
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yun-hsuan Lai
Yen-Chun Chen
Y. Wang
8
8
0
27 May 2023
ChatBridge: Bridging Modalities with Large Language Model as a Language
  Catalyst
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zijia Zhao
Longteng Guo
Tongtian Yue
Si-Qing Chen
Shuai Shao
Xinxin Zhu
Zehuan Yuan
Jing Liu
MLLM
27
51
0
25 May 2023
Target-Aware Spatio-Temporal Reasoning via Answering Questions in
  Dynamics Audio-Visual Scenarios
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Yuanyuan Jiang
Jianqin Yin
8
6
0
21 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
26
99
0
17 Apr 2023
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos
Wenke Xia
Xingjian Li
Andong Deng
Haoyi Xiong
Dejing Dou
Di Hu
11
4
0
16 Apr 2023
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
31
11
0
04 Mar 2023
Balanced Audiovisual Dataset for Imbalance Analysis
Balanced Audiovisual Dataset for Imbalance Analysis
Wenke Xia
Xu Zhao
Xincheng Pang
Changqing Zhang
Di Hu
21
1
0
14 Feb 2023
Revisiting Pre-training in Audio-Visual Learning
Revisiting Pre-training in Audio-Visual Learning
Ruoxuan Feng
Wenke Xia
Di Hu
17
1
0
07 Feb 2023
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Mohit Bansal
Gedas Bertasius
26
69
0
15 Dec 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
28
16
0
05 Oct 2022
Learning in Audio-visual Context: A Review, Analysis, and New
  Perspective
Learning in Audio-visual Context: A Review, Analysis, and New Perspective
Yake Wei
Di Hu
Yapeng Tian
Xuelong Li
41
54
0
20 Aug 2022
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn
Rui Qian
Kimberly Wilber
Hartwig Adam
Oisin Mac Aodha
Serge J. Belongie
19
10
0
21 Jul 2022
Video Question Answering: Datasets, Algorithms and Challenges
Video Question Answering: Datasets, Algorithms and Challenges
Yaoyao Zhong
Junbin Xiao
Wei Ji
Yicong Li
Wei Deng
Tat-Seng Chua
13
83
0
02 Mar 2022
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
Ruohan Gao
Kristen Grauman
CVBM
185
196
0
08 Jan 2021
Previous
12