ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.05876
  4. Cited By
A Simple Baseline for Audio-Visual Scene-Aware Dialog

A Simple Baseline for Audio-Visual Scene-Aware Dialog

11 April 2019
Idan Schwartz
A. Schwing
Tamir Hazan
ArXivPDFHTML

Papers citing "A Simple Baseline for Audio-Visual Scene-Aware Dialog"

49 / 49 papers shown
Title
Question-Aware Gaussian Experts for Audio-Visual Question Answering
Hongyeob Kim
Inyoung Jung
Dayoon Suh
Youjia Zhang
Sangmin Lee
Sungeun Hong
61
0
0
06 Mar 2025
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Zhangbin Li
Jinxing Zhou
J. Zhang
Shengeng Tang
Kun Li
D. Guo
75
4
0
14 Dec 2024
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing
  Audio-Visual Question Answering
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Tianyu Yang
Yiyang Nan
Lisen Dai
Zhenwen Liang
Yapeng Tian
X. Zhang
26
0
0
07 Nov 2024
On-the-fly Modulation for Balanced Multimodal Learning
On-the-fly Modulation for Balanced Multimodal Learning
Yake Wei
D. Hu
Henghui Du
Ji-Rong Wen
11
7
0
15 Oct 2024
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Guangyao Li
Henghui Du
Di Hu
16
1
0
30 Jul 2024
Learning Trimodal Relation for AVQA with Missing Modality
Learning Trimodal Relation for AVQA with Missing Modality
Kyu Ri Park
Hong Joo Lee
Jung Uk Kim
26
1
0
23 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
29
9
0
01 Jul 2024
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual
  Question Answering
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
Zhe Yang
Wenrui Li
Guanghui Cheng
Mamba
21
0
0
14 Jun 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
47
4
0
10 Jun 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
30
1
0
13 May 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
30
4
0
18 Apr 2024
Answering Diverse Questions via Text Attached with Key Audio-Visual
  Clues
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues
Qilang Ye
Zitong Yu
Xin Liu
25
1
0
11 Mar 2024
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
  Dynamic Audio-Visual Scenarios
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye
Zitong Yu
Rui Shao
Xinyu Xie
Philip H. S. Torr
Xiaochun Cao
MLLM
30
24
0
07 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
18
13
0
06 Mar 2024
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced
  Video-grounded Dialogue Generation
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation
Hongcheng Liu
Pingjie Wang
Yu Wang
Yanfeng Wang
22
1
0
19 Feb 2024
Social Media Ready Caption Generation for Brands
Social Media Ready Caption Generation for Brands
Himanshu Maheshwari
Koustava Goswami
Apoorv Saxena
Balaji Vasan Srinivasan
14
1
0
03 Jan 2024
Object-aware Adaptive-Positivity Learning for Audio-Visual Question
  Answering
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Zhangbin Li
Dan Guo
Jinxing Zhou
Jing Zhang
Meng Wang
19
11
0
20 Dec 2023
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
  Downstream Tasks
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Haoyi Duan
Yan Xia
Mingze Zhou
Li Tang
Jieming Zhu
Zhou Zhao
VLM
11
17
0
09 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
6
9
0
25 Oct 2023
Progressive Spatio-temporal Perception for Audio-Visual Question
  Answering
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Guangyao Li
Wenxuan Hou
Di Hu
21
26
0
10 Aug 2023
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
  Parser
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yun-hsuan Lai
Yen-Chun Chen
Y. Wang
6
8
0
27 May 2023
Target-Aware Spatio-Temporal Reasoning via Answering Questions in
  Dynamics Audio-Visual Scenarios
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Yuanyuan Jiang
Jianqin Yin
8
4
0
21 May 2023
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based
  Polishing
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing
Zequn Zeng
Hao Zhang
Zhengjue Wang
Ruiying Lu
Dongsheng Wang
Bo Chen
BDL
DiffM
11
32
0
04 Mar 2023
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Mohit Bansal
Gedas Bertasius
15
69
0
15 Dec 2022
End-to-End Multimodal Representation Learning for Video Dialog
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
17
3
0
26 Oct 2022
Describing Sets of Images with Textual-PCA
Describing Sets of Images with Textual-PCA
Oded Hupert
Idan Schwartz
Lior Wolf
CoGe
23
1
0
21 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
25
16
0
05 Oct 2022
Learning in Audio-visual Context: A Review, Analysis, and New
  Perspective
Learning in Audio-visual Context: A Review, Analysis, and New Perspective
Yake Wei
Di Hu
Yapeng Tian
Xuelong Li
33
54
0
20 Aug 2022
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Yoad Tewel
Yoav Shalev
Roy Nadler
Idan Schwartz
Lior Wolf
29
25
0
22 Jul 2022
Enabling Harmonious Human-Machine Interaction with Visual-Context
  Augmented Dialogue System: A Review
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review
Hao Wang
Bin Guo
Y. Zeng
Yasan Ding
Chen Qiu
Ying Zhang
Li Yao
Zhiwen Yu
22
2
0
02 Jul 2022
Dual Decomposition of Convex Optimization Layers for Consistent
  Attention in Medical Images
Dual Decomposition of Convex Optimization Layers for Consistent Attention in Medical Images
Tom Ron
M. Weiler-Sagie
Tamir Hazan
FAtt
MedIm
11
6
0
06 Jun 2022
Learning to Retrieve Videos by Asking Questions
Learning to Retrieve Videos by Asking Questions
Avinash Madasu
Junier Oliva
Gedas Bertasius
VGen
25
15
0
11 May 2022
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Guangyao Li
Yake Wei
Yapeng Tian
Chenliang Xu
Ji-Rong Wen
Di Hu
20
133
0
26 Mar 2022
AssistQ: Affordance-centric Question-driven Task Completion for
  Egocentric Assistant
AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
B. Wong
Joya Chen
You Wu
Stan Weixian Lei
Dongxing Mao
Difei Gao
Mike Zheng Shou
EgoV
21
27
0
08 Mar 2022
Latent Space Explanation by Intervention
Latent Space Explanation by Intervention
Itai Gat
Guy Lorberbom
Idan Schwartz
Tamir Hazan
BDL
21
14
0
09 Dec 2021
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic
  Arithmetic
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
Yoad Tewel
Yoav Shalev
Idan Schwartz
Lior Wolf
VLM
30
191
0
29 Nov 2021
Perceptual Score: What Data Modalities Does Your Model Perceive?
Perceptual Score: What Data Modalities Does Your Model Perceive?
Itai Gat
Idan Schwartz
A. Schwing
14
30
0
27 Oct 2021
Video and Text Matching with Conditioned Embeddings
Video and Text Matching with Conditioned Embeddings
Ameen Ali
Idan Schwartz
Tamir Hazan
Lior Wolf
49
13
0
21 Oct 2021
Saying the Unseen: Video Descriptions via Dialog Agents
Saying the Unseen: Video Descriptions via Dialog Agents
Ye Zhu
Yu Wu
Yi Yang
Yan Yan
14
6
0
26 Jun 2021
Ensemble of MRR and NDCG models for Visual Dialog
Ensemble of MRR and NDCG models for Visual Dialog
Idan Schwartz
22
7
0
15 Apr 2021
Learning Reasoning Paths over Semantic Graphs for Video-grounded
  Dialogues
Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues
Hung Le
Nancy F. Chen
S. Hoi
26
14
0
01 Mar 2021
VX2TEXT: End-to-End Learning of Video-Based Text Generation From
  Multimodal Inputs
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Xudong Lin
Gedas Bertasius
Jue Wang
Shih-Fu Chang
Devi Parikh
Lorenzo Torresani
VGen
17
66
0
28 Jan 2021
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing
  Functional Entropies
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
Itai Gat
Idan Schwartz
A. Schwing
Tamir Hazan
45
88
0
21 Oct 2020
Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
Ye Zhu
Yu Wu
Yi Yang
Yan Yan
6
12
0
18 Aug 2020
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
  Shuffled Transformers
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Shijie Geng
Peng Gao
Moitreya Chatterjee
Chiori Hori
Jonathan Le Roux
Yongfeng Zhang
Hongsheng Li
A. Cherian
19
11
0
08 Jul 2020
DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style
  Word Generator
DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator
Hwanhee Lee
Seunghyun Yoon
Franck Dernoncourt
Doo Soon Kim
Trung Bui
Kyomin Jung
12
15
0
01 Apr 2020
Trends in Integration of Vision and Language Research: A Survey of
  Tasks, Datasets, and Methods
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
Aditya Mogadala
M. Kalimuthu
Dietrich Klakow
VLM
10
132
0
22 Jul 2019
Factor Graph Attention
Factor Graph Attention
Idan Schwartz
Seunghak Yu
Tamir Hazan
A. Schwing
11
110
0
11 Apr 2019
Multimodal Compact Bilinear Pooling for Visual Question Answering and
  Visual Grounding
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui
Dong Huk Park
Daylen Yang
Anna Rohrbach
Trevor Darrell
Marcus Rohrbach
144
1,458
0
06 Jun 2016
1