Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.08392
Cited By
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
16 January 2024
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)"
34 / 34 papers shown
Title
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
Zhuo Zhi
Qiangqiang Wu
Minghe shen
W. J. Li
Yinchuan Li
Kun Shao
Kaiwen Zhou
LLMAG
28
0
0
06 Apr 2025
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
45
0
0
20 Mar 2025
Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Yucheng Suo
Fan Ma
Kaixin Shen
Linchao Zhu
Yi Yang
VLM
43
0
0
12 Mar 2025
Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation
Minghan Chen
Guikun Chen
Wenguan Wang
Yi Yang
52
3
0
16 Sep 2024
PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning
Mu Chen
Zhedong Zheng
Yi Yang
30
0
0
24 Jul 2024
Navigation Instruction Generation with BEV Perception and Large Language Models
Sheng Fan
Rui Liu
Wenguan Wang
Yi Yang
32
5
0
21 Jul 2024
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Jian Ma
Wenguan Wang
Yi Yang
Feng Zheng
DiffM
33
0
0
15 Jul 2024
VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation
Wenjie Zhuo
Fan Ma
Hehe Fan
Yi Yang
DiffM
26
8
0
13 Jul 2024
Controllable Navigation Instruction Generation with Chain of Thought Prompting
Xianghao Kong
Jinyu Chen
Wenguan Wang
Hang Su
Xiaolin Hu
Yi Yang
Si Liu
LRM
23
3
0
10 Jul 2024
MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis
Dewei Zhou
Y. Li
Fan Ma
Zongxin Yang
Y. Yang
85
11
0
02 Jul 2024
Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
Yuchi Liu
Jaskirat Singh
Gaowen Liu
Ali Payani
Liang Zheng
LLMAG
49
4
0
30 May 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
40
56
0
29 May 2024
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang
Hehe Fan
Yi Yang
30
3
0
24 May 2024
Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity
Ruijie Quan
Wenguan Wang
Zhibo Tian
Fan Ma
Yi Yang
28
12
0
29 Mar 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLM
LLMAG
40
55
0
18 Mar 2024
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang
Yuhui Zhang
Orr Zohar
Serena Yeung-Levy
VLM
88
51
0
15 Mar 2024
Large Multimodal Agents: A Survey
Junlin Xie
Zhihong Chen
Ruifei Zhang
Xiang Wan
Guanbin Li
LM&Ro
LLMAG
34
4
0
23 Feb 2024
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
Zhen Zhou
Fan Ma
Hehe Fan
Yi Yang
3DGS
19
26
0
09 Feb 2024
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
Dewei Zhou
You Li
Fan Ma
Zongxin Yang
Yi Yang
DiffM
18
57
0
08 Feb 2024
CapHuman: Capture Your Moments in Parallel Universes
Chao Liang
Fan Ma
Linchao Zhu
Yingying Deng
Yi Yang
DiffM
13
17
0
01 Feb 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
38
76
0
29 Dec 2023
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction
Zechuan Zhang
Zongxin Yang
Yi Yang
9
34
0
10 Dec 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
87
54
0
27 Apr 2023
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
204
1,701
0
07 Apr 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
Yuxiang Wu
Yu Zhao
Baotian Hu
Pasquale Minervini
Pontus Stenetorp
Sebastian Riedel
RALM
KELM
41
42
0
30 Oct 2022
Binding Language Models in Symbolic Languages
Zhoujun Cheng
Tianbao Xie
Peng Shi
Chengzu Li
Rahul Nadkarni
...
Dragomir R. Radev
Mari Ostendorf
Luke Zettlemoyer
Noah A. Smith
Tao Yu
LMTD
109
195
0
06 Oct 2022
Video Graph Transformer for Video Question Answering
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
131
73
0
12 Jul 2022
Training Language Models with Memory Augmentation
Zexuan Zhong
Tao Lei
Danqi Chen
RALM
221
126
0
25 May 2022
Scalable Video Object Segmentation with Identification Mechanism
Zongxin Yang
Jiaxu Miao
Yunchao Wei
Wenguan Wang
Xiaohan Wang
Yi Yang
VOS
26
23
0
22 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
380
4,010
0
28 Jan 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
212
682
0
13 Oct 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
272
1,939
0
09 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
188
375
0
01 Feb 2021
1