Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2406.00622
Cited By
v1
v2 (latest)
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
2 June 2024
Xingrui Wang
Wufei Ma
Angtian Wang
Shuo Chen
Adam Kortylewski
Yaoyao Liu
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering"
34 / 34 papers shown
Seeing the Wind from a Falling Leaf
Zhiyuan Gao
Jiageng Mao
Hong-Xing Yu
Haozhe Lou
Emily Yue-Ting Jia
J. Barbič
Jiajun Wu
Yue Wang
VGen
PINN
247
0
0
30 Nov 2025
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra
Haoqin Tu
Hardy Chen
Yuanze Lin
Cihang Xie
Ronald Clark
OffRL
ReLM
LRM
374
0
0
10 Nov 2025
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
S. Yu
Yuxin Chen
Hao Ju
Lianjie Jia
Fuxi Zhang
...
Lin Song
Lijun Wang
Yanwei Li
Y. Shan
Huchuan Lu
LRM
319
9
0
23 Sep 2025
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou
Alexander Vilesov
Xuehai He
Ziyu Wan
Shuwang Zhang
Aditya Nagachandra
Di Chang
DongDong Chen
Xin Eric Wang
A. Kadambi
VLM
185
0
0
04 Aug 2025
Augmented Vision-Language Models: A Systematic Review
Anthony C Davis
Burhan Sadiq
Tianmin Shu
Chien-Ming Huang
VLM
LRM
196
0
0
24 Jul 2025
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments
Florian Bordes
Q. Garrido
Justine T Kao
Adina Williams
Michael G. Rabbat
Emmanuel Dupoux
PINN
223
14
0
11 Jun 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Computer Vision and Pattern Recognition (CVPR), 2025
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
Jieneng Chen
LRM
471
21
0
01 May 2025
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Wufei Ma
Kai Li
Zhongshi Jiang
Moustafa Meshry
Qihao Liu
Huiyu Wang
Christian Hane
Yaoyao Liu
VGen
245
2
0
18 Jul 2024
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B. Tenenbaum
Chuang Gan
470
257
0
15 May 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLM
VLM
274
276
0
25 Apr 2024
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
Zhicheng Zheng
Xin Yan
Zhenfang Chen
Jingzhou Wang
Qin Zhi Eddie Lim
Joshua B. Tenenbaum
Chuang Gan
LRM
209
14
0
09 Feb 2024
Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation
Prakhar Kaushik
Aayush Mishra
Adam Kortylewski
Yaoyao Liu
3DH
234
9
0
19 Jan 2024
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
1.6K
1,181
0
16 Nov 2023
3D-Aware Visual Question Answering about Parts, Poses and Occlusions
Neural Information Processing Systems (NeurIPS), 2023
Xingrui Wang
Wufei Ma
Zhuowan Li
Adam Kortylewski
Yaoyao Liu
CoGe
318
21
0
27 Oct 2023
Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties
Neural Information Processing Systems (NeurIPS), 2023
H. Tung
Mingyu Ding
Zhenfang Chen
Daniel M. Bear
Chuang Gan
J. Tenenbaum
Daniel L. K. Yamins
Judy Fan
Kevin A. Smith
223
28
0
27 Jun 2023
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
454
446
0
06 Dec 2022
Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning
Computer Vision and Pattern Recognition (CVPR), 2022
Zhuowan Li
Xingrui Wang
Elias Stengel-Eskin
Adam Kortylewski
Wufei Ma
Benjamin Van Durme
Max Planck Institute for Informatics
OOD
LRM
232
102
0
01 Dec 2022
CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Maitreya Patel
Tejas Gokhale
Chitta Baral
Yezhou Yang
244
13
0
07 Nov 2022
LAION-5B: An open large-scale dataset for training next generation image-text models
Neural Information Processing Systems (NeurIPS), 2022
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLM
MLLM
CLIP
887
4,531
0
16 Oct 2022
Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features
European Conference on Computer Vision (ECCV), 2022
Wufei Ma
Angtian Wang
Alan Yuille
Adam Kortylewski
3DH
3DV
204
30
0
12 Sep 2022
Learning to Answer Visual Questions from Web Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
314
39
0
10 May 2022
ComPhy: Compositional Physical Reasoning of Objects and Events from Videos
International Conference on Learning Representations (ICLR), 2022
Zhenfang Chen
Kexin Yi
Yunzhu Li
Mingyu Ding
Antonio Torralba
J. Tenenbaum
Chuang Gan
CoGe
OCL
212
60
0
02 May 2022
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
Neural Information Processing Systems (NeurIPS), 2021
Mingyu Ding
Zhenfang Chen
Tao Du
Ping Luo
J. Tenenbaum
Chuang Gan
VGen
PINN
OCL
226
79
0
28 Oct 2021
Physion: Evaluating Physical Prediction from Vision in Humans and Machines
Daniel M. Bear
E. Wang
Damian Mrowca
Felix Binder
Hsiau-Yu Fish Tung
...
Li Fei-Fei
Nancy Kanwisher
J. Tenenbaum
Daniel L. K. Yamins
Judith E. Fan
OOD
544
116
0
15 Jun 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
837
1,442
0
01 Apr 2021
NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation
International Conference on Learning Representations (ICLR), 2021
Angtian Wang
Adam Kortylewski
Alan Yuille
3DH
222
52
0
29 Jan 2021
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions
Tayfun Ates
Muhammed Samil Atesoglu
Cagatay Yigit
.Ilker Kesen
Mert Kobaş
Erkut Erdem
Aykut Erdem
T. Goksun
Deniz Yuret
311
38
0
08 Dec 2020
CoKe: Localized Contrastive Learning for Robust Keypoint Detection
Yutong Bai
Angtian Wang
Adam Kortylewski
Alan Yuille
207
18
0
29 Sep 2020
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
International Conference on Learning Representations (ICLR), 2019
Rohit Girdhar
Deva Ramanan
382
192
0
10 Oct 2019
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
International Conference on Learning Representations (ICLR), 2019
Kexin Yi
Yuta Saito
Yunzhu Li
Pushmeet Kohli
Jiajun Wu
Antonio Torralba
J. Tenenbaum
NAI
400
480
0
03 Oct 2019
Soft Rasterizer: A Differentiable Renderer for Image-based 3D Reasoning
Shichen Liu
Tianye Li
Weikai Chen
Hao Li
3DV
403
759
0
03 Apr 2019
FiLM: Visual Reasoning with a General Conditioning Layer
Ethan Perez
Florian Strub
H. D. Vries
Vincent Dumoulin
Aaron Courville
FAtt
AIMat
OffRL
AI4CE
776
2,882
0
22 Sep 2017
The "something something" video database for learning and evaluating visual common sense
IEEE International Conference on Computer Vision (ICCV), 2017
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
434
1,782
0
13 Jun 2017
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015
Shaoqing Ren
Kaiming He
Ross B. Girshick
Jian Sun
AIMat
ObjD
2.5K
69,422
0
04 Jun 2015
1