v1v2 (latest)

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

2 June 2024

ArXiv (abs)PDF HTML Github

Papers citing "Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering"

34 / 34 papers shown

Seeing the Wind from a Falling Leaf

323

30 Nov 2025

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

423

10 Nov 2025

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

...

413

23 Sep 2025

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

297

04 Aug 2025

Augmented Vision-Language Models: A Systematic Review

228

24 Jul 2025

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

314

11 Jun 2025

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2025

560

01 May 2025

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

385

18 Jul 2024

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Chuang Gan

594

284

15 May 2024

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

See Kiong Ng

369

324

25 Apr 2024

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Joshua B. Tenenbaum

Chuang Gan

LRM

250

09 Feb 2024

Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation

345

19 Jan 2024

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin

1.8K

1,402

16 Nov 2023

3D-Aware Visual Question Answering about Parts, Poses and OcclusionsNeural Information Processing Systems (NeurIPS), 2023

423

27 Oct 2023

Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical PropertiesNeural Information Processing Systems (NeurIPS), 2023

Mingyu Ding

Chuang Gan

274

27 Jun 2023

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang

...

Yu Qiao

597

492

06 Dec 2022

Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual ReasoningComputer Vision and Pattern Recognition (CVPR), 2022

Elias Stengel-Eskin

Max Planck Institute for Informatics

OOD LRM

269

114

01 Dec 2022

CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Maitreya Patel

Tejas Gokhale

Chitta Baral

Yezhou Yang

291

07 Nov 2022

LAION-5B: An open large-scale dataset for training next generation image-text modelsNeural Information Processing Systems (NeurIPS), 2022

...

1.5K

4,964

16 Oct 2022

Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural FeaturesEuropean Conference on Computer Vision (ECCV), 2022

252

12 Sep 2022

Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

436

10 May 2022

ComPhy: Compositional Physical Reasoning of Objects and Events from VideosInternational Conference on Learning Representations (ICLR), 2022

Zhenfang Chen

Kexin Yi

Yunzhu Li

Mingyu Ding

Antonio Torralba

J. Tenenbaum

Chuang Gan

CoGe OCL

270

02 May 2022

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and LanguageNeural Information Processing Systems (NeurIPS), 2021

Mingyu Ding

Chuang Gan

280

28 Oct 2021

Physion: Evaluating Physical Prediction from Vision in Humans and Machines

...

Li Fei-Fei

Nancy Kanwisher

642

131

15 Jun 2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalIEEE International Conference on Computer Vision (ICCV), 2021

1.1K

1,535

01 Apr 2021

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose EstimationInternational Conference on Learning Representations (ICLR), 2021

292

29 Jan 2021

CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

Tayfun Ates

Muhammed Samil Atesoglu

415

08 Dec 2020

CoKe: Localized Contrastive Learning for Robust Keypoint Detection

285

29 Sep 2020

CATER: A diagnostic dataset for Compositional Actions and TEmporal ReasoningInternational Conference on Learning Representations (ICLR), 2019

Rohit Girdhar

Deva Ramanan

443

198

10 Oct 2019

CLEVRER: CoLlision Events for Video REpresentation and ReasoningInternational Conference on Learning Representations (ICLR), 2019

Yunzhu Li

Jiajun Wu

528

480

03 Oct 2019

Soft Rasterizer: A Differentiable Renderer for Image-based 3D Reasoning

471

786

03 Apr 2019

FiLM: Visual Reasoning with a General Conditioning Layer

Aaron Courville

FAtt AIMat OffRL AI4CE

879

3,392

22 Sep 2017

The "something something" video database for learning and evaluating visual common senseIEEE International Conference on Computer Vision (ICCV), 2017

Raghav Goyal

Samira Ebrahimi Kahou

Vincent Michalski

Joanna Materzynska

S. Westphal

...

Moritz Mueller-Freitag

572

1,886

13 Jun 2017

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

3.6K

71,917

04 Jun 2015