v1v2 (latest)

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

10 November 2020

Zongheng Tang

Papers citing "Human-centric Spatio-Temporal Video Grounding With Visual Transformers"

50 / 57 papers shown

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

204

03 Dec 2025

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

382

26 Nov 2025

Vidi2.5: Large Multimodal Models for Video Understanding and Creation

...

Yicheng He

Yiming Cui

Zhenfang Chen

Zhihua Wu

Zuhua Lin

107

24 Nov 2025

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

237

21 Nov 2025

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

292

21 Nov 2025

NVIDIA Nemotron Nano V2 VL

Nvidia

Amala Sanjay Deshmukh

...

377

06 Nov 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

406

27 Oct 2025

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

323

14 Oct 2025

Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

237

12 Oct 2025

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

168

18 Sep 2025

RynnEC: Bringing MLLMs into Embodied World

242

19 Aug 2025

Fine-grained Spatiotemporal Grounding on Egocentric Videos

329

01 Aug 2025

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree SearchAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Victoria A. Webster-Wood

Fuzheng Zhang

Deyi Xiong

293

11 Jun 2025

Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

453

10 Jun 2025

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

371

14 Mar 2025

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

436

13 Mar 2025

Large-scale Pre-training for Grounded Video Caption Generation

Evangelos Kazakos

Cordelia Schmid

Josef Sivic

563

13 Mar 2025

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Hanyu Zhou

Gim Hee Lee

328

10 Mar 2025

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban SpacesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

394

08 Mar 2025

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingInternational Conference on Learning Representations (ICLR), 2025

346

16 Feb 2025

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksComputer Vision and Pattern Recognition (CVPR), 2025

Subhashree Radhakrishnan

560

14 Jan 2025

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMComputer Vision and Pattern Recognition (CVPR), 2024

...

493

31 Dec 2024

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024

638

31 Dec 2024

Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

1.1K

28 Dec 2024

VideoOrion: Tokenizing Object Dynamics in Videos

Sipeng Zheng

Zongqing Lu

434

25 Nov 2024

Grounded Video Caption Generation

Evangelos Kazakos

Cordelia Schmid

Josef Sivic

296

12 Nov 2024

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in VideosComputer Vision and Pattern Recognition (CVPR), 2024

548

07 Nov 2024

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

414

14 Oct 2024

Described Spatial-Temporal Video Detection

You Qin

320

08 Jul 2024

Artemis: Towards Referential Understanding in Complex Videos

236

01 Jun 2024

Open-Vocabulary Spatio-Temporal Action Detection

Tao Wu

Gangshan Wu

270

17 May 2024

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

Dongjing Shan

guiqiang chen

ViT

347

07 Apr 2024

IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting

319

18 Mar 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Yifei Huang

335

136

14 Mar 2024

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

349

02 Mar 2024

Context-Guided Spatio-Temporal Video GroundingComputer Vision and Pattern Recognition (CVPR), 2024

373

03 Jan 2024

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingComputer Vision and Pattern Recognition (CVPR), 2023

Syed Talal Wasim

Muzammal Naseer

Salman Khan

Ming-Hsuan Yang

Fahad Shahbaz Khan

398

31 Dec 2023

Video Understanding with Large Language Models: A Survey

...

860

202

29 Dec 2023

Panoptic Video Scene Graph GenerationComputer Vision and Pattern Recognition (CVPR), 2023

Xiangtai Li

...

Ziwei Liu

343

28 Nov 2023

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Salman Khan

Mubarak Shah

Fahad Khan

VLM MLLM

257

22 Nov 2023

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing (IEEE TIP), 2023

Hefeng Wu

359

23 Sep 2023

MeViS: A Large-scale Benchmark for Video Segmentation with Motion ExpressionsIEEE International Conference on Computer Vision (ICCV), 2023

357

217

16 Aug 2023

DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object DetectionComputer Vision and Pattern Recognition (CVPR), 2023

Zongheng Tang

Yifan Sun

Si Liu

Yi Yang

ViT

223

14 Apr 2023

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsComputer Vision and Pattern Recognition (CVPR), 2023

390

29 Mar 2023

Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023

Yaowei Wang

Yonghong Tian

Wen Gao

AI4CE VLM

584

286

20 Feb 2023

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video GroundingNeural Information Processing Systems (NeurIPS), 2022

277

27 Sep 2022

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

274

06 Jul 2022

TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022

370

127

30 Mar 2022

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video GroundingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Zhou Zhao

...

Peng Wang

333

15 Mar 2022

Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

460

20 Jan 2022