ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.05049
  4. Cited By
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
v1v2 (latest)

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

10 November 2020
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
ArXiv (abs)PDFHTML

Papers citing "Human-centric Spatio-Temporal Video Grounding With Visual Transformers"

50 / 57 papers shown
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
Qiáo Xu
Tianwen Qian
Yuqian Fu
Kailing Li
Yang Jiao
Jiacheng Zhang
Xiaoling Wang
Liang He
204
2
0
03 Dec 2025
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu
H. Zhang
Qihang Fan
Jingxuan Niu
Zhipeng Zhang
Libo Zhang
G. Chen
Fan Chen
Longyin Wen
Sijie Zhu
AI4TSLRM
382
2
0
26 Nov 2025
Vidi2.5: Large Multimodal Models for Video Understanding and Creation
Vidi2.5: Large Multimodal Models for Video Understanding and Creation
Vidi Team
Celong Liu
Chia-Wen Kuo
Chuang Huang
Dawei Du
...
Yicheng He
Yiming Cui
Zhenfang Chen
Zhihua Wu
Zuhua Lin
107
0
0
24 Nov 2025
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
Hong-xia Gao
J. Wu
X. Xu
Kangni Xie
Yunchen Zhang
Bin Zhong
Xurui Gao
Min-Ling Zhang
AI4TS
237
1
0
21 Nov 2025
R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios
R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios
Lu Zhu
Tiantian Geng
Yangye Chen
Teng Wang
Ping Lu
Feng Zheng
AI4TS
292
1
0
21 Nov 2025
NVIDIA Nemotron Nano V2 VL
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
377
3
0
06 Nov 2025
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan
W. Zhang
Xin Li
Shihao Wang
Kehan Li
Wentong Li
Jun Xiao
Lei Zhang
Beng Chin Ooi
ObjD
406
3
0
27 Oct 2025
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Tanveer Hannan
Shuaicong Wu
Mark Weber
Antonio Terpin
Jindong Gu
Rajat Koner
Aljosa Osep
Laura Leal-Taixé
Thomas Seidl
323
1
0
14 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
237
1
0
12 Oct 2025
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Zaiquan Yang
Yuhao Liu
Gerhard Hancke
Rynson W. H. Lau
AI4TS
168
3
0
18 Sep 2025
RynnEC: Bringing MLLMs into Embodied World
RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang
Yuqian Yuan
Yunxuan Mao
Kehan Li
Jiangpin Liu
Zhikai Wang
Xin Li
F. Wang
Deli Zhao
VGenLM&Ro
242
7
0
19 Aug 2025
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Shuo Liang
Yiwu Zhong
Zi-Yuan Hu
Yeyao Tao
Liwei Wang
EgoV
329
5
0
01 Aug 2025
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree SearchAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Linhao Yu
Xinguang Ji
Yahui Liu
Fanheng Kong
Chenxi Sun
Jingyuan Zhang
Hongzhi Zhang
Victoria A. Webster-Wood
Fuzheng Zhang
Deyi Xiong
293
2
0
11 Jun 2025
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Shuyi Zhang
Xiaoshuai Hao
Yingbo Tang
Lingfeng Zhang
Pengwei Wang
Zhongyuan Wang
Hongxuan Ma
Shanghang Zhang
VGenAI4TS
453
14
0
10 Jun 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng
Jian Hu
Ziquan Liu
Chenyang Si
Wei Li
Shaogang Gong
LRM
371
31
0
14 Mar 2025
OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
Jiali Yao
Xinran Deng
Xin Gu
Mengrui Dai
Bing Fan
Zhipeng Zhang
Yan Huang
Heng Fan
L. Zhang
436
5
0
13 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
563
3
0
13 Mar 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
328
3
0
10 Mar 2025
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban SpacesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Baining Zhao
Jianjie Fang
Zichao Dai
Liang Luo
Jirong Zha
...
Chen Gao
Yijiao Wang
Jinqiang Cui
Xinlei Chen
Yongqian Li
394
25
0
08 Mar 2025
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingInternational Conference on Learning Representations (ICLR), 2025
Xin Gu
Yaojie Shen
Chenxi Luo
Tiejian Luo
Yan Huang
Lu Ma
Heng Fan
L. Zhang
346
10
0
16 Feb 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksComputer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjDVLM
560
12
0
14 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMComputer Vision and Pattern Recognition (CVPR), 2024
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
493
49
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
638
85
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
1.1K
37
0
28 Dec 2024
VideoOrion: Tokenizing Object Dynamics in Videos
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
434
11
0
25 Nov 2024
Grounded Video Caption Generation
Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
296
0
0
12 Nov 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in VideosComputer Vision and Pattern Recognition (CVPR), 2024
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLMVGenVLM
548
34
0
07 Nov 2024
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour
David Harrison
Maxwell Horton
Jeffrey Marker
Houman Bedayat
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
Saman Naderiparizi
MQ
414
7
0
14 Oct 2024
Described Spatial-Temporal Video Detection
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
320
3
0
08 Jul 2024
Artemis: Towards Referential Understanding in Complex Videos
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLMVGen
236
24
0
01 Jun 2024
Open-Vocabulary Spatio-Temporal Action Detection
Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu
Shuqiu Ge
Jie Qin
Gangshan Wu
Limin Wang
ObjD
270
9
0
17 May 2024
GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets
GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets
Dongjing Shan
guiqiang chen
ViT
347
1
0
07 Apr 2024
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video
  Action Counting
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
Hang Wang
Zhi-Qi Cheng
Youtian Du
Lei Zhang
319
2
0
18 Mar 2024
Video Mamba Suite: State Space Model as a Versatile Alternative for
  Video Understanding
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Guo Chen
Yifei Huang
Jilan Xu
Baoqi Pei
Zhe Chen
Zhiqi Li
Jiahao Wang
Kunchang Li
Tong Lu
Limin Wang
Mamba
335
136
0
14 Mar 2024
Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection
Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection
Chenchen Tao
Chong Wang
Yuexian Zou
Xiaohao Peng
Yan Han
Jiangbo Qian
349
7
0
02 Mar 2024
Context-Guided Spatio-Temporal Video Grounding
Context-Guided Spatio-Temporal Video GroundingComputer Vision and Pattern Recognition (CVPR), 2024
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
373
42
0
03 Jan 2024
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video
  Grounding
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingComputer Vision and Pattern Recognition (CVPR), 2023
Syed Talal Wasim
Muzammal Naseer
Salman Khan
Ming-Hsuan Yang
Fahad Shahbaz Khan
398
32
0
31 Dec 2023
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
860
202
0
29 Dec 2023
Panoptic Video Scene Graph Generation
Panoptic Video Scene Graph GenerationComputer Vision and Pattern Recognition (CVPR), 2023
Jingkang Yang
Wen-Hsiao Peng
Xiangtai Li
Zujin Guo
Liangyu Chen
...
Zheng Ma
Kaiyang Zhou
Wayne Zhang
Chen Change Loy
Ziwei Liu
VOS
343
58
0
28 Nov 2023
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Shehan Munasinghe
Rusiru Thushara
Muhammad Maaz
H. Rasheed
Salman Khan
Mubarak Shah
Fahad Khan
VLMMLLM
257
51
0
22 Nov 2023
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
  Generation
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing (IEEE TIP), 2023
Tao Pu
Tianshui Chen
Hefeng Wu
Yongyi Lu
Liangjie Lin
ViT
359
19
0
23 Sep 2023
MeViS: A Large-scale Benchmark for Video Segmentation with Motion
  Expressions
MeViS: A Large-scale Benchmark for Video Segmentation with Motion ExpressionsIEEE International Conference on Computer Vision (ICCV), 2023
Henghui Ding
Chang Liu
Shuting He
Xudong Jiang
Chen Change Loy
VOS
357
217
0
16 Aug 2023
DETR with Additional Global Aggregation for Cross-domain Weakly
  Supervised Object Detection
DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object DetectionComputer Vision and Pattern Recognition (CVPR), 2023
Zongheng Tang
Yifan Sun
Si Liu
Yi Yang
ViT
223
12
0
14 Apr 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
  Untrimmed Multi-Action Videos from Narrated Instructions
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsComputer Vision and Pattern Recognition (CVPR), 2023
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
390
10
0
29 Mar 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
584
286
0
20 Feb 2023
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
  Grounding
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video GroundingNeural Information Processing Systems (NeurIPS), 2022
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
277
51
0
27 Sep 2022
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
  Cross-Modal Understanding
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding
Zihang Lin
Chaolei Tan
Jianfang Hu
Zhi Jin
Tiancai Ye
Weihao Zheng
274
5
0
06 Jul 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
370
127
0
30 Mar 2022
End-to-End Modeling via Information Tree for One-Shot Natural Language
  Spatial Video Grounding
End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video GroundingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Meng Li
Tianbao Wang
Haoyu Zhang
Shengyu Zhang
Zhou Zhao
...
Wenming Tan
Jin Wang
Peng Wang
Shi Pu
Leilei Gan
333
46
0
15 Mar 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
460
56
0
20 Jan 2022
12
Next
Page 1 of 2