ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.16434
  4. Cited By
TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers

30 March 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
    ViT
ArXivPDFHTML

Papers citing "TubeDETR: Spatio-Temporal Video Grounding with Transformers"

50 / 69 papers shown
Title
MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
Divyanshu Mishra
Pramit Saha
He Zhao
Netzahualcoyotl Hernandez-Cruz
Olga Patey
A. Papageorghiou
J. A. Noble
21
0
0
08 Apr 2025
VideoGEM: Training-free Action Grounding in Videos
VideoGEM: Training-free Action Grounding in Videos
Felix Vogel
Walid Bousselham
Anna Kukleva
Nina Shvetsova
Hilde Kuehne
LM&Ro
VLM
120
0
0
26 Mar 2025
Action tube generation by person query matching for spatio-temporal action detection
Action tube generation by person query matching for spatio-temporal action detection
Kazuki Omi
Jion Oshima
Toru Tamaki
60
0
0
17 Mar 2025
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Saket Gurukar
Asim Kadav
VLM
50
0
0
17 Mar 2025
OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
Jiali Yao
Xinran Deng
Xin Gu
Mengrui Dai
Bing Fan
Zhipeng Zhang
Yan Huang
Heng Fan
L. Zhang
56
0
0
13 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
54
0
0
13 Mar 2025
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Xin Gu
Yaojie Shen
Chenxi Luo
Tiejian Luo
Yan Huang
Yuewei Lin
Heng Fan
L. Zhang
58
1
0
16 Feb 2025
Grounded Video Caption Generation
Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
28
0
0
12 Nov 2024
PESFormer: Boosting Macro- and Micro-expression Spotting with Direct
  Timestamp Encoding
PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding
Wang-Wang Yu
Kai-Fu Yang
Xiangrui Hu
Jingwen Jiang
Hong-Mei Yan
Yong-Jie Li
24
0
0
24 Oct 2024
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding
Yang Liu
Daizong Liu
Wei Hu
3DPC
16
1
0
21 Oct 2024
Described Spatial-Temporal Video Detection
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
32
2
0
08 Jul 2024
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6
  -- Grounded videoQA
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Hailiang Zhang
Dian Chao
Zhihao Guan
Yang Yang
35
0
0
02 Jul 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
49
7
0
21 Mar 2024
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video
  Action Counting
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
Hang Wang
Zhi-Qi Cheng
Youtian Du
Lei Zhang
21
1
0
18 Mar 2024
Context-Guided Spatio-Temporal Video Grounding
Context-Guided Spatio-Temporal Video Grounding
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
26
14
0
03 Jan 2024
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video
  Grounding
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
Syed Talal Wasim
Muzammal Naseer
Salman Khan
Ming-Hsuan Yang
Fahad Shahbaz Khan
12
12
0
31 Dec 2023
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Yifan Lu
Ziqi Zhang
Chunfen Yuan
Peng Li
Yan Wang
Bing Li
Weiming Hu
30
3
0
25 Dec 2023
LLM4VG: Large Language Models Evaluation for Video Grounding
LLM4VG: Large Language Models Evaluation for Video Grounding
Wei Feng
Xin Wang
Hong Chen
Zeyang Zhang
Zihan Song
Yuwei Zhou
Wenwu Zhu
31
8
0
21 Dec 2023
Perception Test 2023: A Summary of the First Challenge And Outcome
Perception Test 2023: A Summary of the First Challenge And Outcome
Joseph Heyward
João Carreira
Dima Damen
Andrew Zisserman
Viorica Patraucean
14
0
0
20 Dec 2023
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
30
12
0
19 Dec 2023
TAM-VT: Transformation-Aware Multi-scale Video Transformer for
  Segmentation and Tracking
TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking
Raghav Goyal
Wan-Cyuan Fan
Mennatullah Siam
Leonid Sigal
VOS
32
2
0
13 Dec 2023
Cross-modal Contrastive Learning with Asymmetric Co-attention Network
  for Video Moment Retrieval
Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval
Love Panta
Prashant Shrestha
Brabeem Sapkota
Amrita Bhattarai
Suresh Manandhar
Anand Kumar Sah
23
2
0
12 Dec 2023
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation
  in Video Understanding
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Trong-Thuan Nguyen
Pha Nguyen
Khoa Luu
22
12
0
05 Dec 2023
REACT: Recognize Every Action Everywhere All At Once
REACT: Recognize Every Action Everywhere All At Once
N. V. R. Chappa
Pha Nguyen
P. Dobbs
Khoa Luu
30
6
0
27 Nov 2023
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
  for Grounding-Based Vision and Language Models
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models
Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
VLM
25
1
0
05 Nov 2023
Video Referring Expression Comprehension via Transformer with
  Content-conditioned Query
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Jiang Ji
Meng Cao
Tengtao Song
Long Chen
Yi Wang
Yuexian Zou
13
6
0
25 Oct 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
13
26
0
25 Sep 2023
Dense Object Grounding in 3D Scenes
Dense Object Grounding in 3D Scenes
Wencan Huang
Daizong Liu
Wei Hu
13
17
0
05 Sep 2023
MeViS: A Large-scale Benchmark for Video Segmentation with Motion
  Expressions
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
Henghui Ding
Chang Liu
Shuting He
Xudong Jiang
Chen Change Loy
VOS
33
101
0
16 Aug 2023
Memory-and-Anticipation Transformer for Online Action Understanding
Memory-and-Anticipation Transformer for Online Action Understanding
Jiahao Wang
Guo Chen
Yifei Huang
Liming Wang
Tong Lu
OffRL
54
37
0
15 Aug 2023
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with
  Glance Annotation
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
Hanjun Li
Xiujun Shu
Su He
Ruizhi Qiao
Wei Wen
Taian Guo
Bei Gan
Xing Sun
12
11
0
08 Aug 2023
Towards Video Anomaly Retrieval from Video Anomaly Detection: New
  Benchmarks and Model
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model
Peng Wu
Jing Liu
Xiangteng He
Yuxin Peng
Peng Wang
Yanning Zhang
32
29
0
24 Jul 2023
Multi-Modal Machine Learning for Assessing Gaming Skills in Online
  Streaming: A Case Study with CS:GO
Multi-Modal Machine Learning for Assessing Gaming Skills in Online Streaming: A Case Study with CS:GO
Longxiang Zhang
Wenping Wang
37
1
0
23 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
25
244
0
13 Jul 2023
Look, Remember and Reason: Grounded reasoning in videos with language
  models
Look, Remember and Reason: Grounded reasoning in videos with language models
Apratim Bhattacharyya
Sunny Panchal
Mingu Lee
Reza Pourreza
Pulkit Madan
Roland Memisevic
LRM
33
7
0
30 Jun 2023
Dense Video Object Captioning from Disjoint Supervision
Dense Video Object Captioning from Disjoint Supervision
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
20
2
0
20 Jun 2023
Meta-Personalizing Vision-Language Models to Find Named Instances in
  Video
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Chun-Hsiao Yeh
Bryan C. Russell
Josef Sivic
Fabian Caba Heilbron
Simon Jenni
VLM
MLLM
44
9
0
16 Jun 2023
Single-Stage Visual Query Localization in Egocentric Videos
Single-Stage Visual Query Localization in Egocentric Videos
Hanwen Jiang
Santhosh Kumar Ramakrishnan
Kristen Grauman
13
13
0
15 Jun 2023
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Pha Nguyen
Kha Gia Quach
Kris M. Kitani
Khoa Luu
30
18
0
22 May 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
  Untrimmed Multi-Action Videos from Narrated Instructions
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
27
7
0
29 Mar 2023
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed
  Human Attention
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Sounak Mondal
Zhibo Yang
Seoyoung Ahn
Dimitris Samaras
G. Zelinsky
Minh Hoai
17
29
0
27 Mar 2023
Query-Dependent Video Representation for Moment Retrieval and Highlight
  Detection
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
WonJun Moon
Sangeek Hyun
S. Park
Dongchan Park
Jae-Pil Heo
ViT
41
106
0
24 Mar 2023
You Can Ground Earlier than See: An Effective and Efficient Pipeline for
  Temporal Sentence Grounding in Compressed Videos
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Xiang Fang
Daizong Liu
Pan Zhou
Guoshun Nan
23
37
0
14 Mar 2023
Referring Multi-Object Tracking
Referring Multi-Object Tracking
Dongming Wu
Wencheng Han
Tiancai Wang
Xingping Dong
Xiangyu Zhang
Jianbing Shen
24
71
0
06 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
23
220
0
27 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
8
7
0
16 Feb 2023
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
43
74
0
17 Nov 2022
Grounded Video Situation Recognition
Grounded Video Situation Recognition
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
22
13
0
19 Oct 2022
Video Referring Expression Comprehension via Transformer with
  Content-aware Query
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
19
5
0
06 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
33
16
0
05 Oct 2022
12
Next