What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

29 March 2023

Papers citing "What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions"

8 / 8 papers shown

Title
Large-scale Pre-training for Grounded Video Caption Generation Evangelos Kazakos Cordelia Schmid Josef Sivic 52 0 0 13 Mar 2025
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Yang Jin Yongzhi Li Zehuan Yuan Yadong Mu 26 32 0 27 Sep 2022
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection Lewei Yao Jianhua Han Youpeng Wen Xiaodan Liang Dan Xu Wei Zhang Zhenguo Li Chunjing Xu Hang Xu CLIP VLM 115 151 0 20 Sep 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video Kristen Grauman Andrew Westbury Eugene Byrne Zachary Chavis Antonino Furnari ... Mike Zheng Shou Antonio Torralba Lorenzo Torresani Mingfei Yan Jitendra Malik EgoV 221 1,017 0 13 Oct 2021
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions Shuang Li Yilun Du Antonio Torralba Josef Sivic Bryan C. Russell 49 15 0 07 Oct 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Yin Cui Boqing Gong ViT 231 573 0 22 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Huaishao Luo Lei Ji Ming Zhong Yang Chen Wen Lei Nan Duan Tianrui Li CLIP VLM 303 771 0 18 Apr 2021
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 228 31,150 0 16 Jan 2013