Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

20 October 2021

Papers citing "Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos"

8 / 8 papers shown

Title
Large-scale Pre-training for Grounded Video Caption Generation Evangelos Kazakos Cordelia Schmid Josef Sivic 59 0 0 13 Mar 2025
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment Zihui Xue Kristen Grauman EgoV 31 30 0 08 Jun 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions Brian Chen Nina Shvetsova Andrew Rouditchenko D. Kondermann Samuel Thomas Shih-Fu Chang Rogerio Feris James R. Glass Hilde Kuehne 32 7 0 29 Mar 2023
Referring Multi-Object Tracking Dongming Wu Wencheng Han Tiancai Wang Xingping Dong Xiangyu Zhang Jianbing Shen 26 71 0 06 Mar 2023
Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures Nannan Li Kevin J. Shih Bryan A. Plummer 29 7 0 04 Oct 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval S. Gorti Noël Vouitsis Junwei Ma Keyvan Golestan M. Volkovs Animesh Garg Guangwei Yu 25 148 0 28 Mar 2022
Improved Baselines with Momentum Contrastive Learning Xinlei Chen Haoqi Fan Ross B. Girshick Kaiming He SSL 267 3,369 0 09 Mar 2020
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 233 31,253 0 16 Jan 2013