ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1912.06430
  4. Cited By
End-to-End Learning of Visual Representations from Uncurated
  Instructional Videos

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

13 December 2019
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
    VGen
    SSL
ArXivPDFHTML

Papers citing "End-to-End Learning of Visual Representations from Uncurated Instructional Videos"

50 / 179 papers shown
Title
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
Siteng Huang
Biao Gong
Yulin Pan
Jianwen Jiang
Yiliang Lv
Yuyuan Li
Donglin Wang
VLM
VPVLM
16
41
0
23 Nov 2022
Unifying Tracking and Image-Video Object Detection
Unifying Tracking and Image-Video Object Detection
Peirong Liu
Rui Wang
Pengchuan Zhang
Omid Poursaeed
Yipin Zhou
Xuefei Cao
Sreya . Dutta Roy
Ashish Shah
Ser-Nam Lim
11
0
0
20 Nov 2022
Self-supervised remote sensing feature learning: Learning Paradigms,
  Challenges, and Future Works
Self-supervised remote sensing feature learning: Learning Paradigms, Challenges, and Future Works
Chao Tao
Ji Qi
Mingning Guo
Qing Zhu
Haifeng Li
SSL
19
56
0
15 Nov 2022
Unsupervised Audio-Visual Lecture Segmentation
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh
Anchit Gupta
C. V. Jawahar
Makarand Tapaswi
VOS
16
4
0
29 Oct 2022
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung
Seongho Choi
Joo-Kyung Kim
Jin-Hwa Kim
Byoung-Tak Zhang
29
7
0
23 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
44
11
0
10 Oct 2022
A Closer Look at Temporal Ordering in the Segmentation of Instructional
  Videos
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Anil Batra
Shreyank N. Gowda
Frank Keller
Laura Sevilla-Lara
24
5
0
30 Sep 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
49
28
0
28 Sep 2022
Graph Soft-Contrastive Learning via Neighborhood Ranking
Graph Soft-Contrastive Learning via Neighborhood Ranking
Zhiyuan Ning
P. Wang
Pengyang Wang
Ziyue Qiao
Wei Fan
Denghui Zhang
Yi Du
Yuanchun Zhou
16
13
0
28 Sep 2022
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
Janghyeon Lee
Jongsuk Kim
Hyounguk Shon
Bumsoo Kim
Seung Wook Kim
Honglak Lee
Junmo Kim
CLIP
VLM
50
53
0
27 Sep 2022
FETA: Towards Specializing Foundation Models for Expert Task
  Applications
FETA: Towards Specializing Foundation Models for Expert Task Applications
Amit Alfassy
Assaf Arbelle
Oshri Halimi
Sivan Harary
Roei Herzig
...
Christoph Auer
Kate Saenko
Peter W. J. Staar
Rogerio Feris
Leonid Karlinsky
21
19
0
08 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked
  Visual Modeling
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
19
63
0
04 Sep 2022
Partially Relevant Video Retrieval
Partially Relevant Video Retrieval
Jianfeng Dong
Xianke Chen
Minsong Zhang
Xun Yang
Shujie Chen
Xirong Li
Xun Wang
14
39
0
26 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Yanbei Chen
Massimiliano Mancini
Xiatian Zhu
Zeynep Akata
36
113
0
24 Aug 2022
A Feature-space Multimodal Data Augmentation Technique for Text-video
  Retrieval
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Alex Falcon
G. Serra
O. Lanz
VGen
34
25
0
03 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
16
17
0
01 Aug 2022
Negative Samples are at Large: Leveraging Hard-distance Elastic Loss for
  Re-identification
Negative Samples are at Large: Leveraging Hard-distance Elastic Loss for Re-identification
Hyungtae Lee
Sungmin Eum
H. Kwon
VLM
15
4
0
20 Jul 2022
Zero-Shot Temporal Action Detection via Vision-Language Prompting
Zero-Shot Temporal Action Detection via Vision-Language Prompting
Sauradip Nag
Xiatian Zhu
Yi-Zhe Song
Tao Xiang
VLM
25
65
0
17 Jul 2022
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
Sumanth Gurram
An Fang
David M. Chan
John F. Canny
VLM
AI4TS
28
1
0
16 Jul 2022
SVGraph: Learning Semantic Graphs from Instructional Videos
SVGraph: Learning Semantic Graphs from Instructional Videos
Madeline Chantry Schiappa
Y. S. Rawat
17
4
0
16 Jul 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
19
48
0
16 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
10
267
0
15 Jul 2022
Robustness Analysis of Video-Language Models Against Visual and Language
  Perturbations
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Madeline Chantry Schiappa
Shruti Vyas
Hamid Palangi
Y. S. Rawat
Vibhav Vineet
VLM
120
17
0
05 Jul 2022
Exploiting Semantic Role Contextualized Video Features for
  Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance
  Retrieval Challenge 2022
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
Burak Satar
Hongyuan Zhu
Hanwang Zhang
J. Lim
18
3
0
29 Jun 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Y. S. Rawat
M. Shah
SSL
34
131
0
18 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale
  Knowledge
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
42
347
0
17 Jun 2022
OmniMAE: Single Model Masked Pretraining on Images and Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos
Rohit Girdhar
Alaaeldin El-Nouby
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
ViT
29
97
0
16 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
18
81
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
525
0
13 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language Understanding
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
27
155
0
03 Jun 2022
Egocentric Video-Language Pretraining
Egocentric Video-Language Pretraining
Kevin Qinghong Lin
Alex Jinpeng Wang
Mattia Soldan
Michael Wray
Rui Yan
...
Hongfa Wang
Dima Damen
Bernard Ghanem
Wei Liu
Mike Zheng Shou
VLM
EgoV
31
188
0
03 Jun 2022
Breaking with Fixed Set Pathology Recognition through Report-Guided
  Contrastive Training
Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training
C. Seibold
Simon Reiß
M. Sarfraz
Rainer Stiefelhagen
Jens Kleesiek
16
31
0
14 May 2022
Scene Consistency Representation Learning for Video Scene Segmentation
Scene Consistency Representation Learning for Video Scene Segmentation
Haoqian Wu
Keyu Chen
Yanan Luo
Ruizhi Qiao
Bo Ren
Haozhe Liu
Weicheng Xie
Linlin Shen
SSL
31
16
0
11 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
32
33
0
10 May 2022
Weakly-supervised segmentation of referring expressions
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
19
21
0
10 May 2022
Scaling up sign spotting through sign language dictionaries
Scaling up sign spotting through sign language dictionaries
Gül Varol
Liliane Momeni
Samuel Albanie
Triantafyllos Afouras
Andrew Zisserman
21
14
0
09 May 2022
P3IV: Probabilistic Procedure Planning from Instructional Videos with
  Weak Supervision
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
Henghui Zhao
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Richard P. Wildes
Allan D. Jepson
20
45
0
04 May 2022
TransRank: Self-supervised Video Representation Learning via
  Ranking-based Transformation Recognition
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition
Haodong Duan
Nanxuan Zhao
Kai-xiang Chen
Dahua Lin
ViT
AI4TS
31
19
0
04 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
Dongdong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
40
45
0
03 May 2022
On Negative Sampling for Audio-Visual Contrastive Learning from Movies
On Negative Sampling for Audio-Visual Contrastive Learning from Movies
Mahdi M. Kalayeh
Shervin Ardeshir
Lingyi Liu
Nagendra Kamath
Ashok Chandrashekar
SSL
22
3
0
29 Apr 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
46
3,328
0
29 Apr 2022
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Alex Falcon
Swathikiran Sudhakaran
G. Serra
Sergio Escalera
O. Lanz
32
7
0
27 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for
  Video-text Retrieval
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
13
43
0
26 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Bernard Ghanem
M. Donoser
Loris Bazzani
30
27
0
26 Apr 2022
Frequency Selective Augmentation for Video Representation Learning
Frequency Selective Augmentation for Video Representation Learning
Jinhyung Kim
Taeoh Kim
Minho Shim
Dongyoon Han
Dongyoon Wee
Junmo Kim
AI4TS
41
3
0
08 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
18
18
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
33
39
0
06 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLM
LRM
13
571
0
01 Apr 2022
Do Vision-Language Pretrained Models Learn Composable Primitive
  Concepts?
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Tian Yun
Usha Bhalla
Ellie Pavlick
Chen Sun
ReLM
CoGe
VLM
LRM
31
23
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
28
94
0
30 Mar 2022
Previous
1234
Next