ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.06183
  4. Cited By
Less is More: ClipBERT for Video-and-Language Learning via Sparse
  Sampling

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

11 February 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Mohit Bansal
Jingjing Liu
    CLIP
ArXivPDFHTML

Papers citing "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling"

50 / 80 papers shown
Title
Empowering Agentic Video Analytics Systems with Video Language Models
Empowering Agentic Video Analytics Systems with Video Language Models
Yuxuan Yan
Shiqi Jiang
Ting Cao
Y. Yang
Qianqian Yang
Yuanchao Shu
Y. Yang
Lili Qiu
VLM
67
0
0
01 May 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
34
0
0
04 Apr 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
62
0
0
16 Mar 2025
Learning to Reason Iteratively and Parallelly for Complex Visual
  Reasoning Scenarios
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
63
2
0
20 Nov 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
87
29
0
26 Sep 2024
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
Lukas Heine
Fabian Horst
Jana Fragemann
Gijs Luijten
M. Balzer
Jan Egger
F. Bahnsen
M. Sarfraz
Jens Kleesiek
15
0
0
25 Sep 2024
Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment
Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment
Jin Chen
Kaijing Ma
Haojian Huang
Jiayu Shen
Han Fang
Xianghao Zang
Chao Ban
76
2
0
17 Sep 2024
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen
Tianxiang Hao
Tao He
Sicheng Zhao
Pengzhang Liu
Yongjun Bao
Guiguang Ding
Guiguang Ding
43
6
0
02 Sep 2024
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Tz-Ying Wu
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
EgoV
51
0
0
28 Jul 2024
Causal Understanding For Video Question Answering
Causal Understanding For Video Question Answering
Bhanu Prakash Reddy Guda
Tanmay Kulkarni
Adithya Sampath
Swarnashree Mysore Sathyendra
CML
31
0
0
23 Jul 2024
End-to-End Video Question Answering with Frame Scoring Mechanisms and
  Adaptive Sampling
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Jianxin Liang
Xiaojun Meng
Yueqian Wang
Chang Liu
Qun Liu
Dongyan Zhao
24
5
0
21 Jul 2024
AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video
  Grounding
AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
Xing Zhang
Jiaxi Gu
Haoyu Zhao
Shicong Wang
Hang Xu
Renjing Pei
Songcen Xu
Zuxuan Wu
Yu-Gang Jiang
31
0
0
11 Jun 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Weizhen He
Yiheng Deng
Yunfeng Yan
Feng Zhu
Yizhou Wang
Lei Bai
Qingsong Xie
Donglian Qi
Wanli Ouyang
Shixiang Tang
90
2
0
28 May 2024
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang
Pengpeng Zeng
Lianli Gao
Jingkuan Song
Yihang Duan
Xinyu Lyu
Hengtao Shen
VLM
CLIP
23
2
0
21 May 2024
STAR: A Benchmark for Situated Reasoning in Real-World Videos
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B Tenenbaum
Chuang Gan
31
176
0
15 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
27
1
0
13 May 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
31
1
0
01 Apr 2024
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
Yifei Huang
Guo Chen
Jilan Xu
Mingfang Zhang
Lijin Yang
...
Hongjie Zhang
Lu Dong
Yali Wang
Limin Wang
Yu Qiao
EgoV
49
32
0
24 Mar 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
58
4
0
08 Feb 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
39
35
0
16 Jan 2024
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan
Md. Mohaiminul Islam
Thomas Seidl
Gedas Bertasius
22
3
0
11 Dec 2023
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
Thanh-Dat Truong
Utsav Prabhu
Bhiksha Raj
Jackson Cothren
Khoa Luu
CLL
26
3
0
27 Nov 2023
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video
  Retrieval
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Konstantin Yakovlev
Gregory Polyakov
I. Alimova
Alexander Podolskiy
A. Bout
Sergey I. Nikolenko
Irina Piontkovskaya
CLIP
14
1
0
14 Nov 2023
Latent Wander: an Alternative Interface for Interactive and
  Serendipitous Discovery of Large AV Archives
Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives
Yuchen Yang
Linyida Zhang
8
2
0
09 Oct 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
26
2
0
27 Sep 2023
Representation Learning for Sequential Volumetric Design Tasks
Representation Learning for Sequential Volumetric Design Tasks
Md Ferdous Alam
Yi Wang
Linh Tran
Chin-Yi Cheng
Jieliang Luo
3DV
8
2
0
05 Sep 2023
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Weizhen He
Yihe Deng
Shixiang Tang
Qihao Chen
Qingsong Xie
...
Feng Zhu
Rui Zhao
Wanli Ouyang
Donglian Qi
Yunfeng Yan
62
18
0
13 Jun 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set
  Alignment
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Peng Jin
Hao Li
Ze-Long Cheng
Jinfa Huang
Zhennan Wang
Li-ming Yuan
Chang-rui Liu
Jie Chen
15
31
0
20 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
22
129
0
11 May 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
16
1
0
10 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
17
38
0
31 Mar 2023
Learning video embedding space with Natural Language Supervision
Learning video embedding space with Natural Language Supervision
P. Uppala
Abhishek Bamotra
S. Priya
Vaidehi Joshi
CLIP
10
1
0
25 Mar 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for
  Cross-Modal Representation Learning
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Peng Jin
Jinfa Huang
Pengfei Xiong
Shangxuan Tian
Chang-rui Liu
Xiang Ji
Li-ming Yuan
Jie Chen
23
48
0
25 Mar 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Peng Jin
Hao Li
Ze-Long Cheng
Kehan Li
Xiang Ji
Chang-rui Liu
Li-ming Yuan
Jie Chen
DiffM
VGen
13
52
0
17 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
14
10
0
12 Mar 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
17
8
0
20 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Yizhen Chen
Jie Wang
Lijian Lin
Zhongang Qi
Jin Ma
Ying Shan
VLM
11
18
0
30 Jan 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
UATVR: Uncertainty-Adaptive Text-Video Retrieval
Bo Fang
Wenhao Wu
Chang-rui Liu
Yu Zhou
Yuxin Song
Weiping Wang
Min Yang
Xiang Ji
Jingdong Wang
17
45
0
16 Jan 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language
  Models for Knowledge-based Visual Reasoning
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen
Qinhong Zhou
Yikang Shen
Yining Hong
Hao Zhang
Chuang Gan
LRM
VLM
26
35
0
12 Jan 2023
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language
  Pre-training
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
26
15
0
21 Nov 2022
Cross-Modal Adapter for Text-Video Retrieval
Cross-Modal Adapter for Text-Video Retrieval
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Jie Zhou
S. Song
Gao Huang
38
35
0
17 Nov 2022
Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
12
25
0
27 Oct 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
  Modeling
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
15
18
0
21 Oct 2022
A New Path: Scaling Vision-and-Language Navigation with Synthetic
  Instructions and Imitation Learning
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Alexander Ku
Austin Waters
Yinfei Yang
Jason Baldridge
Zarana Parekh
LM&Ro
15
45
0
06 Oct 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
35
28
0
28 Sep 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
22
3
0
24 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
8
17
0
01 Aug 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
19
46
0
16 Jul 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Y. S. Rawat
M. Shah
SSL
11
130
0
18 Jun 2022
12
Next