Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.16553
Cited By
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
30 March 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers"
12 / 12 papers shown
Title
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
60
0
0
21 Feb 2025
Are Diffusion Models Vision-And-Language Reasoners?
Benno Krojer
Elinor Poole-Dayan
Vikram S. Voleti
Christopher Pal
Siva Reddy
16
12
0
25 May 2023
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim
Nam-Won Kim
Suha Kwak
8
37
0
30 Nov 2022
Cross-Modal Adapter for Text-Video Retrieval
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Jie Zhou
S. Song
Gao Huang
38
35
0
17 Nov 2022
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
Abhra Chaudhuri
Massimiliano Mancini
Yanbei Chen
Zeynep Akata
Anjan Dutta
8
5
0
19 Oct 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
10
17
0
01 Aug 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
518
0
13 Jun 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
14
148
0
28 Mar 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
101
0
16 Jan 2022
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
922
0
24 Sep 2019
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Vijay Badrinarayanan
Alex Kendall
R. Cipolla
SSeg
420
15,438
0
02 Nov 2015
A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics
Yunchao Gong
Qifa Ke
Michael Isard
Svetlana Lazebnik
3DV
58
583
0
18 Dec 2012
1