ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.14447
  4. Cited By
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic
  Arithmetic

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

29 November 2021
Yoad Tewel
Yoav Shalev
Idan Schwartz
Lior Wolf
    VLM
ArXivPDFHTML

Papers citing "ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic"

27 / 127 papers shown
Title
I Can't Believe There's No Images! Learning Visual Tasks Using only
  Language Supervision
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
6
23
0
17 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
  Alignment
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
14
8
0
14 Nov 2022
Large-Scale Bidirectional Training for Zero-Shot Image Captioning
Large-Scale Bidirectional Training for Zero-Shot Image Captioning
Taehoon Kim
Mark A Marsden
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Alessandra Sala
S. Kim
VLM
19
4
0
13 Nov 2022
Partial Visual-Semantic Embedding: Fashion Intelligence System with
  Sensitive Part-by-Part Learning
Partial Visual-Semantic Embedding: Fashion Intelligence System with Sensitive Part-by-Part Learning
Ryotaro Shimizu
Takuma Nakamura
M. Goto
3DH
12
0
0
12 Nov 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai
Ron Mokady
Amir Globerson
VLM
CLIP
38
69
0
01 Nov 2022
Describing Sets of Images with Textual-PCA
Describing Sets of Images with Textual-PCA
Oded Hupert
Idan Schwartz
Lior Wolf
CoGe
15
1
0
21 Oct 2022
Composing Ensembles of Pre-trained Models via Iterative Consensus
Composing Ensembles of Pre-trained Models via Iterative Consensus
Shuang Li
Yilun Du
J. Tenenbaum
Antonio Torralba
Igor Mordatch
MoMe
11
23
0
20 Oct 2022
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
Mohammad Reza Taesiri
Finlay Macklon
Yihe Wang
Hengshuo Shen
C. Bezemer
ELM
LLMAG
MLLM
15
13
0
05 Oct 2022
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Yoad Tewel
Yoav Shalev
Roy Nadler
Idan Schwartz
Lior Wolf
22
25
0
22 Jul 2022
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
  without Text Inputs
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Tal Shaharabany
Yoad Tewel
Lior Wolf
ObjD
20
15
0
19 Jun 2022
Language Models are General-Purpose Interfaces
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
11
95
0
13 Jun 2022
Multimodal Knowledge Alignment with Reinforcement Learning
Multimodal Knowledge Alignment with Reinforcement Learning
Youngjae Yu
Jiwan Chung
Heeseung Yun
Jack Hessel
J. Park
...
Prithviraj Ammanabrolu
Rowan Zellers
Ronan Le Bras
Gunhee Kim
Yejin Choi
VLM
109
35
0
25 May 2022
Toxicity Detection with Generative Prompt-based Inference
Toxicity Detection with Generative Prompt-based Inference
Yau-Shian Wang
Y. Chang
64
34
0
24 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLM
MLLM
40
91
0
05 May 2022
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
  to Improve Language Understanding
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
Chan-Jan Hsu
Hung-yi Lee
Yu Tsao
VLM
24
3
0
15 Apr 2022
No Token Left Behind: Explainability-Aided Image Classification and
  Generation
No Token Left Behind: Explainability-Aided Image Classification and Generation
Roni Paiss
Hila Chefer
Lior Wolf
VLM
6
29
0
11 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLM
LRM
8
567
0
01 Apr 2022
Image-Based CLIP-Guided Essence Transfer
Image-Based CLIP-Guided Essence Transfer
Hila Chefer
Sagie Benaim
Roni Paiss
Lior Wolf
CLIP
6
50
0
24 Oct 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Mohit Bansal
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
174
342
0
13 Jul 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
253
4,735
0
24 Feb 2021
Improving Image Captioning by Leveraging Intra- and Inter-layer Global
  Representation in Transformer Network
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
Jiayi Ji
Yunpeng Luo
Xiaoshuai Sun
Fuhai Chen
Gen Luo
Yongjian Wu
Yue Gao
Rongrong Ji
ViT
35
163
0
13 Dec 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing
  Functional Entropies
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
Itai Gat
Idan Schwartz
A. Schwing
Tamir Hazan
39
88
0
21 Oct 2020
Normalized and Geometry-Aware Self-Attention Network for Image
  Captioning
Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo
Jing Liu
Xinxin Zhu
Peng Yao
Shichen Lu
Hanqing Lu
ViT
108
188
0
19 Mar 2020
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
273
1,561
0
18 Sep 2019
Neural Baby Talk
Neural Baby Talk
Jiasen Lu
Jianwei Yang
Dhruv Batra
Devi Parikh
VLM
189
432
0
27 Mar 2018
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image
  Captioning
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
Jiasen Lu
Caiming Xiong
Devi Parikh
R. Socher
81
443
0
06 Dec 2016
Efficient Estimation of Word Representations in Vector Space
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
228
29,632
0
16 Jan 2013
Previous
123