Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2306.07915
Cited By
v1
v2
v3
v4
v5 (latest)
Image Captioners Are Scalable Vision Learners Too
Neural Information Processing Systems (NeurIPS), 2023
13 June 2023
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (11 upvotes)
Papers citing
"Image Captioners Are Scalable Vision Learners Too"
34 / 34 papers shown
Title
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Wei-Cheng Tseng
Xuanru Zhou
Mingyue Huo
Yiwen Shao
Hao Zhang
Dong Yu
CLIP
AI4TS
VLM
96
0
0
20 Nov 2025
AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
Mardiyyah Oduwole
Prince Mireku
Fatimo Adebanjo
Oluwatosin Olajide
Mahi Aminu Aliyu
Jekaterina Novikova
73
0
0
20 Oct 2025
Comprehensive language-image pre-training for 3D medical image understanding
Tassilo Wald
Ibrahim Ethem Hamamci
Yuan Gao
Sam Bond-Taylor
H. Sharma
...
Klaus H. Maier-Hein
Panagiotis Korfiatis
Valentina Salvatelli
Javier Alvarez-Valle
Fernando Pérez-García
MedIm
VLM
104
0
0
16 Oct 2025
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Tiancheng Gu
Kaicheng Yang
Kaichen Zhang
Xiang An
Ziyong Feng
Y. Zhang
Weidong Cai
Jiankang Deng
Lidong Bing
149
4
0
15 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
120
1
0
12 Oct 2025
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Yanqing Liu
Xianhang Li
Letian Zhang
Zirui Wang
Zeyu Zheng
Yuyin Zhou
Cihang Xie
VLM
161
2
0
01 Sep 2025
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Junha Song
Yongsik Jo
So Yeon Min
Quanting Xie
Taehwan Kim
Yonatan Bisk
Jaegul Choo
VLM
132
0
0
29 Aug 2025
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang
Zhengyuan Yang
Chao Feng
Yongyuan Liang
Yuhang Zhou
...
Chung-Ching Lin
Kevin Lin
Linjie Li
Furong Huang
L. xilinx Wang
OffRL
LRM
262
7
0
11 Jun 2025
SensorLM: Learning the Language of Wearable Sensors
Yuwei Zhang
Kumar Ayush
Siyuan Qiao
A. Heydari
Girish Narayanswamy
...
Shwetak N. Patel
Cecilia Mascolo
Xin Liu
Daniel J. McDuff
Yuzhe Yang
355
12
0
10 Jun 2025
A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Vishaal Udandarao
Mehdi Cherti
Shyamgopal Karthik
J. Jitsev
Samuel Albanie
Matthias Bethge
CoGe
152
1
0
09 Jun 2025
Progressive Scaling Visual Object Tracking
Jack Hong
Shilin Yan
Zehao Xiao
Jiayin Cai
Xiaolong Jiang
Yao Hu
Henghui Ding
251
1
0
26 May 2025
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Tiancheng Gu
Kaicheng Yang
Ziyong Feng
Xingjun Wang
Yanzhao Zhang
Dingkun Long
Yingda Chen
Weidong Cai
Jiankang Deng
VLM
817
34
0
24 Apr 2025
Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Vaishnavh Nagarajan
Chen Henry Wu
Charles Ding
Aditi Raghunathan
521
12
0
21 Apr 2025
Can Masked Autoencoders Also Listen to Birds?
Lukas Rauch
Ilyass Moummad
René Heinrich
Alexis Joly
Bernhard Sick
Christoph Scholz
425
8
0
17 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
568
90
0
17 Apr 2025
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Zhaoyi Li
Xiaohan Zhao
Dong-Dong Wu
Jiacheng Cui
Zhiqiang Shen
AAML
VLM
421
8
0
13 Mar 2025
MASS: Overcoming Language Bias in Image-Text Matching
AAAI Conference on Artificial Intelligence (AAAI), 2025
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
177
0
0
20 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Computer Vision and Pattern Recognition (CVPR), 2023
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Celine Lee
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
240
38
0
31 Dec 2024
Bringing Multimodality to Amazon Visual Search System
Knowledge Discovery and Data Mining (KDD), 2024
Xinliang Zhu
Michael Huang
Han Ding
Jinyu Yang
Kelvin Chen
...
Son Dinh Tran
Benjamin Z. Yao
Doug Gray
Anuj Bindal
Arnab Dhua
229
7
0
17 Dec 2024
Classification Done Right for Vision-Language Pre-Training
Neural Information Processing Systems (NeurIPS), 2024
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
355
6
0
05 Nov 2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
Apoorv Khandelwal
Tian Yun
Nihal V. Nayak
Jack Merullo
Stephen H. Bach
Chen Sun
Ellie Pavlick
VLM
AI4CE
OnRL
230
5
0
30 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Neural Information Processing Systems (NeurIPS), 2024
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
544
59
0
18 Oct 2024
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
501
11
0
14 Oct 2024
Generative Semantic Communication via Textual Prompts: Latency Performance Tradeoffs
IEEE Transactions on Vehicular Technology (IEEE Trans. Veh. Technol.), 2024
Mengmeng Ren
Li Qiao
Long Yang
Zhen Gao
Jian Chen
Mahdi Boloursaz Mashhadi
Pei Xiao
Rahim Tafazolli
Mehdi Bennis
VLM
313
11
0
15 Sep 2024
Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
H. Kerdegari
Kyle Higgins
Dennis Veselkov
I. Laponogov
I. Poļaka
...
Junior Andrea Pescino
M. Leja
M. Dinis-Ribeiro
T. F. Kanonnikoff
Kirill Veselkov
356
5
0
26 Jun 2024
A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Thomas Stegmüller
Tim Lebailly
Nikola Dukic
Behzad Bozorgtabar
Tinne Tuytelaars
Jean-Philippe Thiran
VLM
371
3
0
23 Jun 2024
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
Neural Information Processing Systems (NeurIPS), 2024
Imanol Miranda
Ander Salaberria
Eneko Agirre
Gorka Azkune
CoGe
209
5
0
14 Jun 2024
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang
Wei Lin
M. Jehanzeb Mirza
Jacob A. Hansen
Sivan Doveh
...
Trevor Darrel
Chuang Gan
Aude Oliva
Rogerio Feris
Leonid Karlinsky
CoGe
LRM
157
16
0
12 Jun 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLM
MLLM
166
15
0
02 Apr 2024
The pitfalls of next-token prediction
International Conference on Machine Learning (ICML), 2024
Gregor Bachmann
Vaishnavh Nagarajan
381
128
0
11 Mar 2024
Cacophony: An Improved Contrastive Audio-Text Model
IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2024
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
248
21
0
10 Feb 2024
Looking at words and points with attention: a benchmark for text-to-shape coherence
Andrea Amaduzzi
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
116
3
0
14 Sep 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
384
150
0
25 Jul 2023
Vision Learners Meet Web Image-Text Pairs
Bingchen Zhao
Quan Cui
Hao Wu
Osamu Yoshie
Cheng Yang
Oisin Mac Aodha
VLM
173
6
0
17 Jan 2023
1