PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer
Andreas Steiner
André Susano Pinto
Alexander Kolesnikov
Xiao Wang
Daniel M. Salz
Maxim Neumann
Ibrahim M. Alabdulmohsin
Michael Tschannen
Emanuele Bugliarello
Thomas Unterthiner
Daniel Keysers
Skanda Koppula
Fangyu Liu
Adam Grycner
A. Gritsenko
N. Houlsby
Manoj Kumar
Keran Rong
Julian Martin Eisenschlos
Rishabh Kabra
Matthias Bauer
Matko Bošnjak
Xi Chen
Matthias Minderer
P. Voigtlaender
Ioana Bica
Ivana Balazevic
J. Puigcerver
Pinelopi Papalampidi
Olivier J. Hénaff
Xi Xiong
Radu Soricut
Jeremiah Harmsen
Xiaohua Zhai

Abstract
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
View on arXivComments on this paper