v1v2 (latest)

T5Gemma 2: Seeing, Reading, and Understanding Longer

16 December 2025

Biao Zhang

Paul Suganthan

Gaël Liu

Ilya Philippov

Sahil Dua

Ben Hora

Kat Black

Gus Martins

Omar Sanseviero

Shreya Pathak

Cassidy Hardin

Francesco Visin

Jiageng Zhang

Kathleen Kenealy

Qin Yin

Xiaodan Song

Olivier Lacombe

Armand Joulin

Tris Warkentin

Adam Roberts

VLM

ArXiv (abs)PDF HTML Github

Main:8 Pages

1 Figures

Bibliography:5 Pages

6 Tables

Abstract

We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.

View on arXiv

Comments on this paper