142
v1v2v3 (latest)

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gabriela Surita
Natalie Clay
Justin Gilmer
Rui Zhu
Mahdis Mahdieh
Sharad Vikram
Fan Yang
Kiran Vodrahalli
Chulayuth Asawaroengchai
Livio Baldini Soares
Laurent El Shafey
Santiago Ontanon
Steven Hansen
Mia Chen
Xerxes Dotiwalla
Yuan Liu
Ivo Danihelka
Lora Aroyo
Rishabh Joshi
Minh Giang
Chung-Cheng Chiu
Wojciech Stokowiec
Emanuel Taropa
Daniel Toyama
Dan Hurt
Xiaowei Li
Dan Horgan
Anudhyan Boral
Dian Yu
Sumit Bagri
Jason Riesa
Charlie Chen
Ruibo Liu
Arun Ahuja
Stephen Spencer
Yiran Mao
Xinyi Wu
Seb Arnold
Mohammad Saleh
Laurent Shefey
Siyuan Qiao
Henryk Michalewski
Jean-baptiste Alayrac
Alexander Pritzel
Michael Isard
Michael B. Chang
Vahab Mirrokni
Evan Palmer
Pei Sun
Hardie Cate
Lei Zhang
Hui Li
Wojciech Fica
Daniel Balle
Diane Wu
Andrey Khorlin
Adnan Ozturel
Han Lu
Ming Zhang
Sara McCarthy
Misha Khalman
Xinjian Li
Pedro Valenzuela
Yuchung Cheng
Christos Kaplanis
Vahab Mirrokni
Kazuki Osawa
Shubin Zhao
Inderjit Dhillon
Shubham Agrawal
Kevin Ramirez
Xiaowei Li
Abstract

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

View on arXiv
Comments on this paper