145

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Ross McIlroy
Henryk Michalewski
Zaheer Abbas
Richard Ives
Pranav Shyam
Evgenii Eltyshev
Siddhartha Brahma
Zhenkai Zhu
James Lottes
Sebastian Riedel
Libin Bai
Taylor Tobin
Marvin Ritter
Danilo Martins
Junhyuk Oh
Ken Franko
Thanumalayan Sankaranarayana Pillai
Hannah Sheahan
Anna Bortsova
Kevin Swersky
Alban Rrustemi
James Qin
Arnar Mar Hrafnkelsson
Ian Mackinnon
Ravi Addanki
Jaehoon Lee
Carrie Grimes Bostock
Hyeontaek Lim
Carlos Araya
Ankur Bapna
Stephanie Winkler
Lexi Walker
Richard Tanburn
Johnson Jia
Chu-Cheng Lin
Nan Hua
Kiam Choo
Jennifer Beattie
Jeff Seibert
Mohamed Elhawaty
Andrey Khorlin
Yannie Liang
Carl Crous
Dessie Petrova
Denese Owusu-Afriyie
Vinod Koverkathu
Slav Petrov
Main:34 Pages
30 Figures
44 Tables
Appendix:120 Pages
Abstract

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

View on arXiv
Comments on this paper