143

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paul Barham
Gregory Thornton
Zhen Yang
Nathan Schucher
Salem Haykal
Nithya Attaluri
Timothy Chung
Petko Georgiev
Ambrose Slone
Siyuan Qiao
Yunhao Tang
Yingjie Miao
Robin Strudel
Sanjay Ganapathy
Rhys May
Arpi Vezer
Sean Sechrist
Justin Mao-Jones
Beer Changpinyo
Anastasia Petrushkina
Rory Blevins
Anirudh Baddepudi
Zafarali Ahmed
Aishwarya Kamath
Tom Hudson
Wojciech Stokowiec
Mina Khan
Elspeth White
Joe Stanton
Carlos Araya
Andy Swing
Ishita Dasgupta
Iain Barr
Alexander Chen
Praveen Kallakuri
Pidong Wang
Nilesh Tripuraneni
James Manyika
Chloe Thornton
Nishesh Gupta
Yicheng Wang
Sandeep Kumar
Daniel Vlasic
Nir Shabat
Guolong Su
Laura Knight
Carl Crous
Solomon Kim
Slav Petrov
Abstract

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

View on arXiv
Comments on this paper