687

v1v2 (latest)

Gemma 2: Improving Open Language Models at a Practical Size

31 July 2024

Gemma Team Morgane Riviere

Pier Giuseppe Sessa

Surya Bhupatiraju

Léonard Hussenot

Bobak Shahriari

Alexandre Ramé

Michelle Casbon

Charline Le Lan

Anton Tsitsulin

Nino Vieillard

Piotr Stańczyk

Sertan Girgin

Nikola Momchev

Jean-Bastien Grill

Behnam Neyshabur

Olivier Bachem

Aliaksei Severyn

Allen Hutchison

Anthony Laforge

Antonia Paterson

Bilal Piot

Charlie Chen

Christoper A. Welty

Christopher A. Choquette-Choo

Danila Sinopalnikov

David Weinberger

Dimple Vijaykumar

Dominika Rogoziñska

Evgenii Eltyshev

Francesco Visin

Gabriel Rasskin

Hanna Klimczak-Pluciñska

Joana Carrasqueira

Joost R. van Amersfoort

Josh Lipschultz

Kartikeya Badola

Keelin McDonell

Kiranbir Sodhia

Lars Lowe Sjoesund

Livio Baldini Soares

Logan Kilpatrick

Luciano Martins

Manvinder Singh

Michael Moynihan

Nenshad Bardoliwalla

Nesh Devanathan

Pradeep Kuppala

Ramona Comanescu

Rishabh Agarwal

Sarah Perrin

Sébastien Arnold

Sebastian Krause

Tomás Kociský

Vikas Yadav

Vishal Dharmadhikari

Minh Giang

Zoubin Ghahramani

Koray Kavukcuoglu

Clement Farabet

Elena Buchatskaya

Sebastian Borgeaud

Kathleen Kenealy

ArXiv (abs)PDF HTML HuggingFace (79 upvotes)

Abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Comments on this paper