291

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Laya Sleiman
Leon Derczynski
Luis Vega
Maer Rodrigues de Melo
Makesh Narsimhan Sreedhar
Marcin Chochowski
Mark Cai
Markus Kliegl
Marta Stepniewska-Dziubinska
Matvei Novikov
Mehrzad Samadi
Meredith Price
Meriem Boubdir
Michael Boone
Michael Evans
Michal Bien
Michal Zawalski
Miguel Martinez
Mike Chrzanowski
Mohammad Shoeybi
Mostofa Patwary
Namit Dhameja
Nave Assaf
Negar Habibi
Nidhi Bhatia
Nikki Pope
Nima Tajbakhsh
Nirmal Kumar Juluru
Oleg Rybakov
Oleksii Hrinchuk
Oleksii Kuchaiev
Oluwatobi Olabiyi
Pablo Ribalta
Padmavathy Subramanian
Parth Chadha
Pavlo Molchanov
Peter Dykas
Peter Jin
Piotr Bialecki
Piotr Januszewski
Pradeep Thalasta
Prashant Gaikwad
Prasoon Varshney
Pritam Gundecha
Przemek Tredak
Rabeeh Karimi Mahabadi
Rajen Patel
Ran El-Yaniv
Ranjit Rajan
Ria Cheruvu
Rima Shahbazyan
Ritika Borkar
Ritu Gala
Roger Waleffe
Ruoxi Zhang
Russell J. Hewett
Ryan Prenger
Sahil Jain
Samuel Kriman
Sanjeev Satheesh
Saori Kaji
Sarah Yurick
Saurav Muralidharan
Sean Narenthiran
Seonmyeong Bak
Sepehr Sameni
Seungju Han
Shanmugam Ramasamy
Shaona Ghosh
Sharath Turuvekere Sreenivas
Shelby Thomas
Shizhe Diao
Shreya Gopal
Shrimai Prabhumoye
Shubham Toshniwal
Shuoyang Ding
Siddharth Singh
Siddhartha Jain
Somshubra Majumdar
Stefania Alborghetti
Syeda Nahida Akter
Terry Kong
Tim Moon
Tomasz Hliwiak
Tomer Asida
Tony Wang
Twinkle Vashishth
Tyler Poon
Udi Karpas
Vahid Noroozi
Venkat Srinivasan
Vijay Korthikanti
Vikram Fugro
Vineeth Kalluru
Vitaly Kurin
Vitaly Lavrukhin
Wasi Uddin Ahmad
Wei Du
Wonmin Byeon
Ximing Lu
Xin Dong
Yashaswi Karnati
Yejin Choi
Yian Zhang
Ying Lin
Yonggan Fu
Yoshi Suhara
Zhen Dong
Zhiyu Li
Zhongbo Zhu
Zijia Chen
et al. (111 additional authors not shown)
Main:27 Pages
4 Figures
Bibliography:2 Pages
9 Tables
Appendix:14 Pages
Abstract

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

View on arXiv
Comments on this paper