ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.03988
19
189

SantaCoder: don't reach for the stars!

9 January 2023
Loubna Ben Allal
Raymond Li
Denis Kocetkov
Chenghao Mou
Christopher Akiki
Carlos Muñoz Ferrandis
Niklas Muennighoff
Mayank Mishra
A. Gu
Manan Dey
Logesh Kumar Umapathi
Carolyn Jane Anderson
Yangtian Zi
J. Lamy-Poirier
Hailey Schoelkopf
S. Troshin
Dmitry Abulkhanov
Manuel Romero
M. Lappert
F. Toni
Bernardo García del Río
Qian Liu
Shamik Bose
Urvashi Bhattacharyya
Terry Yue Zhuo
I. Yu
Paulo Villegas
Marco Zocca
Sourab Mangrulkar
D. Lansky
Huu Nguyen
Danish Contractor
Luisa Villa
Jia Li
Dzmitry Bahdanau
Yacine Jernite
Sean M. Hughes
Daniel Fried
Arjun Guha
H. D. Vries
Leandro von Werra
ArXivPDFHTML
Abstract

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.

View on arXiv
Comments on this paper