CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

1 August 2023

Papers citing "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code"

14 / 14 papers shown

Title
AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation Zhensu Sun Xiaoning Du Zhou Yang Li Li David Lo 28 10 0 25 Apr 2024
Getting the most out of your tokenizer for pre-training and domain adaptation Gautier Dagan Gabriele Synnaeve Baptiste Rozière 19 20 0 01 Feb 2024
FREED++: Improving RL Agents for Fragment-Based Molecule Generation by Thorough Reproduction Alexander Telepov Artem Tsypin Kuzma Khrabrov Sergey Yakukhnov Pavel Strashnov ... Egor Rumiantsev Daniel Ezhov Manvel Avetisian Olga Popova Artur Kadurin 14 4 0 18 Jan 2024
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey Xinyu She Yue Liu Yanjie Zhao Yiling He Li Li C. Tantithamthavorn Zhan Qin Haoyu Wang ELM 14 13 0 27 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling Avijit Thawani Saurabh Ghanekar Xiaoyuan Zhu Jay Pujara 17 1 0 17 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial? Mehdi Ali Michael Fromm Klaudia Thellmann Richard Rutmann Max Lübbering ... Malte Ostendorff Samuel Weinbach R. Sifa Stefan Kesselheim Nicolas Flores-Herr 11 47 0 12 Oct 2023
Efficient Inference for Multilingual Neural Machine Translation Alexandre Berard Dain Lee S. Clinchant K. Jung Vassilina Nikoulina 18 12 0 14 Sep 2021
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation Yue Wang Weishi Wang Shafiq R. Joty S. Hoi 199 1,451 0 02 Sep 2021
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages Baptiste Roziere Marie-Anne Lachaux Marc Szafraniec Guillaume Lample AI4CE 38 53 0 15 Feb 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Shuai Lu Daya Guo Shuo Ren Junjie Huang Alexey Svyatkovskiy ... Nan Duan Neel Sundaresan Shao Kun Deng Shengyu Fu Shujie Liu ELM 186 853 0 09 Feb 2021
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 69 235 0 31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies Hyung Won Chung Dan Garrette Kiat Chuan Tan Jason Riesa VLM 58 56 0 24 Oct 2020
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code Nadezhda Chirkova Sergey Troshin 32 12 0 23 Oct 2020
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu M. Schuster Z. Chen Quoc V. Le Mohammad Norouzi ... Alex Rudnick Oriol Vinyals G. Corrado Macduff Hughes J. Dean AIMat 716 6,435 0 26 Sep 2016