ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.00683
  4. Cited By
CodeBPE: Investigating Subtokenization Options for Large Language Model
  Pretraining on Source Code

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

1 August 2023
Nadezhda Chirkova
Sergey Troshin
ArXivPDFHTML

Papers citing "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code"

14 / 14 papers shown
Title
AI Coders Are Among Us: Rethinking Programming Language Grammar Towards
  Efficient Code Generation
AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation
Zhensu Sun
Xiaoning Du
Zhou Yang
Li Li
David Lo
28
10
0
25 Apr 2024
Getting the most out of your tokenizer for pre-training and domain
  adaptation
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
19
20
0
01 Feb 2024
FREED++: Improving RL Agents for Fragment-Based Molecule Generation by
  Thorough Reproduction
FREED++: Improving RL Agents for Fragment-Based Molecule Generation by Thorough Reproduction
Alexander Telepov
Artem Tsypin
Kuzma Khrabrov
Sergey Yakukhnov
Pavel Strashnov
...
Egor Rumiantsev
Daniel Ezhov
Manvel Avetisian
Olga Popova
Artur Kadurin
14
4
0
18 Jan 2024
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Xinyu She
Yue Liu
Yanjie Zhao
Yiling He
Li Li
C. Tantithamthavorn
Zhan Qin
Haoyu Wang
ELM
14
13
0
27 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani
Saurabh Ghanekar
Xiaoyuan Zhu
Jay Pujara
17
1
0
17 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
11
47
0
12 Oct 2023
Efficient Inference for Multilingual Neural Machine Translation
Efficient Inference for Multilingual Neural Machine Translation
Alexandre Berard
Dain Lee
S. Clinchant
K. Jung
Vassilina Nikoulina
18
12
0
14 Sep 2021
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
  Code Understanding and Generation
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Yue Wang
Weishi Wang
Shafiq R. Joty
S. Hoi
199
1,451
0
02 Sep 2021
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Baptiste Roziere
Marie-Anne Lachaux
Marc Szafraniec
Guillaume Lample
AI4CE
38
53
0
15 Feb 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
  and Generation
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
186
853
0
09 Feb 2021
How Good is Your Tokenizer? On the Monolingual Performance of
  Multilingual Language Models
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
69
235
0
31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
58
56
0
24 Oct 2020
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep
  Learning for Source Code
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code
Nadezhda Chirkova
Sergey Troshin
32
12
0
23 Oct 2020
Google's Neural Machine Translation System: Bridging the Gap between
  Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,435
0
26 Sep 2016
1