ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.16607
  4. Cited By
Data Mixture Inference: What do BPE Tokenizers Reveal about their
  Training Data?
v1v2v3 (latest)

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

23 July 2024
J. Hayase
Alisa Liu
Yejin Choi
Sewoong Oh
Noah A. Smith
ArXiv (abs)PDFHTMLHuggingFace (23 upvotes)

Papers citing "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?"

12 / 12 papers shown
When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity
When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity
Shiyao Cui
Xijia Feng
Yingkang Wang
Junxiao Yang
Zhexin Zhang
Biplab Sikdar
Hongning Wang
Han Qiu
Shiyu Huang
172
1
0
14 Sep 2025
Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Qingjie Zhang
Di Wang
Haoting Qian
Liu Yan
Tianwei Zhang
Ke Xu
Qi Li
Minlie Huang
Hewu Li
Han Qiu
102
2
0
25 Aug 2025
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Tomohiro Sawada
Kartik Goyal
MoMe
109
0
0
08 Aug 2025
TokAlign: Efficient Vocabulary Adaptation via Token Alignment
TokAlign: Efficient Vocabulary Adaptation via Token AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Chong Li
Jiajun Zhang
Chengqing Zong
VLM
227
3
0
04 Jun 2025
Learning Dynamics in Continual Pre-Training for Large Language Models
Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang
Howe Tissue
Lu Wang
Linjing Li
D. Zeng
CLL
340
5
0
12 May 2025
On Linear Representations and Pretraining Data Frequency in Language Models
On Linear Representations and Pretraining Data Frequency in Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Jack Merullo
Noah A. Smith
Sarah Wiegreffe
Yanai Elazar
531
11
0
16 Apr 2025
SuperBPE: Space Travel for Language Models
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
508
35
0
17 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive InvestigationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
616
4
0
18 Dec 2024
VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs
Keer Lu
Keshi Zhao
Zhuoran Zhang
Zheng Liang
Da Pan
...
Xin Wu
Guosheng Dong
Bin Cui
Tengjiao Wang
Wentao Zhang
VLMCLL
551
0
0
18 Nov 2024
Performance Evaluation of Tokenizers in Large Language Models for the
  Assamese Language
Performance Evaluation of Tokenizers in Large Language Models for the Assamese LanguageInternational journal of information technology (IJIT), 2024
Sagar Tamang
Dibya Jyoti Bora
225
7
0
28 Sep 2024
Batching BPE Tokenization Merges
Batching BPE Tokenization Merges
Alexander P. Morgan
216
0
0
05 Aug 2024
MAGNET: Improving the Multilingual Fairness of Language Models with
  Adaptive Gradient-Based Tokenization
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Valentin Hoffman
Tomasz Limisiewicz
Yulia Tsvetkov
Noah A. Smith
349
24
0
11 Jul 2024
1
Page 1 of 1