FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

10 December 2025

Binbin XU

ArXiv (abs)PDF HTML Github

Main:6 Pages

5 Figures

Bibliography:1 Pages

3 Tables

Appendix:1 Pages

Abstract

We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. this https URL

View on arXiv

Comments on this paper