Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2201.06642
Cited By
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
17 January 2022
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards a Cleaner Document-Oriented Multilingual Crawled Corpus"
50 / 97 papers shown
Title
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
43
0
0
30 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLM
ELM
LRM
46
0
0
31 Mar 2025
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding
Indraneil Paul
Haoyi Yang
Goran Glavas
Kristian Kersting
Iryna Gurevych
AAML
SyDa
34
0
0
27 Mar 2025
KréyoLID From Language Identification Towards Language Mining
Rasul Dent
Pedro Ortiz Suarez
Thibault Clérice
Benoît Sagot
51
0
0
09 Mar 2025
Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
A. Zebaze
Benoît Sagot
Rachel Bawden
70
0
0
06 Mar 2025
Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien
Charles-Étienne Joseph
Zain Sarwar
Ashwinee Panda
Anirban Das
Shi-Xiong Zhang
Stephen Rawls
S.
Eugene Belilovsky
Irina Rish
MoE
73
0
0
06 Mar 2025
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Paul Janson
Vaibhav Singh
Paria Mehrbod
Adam Ibrahim
Irina Rish
Eugene Belilovsky
Benjamin Thérien
CLL
73
0
0
04 Mar 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
49
0
0
24 Feb 2025
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander M. Fraser
Maosong Sun
47
0
0
17 Feb 2025
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
Sharad Duwal
Suraj Prasai
Suresh Manandhar
CLL
79
1
0
18 Dec 2024
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
71
0
0
16 Dec 2024
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
Bethel Melesse Tessema
Akhil Kedia
Tae-Sun Chung
67
0
0
21 Nov 2024
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
36
5
0
31 Oct 2024
Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact
Junhua Liu
Bin Fu
LRM
26
1
0
23 Oct 2024
State of NLP in Kenya: A Survey
Cynthia Jayne Amol
Everlyn Asiko Chimoto
Rose Delilah Gesicho
Antony M. Gitau
Naome A. Etori
...
Catherine Gitau
Antony Ndolo
Lilian D. A. Wanzare
Albert Njoroge Kahira
Ronald Tombe
21
1
0
13 Oct 2024
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
Neural machine translation system for Lezgian, Russian and Azerbaijani languages
Alidar Asvarov
Andrey Grabovoy
29
0
0
07 Oct 2024
Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture
Qichao Ma
Rui-Jie Zhu
Peiye Liu
Renye Yan
Fahong Zhang
...
Meng Li
Zhaofei Yu
Zongwei Wang
Yimao Cai
Tiejun Huang
45
0
0
06 Oct 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Daoyang Li
Mingyu Jin
Qingcheng Zeng
Mengnan Du
55
2
0
22 Sep 2024
Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough
Konstantin Dobler
Gerard de Melo
44
1
0
28 Aug 2024
A Survey of Large Language Models for European Languages
Wazir Ali
S. Pyysalo
39
2
0
27 Aug 2024
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
34
5
0
31 Jul 2024
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
J. Hayase
Alisa Liu
Yejin Choi
Sewoong Oh
Noah A. Smith
37
10
0
23 Jul 2024
Do Multilingual Large Language Models Mitigate Stereotype Bias?
Shangrui Nie
Michael Fromm
Charles F Welch
Rebekka Görge
Akbar Karimi
Joan Plepi
Nazia Afsan Mowmita
Nicolas Flores-Herr
Mehdi Ali
Lucie Flek
16
3
0
08 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
29
0
0
29 Jun 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
38
184
0
25 Jun 2024
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models
Lynn Chua
Badih Ghazi
Yangsibo Huang
Pritish Kamath
Ravi Kumar
Pasin Manurangsi
Amer Sinha
Chulin Xie
Chiyuan Zhang
51
1
0
23 Jun 2024
Multilingual Large Language Models and Curse of Multilinguality
Daniil Gurgurov
Tanja Bäumel
Tatiana Anikina
78
4
0
15 Jun 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
39
3
0
13 Jun 2024
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Martin Courtois
Malte Ostendorff
Leonhard Hennig
Georg Rehm
31
2
0
10 Jun 2024
Targeted Multilingual Adaptation for Low-resource Language Families
C.M. Downey
Terra Blevins
Dhwani Serai
Dwija Parikh
Shane Steinert-Threlkeld
32
2
0
20 May 2024
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala
Denis C. Ilie-Ablachim
D. Corlatescu
Miruna Zavelca
Marius Leordeanu
Horia Velicu
Marius Popescu
Mihai Dascalu
Traian Rebedea
35
2
0
13 May 2024
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking
Emre Can Acikgoz
Mete Erdogan
Deniz Yuret
30
7
0
07 May 2024
Building a Large Japanese Web Corpus for Large Language Models
Naoaki Okazaki
Kakeru Hattori
Hirai Shota
Hiroki Iida
Masanari Ohi
Kazuki Fujii
Taishi Nakamura
Mengsay Loem
Rio Yokota
Sakae Mizuki
47
6
0
27 Apr 2024
ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
Trong-Hieu Nguyen
Anh-Cuong Le
Viet-Cuong Nguyen
25
0
0
17 Apr 2024
PRODIS - a speech database and a phoneme-based language model for the study of predictability effects in Polish
Zofia Malisz
Jan Foremski
Malgorzata Kul
23
1
0
15 Apr 2024
JaFIn: Japanese Financial Instruction Dataset
Kota Tanabe
Masahiro Suzuki
Hiroki Sakaji
Itsuki Noda
39
1
0
14 Apr 2024
HyperCLOVA X Technical Report
Kang Min Yoo
Jaegeun Han
Sookyo In
Heewon Jeon
Jisu Jeong
...
Hyunkyung Noh
Se-Eun Choi
Sang-Woo Lee
Jung Hwa Lim
Nako Sung
VLM
27
8
0
02 Apr 2024
An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models
Fahim Faisal
Antonios Anastasopoulos
27
0
0
29 Mar 2024
NSINA: A News Corpus for Sinhala
Hansi Hettiarachchi
Damith Premasiri
Lasitha Uyangodage
Tharindu Ranasinghe
31
2
0
25 Mar 2024
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert
Graeme Nail
Nikolay Arefyev
Marta Bañón
Jelmer van der Linde
...
Gema Ramírez-Sánchez
Andrey Kutuzov
S. Pyysalo
Stephan Oepen
Jörg Tiedemann
VLM
28
20
0
20 Mar 2024
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
Meet Doshi
Raj Dabre
Pushpak Bhattacharyya
SyDa
26
2
0
20 Mar 2024
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
M. Alrefaie
Nour Eldin Morsy
Nada Samir
21
6
0
17 Mar 2024
Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
Rodrigo Santos
João Rodrigues
Luís Gomes
Joao Silva
António Branco
Henrique Lopes Cardoso
T. Osório
Bernardo Leite
28
8
0
04 Mar 2024
VNLP: Turkish NLP Package
Meliksah Turker
Mehmet Erdi Ari
Aydin Han
35
1
0
02 Mar 2024
VBART: The Turkish LLM
Meliksah Turker
Mehmet Erdi Ari
Aydin Han
VLM
23
4
0
02 Mar 2024
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data
Guilherme Lamartine de Mello
Marcelo Finger
F. Serras
M. Carpi
Marcos Menon Jose
Pedro Henrique Domingues
Paulo Cavalim
21
0
0
29 Feb 2024
GlórIA -- A Generative and Open Large Language Model for Portuguese
Ricardo Lopes
João Magalhães
David Semedo
27
8
0
20 Feb 2024
An Analysis of Language Frequency and Error Correction for Esperanto
Junhong Liang
15
0
0
15 Feb 2024
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Shivalika Singh
Freddie Vargus
Daniel D'souza
Börje F. Karlsson
Abinaya Mahendiran
...
Max Bartolo
Julia Kreutzer
A. Ustun
Marzieh Fadaee
Sara Hooker
117
115
0
09 Feb 2024
1
2
Next