ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.12028
  4. Cited By
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
v1v2v3v4 (latest)

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Transactions of the Association for Computational Linguistics (TACL), 2021
22 March 2021
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
Nasanbayar Ulzii-Orshikh
A. Tapo
Nishant Subramani
Artem Sokolov
Claytone Sikasote
Monang Setyawan
Supheakmungkol Sarin
Sokhar Samb
Benoît Sagot
Clara E. Rivera
Annette Rios Gonzales
Isabel Papadimitriou
Salomey Osei
Pedro Ortiz Suarez
Iroro Orife
Kelechi Ogueji
Andre Niyongabo Rubungo
Toan Q. Nguyen
Mathias Müller
A. Muller
Shamsuddeen Hassan Muhammad
N. Muhammad
Ayanda Mnyakeni
Jamshidbek Mirzakhalov
Tapiwanashe Matangira
Colin Leong
Nze Lawson
Sneha Kudugunta
Yacine Jernite
M. Jenny
Orhan Firat
Bonaventure F. P. Dossou
Sakhile Dlamini
Nisansa de Silva
Sakine cCabuk Balli
Stella Biderman
A. Battisti
Ahmed Baruwa
Ankur Bapna
P. Baljekar
Israel Abebe Azime
Ayodele Awokoya
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets"

50 / 191 papers shown
Title
Explaining and Mitigating Crosslingual Tokenizer Inequities
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
140
0
0
24 Oct 2025
SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
Josh McGiff
Nikola S. Nikolov
64
1
0
21 Oct 2025
Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study
Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study
Ayan Majumdar
Feihao Chen
Jinghui Li
Xiaozhen Wang
164
0
0
06 Oct 2025
Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
Yeongbin Seo
Gayoung Kim
Jaehyung Kim
Jinyoung Yeo
119
0
0
23 Sep 2025
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Thales Sales Almeida
Rodrigo Nogueira
Hélio Pedrini
140
4
0
10 Sep 2025
Social Bias in Multilingual Language Models: A Survey
Social Bias in Multilingual Language Models: A Survey
Lance Calvin Lim Gamboa
Yue Feng
Mark Lee
227
0
0
27 Aug 2025
The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
Zachary Hopton
Jannis Vamvas
Andrin Büchler
Anna Rutkiewicz
Rico Cathomas
Rico Sennrich
95
0
0
22 Aug 2025
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Negar Foroutan
Clara Meister
Debjit Paul
Joel Niklaus
Sina Ahmadi
Antoine Bosselut
Rico Sennrich
188
2
0
06 Aug 2025
Synthetic Voice Data for Automatic Speech Recognition in African Languages
Synthetic Voice Data for Automatic Speech Recognition in African Languages
Brian DeRenzi
Anna Dixon
Mohamed Aymane Farhi
Christian Resch
164
2
0
23 Jul 2025
Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
Yikang Liu
Wanyang Zhang
Yiming Wang
Jialong Tang
Pei Zhang
Baosong Yang
Fei Huang
Rui Wang
Hai Hu
110
0
0
16 Jul 2025
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead
Jesujoba Oluwadara Alabi
Michael A. Hedderich
David Ifeoluwa Adelani
Dietrich Klakow
457
4
0
27 May 2025
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Chris C. Emezue
NaijaVoices Community
Busayo Awobade
A. Owodunni
Handel Emezue
...
Nefertiti Nneoma Emezue
Sewade Ogun
Bunmi Akinremi
David Ifeoluwa Adelani
Chris Pal
229
4
0
26 May 2025
Enhancing LLMs via High-Knowledge Data Selection
Enhancing LLMs via High-Knowledge Data SelectionAAAI Conference on Artificial Intelligence (AAAI), 2025
Feiyu Duan
Xuemiao Zhang
Sirui Wang
Haoran Que
Yuqi Liu
Wenge Rong
Xunliang Cai
476
3
0
20 May 2025
Improving Informally Romanized Language Identification
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
328
0
0
30 Apr 2025
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Laurie Burchell
Ona de Gibert
Nikolay Arefyev
Mikko Aulamo
Marta Bañón
...
Pavel Stepachev
and Jörg Tiedemann
Dušan Variš
Tereza Vojtěchová
Jaume Zaragoza-Bernabeu
436
11
0
13 Mar 2025
KréyoLID From Language Identification Towards Language Mining
Rasul Dent
Pedro Ortiz Suarez
Thibault Clérice
Benoît Sagot
167
0
0
09 Mar 2025
Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation
Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and ParticipationConference on Fairness, Accountability and Transparency (FAccT), 2025
Ben Hutchinson
Celeste Rodríguez Louro
Glenys Collard
Ned Cooper
419
0
0
05 Mar 2025
Autoencoder-Based Framework to Capture Vocabulary Quality in NLP
Vu Minh Hoang Dang
Rakesh M. Verma
125
0
0
28 Feb 2025
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando
Nisansa de Silva
Menan Velyuthan
Charitha Rathnayake
Surangika Ranathunga
302
1
0
26 Feb 2025
What are Foundation Models Cooking in the Post-Soviet World?
What are Foundation Models Cooking in the Post-Soviet World?
Anton Lavrouk
Tarek Naous
Alan Ritter
Wei Xu
444
2
0
25 Feb 2025
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Jonathan Rystrøm
Hannah Rose Kirk
Scott A. Hale
440
17
0
23 Feb 2025
Multilingual Language Model Pretraining using Machine-translated Data
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
David Ifeoluwa Adelani
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
332
7
0
20 Feb 2025
Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages
Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages
Aloka Fernando
Surangika Ranathunga
CLL
90
2
0
10 Jan 2025
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain
  Adaptation with an Astronomy Case Study
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Eric Modesitt
Ke Yang
Spencer Hulsey
Chengxiang Zhai
Volodymyr Kindratenko
144
2
0
19 Dec 2024
Alignment at Pre-training! Towards Native Alignment for Arabic LLMs
Alignment at Pre-training! Towards Native Alignment for Arabic LLMsNeural Information Processing Systems (NeurIPS), 2024
Juhao Liang
Zhenyang Cai
Jianqing Zhu
Huang Huang
Kewei Zong
...
Juncai He
Lian Zhang
Haoyang Li
Benyou Wang
Jinchao Xu
LLMSV
124
8
0
04 Dec 2024
LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Y. Kim
Hyunsoo Ha
Seonghoon Yang
Sukyung Lee
Jihoo Kim
Chanjun Park
101
1
0
18 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless DatasetNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
524
7
0
12 Nov 2024
Identifying Implicit Social Biases in Vision-Language Models
Identifying Implicit Social Biases in Vision-Language ModelsAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2024
Kimia Hamidieh
Haoran Zhang
Walter Gerych
Thomas Hartvigsen
Elisa Kreiss
VLM
204
32
0
01 Nov 2024
Multilingual Pretraining Using a Large Corpus Machine-Translated from a
  Single Source Language
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
240
3
0
31 Oct 2024
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority LanguagesNeural Information Processing Systems (NeurIPS), 2024
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
266
11
0
31 Oct 2024
Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset
  Repository
Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
S. Tamang
D. J. Bora
138
1
0
15 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?International Conference on Learning Representations (ICLR), 2024
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
219
8
0
12 Oct 2024
From N-grams to Pre-trained Multilingual Models For Language
  Identification
From N-grams to Pre-trained Multilingual Models For Language Identification
Thapelo Sindane
Vukosi Marivate
219
4
0
11 Oct 2024
X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale
X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at ScaleInternational Conference on Learning Representations (ICLR), 2024
Haoran Xu
Kenton W. Murray
Philipp Koehn
Hieu T. Hoang
Akiko Eriguchi
Huda Khayrallah
295
27
0
04 Oct 2024
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Kaden Uhlig
Joern Wuebker
Raphael Reinauer
John DeNero
318
0
0
26 Sep 2024
Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English
Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English
Fiifi Dawson
Zainab Mosunmola
Sahil Pocker
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
182
7
0
14 Sep 2024
Correcting FLORES Evaluation Dataset for Four African Languages
Correcting FLORES Evaluation Dataset for Four African LanguagesConference on Machine Translation (WMT), 2024
Idris Abdulmumin
Sthembiso Mkhwanazi
Mahlatse S. Mbooi
Shamsuddeen Hassan Muhammad
Ibrahim Said Ahmad
Neo Putini
Miehleketo Mathebula
Matimba Shingange
T. Gwadabe
Vukosi Marivate
279
12
0
01 Sep 2024
Data Contamination Report from the 2024 CONDA Shared Task
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
259
13
0
31 Jul 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
309
62
0
20 Jul 2024
Large Models of What? Mistaking Engineering Achievements for Human
  Linguistic Agency
Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
Abeba Birhane
Marek McGann
147
19
0
11 Jul 2024
A Review of the Challenges with Massive Web-mined Corpora Used in Large
  Language Models Pre-Training
A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training
Michał Perełkiewicz
Rafał Poświata
187
7
0
10 Jul 2024
Recent Advancements and Challenges of Turkic Central Asian Language
  Processing
Recent Advancements and Challenges of Turkic Central Asian Language Processing
Yana Veitsman
223
6
0
06 Jul 2024
Toucan: Many-to-Many Translation for 150 African Language Pairs
Toucan: Many-to-Many Translation for 150 African Language Pairs
AbdelRahim Elmadany
Ife Adebara
Muhammad Abdul-Mageed
231
5
0
05 Jul 2024
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
331
4
0
02 Jul 2024
Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian
  Benchmark
Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark
Fabio Mercorio
Mario Mezzanzanica
Daniele Potertì
Antonio Serino
Andrea Seveso
225
9
0
25 Jun 2024
Less can be more: representational vs. stereotypical gender bias in
  facial expression recognition
Less can be more: representational vs. stereotypical gender bias in facial expression recognition
Iris Dominguez-Catena
D. Paternain
A. Jurio
M. Galar
188
5
0
25 Jun 2024
Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora
Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora
Erik Derner
Sara Sansalvador de la Fuente
Yoan Gutiérrez
Paloma Moreda
Nuria Oliver
248
0
0
19 Jun 2024
Quantifying Geospatial in the Common Crawl Corpus
Quantifying Geospatial in the Common Crawl Corpus
Ilya Ilyankou
Meihui Wang
Stefano Cavazzi
James Haworth
259
6
0
07 Jun 2024
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
David Ifeoluwa Adelani
Jessica Ojo
Israel Abebe Azime
Jian Yun Zhuang
Jesujoba Oluwadara Alabi
...
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Pontus Stenetorp
ELM
447
25
0
05 Jun 2024
An Open Multilingual System for Scoring Readability of Wikipedia
An Open Multilingual System for Scoring Readability of Wikipedia
Mykola Trokhymovych
Indira Sen
Martin Gerlach
212
9
0
03 Jun 2024
1234
Next