Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.12028
Cited By
v1
v2
v3
v4 (latest)
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Transactions of the Association for Computational Linguistics (TACL), 2021
22 March 2021
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
Nasanbayar Ulzii-Orshikh
A. Tapo
Nishant Subramani
Artem Sokolov
Claytone Sikasote
Monang Setyawan
Supheakmungkol Sarin
Sokhar Samb
Benoît Sagot
Clara E. Rivera
Annette Rios Gonzales
Isabel Papadimitriou
Salomey Osei
Pedro Ortiz Suarez
Iroro Orife
Kelechi Ogueji
Andre Niyongabo Rubungo
Toan Q. Nguyen
Mathias Müller
A. Muller
Shamsuddeen Hassan Muhammad
N. Muhammad
Ayanda Mnyakeni
Jamshidbek Mirzakhalov
Tapiwanashe Matangira
Colin Leong
Nze Lawson
Sneha Kudugunta
Yacine Jernite
M. Jenny
Orhan Firat
Bonaventure F. P. Dossou
Sakhile Dlamini
Nisansa de Silva
Sakine cCabuk Balli
Stella Biderman
A. Battisti
Ahmed Baruwa
Ankur Bapna
P. Baljekar
Israel Abebe Azime
Ayodele Awokoya
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets"
50 / 191 papers shown
Title
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
International Conference on Machine Learning (ICML), 2023
Stella Biderman
Hailey Schoelkopf
Quentin G. Anthony
Herbie Bradley
Kyle O'Brien
...
USVSN Sai Prashanth
Edward Raff
Aviya Skowron
Lintang Sutawika
Oskar van der Wal
372
1,616
0
03 Apr 2023
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
Alex Jones
Isaac Caswell
Ishan Saxena
Orhan Firat
250
12
0
27 Mar 2023
AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages
Chris C. Emezue
Sanchit Gandhi
Lewis Tunstall
Abubakar Abid
Josh Meyer
...
Douwe Kiela
Yacine Jernite
Julien Chaumond
Merve Noyan
Omar Sanseviero
156
4
0
22 Mar 2023
CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization
International Conference on Language Resources and Evaluation (LREC), 2023
Ruochen Zhang
Carsten Eickhoff
302
9
0
07 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Neural Information Processing Systems (NeurIPS), 2023
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CE
AILaw
200
194
0
07 Mar 2023
Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Christopher Akiki
Odunayo Ogundepo
Aleksandra Piktus
Xinyu Crystina Zhang
Akintunde Oladipo
Jimmy J. Lin
Martin Potthast
246
5
0
28 Feb 2023
The ROOTS Search Tool: Data Transparency for LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Aleksandra Piktus
Christopher Akiki
Paulo Villegas
Hugo Laurenccon
Gérard Dupont
A. Luccioni
Yacine Jernite
Anna Rogers
VLM
271
36
0
27 Feb 2023
Auditing large language models: a three-layered approach
AI and Ethics (AE), 2023
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
456
267
0
16 Feb 2023
Investigating Multi-source Active Learning for Natural Language Inference
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Ard Snijders
Douwe Kiela
Katerina Margatina
226
9
0
14 Feb 2023
Beyond Arabic: Software for Perso-Arabic Script Manipulation
Workshop on Arabic Natural Language Processing (WANLP), 2023
Alexander Gutkin
Cibu Johny
R. Doctor
Brian Roark
R. Sproat
173
4
0
26 Jan 2023
On the State of German (Abstractive) Text Summarization
Datenbanksysteme für Business, Technologie und Web (BTW), 2023
Dennis Aumiller
Jing Fan
Michael Gertz
231
1
0
17 Jan 2023
SERENGETI: Massively Multilingual Language Models for Africa
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Ife Adebara
AbdelRahim Elmadany
Muhammad Abdul-Mageed
Alcides Alcoba Inciarte
275
41
0
21 Dec 2022
Trustworthy Social Bias Measurement
AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022
Rishi Bommasani
Abigail Z. Jacobs
235
13
0
20 Dec 2022
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Timm Jansen
Yangling Tong
V. Zevallos
Pedro Ortiz Suarez
152
24
0
20 Dec 2022
Synthetic Pre-Training Tasks for Neural Machine Translation
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Zexue He
Graeme W. Blackwood
Yikang Shen
Julian McAuley
Rogerio Feris
245
6
0
19 Dec 2022
LR-Sum: Summarization for Less-Resourced Languages
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Chester Palen-Michel
Constantine Lignos
160
7
0
19 Dec 2022
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Zheng-Xin Yong
Hailey Schoelkopf
Niklas Muennighoff
Alham Fikri Aji
David Ifeoluwa Adelani
...
Genta Indra Winata
Stella Biderman
Edward Raff
Dragomir R. Radev
Vassilina Nikoulina
CLL
VLM
AI4CE
LRM
362
106
0
19 Dec 2022
Synthesis and Evaluation of a Domain-specific Large Data Set for Dungeons & Dragons
Pacific Asia Conference on Language, Information and Computation (PACLIC), 2022
Akila Peiris
Nisansa de Silva
118
5
0
18 Dec 2022
Lessons learned from the evaluation of Spanish Language Models
Rodrigo Agerri
Eneko Agirre
ELM
226
16
0
16 Dec 2022
In-context Examples Selection for Machine Translation
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Sweta Agrawal
Chunting Zhou
M. Lewis
Luke Zettlemoyer
Marjan Ghazvininejad
LRM
309
231
0
05 Dec 2022
TyDiP: A Dataset for Politeness Classification in Nine Typologically Diverse Languages
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
A. Srinivasan
Eunsol Choi
162
22
0
29 Nov 2022
Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi
Devansh Mehta
Harshita Diddee
Ananya Saxena
Anurag Shukla
Sebastin Santy
...
B. M. L. Srivastava
Alok Sharma
Vishnu Prasad
U. Venkanna
Kalika Bali
154
1
0
29 Nov 2022
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xinyan Velocity Yu
Akari Asai
Trina Chatterjee
Junjie Hu
Eunsol Choi
212
28
0
28 Nov 2022
Measuring Harmful Representations in Scandinavian Language Models
Samia Touileb
Debora Nozza
222
13
0
21 Nov 2022
Efficient Transformers with Dynamic Token Pooling
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Piotr Nawrot
J. Chorowski
Adrian Lañcucki
Edoardo Ponti
225
69
0
17 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
824
2,733
0
09 Nov 2022
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Colin Leong
Joshua Nemecek
Jacob Mansdorfer
Anna Filighera
A. Owodunni
Daniel Whitenack
VLM
AI4CE
386
32
0
26 Oct 2022
Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages
Pacific Asia Conference on Language, Information and Computation (PACLIC), 2022
Gihan Weeraprameshwara
Vihanga Jayawickrama
Nisansa de Silva
Yudhanjaya Wijeratne
147
4
0
26 Oct 2022
Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Iker García-Ferrero
Rodrigo Agerri
German Rigau
164
25
0
23 Oct 2022
University of Cape Town's WMT22 System: Multilingual Machine Translation for Southern African Languages
Conference on Machine Translation (WMT), 2022
Khalid N. Elmadani
Francois Meyer
Jan Buys
113
2
0
21 Oct 2022
AfroLID: A Neural Language Identification Tool for African Languages
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Ife Adebara
AbdelRahim Elmadany
Muhammad Abdul-Mageed
Alcides Alcoba Inciarte
247
39
0
21 Oct 2022
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
Conference on Machine Translation (WMT), 2022
Idris Abdulmumin
Michael Beukman
Jesujoba Oluwadara Alabi
Chris C. Emezue
Everlyn Asiko
...
Shamsuddeen Hassan Muhammad
Mofetoluwa Adeyemi
Oreen Yousuf
Sahib Singh
T. Gwadabe
313
10
0
19 Oct 2022
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
Spoken Language Technology Workshop (SLT), 2022
Zhehuai Chen
Ankur Bapna
Andrew Rosenberg
Yu Zhang
Bhuvana Ramabhadran
Pedro J. Moreno
Nanxin Chen
211
17
0
18 Oct 2022
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga
Nisansa de Silva
248
49
0
16 Oct 2022
Rethinking Annotation: Can Language Learners Contribute?
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Haneul Yoo
Rifki Afina Putri
Changyoon Lee
Youngin Lee
So-Yeon Ahn
Luan Tuyen Chau
Alice Oh
177
2
0
13 Oct 2022
Subword Segmental Language Modelling for Nguni Languages
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Francois Meyer
Jan Buys
166
6
0
12 Oct 2022
Language Varieties of Italy: Technology Challenges and Opportunities
Transactions of the Association for Computational Linguistics (TACL), 2022
Alan Ramponi
264
13
0
20 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
245
8
0
12 Sep 2022
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation
Bryan Li
Mohammad Sadegh Rasooli
Ajay Patel
Chris Callison-Burch
168
4
0
06 Sep 2022
Efficient Methods for Natural Language Processing: A Survey
Transactions of the Association for Computational Linguistics (TACL), 2022
Marcos Vinícius Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
...
Emma Strubell
Niranjan Balasubramanian
Leon Derczynski
Iryna Gurevych
Roy Schwartz
365
140
0
31 Aug 2022
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Nuno M. Guerreiro
Elena Voita
André F. T. Martins
HILM
273
68
0
10 Aug 2022
The Impact of Data Corruption on Named Entity Recognition for Low-resourced Languages
Manuel A. Fokam
Michael Beukman
154
0
0
09 Aug 2022
esCorpius: A Massive Spanish Crawling Corpus
IberSPEECH Conference (IberSPEECH), 2022
Asier Gutiérrez-Fandiño
David Pérez-Fernández
Jordi Armengol-Estapé
D. Griol
Z. Callejas
312
4
0
30 Jun 2022
Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Aditi Chaudhary
Arun Sampath
Ashwin Sheshadri
Antonios Anastasopoulos
Graham Neubig
AI4Ed
220
4
0
10 Jun 2022
Detecting Label Errors by using Pre-Trained Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Derek Chong
Jenny Hong
Christopher D. Manning
NoLa
274
23
0
25 May 2022
Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Tu Vu
Aditya Barua
Brian Lester
Daniel Cer
Mohit Iyyer
Noah Constant
CLL
317
71
0
25 May 2022
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing (ALNLP), 2022
Kurt Micallef
Albert Gatt
Marc Tanti
Lonneke van der Plas
Claudia Borg
223
33
0
21 May 2022
Evaluation of Transfer Learning for Polish with a Text-to-Text Model
International Conference on Language Resources and Evaluation (LREC), 2022
Aleksandra Chrabrowa
Lukasz Dragan
Karol Grzegorczyk
D. Kajtoch
Mikołaj Koszowski
Robert Mroczkowski
Piotr Rybak
188
21
0
18 May 2022
Extracting Latent Steering Vectors from Pretrained Language Models
Findings (Findings), 2022
Nishant Subramani
Nivedita Suresh
Matthew E. Peters
LLMSV
178
138
0
10 May 2022
Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna
Isaac Caswell
Julia Kreutzer
Orhan Firat
D. Esch
...
Apurva Shah
Yanping Huang
Zhiwen Chen
Yonghui Wu
Macduff Hughes
270
108
0
09 May 2022
Previous
1
2
3
4
Next