Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.12028
Cited By
v1
v2
v3
v4 (latest)
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Transactions of the Association for Computational Linguistics (TACL), 2021
22 March 2021
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
Nasanbayar Ulzii-Orshikh
A. Tapo
Nishant Subramani
Artem Sokolov
Claytone Sikasote
Monang Setyawan
Supheakmungkol Sarin
Sokhar Samb
Benoît Sagot
Clara E. Rivera
Annette Rios Gonzales
Isabel Papadimitriou
Salomey Osei
Pedro Ortiz Suarez
Iroro Orife
Kelechi Ogueji
Andre Niyongabo Rubungo
Toan Q. Nguyen
Mathias Müller
A. Muller
Shamsuddeen Hassan Muhammad
N. Muhammad
Ayanda Mnyakeni
Jamshidbek Mirzakhalov
Tapiwanashe Matangira
Colin Leong
Nze Lawson
Sneha Kudugunta
Yacine Jernite
M. Jenny
Orhan Firat
Bonaventure F. P. Dossou
Sakhile Dlamini
Nisansa de Silva
Sakine cCabuk Balli
Stella Biderman
A. Battisti
Ahmed Baruwa
Ankur Bapna
P. Baljekar
Israel Abebe Azime
Ayodele Awokoya
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets"
41 / 191 papers shown
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
North American Chapter of the Association for Computational Linguistics (NAACL), 2022
David Ifeoluwa Adelani
Jesujoba Oluwadara Alabi
Angela Fan
Julia Kreutzer
Xiaoyu Shen
...
Ayodele Awokoya
Happy Buzaaba
Blessing K. Sibanda
Andiswa Bukula
Sam Manthalu
441
130
0
04 May 2022
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Conference on Fairness, Accountability and Transparency (FAccT), 2022
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
224
57
0
04 May 2022
Handling and Presenting Harmful Text in NLP Research
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Hannah Rose Kirk
Abeba Birhane
Bertie Vidgen
Leon Derczynski
290
58
0
29 Apr 2022
RobBERTje: a Distilled Dutch BERT Model
Pieter Delobelle
Thomas Winters
Bettina Berendt
194
15
0
28 Apr 2022
The Risks of Machine Learning Systems
Samson Tan
Araz Taeihagh
K. Baxter
130
7
0
21 Apr 2022
Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Terra Blevins
Luke Zettlemoyer
312
103
0
17 Apr 2022
mGPT: Few-Shot Learners Go Multilingual
Transactions of the Association for Computational Linguistics (TACL), 2022
Oleh Shliazhko
Alena Fenogenova
Maria Tikhonova
Vladislav Mikhailov
Anastasia Kozlova
Tatiana Shavrina
360
190
0
15 Apr 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
371
949
0
14 Apr 2022
Experimental Standards for Deep Learning in Natural Language Processing Research
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Dennis Ulmer
Elisa Bassignana
Max Müller-Eberstein
Daniel Varab
Mike Zhang
Rob van der Goot
Christian Hardmeier
Barbara Plank
260
12
0
13 Apr 2022
Considerations for Multilingual Wikipedia Research
Isaac Johnson
Emily A. Lescak
130
4
0
05 Apr 2022
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Alham Fikri Aji
Genta Indra Winata
Fajri Koto
Samuel Cahyawijaya
Ade Romadhony
...
David Moeljadi
Radityo Eko Prasojo
Timothy Baldwin
Jey Han Lau
Sebastian Ruder
226
126
0
24 Mar 2022
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
Findings (Findings), 2022
E. Lee
Sarubi Thillainathan
Shravan Nayak
Surangika Ranathunga
David Ifeoluwa Adelani
Ruisi Su
Arya D. McCarthy
VLM
345
51
0
16 Mar 2022
Does Corpus Quality Really Matter for Low-Resource Languages?
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Mikel Artetxe
Itziar Aldabe
Rodrigo Agerri
Olatz Perez-de-Viñaspre
Aitor Soroa Etxabe
227
21
0
15 Mar 2022
Can Synthetic Translations Improve Bitext Quality?
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Eleftheria Briakou
Marine Carpuat
144
6
0
15 Mar 2022
Toward More Meaningful Resources for Lower-resourced Languages
Findings (Findings), 2022
Constantine Lignos
Nolan Holley
Chester Palen-Michel
Jonne Saleva
141
20
0
24 Feb 2022
Sequence-to-Sequence Resources for Catalan
Ona de Gibert
Ksenia Kharitonova
B. Figueras
Jordi Armengol-Estapé
Maite Melero
60
0
0
14 Feb 2022
Cedille: A large autoregressive French language model
Martin Müller
Florian Laurent
196
23
0
07 Feb 2022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Angelina McMillan-Major
Zaid Alyafeai
Stella Biderman
Kimbo Chen
F. Toni
...
Aitor Soroa Etxabe
Pedro Ortiz Suarez
Zeerak Talat
Daniel Alexander van Strien
Yacine Jernite
209
14
0
25 Jan 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
International Conference on Language Resources and Evaluation (LREC), 2022
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
203
189
0
17 Jan 2022
Multilingual Open Text Release 1: Public Domain News in 44 Languages
International Conference on Language Resources and Evaluation (LREC), 2022
Chester Palen-Michel
June-Woo Kim
Constantine Lignos
VLM
136
14
0
14 Jan 2022
A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models
International Conference on Language Resources and Evaluation (LREC), 2022
Vésteinn Snæbjarnarson
Haukur Barri Símonarson
Pétur Orri Ragnarsson
Svanhvít Lilja Ingólfsdóttir
H. Jónsson
Vilhjálmur Þorsteinsson
H. Einarsson
270
31
0
14 Jan 2022
Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data
International Conference on Artificial Intelligence in Electronics Engineering (AIEE), 2022
Gihan Weeraprameshwara
Vihanga Jayawickrama
Nisansa de Silva
Yudhanjaya Wijeratne
150
6
0
11 Jan 2022
DOCmT5: Document-Level Pretraining of Multilingual Language Models
Chia-Hsuan Lee
Aditya Siddhant
Viresh Ratnakar
Melvin Johnson
LRM
156
13
0
16 Dec 2021
Ethical and social risks of harm from Language Models
Laura Weidinger
John F. J. Mellor
Maribeth Rauh
Conor Griffin
J. Uesato
...
Lisa Anne Hendricks
William S. Isaac
Sean Legassick
G. Irving
Iason Gabriel
PILM
535
1,307
0
08 Dec 2021
Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts
Vihanga Jayawickrama
Gihan Weeraprameshwara
Nisansa de Silva
Yudhanjaya Wijeratne
128
6
0
01 Dec 2021
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
A. DeMarco
C. Mena
Albert Gatt
Claudia Borg
A. Williams
Lonneke van der Plas
161
0
0
15 Nov 2021
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
Eleftheria Briakou
Sida Wang
Luke Zettlemoyer
Marjan Ghazvininejad
179
5
0
12 Nov 2021
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
ACM Computing Surveys (CSUR), 2021
Bonan Min
Hayley L Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Eneko Agirre
Ilana Heinz
Dan Roth
LM&MA
VLM
AI4CE
429
1,365
0
01 Nov 2021
PAGnol: An Extra-Large French Generative Model
Julien Launay
E. L. Tommasone
B. Pannier
Franccois Boniface
A. Chatelain
Alessandro Cappelli
Iacopo Poli
Djamé Seddah
AILaw
MoE
AI4CE
213
9
0
16 Oct 2021
Sparks: Inspiration for Science Writing using Language Models
Katy Ilonka Gero
Vivian Liu
Lydia B. Chilton
LRM
282
190
0
14 Oct 2021
Few-shot Controllable Style Transfer for Low-Resource Multilingual Settings
Kalpesh Krishna
Deepak Nathani
Xavier Garcia
Bidisha Samanta
Partha P. Talukdar
195
27
0
14 Oct 2021
Training Dynamic based data filtering may not work for NLP datasets
Arka Talukdar
Monika Dagar
Prachi Gupta
Varun G. Menon
NoLa
118
3
0
19 Sep 2021
Datasets: A Community Library for Natural Language Processing
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Quentin Lhoest
Albert Villanova del Moral
Yacine Jernite
A. Thakur
Patrick von Platen
...
Thibault Goehringer
Victor Mustar
François Lagunas
Alexander M. Rush
Thomas Wolf
579
705
0
07 Sep 2021
Survey of Low-Resource Machine Translation
Computational Linguistics (CL), 2021
Barry Haddow
Rachel Bawden
Antonio Valerio Miceli Barone
Jindvrich Helcl
Alexandra Birch
AIMat
502
196
0
01 Sep 2021
AraT5: Text-to-Text Transformers for Arabic Language Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
El Moatez Billah Nagoudi
AbdelRahim Elmadany
Muhammad Abdul-Mageed
348
156
0
31 Aug 2021
Neural Machine Translation for Low-Resource Languages: A Survey
ACM Computing Surveys (CSUR), 2021
Surangika Ranathunga
E. Lee
Marjana Prifti Skenduli
Ravi Shekhar
Mehreen Alam
Rishemjit Kaur
321
322
0
29 Jun 2021
What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
A. Luccioni
J. Viviano
365
136
0
06 May 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
309
562
0
18 Apr 2021
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
Transactions of the Association for Computational Linguistics (TACL), 2021
Gowtham Ramesh
Sumanth Doddapaneni
Aravinth Bheemaraj
Mayank Jobanputra
AK Raghavan
...
K. Deepak
Vivek Raghavan
Anoop Kunchukuttan
Pratyush Kumar
Mitesh Khapra
LRM
369
266
0
12 Apr 2021
The Effect of Domain and Diacritics in Yorùbá-English Neural Machine Translation
Machine Translation Summit (MT Summit), 2021
David Ifeoluwa Adelani
Dana Ruiter
Jesujoba Oluwadara Alabi
Damilola Adebonojo
Adesina Ayeni
Mofetoluwa Adeyemi
Ayodele Awokoya
C. España-Bonet
240
45
0
15 Mar 2021
Data and its (dis)contents: A survey of dataset development and use in machine learning research
Amandalynne Paullada
Inioluwa Deborah Raji
Emily M. Bender
Emily L. Denton
A. Hanna
313
599
0
09 Dec 2020
Previous
1
2
3
4