Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2010.14571
Cited By
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
27 October 2020
Isaac Caswell
Theresa Breiner
D. Esch
Ankur Bapna
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus"
20 / 20 papers shown
Title
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
43
0
0
30 Apr 2025
Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database
Maxime Holmberg Sainte-Marie
Diego Kozlowski
Lucía Céspedes
Vincent Larivière
80
0
0
05 Feb 2025
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
36
5
0
31 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
32
2
0
12 Oct 2024
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
David Ifeoluwa Adelani
Jessica Ojo
Israel Abebe Azime
Jian Yun Zhuang
Jesujoba Oluwadara Alabi
...
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Pontus Stenetorp
ELM
55
7
0
05 Jun 2024
Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability
Wei-Rui Chen
Ife Adebara
Khai Duy Doan
Qisheng Liao
Muhammad Abdul-Mageed
17
5
0
16 Nov 2023
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
Isaac Caswell
Lisa Wang
Isabel Papadimitriou
26
0
0
11 Nov 2023
Multimodal Modeling For Spoken Language Identification
Shikhar Bharadwaj
Min Ma
Shikhar Vashishth
Ankur Bapna
Sriram Ganapathy
...
Yu Zhang
D. Esch
Sandy Ritchie
Partha P. Talukdar
Jason Riesa
27
0
0
19 Sep 2023
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
Alex Jones
Isaac Caswell
Ishan Saxena
Orhan Firat
21
8
0
27 Mar 2023
AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages
Chris C. Emezue
Sanchit Gandhi
Lewis Tunstall
Abubakar Abid
Josh Meyer
...
Douwe Kiela
Yacine Jernite
Julien Chaumond
Merve Noyan
Omar Sanseviero
23
2
0
22 Mar 2023
High-Resource Methodological Bias in Low-Resource Investigations
Maartje ter Hoeve
David Grangier
Natalie Schluter
33
2
0
14 Nov 2022
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
Zhehuai Chen
Ankur Bapna
Andrew Rosenberg
Yu Zhang
Bhuvana Ramabhadran
Pedro J. Moreno
Nanxin Chen
26
17
0
18 Oct 2022
Language Modelling with Pixels
Phillip Rust
Jonas F. Lotz
Emanuele Bugliarello
Elizabeth Salesky
Miryam de Lhoneux
Desmond Elliott
VLM
30
46
0
14 Jul 2022
NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis
Shamsuddeen Hassan Muhammad
David Ifeoluwa Adelani
Sebastian Ruder
I. Ahmad
Idris Abdulmumin
...
Chris C. Emezue
Saheed Abdul
Anuoluwapo Aremu
Alipio Jeorge
P. Brazdil
27
95
0
20 Jan 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
34
153
0
17 Jan 2022
Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning
Aditya Siddhant
Ankur Bapna
Orhan Firat
Yuan Cao
M. Chen
Isaac Caswell
Xavier Garcia
ELM
LRM
31
29
0
09 Jan 2022
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties
Xinyi Wang
Yulia Tsvetkov
Sebastian Ruder
Graham Neubig
25
34
0
10 Sep 2021
Survey of Low-Resource Machine Translation
Barry Haddow
Rachel Bawden
Antonio Valerio Miceli Barone
Jindvrich Helcl
Alexandra Birch
AIMat
29
147
0
01 Sep 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,950
0
20 Apr 2018
1