Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.14712
Cited By
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
26 October 2022
Colin Leong
Joshua Nemecek
Jacob Mansdorfer
Anna Filighera
A. Owodunni
Daniel Whitenack
VLM
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks"
18 / 18 papers shown
Title
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander M. Fraser
Maosong Sun
47
0
0
17 Feb 2025
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
36
5
0
31 Oct 2024
The Zeno's Paradox of `Low-Resource' Languages
H. Nigatu
A. Tonja
Benjamin Rosman
Thamar Solorio
Monojit Choudhury
111
5
0
28 Oct 2024
State of NLP in Kenya: A Survey
Cynthia Jayne Amol
Everlyn Asiko Chimoto
Rose Delilah Gesicho
Antony M. Gitau
Naome A. Etori
...
Catherine Gitau
Antony Ndolo
Lilian D. A. Wanzare
Albert Njoroge Kahira
Ronald Tombe
21
1
0
13 Oct 2024
Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
36
4
0
19 Aug 2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
Rahmad Mahendra
Salsabil Maulana Akbar
Lester James Validad Miranda
Jennifer Santoso
...
Genta Indra Winata
Ruochen Zhang
Fajri Koto
Zheng-Xin Yong
Samuel Cahyawijaya
79
9
0
14 Jun 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
39
3
0
13 Jun 2024
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
Simran Khanuja
Sathyanarayanan Ramamoorthy
Yueqi Song
Graham Neubig
DiffM
20
11
0
01 Apr 2024
Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences
Jesse Atuhurra
Hidetaka Kamigaito
39
0
0
31 Mar 2024
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
38
34
0
25 Mar 2024
Multi-dimensional data refining strategy for effective fine-tuning LLMs
Thanh Nguyen Ngoc
Q. Tran
Arthur Tang
Bao Nguyen
Thuy Nguyen
Thanh Pham
14
0
0
02 Nov 2023
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Sneha Kudugunta
Isaac Caswell
Biao Zhang
Xavier Garcia
Christopher A. Choquette-Choo
...
Derrick Xin
Aditya Kusupati
Romi Stella
Ankur Bapna
Orhan Firat
59
118
0
09 Sep 2023
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
Claytone Sikasote
Eunice Mukonde
Md Mahfuz Ibn Alam
Antonios Anastasopoulos
18
6
0
26 May 2023
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models
Bryan Li
Samar Haider
Chris Callison-Burch
15
16
0
24 May 2023
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
M. Agarwal
Md Mahfuz Ibn Alam
Antonios Anastasopoulos
25
5
0
23 May 2023
Scaling Speech Technology to 1,000+ Languages
Vineel Pratap
Andros Tjandra
Bowen Shi
Paden Tomasello
Arun Babu
...
Yossi Adi
Xiaohui Zhang
Wei-Ning Hsu
Alexis Conneau
Michael Auli
VLM
75
298
0
22 May 2023
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani
Peiqin Lin
Amir Hossein Kargaran
Silvia Severini
Masoud Jalili Sabet
...
Chunlan Ma
Helmut Schmid
André F. T. Martins
François Yvon
Hinrich Schütze
ALM
LRM
31
95
0
20 May 2023
Systematic Inequalities in Language Technology Performance across the World's Languages
Damián E. Blasi
Antonios Anastasopoulos
Graham Neubig
111
131
0
13 Oct 2021
1