ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.03915
  4. Cited By
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

7 March 2023
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
Teven Le Scao
Leandro von Werra
Chenghao Mou
E. G. Ponferrada
Huu Nguyen
Jorg Frohberg
Mario vSavsko
Quentin Lhoest
Angelina McMillan-Major
Gérard Dupont
Stella Biderman
Anna Rogers
Loubna Ben Allal
F. Toni
Giada Pistilli
Olivier Nguyen
Somaieh Nikpoor
Maraim Masoud
Pierre Colombo
Javier de la Rosa
Paulo Villegas
Tristan Thrush
Shayne Longpre
Sebastian Nagel
Leon Weber
M. Muñoz
Jian Zhu
Daniel Alexander van Strien
Zaid Alyafeai
Khalid Almubarak
Minh Chien Vu
Itziar Gonzalez-Dios
Aitor Soroa Etxabe
Kyle Lo
Manan Dey
Pedro Ortiz Suarez
Aaron Gokaslan
Shamik Bose
David Ifeoluwa Adelani
Long Phan
H. Tran
I. Yu
S. Pai
Jenny Chim
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
    AI4CE
    AILaw
ArXivPDFHTML

Papers citing "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset"

31 / 31 papers shown
Title
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
55
0
0
23 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
40
0
0
01 Apr 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
81
0
0
15 Mar 2025
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
36
5
0
31 Oct 2024
Mastering the Craft of Data Synthesis for CodeLLMs
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
...
Mark Johnson
K. K.
Don Dharmasiri
Long Duong
Yuan-Fang Li
SyDa
46
1
0
16 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
Improving Pretraining Data Using Perplexity Correlations
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
32
17
0
09 Sep 2024
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Rishabh Maheshwary
Vikas Yadav
Hoang Nguyen
Khyati Mahajan
Sathwik Tejaswi Madhusudhan
40
3
0
24 Jun 2024
How Do Large Language Models Acquire Factual Knowledge During
  Pretraining?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang
Jinho Park
Seonghyeon Ye
Sohee Yang
Youngkyung Seo
Du-Seong Chang
Minjoon Seo
KELM
28
30
0
17 Jun 2024
Datasets for Multilingual Answer Sentence Selection
Datasets for Multilingual Answer Sentence Selection
Matteo Gabburo
S. Campese
Federico Agostini
Alessandro Moschitti
26
0
0
14 Jun 2024
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
  Reference Models
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
30
23
0
30 May 2024
The Mosaic Memory of Large Language Models
The Mosaic Memory of Large Language Models
Igor Shilov
Matthieu Meeus
Yves-Alexandre de Montjoye
39
3
0
24 May 2024
Where does In-context Translation Happen in Large Language Models
Where does In-context Translation Happen in Large Language Models
Suzanna Sia
David Mueller
Kevin Duh
LRM
33
0
0
07 Mar 2024
PALO: A Polyglot Large Multimodal Model for 5B People
PALO: A Polyglot Large Multimodal Model for 5B People
Muhammad Maaz
H. Rasheed
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakal
Rao M. Anwer
Timothy Baldwin
M. Felsberg
Fahad S. Khan
VLM
LRM
53
12
0
22 Feb 2024
CBQ: Cross-Block Quantization for Large Language Models
CBQ: Cross-Block Quantization for Large Language Models
Xin Ding
Xiaoyu Liu
Zhijun Tu
Yun-feng Zhang
Wei Li
...
Hanting Chen
Yehui Tang
Zhiwei Xiong
Baoqun Yin
Yunhe Wang
MQ
22
11
0
13 Dec 2023
MELA: Multilingual Evaluation of Linguistic Acceptability
MELA: Multilingual Evaluation of Linguistic Acceptability
Ziyin Zhang
Yikang Liu
Wei Huang
Junyu Mao
Rui Wang
Hai Hu
22
3
0
15 Nov 2023
Large Language Models as Superpositions of Cultural Perspectives
Large Language Models as Superpositions of Cultural Perspectives
Grgur Kovač
Masataka Sawayama
Rémy Portelas
Cédric Colas
Peter Ford Dominey
Pierre-Yves Oudeyer
LLMAG
20
32
0
15 Jul 2023
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ariel Ekgren
Amaru Cuba Gyllensten
Felix Stollenwerk
Joey Öhman
T. Isbister
Evangelia Gogoulou
F. Carlsson
Alice Heiman
Judit Casademont
Magnus Sahlgren
16
13
0
22 May 2023
Prompting with Pseudo-Code Instructions
Prompting with Pseudo-Code Instructions
Mayank Mishra
Prince Kumar
Riyaz Ahmad Bhat
V. Rudramurthy
Danish Contractor
Srikanth G. Tamilselvam
30
13
0
19 May 2023
How Good are Commercial Large Language Models on African Languages?
How Good are Commercial Large Language Models on African Languages?
Jessica Ojo
Kelechi Ogueji
19
5
0
11 May 2023
LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits
  Siamese-BLOOM
LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits Siamese-BLOOM
Wenhui Hua
Brian Williams
Davood Shamsi
19
3
0
10 May 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
22
39
0
07 Apr 2023
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Zheng-Xin Yong
Hailey Schoelkopf
Niklas Muennighoff
Alham Fikri Aji
David Ifeoluwa Adelani
...
Genta Indra Winata
Stella Biderman
Edward Raff
Dragomir R. Radev
Vassilina Nikoulina
CLL
VLM
AI4CE
LRM
27
81
0
19 Dec 2022
Large Language Models Struggle to Learn Long-Tail Knowledge
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal
H. Deng
Adam Roberts
Eric Wallace
Colin Raffel
RALM
KELM
31
373
0
15 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
27
2,297
0
09 Nov 2022
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Zaid Alyafeai
Maraim Masoud
Mustafa Ghaleb
Maged S. Al-Shaibani
36
25
0
13 Oct 2021
Are Multilingual Models the Best Choice for Moderately Under-resourced
  Languages? A Comprehensive Assessment for Catalan
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé
C. Carrino
Carlos Rodríguez-Penagos
Ona de Gibert Bonet
Carme Armentano-Oller
Aitor Gonzalez-Agirre
Maite Melero
Marta Villegas
58
42
0
16 Jul 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,977
0
31 Dec 2020
Stanza: A Python Natural Language Processing Toolkit for Many Human
  Languages
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi
Yuhao Zhang
Yuhui Zhang
Jason Bolton
Christopher D. Manning
AI4TS
193
1,638
0
16 Mar 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
226
4,424
0
23 Jan 2020
1