ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.09400
  4. Cited By
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
  Language Models in 167 Languages

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

17 September 2023
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan A. Rossi
Thien Huu Nguyen
ArXivPDFHTML

Papers citing "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages"

50 / 65 papers shown
Title
Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2
Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2
Vytenis Šliogeris
Povilas Daniušis
Arturas Nakvosas
CLL
32
0
0
09 May 2025
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Luca Moroni
Giovanni Puccetti
Pere-Lluís Huguet Cabot
Andrei Stefan Bejgu
Edoardo Barba
Alessio Miaschi
F. Dell’Orletta
Andrea Esuli
Roberto Navigli
30
0
0
23 Apr 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Minghao Wu
Weixuan Wang
Sinuo Liu
Huifeng Yin
Xintong Wang
Yu Zhao
Chenyang Lyu
Longyue Wang
Weihua Luo
Kaifu Zhang
ELM
74
0
0
22 Apr 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
29
1
0
21 Apr 2025
From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time
From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time
Mikkel Wildner Kildeberg
Emil Allerslev Schledermann
Nicolaj Larsen
Rob van der Goot
31
0
0
02 Apr 2025
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
66
0
0
29 Mar 2025
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Laurie Burchell
Ona de Gibert
Nikolay Arefyev
Mikko Aulamo
Marta Bañón
...
Pavel Stepachev
and Jörg Tiedemann
Dušan Variš
Tereza Vojtěchová
Jaume Zaragoza-Bernabeu
43
1
0
13 Mar 2025
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
Wenhui Zhang
Huiyu Xu
Zhibo Wang
Zeqing He
Ziqi Zhu
Kui Ren
AAML
PILM
67
0
0
09 Mar 2025
EuroBERT: Scaling Multilingual Encoders for European Languages
EuroBERT: Scaling Multilingual Encoders for European Languages
Nicolas Boizard
Hippolyte Gisserot-Boukhlef
Duarte M. Alves
André F. T. Martins
Ayoub Hammal
...
Maxime Peyrard
Nuno M. Guerreiro
Patrick Fernandes
Ricardo Rei
Pierre Colombo
99
1
0
07 Mar 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
66
0
0
24 Feb 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
49
0
0
24 Feb 2025
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander M. Fraser
Maosong Sun
47
0
0
17 Feb 2025
Matina: A Large-Scale 73B Token Persian Text Corpus
Matina: A Large-Scale 73B Token Persian Text Corpus
Sara Bourbour Hosseinbeigi
Fatemeh Taherinezhad
Heshaam Faili
Hamed Baghbani
Fatemeh Nadi
Mostafa Amiri
71
0
0
13 Feb 2025
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Menglong Cui
Pengzhi Gao
Wei Liu
Jian Luan
Bin Wang
LRM
41
0
0
04 Feb 2025
BayLing 2: A Multilingual Large Language Model with Efficient Language
  Alignment
BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
Shaolei Zhang
Kehao Zhang
Qingkai Fang
Shoutao Guo
Yan Zhou
Xiaodong Liu
Yang Feng
ALM
69
1
0
25 Nov 2024
DRPruning: Efficient Large Language Model Pruning through
  Distributionally Robust Optimization
DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Hexuan Deng
Wenxiang Jiao
Xuebo Liu
Min Zhang
Zhaopeng Tu
VLM
75
0
0
21 Nov 2024
Not All Languages are Equal: Insights into Multilingual
  Retrieval-Augmented Generation
Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation
Suhang Wu
Jialong Tang
Baosong Yang
Ante Wang
Kaidi Jia
Jiawei Yu
Junfeng Yao
Jinsong Su
22
1
0
29 Oct 2024
BongLLaMA: LLaMA for Bangla Language
BongLLaMA: LLaMA for Bangla Language
Abdullah Khan Zehady
Safi Al Mamun
Naymul Islam
Santu Karmaker
ALM
19
1
0
28 Oct 2024
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
Shaoxiong Ji
Zihao Li
Indraneil Paul
Jaakko Paavola
Peiqin Lin
...
Dayyán O'Brien
Hengyu Luo
Hinrich Schütze
Jörg Tiedemann
Barry Haddow
CLL
35
3
0
26 Sep 2024
EuroLLM: Multilingual Language Models for Europe
EuroLLM: Multilingual Language Models for Europe
Pedro Henrique Martins
Patrick Fernandes
Joao Alves
Nuno M. Guerreiro
Ricardo Rei
...
Pierre Colombo
Barry Haddow
José G. C. de Souza
Alexandra Birch
André F. T. Martins
27
16
0
24 Sep 2024
Small Language Models: Survey, Measurements, and Insights
Small Language Models: Survey, Measurements, and Insights
Zhenyan Lu
Xiang Li
Dongqi Cai
Rongjie Yi
Fangming Liu
Xiwen Zhang
Nicholas D. Lane
Mengwei Xu
ObjD
LRM
51
36
0
24 Sep 2024
Scaling Laws of Decoder-Only Models on the Multilingual Machine
  Translation Task
Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task
Gaëtan Caillaut
Raheel Qader
Mariam Nakhlé
Jingshu Liu
Jean-Gabriel Barthélemy
20
1
0
23 Sep 2024
EMMeTT: Efficient Multimodal Machine Translation Training
EMMeTT: Efficient Multimodal Machine Translation Training
Piotr Żelasko
Zhehuai Chen
Mengru Wang
Daniel Galvez
Oleksii Hrinchuk
Shuoyang Ding
Ke Hu
Jagadeesh Balam
Vitaly Lavrukhin
Boris Ginsburg
28
1
0
20 Sep 2024
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text
  Quality Filtering in Large Web Corpora
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Yungi Kim
Hyunsoo Ha
Sukyung Lee
Jihoo Kim
Seonghoon Yang
Chanjun Park
26
0
0
15 Sep 2024
A Survey of Large Language Models for European Languages
A Survey of Large Language Models for European Languages
Wazir Ali
S. Pyysalo
39
2
0
27 Aug 2024
Meltemi: The first open Large Language Model for Greek
Meltemi: The first open Large Language Model for Greek
Leon Voukoutis
Dimitris Roussis
Georgios Paraskevopoulos
Sokratis Sofianopoulos
Prokopis Prokopidis
Vassilis Papavasileiou
Athanasios Katsamanis
Stelios Piperidis
V. Katsouros
VLM
33
7
0
30 Jul 2024
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models
  for Southeast Asian Languages
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang
Hou Pong Chan
Yiran Zhao
Mahani Aljunied
Jianyu Wang
...
Zhiqiang Hu
Weiwen Xu
Yew Ken Chia
Xin Li
Li Bing
LRM
52
7
0
29 Jul 2024
mGTE: Generalized Long-Context Text Representation and Reranking Models
  for Multilingual Text Retrieval
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Xin Zhang
Yanzhao Zhang
Dingkun Long
Wen Xie
Ziqi Dai
...
Pengjun Xie
Fei Huang
Meishan Zhang
Wenjie Li
Min Zhang
35
73
0
29 Jul 2024
PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of
  Multilingual Alignment
PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment
Jiahuan Li
Shujian Huang
Xinyu Dai
Jiajun Chen
LLMSV
38
5
0
23 Jul 2024
Mitigating Catastrophic Forgetting in Language Transfer via Model
  Merging
Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
Anton Alexandrov
Veselin Raychev
Mark Niklas Muller
Ce Zhang
Martin Vechev
Kristina Toutanova
MoMe
CLL
KELM
40
13
0
11 Jul 2024
CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation
CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation
Krishnakant Bhatt
Karthika N J
Ganesh Ramakrishnan
P. Jyothi
24
1
0
08 Jul 2024
Unlocking the Potential of Model Merging for Low-Resource Languages
Unlocking the Potential of Model Merging for Low-Resource Languages
Mingxu Tao
Chen Zhang
Quzhe Huang
Tianyao Ma
Songfang Huang
Dongyan Zhao
Yansong Feng
CLL
MoMe
27
3
0
04 Jul 2024
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Ekaterina Taktasheva
Maxim Bazhukov
Kirill Koncha
Alena Fenogenova
Ekaterina Artemova
Vladislav Mikhailov
37
9
0
27 Jun 2024
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Rishabh Maheshwary
Vikas Yadav
Hoang Nguyen
Khyati Mahajan
Sathwik Tejaswi Madhusudhan
40
3
0
24 Jun 2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
Rahmad Mahendra
Salsabil Maulana Akbar
Lester James Validad Miranda
Jennifer Santoso
...
Genta Indra Winata
Ruochen Zhang
Fajri Koto
Zheng-Xin Yong
Samuel Cahyawijaya
77
9
0
14 Jun 2024
Decoding the Diversity: A Review of the Indic AI Research Landscape
Decoding the Diversity: A Review of the Indic AI Research Landscape
Sankalp KJ
Vinija Jain
S. Bhaduri
Tamoghna Roy
Aman Chadha
47
5
0
13 Jun 2024
X-Instruction: Aligning Language Model in Low-resource Languages with
  Self-curated Cross-lingual Instructions
X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
Chong Li
Wen Yang
Jiajun Zhang
Jinliang Lu
Shaonan Wang
Chengqing Zong
39
6
0
30 May 2024
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model
  Series
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
Ge Zhang
Scott Qu
Jiaheng Liu
Chenchen Zhang
Chenghua Lin
...
Zi-Kai Zhao
Jiajun Zhang
Wanli Ouyang
Wenhao Huang
Wenhu Chen
ELM
43
44
0
29 May 2024
Token-wise Influential Training Data Retrieval for Large Language Models
Token-wise Influential Training Data Retrieval for Large Language Models
Huawei Lin
Jikai Long
Zhaozhuo Xu
Weijie Zhao
42
3
0
20 May 2024
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala
Denis C. Ilie-Ablachim
D. Corlatescu
Miruna Zavelca
Marius Leordeanu
Horia Velicu
Marius Popescu
Mihai Dascalu
Traian Rebedea
35
2
0
13 May 2024
ChuXin: 1.6B Technical Report
ChuXin: 1.6B Technical Report
Xiaomin Zhuang
Yufan Jiang
Qiaozhi He
Zhihua Wu
ALM
41
0
0
08 May 2024
Impact of emoji exclusion on the performance of Arabic sarcasm detection
  models
Impact of emoji exclusion on the performance of Arabic sarcasm detection models
Ghalyah H. Aleryani
Wael A. Deabes
Khaled Albishre
Alaa E. Abdel-Hakim
20
0
0
03 May 2024
Introducing cosmosGPT: Monolingual Training for Turkish Language Models
Introducing cosmosGPT: Monolingual Training for Turkish Language Models
Himmet Toprak Kesgin
M. K. Yuce
Eren Dogan
M. E. Uzun
Atahan Uz
H. E. Seyrek
Ahmed Zeer
M. Amasyalı
38
9
0
26 Apr 2024
SambaLingo: Teaching Large Language Models New Languages
SambaLingo: Teaching Large Language Models New Languages
Zoltan Csaki
Bo Li
Jonathan Li
Qiantong Xu
Pian Pawakapan
Leon Zhang
Yun Du
Hengyu Zhao
Changran Hu
Urmish Thakker
32
6
0
08 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and
  Frontiers
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Yinghui Li
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
47
36
0
07 Apr 2024
A Survey on Multilingual Large Language Models: Corpora, Alignment, and
  Bias
A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
Yuemei Xu
Ling Hu
Jiayi Zhao
Zihan Qiu
Yuqi Ye
Hanwen Gu
LRM
19
36
0
01 Apr 2024
Aurora-M: The First Open Source Multilingual Language Model Red-teamed
  according to the U.S. Executive Order
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Taishi Nakamura
Mayank Mishra
Simone Tedeschi
Yekun Chai
Jason T Stillerman
...
Virendra Mehta
Matthew Blumberg
Victor May
Huu Nguyen
S. Pyysalo
LRM
21
7
0
30 Mar 2024
Latxa: An Open Language Model and Evaluation Suite for Basque
Latxa: An Open Language Model and Evaluation Suite for Basque
Julen Etxaniz
Oscar Sainz
Naiara Pérez
Itziar Aldabe
German Rigau
Eneko Agirre
Aitor Ormazabal
Mikel Artetxe
A. Soroa
ELM
34
22
0
29 Mar 2024
A New Massive Multilingual Dataset for High-Performance Language
  Technologies
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert
Graeme Nail
Nikolay Arefyev
Marta Bañón
Jelmer van der Linde
...
Gema Ramírez-Sánchez
Andrey Kutuzov
S. Pyysalo
Stephan Oepen
Jörg Tiedemann
VLM
28
20
0
20 Mar 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
121
497
0
07 Mar 2024
12
Next