ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.09400
  4. Cited By
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
  Language Models in 167 Languages

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

17 September 2023
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan A. Rossi
Thien Huu Nguyen
ArXivPDFHTML

Papers citing "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages"

15 / 65 papers shown
Title
Fostering the Ecosystem of Open Neural Encoders for Portuguese with
  Albertina PT* Family
Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
Rodrigo Santos
João Rodrigues
Luís Gomes
Joao Silva
António Branco
Henrique Lopes Cardoso
T. Osório
Bernardo Leite
33
8
0
04 Mar 2024
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso
David Samuel Setiawan
Steven Limcorn
Ananto Joyoadikusumo
32
1
0
04 Mar 2024
Towards Comprehensive Vietnamese Retrieval-Augmented Generation and
  Large Language Models
Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models
Nguyen Quang Duc
Le Hai Son
Nguyen Duc Nhan
Nguyen Dich Nhat Minh
Le Thanh Huong
D. V. Sang
3DV
RALM
25
2
0
03 Mar 2024
Towards Building Multilingual Language Model for Medicine
Towards Building Multilingual Language Model for Medicine
Pengcheng Qiu
Chaoyi Wu
Xiaoman Zhang
Weixiong Lin
Haicheng Wang
Ya-Qin Zhang
Yanfeng Wang
Weidi Xie
LM&MA
ELM
33
64
0
21 Feb 2024
TeenyTinyLlama: open-source tiny language models trained in Brazilian
  Portuguese
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese
N. Corrêa
Sophia Falk
Shiza Fatimah
Aniket Sen
N. D. Oliveira
20
9
0
30 Jan 2024
Text Embedding Inversion Security for Multilingual Language Models
Text Embedding Inversion Security for Multilingual Language Models
Yiyi Chen
Heather Lent
Johannes Bjerva
19
14
0
22 Jan 2024
Adapting Large Language Models for Document-Level Machine Translation
Adapting Large Language Models for Document-Level Machine Translation
Minghao Wu
Thuy-Trang Vu
Lizhen Qu
George F. Foster
Gholamreza Haffari
80
42
0
12 Jan 2024
VinaLLaMA: LLaMA-based Vietnamese Foundation Model
VinaLLaMA: LLaMA-based Vietnamese Foundation Model
Quan Van Nguyen
Huy Quang Pham
Dung Dao
ALM
16
8
0
18 Dec 2023
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority
  Languages in China
MC2^22: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
Chen Zhang
Mingxu Tao
Quzhe Huang
Jiuheng Lin
Zhibin Chen
Yansong Feng
25
2
0
14 Nov 2023
Okapi: Instruction-tuned Large Language Models in Multiple Languages
  with Reinforcement Learning from Human Feedback
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback
Viet Dac Lai
Chien Van Nguyen
Nghia Trung Ngo
Thuat Nguyen
Franck Dernoncourt
Ryan A. Rossi
Thien Huu Nguyen
ALM
38
127
0
29 Jul 2023
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
  Language Models in Multilingual Learning
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Viet Dac Lai
Nghia Trung Ngo
Amir Pouran Ben Veyseh
Hieu Man
Franck Dernoncourt
Trung Bui
Thien Huu Nguyen
ELM
LM&MA
25
267
0
12 Apr 2023
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
590
0
14 Jul 2021
Understanding the Capabilities, Limitations, and Societal Impact of
  Large Language Models
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
Alex Tamkin
Miles Brundage
Jack Clark
Deep Ganguli
AILaw
ELM
192
258
0
04 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,817
0
17 Sep 2019
Previous
12