ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.12182
  4. Cited By
Glot500: Scaling Multilingual Corpora and Language Models to 500
  Languages

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

20 May 2023
Ayyoob Imani
Peiqin Lin
Amir Hossein Kargaran
Silvia Severini
Masoud Jalili Sabet
Nora Kassner
Chunlan Ma
Helmut Schmid
André F. T. Martins
François Yvon
Hinrich Schütze
    ALM
    LRM
ArXivPDFHTML

Papers citing "Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages"

50 / 80 papers shown
Title
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Enes Özeren
Yihong Liu
Hinrich Schütze
25
0
0
21 Apr 2025
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Monojit Choudhury
Shivam Chauhan
Rocktim Jyoti Das
Dhruv Sahnan
Xudong Han
...
Rituraj Joshi
Gurpreet Gosal
Avraham Sheinin
Natalia Vassilieva
Preslav Nakov
16
0
0
08 Apr 2025
Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation
Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation
Toqeer Ehsan
Thamar Solorio
36
0
0
07 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLM
ELM
LRM
46
0
0
31 Mar 2025
PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment
PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment
Jong Myoung Kim
Young-Jun_Lee
Ho-Jin Choi
Sangkeun Jung
53
0
0
24 Mar 2025
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Laurie Burchell
Ona de Gibert
Nikolay Arefyev
Mikko Aulamo
Marta Bañón
...
Pavel Stepachev
and Jörg Tiedemann
Dušan Variš
Tereza Vojtěchová
Jaume Zaragoza-Bernabeu
38
1
0
13 Mar 2025
Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
A. Zebaze
Benoît Sagot
Rachel Bawden
67
0
0
06 Mar 2025
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Fajri Koto
Rituraj Joshi
Nurdaulet Mukhituly
Y. Wang
Zhuohan Xie
...
Avraham Sheinin
Natalia Vassilieva
Neha Sengupta
Larry Murray
Preslav Nakov
ALM
KELM
36
0
0
03 Mar 2025
Multilingual Language Model Pretraining using Machine-translated Data
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
David Ifeoluwa Adelani
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
67
2
0
20 Feb 2025
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander M. Fraser
Maosong Sun
47
0
0
17 Feb 2025
Beyond Data Quantity: Key Factors Driving Performance in Multilingual
  Language Models
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
Sina Bagheri Nezhad
Ameeta Agrawal
Rhitabrat Pokharel
LRM
74
2
0
17 Dec 2024
A Practical Guide to Fine-tuning Language Models with Limited Data
A Practical Guide to Fine-tuning Language Models with Limited Data
Márton Szép
Daniel Rueckert
Rüdiger von Eisenhart-Rothe
Florian Hinterwimmer
SyDa
ALM
35
2
0
14 Nov 2024
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
26
5
0
31 Oct 2024
Troubling Taxonomies in GenAI Evaluation
Troubling Taxonomies in GenAI Evaluation
Glen Berman
Ned Cooper
Wesley Hanwen Deng
Ben Hutchinson
31
0
0
30 Oct 2024
The Zeno's Paradox of `Low-Resource' Languages
The Zeno's Paradox of `Low-Resource' Languages
H. Nigatu
A. Tonja
Benjamin Rosman
Thamar Solorio
Monojit Choudhury
26
5
0
28 Oct 2024
State of NLP in Kenya: A Survey
State of NLP in Kenya: A Survey
Cynthia Jayne Amol
Everlyn Asiko Chimoto
Rose Delilah Gesicho
Antony M. Gitau
Naome A. Etori
...
Catherine Gitau
Antony Ndolo
Lilian D. A. Wanzare
Albert Njoroge Kahira
Ronald Tombe
16
1
0
13 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
65
2
0
11 Oct 2024
NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages
  with Large Language Models
NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models
William Tan
Kevin Zhu
15
0
0
10 Oct 2024
How Transliterations Improve Crosslingual Alignment
How Transliterations Improve Crosslingual Alignment
Yihong Liu
Mingyang Wang
Amir Hossein Kargaran
Ayyoob Imani
Orgest Xhelili
Haotian Ye
Chunlan Ma
François Yvon
Hinrich Schütze
23
2
0
25 Sep 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Daoyang Li
Mingyu Jin
Qingcheng Zeng
Mengnan Du
45
2
0
22 Sep 2024
SpeechTaxi: On Multilingual Semantic Speech Classification
SpeechTaxi: On Multilingual Semantic Speech Classification
Lennart Keller
Goran Glavaš
16
0
0
10 Sep 2024
Quality or Quantity? On Data Scale and Diversity in Adapting Large
  Language Models for Low-Resource Translation
Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation
Vivek Iyer
Bhavitvya Malik
Pavel Stepachev
Pinzhen Chen
Barry Haddow
Alexandra Birch
ALM
16
3
0
23 Aug 2024
Goldfish: Monolingual Language Models for 350 Languages
Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
28
4
0
19 Aug 2024
Modular Sentence Encoders: Separating Language Specialization from
  Cross-Lingual Alignment
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
Yongxin Huang
Kexin Wang
Goran Glavavs
Iryna Gurevych
39
0
0
20 Jul 2024
A Principled Framework for Evaluating on Typologically Diverse Languages
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger
Wessel Poelman
Andreas Holck Høeg-Petersen
Anders Schlichtkrull
Miryam de Lhoneux
Johannes Bjerva
31
1
0
06 Jul 2024
Exploring the Role of Transliteration in In-Context Learning for
  Low-resource Languages Written in Non-Latin Scripts
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts
Chunlan Ma
Yihong Liu
Haotian Ye
Hinrich Schütze
19
1
0
02 Jul 2024
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
Peiqin Lin
André F. T. Martins
Hinrich Schütze
41
2
0
29 Jun 2024
Breaking the Script Barrier in Multilingual Pre-Trained Language Models
  with Transliteration-Based Post-Training Alignment
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Orgest Xhelili
Yihong Liu
Hinrich Schütze
18
6
0
28 Jun 2024
Low-Resource Machine Translation through the Lens of Personalized
  Federated Learning
Low-Resource Machine Translation through the Lens of Personalized Federated Learning
Viktor Moskvoretskii
N. Tupitsa
Chris Biemann
Samuel Horváth
Eduard A. Gorbunov
Irina Nikishina
FedML
18
0
0
18 Jun 2024
Decoding the Diversity: A Review of the Indic AI Research Landscape
Decoding the Diversity: A Review of the Indic AI Research Landscape
Sankalp KJ
Vinija Jain
S. Bhaduri
Tamoghna Roy
Aman Chadha
34
5
0
13 Jun 2024
MINERS: Multilingual Language Models as Semantic Retrievers
MINERS: Multilingual Language Models as Semantic Retrievers
Genta Indra Winata
Ruochen Zhang
David Ifeoluwa Adelani
RALM
31
5
0
11 Jun 2024
Targeted Multilingual Adaptation for Low-resource Language Families
Targeted Multilingual Adaptation for Low-resource Language Families
C.M. Downey
Terra Blevins
Dhwani Serai
Dwija Parikh
Shane Steinert-Threlkeld
19
0
0
20 May 2024
TransMI: A Framework to Create Strong Baselines from Multilingual
  Pretrained Language Models for Transliterated Data
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
21
2
0
16 May 2024
UCCIX: Irish-eXcellence Large Language Model
UCCIX: Irish-eXcellence Large Language Model
Khanh-Tung Tran
Barry O’Sullivan
Hoang D. Nguyen
25
3
0
13 May 2024
Natural Language Processing RELIES on Linguistics
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
38
7
0
09 May 2024
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples
Peiqin Lin
André F. T. Martins
Hinrich Schütze
RALM
39
2
0
08 May 2024
What Drives Performance in Multilingual Language Models?
What Drives Performance in Multilingual Language Models?
Sina Bagheri Nezhad
Ameeta Agrawal
LRM
25
9
0
29 Apr 2024
Toxicity Classification in Ukrainian
Toxicity Classification in Ukrainian
Daryna Dementieva
Valeriia Khylenko
N. Babakov
Georg Groh
14
1
0
27 Apr 2024
Modeling the Sacred: Considerations when Using Religious Texts in
  Natural Language Processing
Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing
Ben Hutchinson
61
0
0
23 Apr 2024
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The
  Medical Domain
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain
Iker García-Ferrero
Rodrigo Agerri
Aitziber Atutxa Salazar
Elena Cabrio
Iker de la Iglesia
...
Johana Ramirez-Romero
German Rigau
J. M. Villa-Gonzalez
S. Villata
Andrea Zaninello
73
19
0
11 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and
  Frontiers
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Yinghui Li
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
47
35
0
07 Apr 2024
Large Language Model (LLM) AI text generation detection based on
  transformer deep learning algorithm
Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm
Yuhong Mo
Hao Qin
Yushan Dong
Ziyi Zhu
Zhenglin Li
DeLMO
13
40
0
06 Apr 2024
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in
  Cross-Lingual Textual Relatedness
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness
Shijia Zhou
Huangyan Shan
Barbara Plank
Robert Litschko
16
2
0
03 Apr 2024
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for
  Angolan Language Model
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Osvaldo Luamba Quinjica
David Ifeoluwa Adelani
14
0
0
03 Apr 2024
A New Massive Multilingual Dataset for High-Performance Language
  Technologies
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert
Graeme Nail
Nikolay Arefyev
Marta Bañón
Jelmer van der Linde
...
Gema Ramírez-Sánchez
Andrey Kutuzov
S. Pyysalo
Stephan Oepen
Jörg Tiedemann
VLM
23
20
0
20 Mar 2024
Evaluating the Elementary Multilingual Capabilities of Large Language
  Models with MultiQ
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann
Paul Röttger
Timm Dill
Anne Lauscher
ELM
LRM
16
10
0
06 Mar 2024
Analyzing and Adapting Large Language Models for Few-Shot Multilingual
  NLU: Are We There Yet?
Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?
E. Razumovskaia
Ivan Vulić
Anna Korhonen
16
5
0
04 Mar 2024
The Hidden Space of Transformer Language Adapters
The Hidden Space of Transformer Language Adapters
Jesujoba Oluwadara Alabi
Marius Mosbach
Matan Eyal
Dietrich Klakow
Mor Geva
35
7
1
20 Feb 2024
Shallow Synthesis of Knowledge in GPT-Generated Texts: A Case Study in
  Automatic Related Work Composition
Shallow Synthesis of Knowledge in GPT-Generated Texts: A Case Study in Automatic Related Work Composition
Anna Martin-Boyle
Aahan Tyagi
Marti A. Hearst
Dongyeop Kang
21
7
0
19 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language
  Model
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
A. Ustun
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
27
192
0
12 Feb 2024
12
Next