ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.13357
  4. Cited By
One Country, 700+ Languages: NLP Challenges for Underrepresented
  Languages and Dialects in Indonesia

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

24 March 2022
Alham Fikri Aji
Genta Indra Winata
Fajri Koto
Samuel Cahyawijaya
Ade Romadhony
Rahmad Mahendra
Kemal Kurniawan
David Moeljadi
Radityo Eko Prasojo
Timothy Baldwin
Jey Han Lau
Sebastian Ruder
ArXivPDFHTML

Papers citing "One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia"

50 / 50 papers shown
Title
Improving Informally Romanized Language Identification
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
38
0
0
30 Apr 2025
HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
Tsz Chung Cheng
Chung Shing Cheng
Chaak Ming Lau
Eugene Tin-Ho Lam
Chun Yat Wong
Hoi On Yu
Cheuk Hei Chong
ELM
53
1
0
16 Mar 2025
Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation
Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation
Ben Hutchinson
Celeste Rodríguez Louro
Glenys Collard
Ned Cooper
57
0
0
05 Mar 2025
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda
M. Wijanarko
Lucky Susanto
Khumaisa Nuráini
Derry Wijaya
Alham Fikri Aji
47
0
0
25 Feb 2025
SailCompass: Towards Reproducible and Robust Evaluation for Southeast
  Asian Languages
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages
Jia Guo
Longxu Dou
Guangtao Zeng
Stanley Kok
Wei Lu
Qian Liu
ELM
LRM
65
1
0
02 Dec 2024
Linguistics Theory Meets LLM: Code-Switched Text Generation via
  Equivalence Constrained Large Language Models
Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models
Garry Kuwanto
Chaitanya Agarwal
Genta Indra Winata
Derry Wijaya
44
1
0
30 Oct 2024
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models
  for Southeast Asian Languages
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang
Hou Pong Chan
Yiran Zhao
Mahani Aljunied
Jianyu Wang
...
Zhiqiang Hu
Weiwen Xu
Yew Ken Chia
Xin Li
Li Bing
LRM
41
0
0
29 Jul 2024
Voices Unheard: NLP Resources and Models for Yorùbá Regional
  Dialects
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Orevaoghene Ahia
Anuoluwapo Aremu
Diana Abagyan
Hila Gonen
David Ifeoluwa Adelani
Daud Abolade
Noah A. Smith
Yulia Tsvetkov
46
3
0
27 Jun 2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
Rahmad Mahendra
Salsabil Maulana Akbar
Lester James Validad Miranda
Jennifer Santoso
...
Genta Indra Winata
Ruochen Zhang
Fajri Koto
Zheng-Xin Yong
Samuel Cahyawijaya
69
9
0
14 Jun 2024
IndoCulture: Exploring Geographically-Influenced Cultural Commonsense
  Reasoning Across Eleven Indonesian Provinces
IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces
Fajri Koto
Rahmad Mahendra
Nurul Aisyah
Timothy Baldwin
LRM
56
16
0
02 Apr 2024
Constructing and Expanding Low-Resource and Underrepresented Parallel
  Datasets for Indonesian Local Languages
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages
Joanito Agili Lopo
Radius Tanone
33
2
0
01 Apr 2024
LLMs Are Few-Shot In-Context Low-Resource Language Learners
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
33
32
0
25 Mar 2024
Simple Hack for Transformers against Heavy Long-Text Classification on a
  Time- and Memory-Limited GPU Service
Simple Hack for Transformers against Heavy Long-Text Classification on a Time- and Memory-Limited GPU Service
Mirza Alim Mutasodirin
Radityo Eko Prasojo
Achmad F. Abka
Hanif Rasyidi
VLM
12
0
0
19 Mar 2024
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso
David Samuel Setiawan
Steven Limcorn
Ananto Joyoadikusumo
19
1
0
04 Mar 2024
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in
  Indonesian and Sundanese
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Rifki Afina Putri
Faiz Ghifari Haznitrama
Dea Adhista
Alice H. Oh
37
14
0
27 Feb 2024
Could We Have Had Better Multilingual LLMs If English Was Not the
  Central Language?
Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?
Ryandito Diandaru
Lucky Susanto
Zilu Tang
Ayu Purwarianti
Derry Wijaya
20
1
0
21 Feb 2024
Improving Machine Translation with Human Feedback: An Exploration of
  Quality Estimation as a Reward Model
Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model
Zhiwei He
Xing Wang
Wenxiang Jiao
Zhuosheng Zhang
Rui Wang
Shuming Shi
Zhaopeng Tu
ALM
21
24
0
23 Jan 2024
Natural Language Processing for Dialects of a Language: A Survey
Natural Language Processing for Dialects of a Language: A Survey
Aditya Joshi
Raj Dabre
Diptesh Kanojia
Zhuang Li
Haolan Zhan
Gholamreza Haffari
Doris Dippold
LM&MA
10
27
0
11 Jan 2024
IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
  Local Languages
IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages
Muhammad Farid Adilazuarda
Samuel Cahyawijaya
Genta Indra Winata
Pascale Fung
Ayu Purwarianti
24
11
0
21 Nov 2023
A Material Lens on Coloniality in NLP
A Material Lens on Coloniality in NLP
William B. Held
Camille Harris
Michael Best
Diyi Yang
10
11
0
14 Nov 2023
Replicable Benchmarking of Neural Machine Translation (NMT) on
  Low-Resource Local Languages in Indonesia
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia
Lucky Susanto
Ryandito Diandaru
Adila Alfa Krisnadhi
Ayu Purwarianti
Derry Wijaya
11
0
0
02 Nov 2023
IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End
  Task-Oriented Dialogue Systems
IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems
Muhammad Dehan Al Kautsar
Rahmah Khoirussyifa' Nurdini
Samuel Cahyawijaya
Genta Indra Winata
Ayu Purwarianti
11
0
0
02 Nov 2023
Representativeness as a Forgotten Lesson for Multilingual and
  Code-switched Data Collection and Preparation
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
A. Seza Doğruöz
Sunayana Sitaram
Zheng-Xin Yong
19
13
0
31 Oct 2023
Quantifying the Dialect Gap and its Correlates Across Languages
Quantifying the Dialect Gap and its Correlates Across Languages
Anjali Kantharuban
Ivan Vulić
Anna Korhonen
54
19
0
23 Oct 2023
CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
Ma. Beatrice Emanuela Pilar
Ellyza Mari Papas
Mary Loise Buenaventura
Dane Dedoroy
M. D. Montefalcon
Jay Rhald Padilla
Lany L. Maceda
Mideth B. Abisado
Joseph Marvin Imperial
8
1
0
01 Oct 2023
NusaWrites: Constructing High-Quality Corpora for Underrepresented and
  Extremely Low-Resource Languages
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Samuel Cahyawijaya
Holy Lovenia
Fajri Koto
Dea Adhista
Emmanuel Dave
...
Genta Indra Winata
David Moeljadi
Alham Fikri Aji
Ayu Purwarianti
Pascale Fung
34
7
0
19 Sep 2023
AlbNER: A Corpus for Named Entity Recognition in Albanian
AlbNER: A Corpus for Named Entity Recognition in Albanian
Erion Çano
14
1
0
15 Sep 2023
Lexical Diversity in Kinship Across Languages and Dialects
Lexical Diversity in Kinship Across Languages and Dialects
H. Khalilia
Gábor Bella
Abed Alhakim Freihat
Shandy Darma
Fausto Giunchiglia
6
7
0
24 Aug 2023
Multi-lingual and Multi-cultural Figurative Language Understanding
Multi-lingual and Multi-cultural Figurative Language Understanding
Anubha Kabra
Emmy Liu
Simran Khanuja
Alham Fikri Aji
Genta Indra Winata
Samuel Cahyawijaya
Anuoluwapo Aremu
Perez Ogayo
Graham Neubig
6
26
0
25 May 2023
Bactrian-X: Multilingual Replicable Instruction-Following Models with
  Low-Rank Adaptation
Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation
Haonan Li
Fajri Koto
Minghao Wu
Alham Fikri Aji
Timothy Baldwin
ALM
14
73
0
24 May 2023
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
  Transfer
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Akari Asai
Sneha Kudugunta
Xinyan Velocity Yu
Terra Blevins
Hila Gonen
Machel Reid
Yulia Tsvetkov
Sebastian Ruder
Hannaneh Hajishirzi
15
53
0
24 May 2023
InstructAlign: High-and-Low Resource Language Alignment via Continual
  Crosslingual Instruction Tuning
InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning
Samuel Cahyawijaya
Holy Lovenia
Tiezheng Yu
Willy Chung
Pascale Fung
ALM
36
14
0
23 May 2023
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
Verena Blaschke
Hinrich Schütze
Barbara Plank
9
13
0
19 Apr 2023
Prompting Multilingual Large Language Models to Generate Code-Mixed
  Texts: The Case of South East Asian Languages
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Zheng-Xin Yong
Ruochen Zhang
Jessica Zosa Forde
Skyler Wang
Arjun Subramonian
...
Yinghua Tan
Long Phan
Rowena Garcia
Thamar Solorio
Alham Fikri Aji
LRM
33
28
0
23 Mar 2023
Fairness in Language Models Beyond English: Gaps and Challenges
Fairness in Language Models Beyond English: Gaps and Challenges
Krithika Ramesh
Sunayana Sitaram
Monojit Choudhury
14
23
0
24 Feb 2023
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
  Reasoning, Hallucination, and Interactivity
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang
Samuel Cahyawijaya
Nayeon Lee
Wenliang Dai
Dan Su
...
Tiezheng Yu
Willy Chung
Quyet V. Do
Yan Xu
Pascale Fung
ReLM
LRM
11
1,311
0
08 Feb 2023
A Survey of Code-switching: Linguistic and Social Perspectives for
  Language Technologies
A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies
A. Seza Doğruöz
Sunayana Sitaram
Barbara E. Bullock
Almeida Jacqueline Toribio
55
72
0
05 Jan 2023
SERENGETI: Massively Multilingual Language Models for Africa
SERENGETI: Massively Multilingual Language Models for Africa
Ife Adebara
AbdelRahim Elmadany
Muhammad Abdul-Mageed
Alcides Alcoba Inciarte
12
29
0
21 Dec 2022
The Decades Progress on Code-Switching Research in NLP: A Systematic
  Survey on Trends and Challenges
The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges
Genta Indra Winata
Alham Fikri Aji
Zheng-Xin Yong
Thamar Solorio
17
31
0
19 Dec 2022
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya
Holy Lovenia
Alham Fikri Aji
Genta Indra Winata
Bryan Wilie
...
Timothy Baldwin
Sebastian Ruder
Herry Sujaini
S. Sakti
Ayu Purwarianti
11
47
0
19 Dec 2022
Multilingual Relation Classification via Efficient and Effective
  Prompting
Multilingual Relation Classification via Efficient and Effective Prompting
Yuxuan Chen
David Harbecke
Leonhard Hennig
LRM
16
11
0
25 Oct 2022
Rethinking Round-Trip Translation for Machine Translation Evaluation
Rethinking Round-Trip Translation for Machine Translation Evaluation
Terry Yue Zhuo
Qiongkai Xu
Xuanli He
Trevor Cohn
LRM
9
2
0
15 Sep 2022
NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian
  Languages
NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages
Samuel Cahyawijaya
Alham Fikri Aji
Holy Lovenia
Genta Indra Winata
Bryan Wilie
...
Fajri Koto
David Moeljadi
Karissa Vincentio
Ade Romadhony
Ayu Purwarianti
24
5
0
21 Jul 2022
Location-based Twitter Filtering for the Creation of Low-Resource
  Language Datasets in Indonesian Local Languages
Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages
Mukhlis Amien
Chong Feng
Heyan Huang
8
3
0
15 Jun 2022
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
  Languages
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Genta Indra Winata
Alham Fikri Aji
Samuel Cahyawijaya
Rahmad Mahendra
Fajri Koto
...
Pascale Fung
Timothy Baldwin
Jey Han Lau
Rico Sennrich
Sebastian Ruder
21
77
0
31 May 2022
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Zaid Alyafeai
Maraim Masoud
Mustafa Ghaleb
Maged S. Al-Shaibani
28
21
0
13 Oct 2021
Visually Grounded Reasoning across Languages and Cultures
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
E. Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLM
LRM
87
167
0
28 Sep 2021
Just What do You Think You're Doing, Dave?' A Checklist for Responsible
  Data Use in NLP
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
Anna Rogers
Timothy Baldwin
Kobi Leins
99
64
0
14 Sep 2021
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with
  Effective Domain-Specific Vocabulary Initialization
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Fajri Koto
Jey Han Lau
Timothy Baldwin
VLM
49
82
0
10 Sep 2021
Code-Switched Language Models Using Neural Based Synthetic Data from
  Parallel Sentences
Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences
Genta Indra Winata
Andrea Madotto
Chien-Sheng Wu
Pascale Fung
SyDa
124
92
0
18 Sep 2019
1