ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.05409
  4. Cited By
Towards Leaving No Indic Language Behind: Building Monolingual Corpora,
  Benchmark and Models for Indic Languages
v1v2v3 (latest)

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Annual Meeting of the Association for Computational Linguistics (ACL), 2022
11 December 2022
Sumanth Doddapaneni
Rahul Aralikatte
Gowtham Ramesh
Shreyansh Goyal
Mitesh M. Khapra
Anoop Kunchukuttan
Pratyush Kumar
    ELM
ArXiv (abs)PDFHTMLGithub (95★)

Papers citing "Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages"

50 / 65 papers shown
Title
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages
Neha Joshi
Pamir Gogoi
Aasim Mirza
Aayush Jansari
Aditya Yadavalli
Ayushi Pandey
Arunima Shukla
Deepthi Sudharsan
Kalika Bali
Vivek Seshadri
40
0
0
30 Nov 2025
IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
Ayush Maheshwari
Kaushal Sharma
Vivek Patel
Aditya Maheshwari
ELM
69
0
0
29 Nov 2025
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
Firoj Ahmmed Patwary
Abdullah Al Noman
52
0
0
07 Nov 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
108
0
0
05 Nov 2025
Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus
Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Pooja Singh
Shashwat Bhardwaj
V. Sharma
Sandeep Kumar
92
0
0
01 Nov 2025
Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
Sandipan Majhi
Paheli Bhattacharya
68
0
0
29 Oct 2025
Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Neel Prabhanjan Rachamalla
Aravind Konakalla
Gautam Rajeev
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
124
0
0
08 Oct 2025
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse
Sebastian Ruder
Tony Lin
Oksana Kurylo
Haruka Takagi
Janice Lam
Nicolò Busetto
Denise Diaz
Francisco Guzmán
128
0
0
30 Sep 2025
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture
Arijit Maji
Raghvendra Kumar
Akash Ghosh
Anushka
Nemil Shah
Abhilekh Borah
Vanshika Shah
Nishant Mishra
Sriparna Saha
VLM
144
2
0
23 Sep 2025
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges
Sampoorna Poria
Xiaolei Huang
184
0
0
15 Sep 2025
Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Anusha Kamath
Kanishk Singla
Rakesh Paul
Raviraj Joshi
Utkarsh Vaidya
Sanjay Singh Chauhan
Niranjan Wartikar
108
0
0
27 Aug 2025
Quantifying Language Disparities in Multilingual Large Language Models
Quantifying Language Disparities in Multilingual Large Language Models
Songbo Hu
Ivan Vulić
Anna Korhonen
104
2
0
23 Aug 2025
SEA-BED: Southeast Asia Embedding Benchmark
SEA-BED: Southeast Asia Embedding Benchmark
Wuttikorn Ponwitayarat
Raymond Ng
Jann Railey Montalan
Thura Aung
Jian Gang Ngui
...
Panuthep Tasawong
Erik Cambria
Ekapol Chuangsuwanich
Sarana Nutanong
Peerat Limkonchotiwat
134
1
0
17 Aug 2025
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula
Sandipan Dandapat
D. Sharma
Parameswari Krishnamurthy
203
0
0
11 Aug 2025
LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning
LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning
Sandipan Dhar
Mayank Gupta
Preeti Rao
DiffM
35
0
0
07 Jul 2025
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Md. Tanzib Hosain
Rajan Das Gupta
Md. Kishor Morol
186
2
0
24 May 2025
CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
Aditeya Baral
Allen George Ajith
Roshan Nayak
Mrityunjay Abhijeet Bhanja
120
1
0
19 May 2025
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Tokenization Matters: Improving Zero-Shot NER for Indic LanguagesIEEE International Conference on Electro/Information Technology (EIT), 2025
Priyaranjan Pattnayak
Hitesh Laxmichand Patel
Amit Agarwal
182
9
0
23 Apr 2025
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Monojit Choudhury
Shivam Chauhan
Rocktim Jyoti Das
Dhruv Sahnan
Xudong Han
...
Rituraj Joshi
Gurpreet Gosal
Avraham Sheinin
Natalia Vassilieva
Preslav Nakov
229
5
0
08 Apr 2025
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Batayan: A Filipino NLP benchmark for evaluating Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jann Railey Montalan
Jimson Paulo Layacan
David Demitri Africa
Richell Isaiah Flores
Michael T. Lopez II
Theresa Denise Magsajo
Anjanette Cayabyab
William-Chandra Tjhi
230
3
0
19 Feb 2025
Are Language Models Agnostic to Linguistically Grounded Perturbations? A
  Case Study of Indic Languages
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Poulami Ghosh
Mary Dabre
Pushpak Bhattacharyya
AAML
262
0
0
14 Dec 2024
Development of Pre-Trained Transformer-based Models for the Nepali Language
Development of Pre-Trained Transformer-based Models for the Nepali Language
Prajwal Thapa
Jinu Nyachhyon
Mridul Sharma
Bal Krishna Bal
247
4
0
24 Nov 2024
1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of
  Language, Hate Speech, and Targets using LLMs
1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs
Jebish Purbey
Siddartha Pullakhandam
Kanwal Mehreen
Muhammad Arham
Drishti Sharma
Ashay Srivastava
Ram Mohan Rao Kadiyala
118
2
0
11 Nov 2024
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and
  Low-frequency Character Bigrams
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams
Srija Anand
Praveen Srinivasa Varadhan
Mehak Singal
Mitesh M. Khapra
145
2
0
23 Oct 2024
Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
Raviraj Joshi
Kanishk Singla
Anusha Kamath
Raunak Kalani
Rakesh Paul
Utkarsh Vaidya
Sanjay Singh Chauhan
Niranjan Wartikar
Eileen Long
SyDaCLL
352
21
0
18 Oct 2024
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Dilip Venkatesh
Mary Dabre
Anoop Kunchukuttan
Mitesh Khapra
ELM
334
7
0
17 Oct 2024
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Amir Hossein Kargaran
Ali Modarressi
Nafiseh Nikeghbal
Jana Diesner
François Yvon
Hinrich Schütze
ELM
288
16
0
08 Oct 2024
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for FilipinoPacific Asia Conference on Language, Information and Computation (PACLIC), 2024
Jann Railey Montalan
Jian Gang Ngui
Wei Qi Leong
Yosephine Susanto
Hamsawardhini Rengarajan
William-Chandra Tjhi
Alham Fikri Aji
404
6
0
20 Sep 2024
L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating
  Knowledge of LLMs in Indic Context
L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic ContextPacific Asia Conference on Language, Information and Computation (PACLIC), 2024
Pritika Rohera
Chaitrali Ginimav
Akanksha Salunke
Gayatri Sawant
Raviraj Joshi
192
7
0
13 Sep 2024
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi
Arkadeep Acharya
Rudra Murthy
Vishwajeet Kumar
Jaydeep Sen
177
2
0
18 Aug 2024
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages
A. Singh
Rudra Murthy
Vishwajeet Kumar
Jaydeep Sen
Ashish Mittal
Ganesh Ramakrishnan
454
19
0
18 Jul 2024
An Empirical Comparison of Vocabulary Expansion and Initialization
  Approaches for Language Models
An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models
Nandini Mundra
Aditya Nanda Kishore
Mary Dabre
Ratish Puduppully
Anoop Kunchukuttan
Mitesh Khapra
149
15
0
08 Jul 2024
Unlocking the Potential of Model Merging for Low-Resource Languages
Unlocking the Potential of Model Merging for Low-Resource Languages
Mingxu Tao
Chen Zhang
Quzhe Huang
Tianyao Ma
Songfang Huang
Dongyan Zhao
Yansong Feng
CLLMoMe
243
10
0
04 Jul 2024
Soft Language Prompts for Language Transfer
Soft Language Prompts for Language Transfer
Ivan Vykopal
Simon Ostermann
Marian Simko
AAML
236
6
0
02 Jul 2024
Multilingual Trolley Problems for Language Models
Multilingual Trolley Problems for Language Models
Zhijing Jin
Sydney Levine
Max Kleiman-Weiner
Giorgio Piatti
Jiarui Liu
...
András Strausz
Mrinmaya Sachan
Amélie Reymond
Yejin Choi
Bernhard Schölkopf
LRM
331
25
0
02 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability
  of Low-Resource Bengali LLMs
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
209
4
0
29 Jun 2024
A multi-speaker multi-lingual voice cloning system based on vits2 for
  limmits 2024 challenge
A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
Xiaopeng Wang
Yi Lu
Xin Qi
Zhiyong Wang
Yuankun Xie
Shuchen Shi
Ruibo Fu
75
0
0
22 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement
  on Multilingual and Multi-Cultural Data
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
243
23
0
21 Jun 2024
Exploring Design Choices for Building Language-Specific LLMs
Exploring Design Choices for Building Language-Specific LLMs
Atula Tejaswi
Nilesh Gupta
Eunsol Choi
197
21
0
20 Jun 2024
Decoding the Diversity: A Review of the Indic AI Research Landscape
Decoding the Diversity: A Review of the Indic AI Research Landscape
Sankalp KJ
Vinija Jain
S. Bhaduri
Tamoghna Roy
Vasu Sharma
203
7
0
13 Jun 2024
Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language
  Technology
Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language TechnologyConference on Fairness, Accountability and Transparency (FAccT), 2024
Rishav Hada
Safiya Husain
Varun Gumma
Harshita Diddee
Aditya Yadavalli
...
Nidhi Kulkarni
U. Gadiraju
Aditya Vashistha
Vivek Seshadri
Kalika Bali
261
14
0
10 May 2024
IndicGenBench: A Multilingual Benchmark to Evaluate Generation
  Capabilities of LLMs on Indic Languages
IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages
Harman Singh
Nitish Gupta
Shikhar Bharadwaj
Dinesh Tewari
Partha P. Talukdar
ELM
202
51
0
25 Apr 2024
TeClass: A Human-Annotated Relevance-based Headline Classification and
  Generation Dataset for Telugu
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
Gopichand Kanumolu
Lokesh Madasu
Nirmal Surange
Manish Shrivastava
156
2
0
17 Apr 2024
IndiBias: A Benchmark Dataset to Measure Social Biases in Language
  Models for Indian Context
IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context
Nihar Ranjan Sahoo
Pranamya Prashant Kulkarni
Narjis Asad
Arif Ahmad
Tanu Goyal
Aparna Garimella
Pushpak Bhattacharyya
255
24
0
29 Mar 2024
Pretraining Language Models Using Translationese
Pretraining Language Models Using Translationese
Meet Doshi
Mary Dabre
Pushpak Bhattacharyya
SyDa
359
2
0
20 Mar 2024
BEnQA: A Question Answering and Reasoning Benchmark for Bengali and
  English
BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English
H. M. Q. H. Sheikh Shafayat
Rishav Hada
Isaac Cowhey
Rifki Afina
Jerry Tworek
Lorie De Leon
129
3
0
16 Mar 2024
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in
  Korean
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in KoreanInternational Conference on Language Resources and Evaluation (LREC), 2024
Eunsu Kim
Juyoung Suk
Philhoon Oh
Haneul Yoo
Hyunjung Shim
Alice Oh
ELM
381
45
0
11 Mar 2024
Cost-Performance Optimization for Processing Low-Resource Language Tasks
  Using Commercial LLMs
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Arijit Nag
Animesh Mukherjee
Niloy Ganguly
Soumen Chakrabarti
195
8
0
08 Mar 2024
Evaluating the Elementary Multilingual Capabilities of Large Language
  Models with MultiQ
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann
Paul Röttger
Timm Dill
Anne Lauscher
ELMLRM
255
33
0
06 Mar 2024
Direct Punjabi to English speech translation using discrete units
Direct Punjabi to English speech translation using discrete units
Prabhjot Kaur
L. A. M. Bush
Weisong Shi
151
2
0
25 Feb 2024
12
Next