v1v2 (latest)

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

25 January 2023

Luke Zettlemoyer

Madian Khabsa

ArXiv (abs)PDF HTML

Papers citing "XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models"

39 / 39 papers shown

Title
Explaining and Mitigating Crosslingual Tokenizer Inequities Catherine Arnett T. Chang Stella Biderman Benjamin Bergen 152 0 0 24 Oct 2025
Model-Aware Tokenizer Transfer Mykola Haltiuk Aleksander Smywiński-Pohl 112 0 0 24 Oct 2025
Quick-CapsNet (QCN): A fast alternative to Capsule NetworksACS/IEEE International Conference on Computer Systems and Applications (AICCSA), 2020 Pouya Shiri Ramin Sharifi A. Baniasadi 3DPC 157 0 0 08 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework Mosong Ma Tania Stathaki Michalis Lazarou MedIm GAN 241 0 0 07 Oct 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models Julie Kallini Dan Jurafsky Christopher Potts Martijn Bartelds 181 0 0 23 Sep 2025
SEA-BED: Southeast Asia Embedding Benchmark Wuttikorn Ponwitayarat Raymond Ng Jann Railey Montalan Thura Aung Jian Gang Ngui ... Panuthep Tasawong Erik Cambria Ekapol Chuangsuwanich Sarana Nutanong Peerat Limkonchotiwat 162 1 0 17 Aug 2025
Meta CLIP 2: A Worldwide Scaling Recipe Yung-Sung Chuang Yang Li Dong Wang Ching-Feng Yeh Kehan Lyu ... Zhuang Liu Saining Xie Anuj Kumar Shang-Wen Li Hu Xu CLIP VLM 356 13 0 29 Jul 2025
Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent? Xi Ai Mahardika Krisna Ihsani Min-Yen Kan HILM 189 1 0 17 Jul 2025
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks Luel Hagos Beyene Vivek Verma Min Ma Jesujoba Oluwadara Alabi Fabian David Schmidt Joyce Nakatumba-Nabende David Ifeoluwa Adelani 323 2 0 10 Jun 2025
Incorporating Domain Knowledge into Materials TokenizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Yerim Oh Jun-Hyung Park Junho Kim SungHo Kim S. Lee 156 0 0 09 Jun 2025
Crosslingual Reasoning through Test-Time Scaling Zheng-Xin Yong Muhammad Farid Adilazuarda Jonibek Mansurov Ruochen Zhang Niklas Muennighoff Carsten Eickhoff Genta Indra Winata Julia Kreutzer Stephen H. Bach Alham Fikri Aji LRM ELM 969 27 0 08 May 2025
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization Enes Özeren Yihong Liu Hinrich Schütze 250 1 0 21 Apr 2025
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of HallucinationsComputers in Human Behavior (CHB), 2025 Mahjabin Nahar Eun-Ju Lee Jin Won Park Dongwon Lee HILM 533 0 0 01 Apr 2025
Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions Guy Bar-Shalom Fabrizio Frasca Derek Lim Yoav Gelberg Yftah Ziser Ran El-Yaniv Gal Chechik Haggai Maron 402 2 0 18 Mar 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Xiulin Yang Tatsuya Aoyama Yuekun Yao Ethan Wilcox 430 5 0 26 Feb 2025
Scaling Embedding Layers in Language Models Da Yu Edith Cohen Badih Ghazi Yangsibo Huang Pritish Kamath Ravi Kumar Daogao Liu Chiyuan Zhang 484 7 0 03 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies Ehsaneddin Asgari Yassine El Kheir Mohammad Ali Sadraei Javaheri 276 12 0 02 Feb 2025
PixelWorld: How Far Are We from Perceiving Everything as Pixels? Zhiheng Lyu Xueguang Ma Wenhu Chen 643 3 0 31 Jan 2025
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Kushal Tatariya Vladimir Araujo Thomas Bauwens Miryam de Lhoneux VLM 236 1 0 15 Oct 2024
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? Akhilesh Aravapalli Mounika Marreddy R. Mamidi R. Mamidi Subba Reddy Oota 285 2 0 03 Oct 2024
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024 Lucas Bandarkar Benjamin Muller Pritish Yuvraj Rui Hou Nayan Singhal Hongjiang Lv Bing-Quan Liu KELM LRM MoMe 443 12 0 02 Oct 2024
LangSAMP: Language-Script Aware Multilingual PretrainingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Yihong Liu Haotian Ye Chunlan Ma Mingyang Wang Hinrich Schütze VLM 497 2 0 26 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer TrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Pavel Chizhov Catherine Arnett Elizaveta Korotkova Ivan P. Yamshchikov 224 14 0 06 Sep 2024
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies Chaofan Tao Qian Liu Longxu Dou Niklas Muennighoff Zhongwei Wan Ping Luo Min Lin Ngai Wong PILM 296 91 0 18 Jul 2024
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation Markus Frohmann Igor Sterner Ivan Vulić Benjamin Minixhofer Markus Schedl VLM 273 40 0 24 Jun 2024
ThaiCoref: Thai Coreference Resolution Dataset Pontakorn Trakuekul Wei Qi Leong Charin Polpanumas Jitkapat Sawatphol William-Chandra Tjhi Attapol T. Rutherford 156 0 0 10 Jun 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling Tomasz Limisiewicz Terra Blevins Hila Gonen Orevaoghene Ahia Luke Zettlemoyer 301 28 0 15 Mar 2024
Getting the most out of your tokenizer for pre-training and domain adaptation Gautier Dagan Gabriele Synnaeve Baptiste Rozière 342 54 0 01 Feb 2024
SurreyAI 2023 Submission for the Quality Estimation Shared TaskConference on Machine Translation (WMT), 2023 Archchana Sindhujan Helen Treharne Constantin Orasan Tharindu Ranasinghe 180 4 0 01 Dec 2023
A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models Yi Zhou Jose Camacho-Collados Danushka Bollegala 417 6 0 19 Oct 2023
One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual TransferConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Fabian David Schmidt Ivan Vulić Goran Glavaš MoMe 117 5 0 16 Oct 2023
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from ScratchScience China Information Sciences (Sci China Inf Sci), 2023 Juntao Li Zecheng Tang Yuyang Ding Pinzheng Wang Pei Guo ... Wenliang Chen Guohong Fu Qiaoming Zhu Guodong Zhou Hao Fei 356 8 0 19 Sep 2023
MultiLegalPile: A 689GB Multilingual Legal CorpusAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Joel Niklaus Veton Matoshi Matthias Sturmer Ilias Chalkidis Daniel E. Ho AILaw ELM 402 59 0 03 Jun 2023
An Efficient Multilingual Language Model Compression through Vocabulary TrimmingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Asahi Ushio Yi Zhou Jose Camacho-Collados 379 15 0 24 May 2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual TransferConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Elizabeth Salesky Neha Verma Philipp Koehn Matt Post 285 19 0 23 May 2023
Small Models are Valuable Plug-ins for Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Canwen Xu Yichong Xu Shuohang Wang Yang Liu Chenguang Zhu Julian McAuley LLMAG 218 73 0 15 May 2023
Evaluating Inter-Bilingual Semantic Parsing for Indian Languages Divyanshu Aggarwal V. Gupta Anoop Kunchukuttan 200 3 0 25 Apr 2023
Oolong: Investigating What Makes Transfer Learning Hard with Controlled StudiesConference on Empirical Methods in Natural Language Processing (EMNLP), 2022 Zhengxuan Wu Alex Tamkin Isabel Papadimitriou 251 14 0 24 Feb 2022
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse LanguagesTransactions of the Association for Computational Linguistics (TACL), 2020 J. Clark Eunsol Choi Michael Collins Dan Garrette Tom Kwiatkowski Vitaly Nikolaev J. Palomaki 536 686 0 10 Mar 2020