Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2301.10472
Cited By
v1
v2 (latest)
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
25 January 2023
Davis Liang
Hila Gonen
Yuning Mao
Rui Hou
Naman Goyal
Marjan Ghazvininejad
Luke Zettlemoyer
Madian Khabsa
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models"
39 / 39 papers shown
Title
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
152
0
0
24 Oct 2025
Model-Aware Tokenizer Transfer
Mykola Haltiuk
Aleksander Smywiński-Pohl
112
0
0
24 Oct 2025
Quick-CapsNet (QCN): A fast alternative to Capsule Networks
ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), 2020
Pouya Shiri
Ramin Sharifi
A. Baniasadi
3DPC
157
0
0
08 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Mosong Ma
Tania Stathaki
Michalis Lazarou
MedIm
GAN
241
0
0
07 Oct 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini
Dan Jurafsky
Christopher Potts
Martijn Bartelds
181
0
0
23 Sep 2025
SEA-BED: Southeast Asia Embedding Benchmark
Wuttikorn Ponwitayarat
Raymond Ng
Jann Railey Montalan
Thura Aung
Jian Gang Ngui
...
Panuthep Tasawong
Erik Cambria
Ekapol Chuangsuwanich
Sarana Nutanong
Peerat Limkonchotiwat
162
1
0
17 Aug 2025
Meta CLIP 2: A Worldwide Scaling Recipe
Yung-Sung Chuang
Yang Li
Dong Wang
Ching-Feng Yeh
Kehan Lyu
...
Zhuang Liu
Saining Xie
Anuj Kumar
Shang-Wen Li
Hu Xu
CLIP
VLM
356
13
0
29 Jul 2025
Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?
Xi Ai
Mahardika Krisna Ihsani
Min-Yen Kan
HILM
189
1
0
17 Jul 2025
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
Luel Hagos Beyene
Vivek Verma
Min Ma
Jesujoba Oluwadara Alabi
Fabian David Schmidt
Joyce Nakatumba-Nabende
David Ifeoluwa Adelani
323
2
0
10 Jun 2025
Incorporating Domain Knowledge into Materials Tokenization
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yerim Oh
Jun-Hyung Park
Junho Kim
SungHo Kim
S. Lee
156
0
0
09 Jun 2025
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRM
ELM
969
27
0
08 May 2025
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Enes Özeren
Yihong Liu
Hinrich Schütze
250
1
0
21 Apr 2025
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Computers in Human Behavior (CHB), 2025
Mahjabin Nahar
Eun-Ju Lee
Jin Won Park
Dongwon Lee
HILM
533
0
0
01 Apr 2025
Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions
Guy Bar-Shalom
Fabrizio Frasca
Derek Lim
Yoav Gelberg
Yftah Ziser
Ran El-Yaniv
Gal Chechik
Haggai Maron
402
2
0
18 Mar 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiulin Yang
Tatsuya Aoyama
Yuekun Yao
Ethan Wilcox
430
5
0
26 Feb 2025
Scaling Embedding Layers in Language Models
Da Yu
Edith Cohen
Badih Ghazi
Yangsibo Huang
Pritish Kamath
Ravi Kumar
Daogao Liu
Chiyuan Zhang
484
7
0
03 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Ehsaneddin Asgari
Yassine El Kheir
Mohammad Ali Sadraei Javaheri
276
12
0
02 Feb 2025
PixelWorld: How Far Are We from Perceiving Everything as Pixels?
Zhiheng Lyu
Xueguang Ma
Wenhu Chen
643
3
0
31 Jan 2025
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Kushal Tatariya
Vladimir Araujo
Thomas Bauwens
Miryam de Lhoneux
VLM
236
1
0
15 Oct 2024
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
Akhilesh Aravapalli
Mounika Marreddy
R. Mamidi
R. Mamidi
Subba Reddy Oota
285
2
0
03 Oct 2024
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
International Conference on Learning Representations (ICLR), 2024
Lucas Bandarkar
Benjamin Muller
Pritish Yuvraj
Rui Hou
Nayan Singhal
Hongjiang Lv
Bing-Quan Liu
KELM
LRM
MoMe
443
12
0
02 Oct 2024
LangSAMP: Language-Script Aware Multilingual Pretraining
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yihong Liu
Haotian Ye
Chunlan Ma
Mingyang Wang
Hinrich Schütze
VLM
497
2
0
26 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
224
14
0
06 Sep 2024
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Chaofan Tao
Qian Liu
Longxu Dou
Niklas Muennighoff
Zhongwei Wan
Ping Luo
Min Lin
Ngai Wong
PILM
296
91
0
18 Jul 2024
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Markus Frohmann
Igor Sterner
Ivan Vulić
Benjamin Minixhofer
Markus Schedl
VLM
273
40
0
24 Jun 2024
ThaiCoref: Thai Coreference Resolution Dataset
Pontakorn Trakuekul
Wei Qi Leong
Charin Polpanumas
Jitkapat Sawatphol
William-Chandra Tjhi
Attapol T. Rutherford
156
0
0
10 Jun 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
301
28
0
15 Mar 2024
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
342
54
0
01 Feb 2024
SurreyAI 2023 Submission for the Quality Estimation Shared Task
Conference on Machine Translation (WMT), 2023
Archchana Sindhujan
Helen Treharne
Constantin Orasan
Tharindu Ranasinghe
180
4
0
01 Dec 2023
A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models
Yi Zhou
Jose Camacho-Collados
Danushka Bollegala
417
6
0
19 Oct 2023
One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual Transfer
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Fabian David Schmidt
Ivan Vulić
Goran Glavaš
MoMe
117
5
0
16 Oct 2023
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch
Science China Information Sciences (Sci China Inf Sci), 2023
Juntao Li
Zecheng Tang
Yuyang Ding
Pinzheng Wang
Pei Guo
...
Wenliang Chen
Guohong Fu
Qiaoming Zhu
Guodong Zhou
Hao Fei
356
8
0
19 Sep 2023
MultiLegalPile: A 689GB Multilingual Legal Corpus
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Joel Niklaus
Veton Matoshi
Matthias Sturmer
Ilias Chalkidis
Daniel E. Ho
AILaw
ELM
402
59
0
03 Jun 2023
An Efficient Multilingual Language Model Compression through Vocabulary Trimming
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Asahi Ushio
Yi Zhou
Jose Camacho-Collados
379
15
0
24 May 2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Elizabeth Salesky
Neha Verma
Philipp Koehn
Matt Post
285
19
0
23 May 2023
Small Models are Valuable Plug-ins for Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Canwen Xu
Yichong Xu
Shuohang Wang
Yang Liu
Chenguang Zhu
Julian McAuley
LLMAG
218
73
0
15 May 2023
Evaluating Inter-Bilingual Semantic Parsing for Indian Languages
Divyanshu Aggarwal
V. Gupta
Anoop Kunchukuttan
200
3
0
25 Apr 2023
Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zhengxuan Wu
Alex Tamkin
Isabel Papadimitriou
251
14
0
24 Feb 2022
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Transactions of the Association for Computational Linguistics (TACL), 2020
J. Clark
Eunsol Choi
Michael Collins
Dan Garrette
Tom Kwiatkowski
Vitaly Nikolaev
J. Palomaki
536
686
0
10 Mar 2020
1