Breaking Character: Are Subwords Good Enough for MRLs After All?

Breaking Character: Are Subwords Good Enough for MRLs After All?

10 April 2022

Papers citing "Breaking Character: Are Subwords Good Enough for MRLs After All?"

14 / 14 papers shown

Title
Splintering Nonconcatenative Languages for Better Tokenization Bar Gazit Shaltiel Shmidman Avi Shmidman Yuval Pinter 57 0 0 18 Mar 2025
MenakBERT -- Hebrew Diacriticizer Ido Cohen Jacob Gidron Idan Pinto VLM 16 0 0 03 Oct 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance Omer Goldman Avi Caciularu Matan Eyal Kris Cao Idan Szpektor Reut Tsarfaty 43 22 0 10 Mar 2024
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations Aina Garí Soler Matthieu Labeau Chloé Clavel VLM 30 2 0 22 Feb 2024
D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models Adi Rosenthal Nadav Shaked 11 0 0 30 Jan 2024
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew Eylon Gueta Omer Goldman Reut Tsarfaty 11 1 0 01 Nov 2023
Text Rendering Strategies for Pixel Language Models Jonas F. Lotz Elizabeth Salesky Phillip Rust Desmond Elliott VLM 22 11 0 01 Nov 2023
What is the best recipe for character-level encoder-only modelling? Kris Cao 32 2 0 09 May 2023
Impact of Subword Pooling Strategy on Cross-lingual Event Detection Shantanu Agarwal Steven Fincke Chris Jenkins Scott Miller Elizabeth Boschee 14 2 0 22 Feb 2023
Multilingual Sequence-to-Sequence Models for Hebrew NLP Matan Eyal Hila Noga Roee Aharoni Idan Szpektor Reut Tsarfaty 27 4 0 19 Dec 2022
Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All Eylon Guetta Avi Shmidman Shaltiel Shmidman C. Shmidman Joshua Guedalia Moshe Koppel Dan Bareket Amit Seker Reut Tsarfaty VLM 16 14 0 28 Nov 2022
Incorporating Context into Subword Vocabularies Shaked Yehezkel Yuval Pinter 35 8 0 13 Oct 2022
Language Modelling with Pixels Phillip Rust Jonas F. Lotz Emanuele Bugliarello Elizabeth Salesky Miryam de Lhoneux Desmond Elliott VLM 30 46 0 14 Jul 2022
ParaShoot: A Hebrew Question Answering Dataset Omri Keren Omer Levy 29 17 0 23 Sep 2021