Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.06606
Cited By
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
3 May 2020
Xuanli He
Gholamreza Haffari
Mohammad Norouzi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation"
33 / 33 papers shown
Title
Lexically Grounded Subword Segmentation
Jindřich Libovický
Jindřich Helcl
35
1
0
19 Jun 2024
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Haoran Lian
Yizhe Xiong
Jianwei Niu
Shasha Mo
Zhenpeng Su
Zijia Lin
Peng Liu
Hui Chen
Guiguang Ding
34
1
0
27 Apr 2024
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Khuyagbaatar Batsuren
Ekaterina Vylomova
Verna Dankers
Tsetsuukhei Delgerbaatar
Omri Uzan
Yuval Pinter
Gábor Bella
27
9
0
20 Apr 2024
Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation
Francois Meyer
Jan Buys
27
2
0
12 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
38
14
0
02 Mar 2024
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
38
28
0
28 Feb 2024
Two Counterexamples to Tokenization and the Noiseless Channel
Marco Cognetta
Vilém Zouhar
Sangwhan Moon
Naoaki Okazaki
27
0
0
22 Feb 2024
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
David Yunis
Justin Jung
Falcon Z. Dai
Matthew R. Walter
OffRL
35
0
0
08 Sep 2023
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
Haiyue Song
Raj Dabre
Chenhui Chu
Sadao Kurohashi
Eiichiro Sumita
14
3
0
31 Jul 2023
Should you marginalize over possible tokenizations?
Nadezhda Chirkova
Germán Kruszewski
Jos Rozen
Marc Dymetman
14
10
0
30 Jun 2023
Evolution of Efficient Symbolic Communication Codes
Anton Kolonin
15
0
0
04 Jun 2023
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
Francois Meyer
Jan Buys
33
8
0
11 May 2023
What changes when you randomly choose BPE merge operations? Not much
Jonne Saleva
Constantine Lignos
20
6
0
04 May 2023
Tokenization Preference for Human and Machine Learning Model: An Annotation Study
Tatsuya Hiraoka
Tomoya Iwakura
24
1
0
21 Apr 2023
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing
Tatsuya Hiraoka
Tomoya Iwakura
12
0
0
21 Apr 2023
Elementwise Language Representation
Du-Yeong Kim
Jeeeun Kim
28
0
0
27 Feb 2023
Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
Kaiser Sun
Peng Qi
Yuhao Zhang
Lan Liu
William Yang Wang
Zhiheng Huang
24
7
0
19 Dec 2022
Extending the Subwording Model of Multilingual Pretrained Models for New Languages
K. Imamura
Eiichiro Sumita
VLM
27
3
0
29 Nov 2022
Incorporating Context into Subword Vocabularies
Shaked Yehezkel
Yuval Pinter
39
8
0
13 Oct 2022
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
Ali Araabi
Christof Monz
Vlad Niculae
20
10
0
10 Aug 2022
The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Khuyagbaatar Batsuren
Gábor Bella
Aryaman Arora
Viktor Martinović
Kyle Gorman
...
Magda vSevvcíková
Katevrina Pelegrinová
Fausto Giunchiglia
Ryan Cotterell
Ekaterina Vylomova
23
39
0
15 Jun 2022
Local Byte Fusion for Neural Machine Translation
Makesh Narsimhan Sreedhar
Xiangpeng Wan
Yu-Jie Cheng
Junjie Hu
22
4
0
23 May 2022
Improving Tokenisation by Alternative Treatment of Spaces
Edward Gow-Smith
Harish Tayyar Madabushi
Carolina Scarton
Aline Villavicencio
29
20
0
08 Apr 2022
LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
Keita Nonaka
Kazutaka Yamanouchi
Tomohiro I
Tsuyoshi Okita
Kazutaka Shimada
H. Sakamoto
19
8
0
28 Feb 2022
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
...
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
Samson Tan
28
140
0
20 Dec 2021
You should evaluate your language model on marginal likelihood over tokenisations
Kris Cao
Laura Rimell
23
23
0
06 Sep 2021
Survey of Low-Resource Machine Translation
Barry Haddow
Rachel Bawden
Antonio Valerio Miceli Barone
Jindvrich Helcl
Alexandra Birch
AIMat
29
147
0
01 Sep 2021
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Yi Tay
Vinh Q. Tran
Sebastian Ruder
Jai Gupta
Hyung Won Chung
Dara Bahri
Zhen Qin
Simon Baumgartner
Cong Yu
Donald Metzler
45
152
0
23 Jun 2021
How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation
Marco Gaido
Beatrice Savoldi
L. Bentivogli
Matteo Negri
Marco Turchi
56
15
0
28 May 2021
Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka
Sho Takase
Kei Uchiumi
Atsushi Keyaki
Naoaki Okazaki
14
17
0
26 May 2021
Multi-view Subword Regularization
Xinyi Wang
Sebastian Ruder
Graham Neubig
19
45
0
15 Mar 2021
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
A. Laptev
A. Andrusenko
Ivan Podluzhny
Anton Mitrofanov
Ivan Medennikov
Yuri N. Matveev
VLM
18
14
0
12 Mar 2021
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,743
0
26 Sep 2016
1