Faster Transformer Decoding: N-gram Masked Self-Attention
Abstract
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence , we propose truncating the target-side window used for computing self-attention by making an -gram assumption. Experiments on WMT EnDe and EnFr data sets show that the -gram masked self-attention model loses very little in BLEU score for values in the range , depending on the task.
View on arXivComments on this paper
