FASTSUBS: An Efficient Admissible Algorithm for Finding the Most Likely Lexical Substitutes Using a Statistical Language Model

IEEE Signal Processing Letters (SPL), 2012

24 May 2012

Abstract

Lexical substitutes have found use in the context of word sense disambiguation, unsupervised part-of-speech induction, paraphrasing, machine translation, and text simplification. Using a statistical language model to find the most likely substitutes in a given context is a successful approach, but the cost of a naive algorithm is proportional to the vocabulary size. This paper presents the Fastsubs algorithm which can efficiently and correctly identify the most likely lexical substitutes for a given context based on a statistical language model without going through most of the vocabulary. The efficiency of Fastsubs makes large scale experiments based on lexical substitutes feasible. For example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank in about 6 hours on a typical workstation. The same task would take about 6 days with the naive algorithm. An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available from the author's website at http://goo.gl/jzKH0.

View on arXiv

Comments on this paper