Deep models are commonly vulnerable to adversarial examples. In this paper, we propose the first algorithm for effectively generating both positive and negative adversarial examples for paraphrase identification. We first sample an original sentence pair from the dataset and then adversarially replace some word pairs with difficult common words. We take multiple steps and use beam search to find a modification that makes the target model fail, and thereby obtain an adversarial example. The word replacement is also constrained by heuristic rules and a language model, to preserve the label and language quality during modification. Experiments show that the performance of the target models has a severe drop on our adversarially modified examples.Meanwhile, human annotators are much less affected, and the generated sentences retain a good language quality. We also show that adversarial training with generated adversarial examples can improve model robustness, while previous work provides little improvement on our adversarial examples.
View on arXiv