"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

10 April 2022

Papers citing ""That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks"

9 / 9 papers shown

Title
Learning on LLM Output Signatures for gray-box LLM Behavior Analysis Guy Bar-Shalom Fabrizio Frasca Derek Lim Yoav Gelberg Yftah Ziser Ran El-Yaniv Gal Chechik Haggai Maron 62 0 0 18 Mar 2025
TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification Lingfeng Shen Ze Zhang Haiyun Jiang Ying Chen AAML 23 5 0 03 Feb 2023
ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin Yao Li Cho-Jui Hsieh Kai-Wei Chang AAML 58 4 0 22 Oct 2022
An Interpretability Evaluation Benchmark for Pre-trained Language Models Ya-Ming Shen Lijie Wang Ying Chen Xinyan Xiao Jing Liu Hua-Hong Wu 27 4 0 28 Jul 2022
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO Javier Rando Nasib Naimi Thomas Baumann Max Mathys AAML 18 5 0 14 Jun 2022
Certified Robustness to Adversarial Word Substitutions Robin Jia Aditi Raghunathan Kerem Göksel Percy Liang AAML 178 290 0 03 Sep 2019
Generating Natural Language Adversarial Examples M. Alzantot Yash Sharma Ahmed Elgohary Bo-Jhang Ho Mani B. Srivastava Kai-Wei Chang AAML 243 914 0 21 Apr 2018
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks Mohit Iyyer John Wieting Kevin Gimpel Luke Zettlemoyer AAML GAN 185 711 0 17 Apr 2018
Adversarial examples in the physical world Alexey Kurakin Ian Goodfellow Samy Bengio SILM AAML 250 5,830 0 08 Jul 2016