LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem?

LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

20 July 2023

Nicolas Papernot

Papers citing "LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?"

13 / 13 papers shown

Title
XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs Marco Arazzi Vignesh Kumar Kembu Antonino Nocera V. P. 78 0 0 30 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control Hannah Cyberey David E. Evans LLMSV 72 0 0 23 Apr 2025
What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices Sander Noels Guillaume Bied Maarten Buyl Alexander Rogiers Yousra Fettach Jefrey Lijffijt Tijl De Bie 28 0 0 04 Apr 2025
Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches Jamal N. Al-Karaki Muhammad Al-Zafar Khan Marwan Omar 26 4 0 11 Sep 2024
Securing the Future of GenAI: Policy and Technology Mihai Christodorescu Craven S. Feizi Neil Zhenqiang Gong Mia Hoffmann ... Jessica Newman Emelia Probasco Yanjun Qi Khawaja Shams Turek SILM 26 3 0 21 May 2024
Navigating LLM Ethics: Advancements, Challenges, and Future Directions Junfeng Jiao S. Afroogh Yiming Xu Connor Phillips AILaw 55 19 0 14 May 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency Akila Wickramasekara F. Breitinger Mark Scanlon 42 7 0 29 Feb 2024
StruQ: Defending Against Prompt Injection with Structured Queries Sizhe Chen Julien Piet Chawin Sitawarin David A. Wagner SILM AAML 22 65 0 09 Feb 2024
Demystifying RCE Vulnerabilities in LLM-Integrated Apps Tong Liu Zizhuang Deng Guozhu Meng Yuekang Li Kai Chen SILM 29 19 0 06 Sep 2023
The Internal State of an LLM Knows When It's Lying A. Azaria Tom Michael Mitchell HILM 216 297 0 26 Apr 2023
On the Impossible Safety of Large AI Models El-Mahdi El-Mhamdi Sadegh Farhadkhani R. Guerraoui Nirupam Gupta L. Hoang Rafael Pinot Sébastien Rouault John Stephan 26 31 0 30 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,561 0 18 Sep 2019