Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

16 February 2025

Papers citing "Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?"

3 / 3 papers shown

Title
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction Y. Chen Haoran Li Yuan Sui Y. Liu Yufei He Y. Song Bryan Hooi AAML SILM 61 0 0 29 Apr 2025
AI Awareness X. Li Haoyuan Shi Rongwu Xu Wei Xu 54 0 0 25 Apr 2025
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger Yulin Chen Haoran Li Zihao Zheng Zihao Zheng Yangqiu Song Bryan Hooi 30 6 0 17 Aug 2024