Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.12206
Cited By
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
16 February 2025
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?"
3 / 3 papers shown
Title
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Y. Chen
Haoran Li
Yuan Sui
Y. Liu
Yufei He
Y. Song
Bryan Hooi
AAML
SILM
61
0
0
29 Apr 2025
AI Awareness
X. Li
Haoyuan Shi
Rongwu Xu
Wei Xu
54
0
0
25 Apr 2025
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Yulin Chen
Haoran Li
Zihao Zheng
Zihao Zheng
Yangqiu Song
Bryan Hooi
30
6
0
17 Aug 2024
1