Do LLMs "know" internally when they follow instructions?

18 October 2024

Juyeon Heo

Christina Heinze-Deml

Jaya Narain

Papers citing "Do LLMs "know" internally when they follow instructions?"

2 / 2 papers shown

Title
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary Yakai Li Jiekang Hu Weiduan Sang Luping Ma Jing Xie Weijuan Zhang Aimin Yu Shijie Zhao Qingjia Huang Qihang Zhou AAML 40 0 0 28 Apr 2025
I'm Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk Paul Martin Sarah Mercer 49 0 0 26 Mar 2025