Probing the Safety Response Boundary of Large Language Models via Unsafe
Decoding Path Generation

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

20 August 2024

Yatao Bian

Yongzhe Chang

Xueqian Wang

Peilin Zhao

Papers citing "Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation"

3 / 3 papers shown

Title
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts Haoxiang Wang Wei Xiong Tengyang Xie Han Zhao Tong Zhang 46 132 0 18 Jun 2024
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping Haoyu Wang Guozheng Ma Ziqiao Meng Zeyu Qin Li Shen ... Liu Liu Yatao Bian Tingyang Xu Xueqian Wang Peilin Zhao 55 12 0 12 Feb 2024
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,730 0 04 Mar 2022