Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.14667
Cited By
v1
v2
v3
v4 (latest)
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
20 May 2025
Wonje Jeung
Sangyeon Yoon
Minsuk Kahng
Albert No
LRM
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (69★)
Papers citing
"SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment"
48 / 48 papers shown
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
Yingzhi Mao
Chunkang Zhang
Junxiang Wang
Xinyan Guan
Boxi Cao
Yaojie Lu
Hongyu Lin
Xianpei Han
Le Sun
LRM
ELM
332
1
0
24 Oct 2025
Large Reasoning Models Learn Better Alignment from Flawed Thinking
ShengYun Peng
Eric Michael Smith
Ivan Evtimov
Song Jiang
Pin-Yu Chen
Hongyuan Zhan
Haozhu Wang
Duen Horng Chau
Mahesh Pasupuleti
Jianfeng Chi
OffRL
LRM
148
4
0
01 Oct 2025
A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models
Wonje Jeung
Sangyeon Yoon
Yoonjun Cho
Dongjae Jeon
Sangwoo Shin
Hyesoo Hong
Albert No
DiffM
137
0
0
27 Sep 2025
A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
Yanbo Wang
Yongcan Yu
Jian Liang
Ran He
HILM
LRM
205
5
0
04 Sep 2025
R-TOFU: Unlearning in Large Reasoning Models
Sangyeon Yoon
Wonje Jeung
Albert No
MU
LRM
449
2
0
21 May 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
520
7
0
23 Apr 2025
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
Cunchun Li
Longji Xu
Ruipeng Wang
Zijun Yao
Kun Wang
An Zhang
Xiang Wang
Tat-Seng Chua
AAML
LRM
312
32
0
09 Apr 2025
Representation Bending for Large Language Model Safety
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
438
12
0
02 Apr 2025
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Zonghao Ying
Guangyi Zheng
Yongxin Huang
Deyue Zhang
Wenxin Zhang
Quanchen Zou
Aishan Liu
Xianglong Liu
Dacheng Tao
ELM
289
24
0
19 Mar 2025
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Zachary Yahn
Yichang Xu
Ling Liu
328
63
0
01 Mar 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Yue Liu
Bill Yuchen Lin
Radha Poovendran
KELM
ELM
LRM
286
76
0
17 Feb 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
OffRL
AI4TS
LRM
ReLM
VLM
1.2K
5,517
0
22 Jan 2025
Large Language Model Safety: A Holistic Survey
Dan Shi
Shangda Wu
Yufei Huang
Zhigen Li
Yongqi Leng
...
Zishan Guo
Linhao Yu
Ling Shi
Bojian Jiang
Deyi Xiong
ELM
LM&MA
292
38
0
23 Dec 2024
Large Language Models Still Exhibit Bias in Long Text
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Wonje Jeung
Dongjae Jeon
Ashkan Yousefpour
Jonghyun Choi
ALM
504
12
0
23 Oct 2024
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Fei Wang
Ninareh Mehrabi
Palash Goyal
Rahul Gupta
Kai-Wei Chang
Aram Galstyan
ALM
226
7
0
07 Oct 2024
Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions
Qingbin Zeng
Qinglong Yang
Shunan Dong
Heming Du
Liang Zheng
Fengli Xu
Yong Li
LLMAG
LM&Ro
320
21
0
08 Aug 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
Kavel Rao
Seungju Han
Allyson Ettinger
Faeze Brahman
...
Niloofar Mireshghallah
Ximing Lu
Maarten Sap
Yejin Choi
Nouha Dziri
197
134
0
26 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Neural Information Processing Systems (NeurIPS), 2024
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
620
206
0
06 Jun 2024
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Weikai Lu
Huiping Zhuang
Jianwei Wang
Zhengdong Lu
Zelin Chen
Huiping Zhuang
Cen Chen
MU
AAML
KELM
298
47
0
08 Apr 2024
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang
Licong Lin
Yu Bai
Song Mei
MU
337
307
0
08 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
International Conference on Learning Representations (ICLR), 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
780
367
0
02 Apr 2024
A StrongREJECT for Empty Jailbreaks
Alexandra Souly
Qingyuan Lu
Dillon Bowen
Tu Trinh
Elvis Hsieh
...
Pieter Abbeel
Justin Svegliato
Scott Emmons
Olivia Watkins
Sam Toyer
259
188
0
15 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
1.5K
3,768
0
05 Feb 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
International Conference on Learning Representations (ICLR), 2024
Zhen Xiang
Fengqing Jiang
Zidi Xiong
Bhaskar Ramasubramanian
Radha Poovendran
Bo Li
LRM
SILM
280
80
0
20 Jan 2024
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
250
40
0
19 Dec 2023
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Neural Information Processing Systems (NeurIPS), 2023
Anay Mehrotra
Manolis Zampetakis
Paul Kassianik
Blaine Nelson
Hyrum Anderson
Yaron Singer
Amin Karbasi
348
442
0
04 Dec 2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
461
1,639
0
20 Nov 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Avi Schwarzschild
Guang Cheng
Hamed Hassani
George J. Pappas
Eric Wong
AAML
643
1,061
0
12 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
434
267
0
03 Oct 2023
At Which Training Stage Does Code Data Help LLMs Reasoning?
International Conference on Learning Representations (ICLR), 2023
Xiaogang Jia
Yue Liu
Yue Yu
Yuanliang Zhang
Yu Jiang
Changjian Wang
Shanshan Li
LRM
SyDa
362
90
0
28 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
924
507
0
19 Sep 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Conference on Computer and Communications Security (CCS), 2023
Xinyue Shen
Sihao Lin
Michael Backes
Yun Shen
Yang Zhang
SILM
431
454
0
07 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
386
255
0
02 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
623
2,269
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
8.0K
15,207
0
18 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Neural Information Processing Systems (NeurIPS), 2023
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
400
707
0
10 Jul 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Neural Information Processing Systems (NeurIPS), 2023
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
860
6,697
0
29 May 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Neural Information Processing Systems (NeurIPS), 2023
Shunyu Yao
Dian Yu
Jeffrey Zhao
Izhak Shafran
Thomas Griffiths
Yuan Cao
Karthik Narasimhan
LM&Ro
LRM
AI4CE
535
3,077
0
17 May 2023
Editing Models with Task Arithmetic
International Conference on Learning Representations (ICLR), 2022
Gabriel Ilharco
Marco Tulio Ribeiro
Mitchell Wortsman
Suchin Gururangan
Ludwig Schmidt
Hannaneh Hajishirzi
Ali Farhadi
KELM
MoMe
MU
1.2K
734
0
08 Dec 2022
ReAct: Synergizing Reasoning and Acting in Language Models
International Conference on Learning Representations (ICLR), 2022
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
2.4K
5,256
0
06 Oct 2022
Training language models to follow instructions with human feedback
Neural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
2.1K
17,490
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Neural Information Processing Systems (NeurIPS), 2022
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
2.3K
14,449
0
28 Jan 2022
Program Synthesis with Large Language Models
Jacob Austin
Augustus Odena
Maxwell Nye
Maarten Bosma
Henryk Michalewski
...
Ellen Jiang
Carrie J. Cai
Michael Terry
Quoc V. Le
Charles Sutton
ELM
AIMat
ReCod
ALM
418
2,869
0
16 Aug 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
904
3,932
0
05 Mar 2021
Measuring Massive Multitask Language Understanding
International Conference on Learning Representations (ICLR), 2020
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
2.2K
6,566
0
07 Sep 2020
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELM
RALM
LRM
971
3,751
0
14 Mar 2018
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry
Aleksandar Makelov
Ludwig Schmidt
Dimitris Tsipras
Adrian Vladu
SILM
OOD
1.4K
13,707
0
19 Jun 2017
Deep reinforcement learning from human preferences
Neural Information Processing Systems (NeurIPS), 2017
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
1.6K
4,387
0
12 Jun 2017
1