v1v2v3v4 (latest)

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

20 May 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (69★)

Papers citing "SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment"

48 / 48 papers shown

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

332

24 Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

148

01 Oct 2025

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

137

27 Sep 2025

A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

205

04 Sep 2025

R-TOFU: Unlearning in Large Reasoning Models

449

21 May 2025

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Hannah Cyberey

David Evans

LLMSV

520

23 Apr 2025

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

312

09 Apr 2025

Representation Bending for Large Language Model SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

438

02 Apr 2025

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

289

19 Mar 2025

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

328

01 Mar 2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning CapabilitiesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

286

17 Feb 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

...

OffRL AI4TS LRM ReLM VLM

1.2K

5,517

22 Jan 2025

Large Language Model Safety: A Holistic Survey

...

292

23 Dec 2024

Large Language Models Still Exhibit Bias in Long TextAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

504

23 Oct 2024

Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Kai-Wei Chang

226

07 Oct 2024

Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

320

08 Aug 2024

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Faeze Brahman

...

Niloofar Mireshghallah

Ximing Lu

Maarten Sap

Yejin Choi

Nouha Dziri

197

134

26 Jun 2024

Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024

Maksym Andriushchenko

620

206

06 Jun 2024

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

298

08 Apr 2024

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

337

307

08 Apr 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

AAML

780

367

02 Apr 2024

A StrongREJECT for Empty Jailbreaks

Dillon Bowen

...

259

188

15 Feb 2024

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao

Peiyi Wang

Runxin Xu

...

1.5K

3,768

05 Feb 2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Zhen Xiang

Fengqing Jiang

Zidi Xiong

Bhaskar Ramasubramanian

Radha Poovendran

Bo Li

LRM SILM

280

20 Jan 2024

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks

250

19 Dec 2023

Tree of Attacks: Jailbreaking Black-Box LLMs AutomaticallyNeural Information Processing Systems (NeurIPS), 2023

348

442

04 Dec 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

461

1,639

20 Nov 2023

Jailbreaking Black Box Large Language Models in Twenty Queries

George J. Pappas

643

1,061

12 Oct 2023

Low-Resource Languages Jailbreak GPT-4

434

267

03 Oct 2023

At Which Training Stage Does Code Data Help LLMs Reasoning?International Conference on Learning Representations (ICLR), 2023

Yue Liu

Shanshan Li

362

28 Sep 2023

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

924

507

19 Sep 2023

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language ModelsConference on Computer and Communications Security (CCS), 2023

Michael Backes

431

454

07 Aug 2023

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Paul Röttger

386

255

02 Aug 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

J. Zico Kolter

623

2,269

27 Jul 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

8.0K

15,207

18 Jul 2023

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetNeural Information Processing Systems (NeurIPS), 2023

Jiaming Ji

Juntao Dai

Chi Zhang

Chi Zhang

400

707

10 Jul 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023

Christopher D. Manning

Chelsea Finn

ALM

860

6,697

29 May 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsNeural Information Processing Systems (NeurIPS), 2023

Dian Yu

535

3,077

17 May 2023

Editing Models with Task ArithmeticInternational Conference on Learning Representations (ICLR), 2022

1.2K

734

08 Dec 2022

ReAct: Synergizing Reasoning and Acting in Language ModelsInternational Conference on Learning Representations (ICLR), 2022

Dian Yu

2.4K

5,256

06 Oct 2022

Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022

Carroll L. Wainwright

...

2.1K

17,490

04 Mar 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,449

28 Jan 2022

Program Synthesis with Large Language Models

Henryk Michalewski

...

418

2,869

16 Aug 2021

Measuring Mathematical Problem Solving With the MATH Dataset

904

3,932

05 Mar 2021

Measuring Massive Multitask Language UnderstandingInternational Conference on Learning Representations (ICLR), 2020

2.2K

6,566

07 Sep 2020

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Oyvind Tafjord

971

3,751

14 Mar 2018

Towards Deep Learning Models Resistant to Adversarial Attacks

1.4K

13,707

19 Jun 2017

Deep reinforcement learning from human preferencesNeural Information Processing Systems (NeurIPS), 2017

1.6K

4,387

12 Jun 2017