ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.15513
  4. Cited By
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
v1v2v3 (latest)

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

20 June 2024
Yalan Qin
Chongye Guo
Borong Zhang
Boyuan Chen
Josef Dai
Boren Zheng
Tianyi Qiu
Boxun Li
Kaile Wang
Boxuan Li
Sirui Han
Wenhan Luo
Yaodong Yang
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github

Papers citing "PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference"

16 / 16 papers shown
MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization
MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization
Lanxue Zhang
Lei Shen
Fang Fang
Fanglong Dong
R. Liu
Yanan Cao
CLL
223
1
0
15 Nov 2025
DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
DRAGON: Guard LLM Unlearning in Context via Negative Detection and ReasoningConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Y. Wang
Chris Yuhao Liu
Quan Liu
Jinglong Pang
Wei Wei
Yujia Bao
Yang Liu
MU
430
2
0
08 Nov 2025
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang
Yue Ding
Jingwen Yang
Tianwei Luo
Dongbai Li
Ranjie Duan
Qiang Liu
Hang Su
Yinpeng Dong
Jun Zhu
LRM
197
3
0
29 Sep 2025
PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
Jiawei Zhao
Yuang Qi
Weiming Zhang
Nenghai Yu
Kejiang Chen
LRM
184
0
0
26 Sep 2025
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
Yuqi Tang
Kehua Feng
Yunfeng Wang
Zhiwen Chen
Chengfei Lv
Gang Yu
Qiang Zhang
Keyan Ding
Huajun Chen
ELM
288
1
0
01 Aug 2025
SAFER: Probing Safety in Reward Models with Sparse Autoencoder
SAFER: Probing Safety in Reward Models with Sparse Autoencoder
Sihang Li
Wei Shi
Ziyuan Xie
Tao Liang
OffRL
226
2
0
01 Jul 2025
SafeVid: Toward Safety Aligned Video Large Multimodal Models
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang
Jiaxin Song
Yifeng Gao
Xin Wang
Yang Yao
Yan Teng
Jiabo He
Yingchun Wang
Yu-Gang Jiang
494
5
0
17 May 2025
Probing and Inducing Combinational Creativity in Vision-Language Models
Probing and Inducing Combinational Creativity in Vision-Language Models
Yongqian Peng
Yuxi Ma
Minghua Yi
Yuxuan Wang
Yizhou Wang
Chuxu Zhang
Yixin Zhu
Zilong Zheng
MLLMCoGe
539
5
0
17 Apr 2025
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
Yuanhang Zhang
Zihao Zeng
Dongbai Li
Yao Huang
Zhijie Deng
Yinpeng Dong
LRM
317
48
0
14 Apr 2025
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Synthesizing Post-Training Data for LLMs through Multi-Agent SimulationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Shuo Tang
Xianghe Pang
Zexi Liu
Bohan Tang
Guangyi Liu
Xiaowen Dong
Yanjie Wang
Yanfeng Wang
Tian Jin
SyDaLLMAG
611
20
0
21 Feb 2025
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?International Conference on Learning Representations (ICLR), 2024
Xueru Wen
Jie Lou
Yaojie Lu
Hongyu Lin
Xing Yu
Xinyu Lu
Xianpei Han
Jia Zheng
Debing Zhang
Le Sun
ALM
882
21
0
17 Feb 2025
STAIR: Improving Safety Alignment with Introspective Reasoning
STAIR: Improving Safety Alignment with Introspective Reasoning
Yuanhang Zhang
Siyuan Zhang
Yao Huang
Zeyu Xia
Zhengwei Fang
Xiao Yang
Ranjie Duan
Dong Yan
Yinpeng Dong
Jun Zhu
LRMLLMSV
455
52
0
04 Feb 2025
Latent Feature Mining for Predictive Model Enhancement with Large
  Language Models
Latent Feature Mining for Predictive Model Enhancement with Large Language Models
Bingxuan Li
Pengyi Shi
Amy Ward
290
32
0
06 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Pratiksha Thaker
Shengyuan Hu
Neil Kale
Yash Maurya
Zhiwei Steven Wu
Virginia Smith
MU
422
49
0
03 Oct 2024
Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
1.2K
707
0
20 Oct 2023
Baichuan 2: Open Large-scale Language Models
Baichuan 2: Open Large-scale Language Models
Ai Ming Yang
Bin Xiao
Bingning Wang
Borong Zhang
Ce Bian
...
Youxin Jiang
Yuchen Gao
Yupeng Zhang
Guosheng Dong
Zhiying Wu
ELMLRM
998
966
0
19 Sep 2023
1
Page 1 of 1