ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.08968
  4. Cited By
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
v1v2 (latest)

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

International Conference on Learning Representations (ICLR), 2024
11 October 2024
Jingyu Zhang
Ahmed Elgohary
Ahmed Magooda
Daniel Khashabi
Benjamin Van Durme
ArXiv (abs)PDFHTMLHuggingFace (14 upvotes)Github (178★)

Papers citing "Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements"

50 / 75 papers shown
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Prasoon Varshney
Makesh Narsimhan Sreedhar
Liwei Jiang
Traian Rebedea
Christopher Parisien
167
0
0
07 Nov 2025
Reasoning Up the Instruction Ladder for Controllable Language Models
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng
Vidhisha Balachandran
Chan Young Park
Faeze Brahman
Sachin Kumar
LRM
317
3
0
30 Oct 2025
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang
Haozhu Wang
Eric Michael Smith
Sid Wang
Amr Sharaf
Mahesh Pasupuleti
Benjamin Van Durme
Daniel Khashabi
Jason Weston
Hongyuan Zhan
175
3
0
09 Oct 2025
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Rui Wu
Yihao Quan
Zeru Shi
Zhenting Wang
Yanshu Li
Ruixiang Tang
182
1
0
05 Oct 2025
DynaGuard: A Dynamic Guardian Model With User-Defined Policies
DynaGuard: A Dynamic Guardian Model With User-Defined Policies
Monte Hoover
Vatsal Baherwani
Neel Jain
Khalid Saifullah
Joseph Vincent
Chirag Jain
Melissa Kazemi Rad
C. Bayan Bruss
Ashwinee Panda
Tom Goldstein
325
1
0
02 Sep 2025
NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
Birong Pan
Mayi Xu
Qiankun Pi
Jianhao Chen
Yuanyuan Zhu
Ming Zhong
T. Qian
174
2
0
13 Aug 2025
A Survey on Training-free Alignment of Large Language Models
A Survey on Training-free Alignment of Large Language Models
Birong Pan
Yongqi Li
Jiasheng Si
Sibo Wei
Mayi Xu
Shen Zhou
Yuanyuan Zhu
Ming Zhong
T. Qian
3DVLM&MA
547
2
0
12 Aug 2025
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si
Lin Sun
Zhewen Tan
Xiangzheng Zhang
MU
272
6
0
12 Aug 2025
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
Han Jiang
Dongyao Zhu
Zhihua Wei
Xiaoyuan Yi
Ziang Xiao
Xing Xie
299
2
0
22 Jul 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Nell Watson
Ahmed Amer
Evan Harris
Preeti Ravindra
Shujun Zhang
291
1
0
08 Jun 2025
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae
Hyunbin Jin
Taesup Kim
258
0
0
07 Jun 2025
Aligning VLM Assistants with Personalized Situated Cognition
Aligning VLM Assistants with Personalized Situated CognitionAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yongqi Li
Shen Zhou
Xiaohu Li
Xin Miao
Jintao Wen
...
Birong Pan
Hankun Kang
Yuanyuan Zhu
Ming Zhong
T. Qian
292
2
0
01 Jun 2025
Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
Makesh Narsimhan Sreedhar
Traian Rebedea
Christopher Parisien
LRM
298
5
0
26 May 2025
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2023
Jingwei Yi
Yueqi Xie
Bin Zhu
Emre Kiciman
Guangzhong Sun
Xing Xie
Fangzhao Wu
AAML
599
211
0
28 Jan 2025
SafeWorld: Geo-Diverse Safety Alignment
SafeWorld: Geo-Diverse Safety AlignmentNeural Information Processing Systems (NeurIPS), 2024
Da Yin
Haoyi Qiu
Kung-Hsiang Huang
Kai-Wei Chang
Nanyun Peng
412
12
0
09 Dec 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
395
27
0
22 Sep 2024
How Well Do LLMs Identify Cultural Unity in Diversity?
How Well Do LLMs Identify Cultural Unity in Diversity?
Jialin Li
Junli Wang
Junjie Hu
Ming Jiang
262
12
0
09 Aug 2024
Improving Context-Aware Preference Modeling for Language Models
Improving Context-Aware Preference Modeling for Language Models
Silviu Pitis
Ziang Xiao
Nicolas Le Roux
Alessandro Sordoni
302
23
0
20 Jul 2024
ValueScope: Unveiling Implicit Norms and Values via Return Potential
  Model of Social Interactions
ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions
Chan Young Park
Shuyue Stella Li
Hayoung Jung
Svitlana Volkova
Tanushree Mitra
David Jurgens
Yulia Tsvetkov
273
17
0
02 Jul 2024
Decoding-Time Language Model Alignment with Multiple Objectives
Decoding-Time Language Model Alignment with Multiple Objectives
Ruizhe Shi
Yifang Chen
Yushi Hu
Alisa Liu
Hannaneh Hajishirzi
Noah A. Smith
Simon Du
431
84
0
27 Jun 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
Beyza Ermis
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
410
57
0
26 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
461
318
0
26 Jun 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
476
33
0
25 Jun 2024
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and
  BenchBuilder Pipeline
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Tianle Li
Wei-Lin Chiang
Evan Frick
Lisa Dunlap
Tianhao Wu
Banghua Zhu
Joseph E. Gonzalez
Ion Stoica
ALM
413
411
0
17 Jun 2024
How Far Can In-Context Alignment Go? Exploring the State of In-Context
  Alignment
How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment
Heyan Huang
Yinghao Li
Huashan Sun
Yu Bai
Yang Gao
250
7
0
17 Jun 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous
  Preferences
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen
Yi Chen
Aniket Rege
Ramya Korlakai Vinayak
377
44
0
12 Jun 2024
Collective Constitutional AI: Aligning a Language Model with Public
  Input
Collective Constitutional AI: Aligning a Language Model with Public Input
Saffron Huang
Divya Siddarth
Liane Lovitt
Thomas I. Liao
Esin Durmus
Alex Tamkin
Deep Ganguli
ELM
463
163
0
12 Jun 2024
Is In-Context Learning Sufficient for Instruction Following in LLMs?
Is In-Context Learning Sufficient for Instruction Following in LLMs?
Hao Zhao
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
666
22
0
30 May 2024
Normative Modules: A Generative Agent Architecture for Learning Norms
  that Supports Multi-Agent Cooperation
Normative Modules: A Generative Agent Architecture for Learning Norms that Supports Multi-Agent Cooperation
Atrisha Sarkar
Andrei Ioan Muresanu
Carter Blair
Aaryam Sharma
Rakshit S Trivedi
Gillian K Hadfield
360
6
0
29 May 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged
  Instructions
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
435
300
0
19 Apr 2024
Social Choice Should Guide AI Alignment in Dealing with Diverse Human
  Feedback
Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback
Vincent Conitzer
Rachel Freedman
J. Heitzig
Wesley H. Holliday
Bob M. Jacobs
...
Eric Pacuit
Stuart Russell
Hailey Schoelkopf
Emanuel Tewolde
W. Zwicker
409
75
0
16 Apr 2024
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging
  LLMs' (Lack of) Multicultural Knowledge
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
Yu Ying Chiu
Amirhossein Ajalloeian
Maria Antoniak
Chan Young Park
Shuyue Stella Li
Mehar Bhatia
Sahithya Ravi
Yulia Tsvetkov
Vered Shwartz
Yejin Choi
248
35
0
10 Apr 2024
Controllable Preference Optimization: Toward Controllable
  Multi-Objective Alignment
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
Yiju Guo
Ganqu Cui
Lifan Yuan
Ning Ding
Jiexin Wang
...
Ruobing Xie
Jie Zhou
Yankai Lin
Zhiyuan Liu
Maosong Sun
360
110
0
29 Feb 2024
Investigating Cultural Alignment of Large Language Models
Investigating Cultural Alignment of Large Language Models
Badr AlKhamissi
Muhammad N. ElNokrashy
Mai AlKhamissi
Mona T. Diab
492
150
0
20 Feb 2024
Rewards-in-Context: Multi-objective Alignment of Foundation Models with
  Dynamic Preference Adjustment
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
Rui Yang
Xiaoman Pan
Feng Luo
Delin Qu
Han Zhong
Dong Yu
Jianshu Chen
651
138
0
15 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback
Suppressing Pink Elephants with Direct Principle Feedback
Louis Castricato
Nathan Lile
Suraj Anand
Hailey Schoelkopf
Siddharth Verma
Stella Biderman
302
13
0
12 Feb 2024
CultureLLM: Incorporating Cultural Differences into Large Language
  Models
CultureLLM: Incorporating Cultural Differences into Large Language Models
Cheng-rong Li
Mengzhou Chen
Yongfeng Zhang
Sunayana Sitaram
Xing Xie
VLM
346
67
0
09 Feb 2024
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
  Learning
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Bill Yuchen Lin
Abhilasha Ravichander
Ximing Lu
Nouha Dziri
Melanie Sclar
Khyathi Chandu
Chandra Bhagavatula
Yejin Choi
321
297
0
04 Dec 2023
Cultural Bias and Cultural Alignment of Large Language Models
Cultural Bias and Cultural Alignment of Large Language ModelsPNAS Nexus (PNAS Nexus), 2023
Yan Tao
Olga Viberg
Ryan S. Baker
René F. Kizilcec
ELM
547
275
0
23 Nov 2023
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in
  Large Language Models
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
Bertie Vidgen
Nino Scherrer
Hannah Rose Kirk
Rebecca Qian
Anand Kannappan
Scott A. Hale
Paul Röttger
ALMELM
541
55
0
14 Nov 2023
Removing RLHF Protections in GPT-4 via Fine-Tuning
Removing RLHF Protections in GPT-4 via Fine-TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MUAAML
369
162
0
09 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Tensor Trust: Interpretable Prompt Injection Attacks from an Online GameInternational Conference on Learning Representations (ICLR), 2023
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
435
120
0
02 Nov 2023
Controlled Decoding from Language Models
Controlled Decoding from Language ModelsInternational Conference on Machine Learning (ICML), 2023
Sidharth Mudgal
Jong Lee
H. Ganapathy
Yaguang Li
Tao Wang
...
Michael Collins
Trevor Strohman
Jilin Chen
Alex Beutel
Ahmad Beirami
562
127
0
25 Oct 2023
Personalized Soups: Personalized Large Language Model Alignment via
  Post-hoc Parameter Merging
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
Joel Jang
Seungone Kim
Bill Yuchen Lin
Yizhong Wang
Jack Hessel
Luke Zettlemoyer
Hannaneh Hajishirzi
Yejin Choi
Prithviraj Ammanabrolu
MoMe
384
245
0
17 Oct 2023
Reward-Augmented Decoding: Efficient Controlled Text Generation With a
  Unidirectional Reward Model
Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
H. Deng
Colin Raffel
568
79
0
14 Oct 2023
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to
  RLHF
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHFConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yi Dong
Zhilin Wang
Makesh Narsimhan Sreedhar
Xianchao Wu
Oleksii Kuchaiev
ALMLLMSV
371
106
0
09 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!International Conference on Learning Representations (ICLR), 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
487
1,058
0
05 Oct 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language
  Models that Follow Instructions
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow InstructionsInternational Conference on Learning Representations (ICLR), 2023
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Zou
ALMLM&MALRM
400
362
0
14 Sep 2023
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights,
  and Duties
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and DutiesAAAI Conference on Artificial Intelligence (AAAI), 2023
Taylor Sorensen
Liwei Jiang
Jena D. Hwang
Sydney Levine
Valentina Pyatkin
...
Kavel Rao
Chandra Bhagavatula
Maarten Sap
J. Tasioulas
Yejin Choi
SLR
598
108
0
02 Sep 2023
In-Context Alignment: Chat with Vanilla Language Models Before
  Fine-Tuning
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
Xiaochuang Han
171
21
0
08 Aug 2023
12
Next
Page 1 of 2