v1v2 (latest)

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

International Conference on Learning Representations (ICLR), 2024

11 October 2024

ArXiv (abs)PDF HTML HuggingFace (14 upvotes)Github (178★)

Papers citing "Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements"

50 / 75 papers shown

Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Prasoon Varshney

Makesh Narsimhan Sreedhar

Liwei Jiang

Traian Rebedea

Christopher Parisien

167

07 Nov 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng

Vidhisha Balachandran

317

30 Oct 2025

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

175

09 Oct 2025

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

182

05 Oct 2025

DynaGuard: A Dynamic Guardian Model With User-Defined Policies

325

02 Sep 2025

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

174

13 Aug 2025

A Survey on Training-free Alignment of Large Language Models

547

12 Aug 2025

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

272

12 Aug 2025

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

299

22 Jul 2025

Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

291

08 Jun 2025

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Kyubyung Chae

Hyunbin Jin

Taesup Kim

258

07 Jun 2025

Aligning VLM Assistants with Personalized Situated CognitionAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

292

01 Jun 2025

Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

Makesh Narsimhan Sreedhar

Traian Rebedea

Christopher Parisien

LRM

298

26 May 2025

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2023

599

211

28 Jan 2025

SafeWorld: Geo-Diverse Safety AlignmentNeural Information Processing Systems (NeurIPS), 2024

412

09 Dec 2024

Backtracking Improves Generation Safety

395

22 Sep 2024

How Well Do LLMs Identify Cultural Unity in Diversity?

Jialin Li

Junli Wang

Junjie Hu

Ming Jiang

262

09 Aug 2024

Improving Context-Aware Preference Modeling for Language Models

Silviu Pitis

Ziang Xiao

Nicolas Le Roux

Alessandro Sordoni

302

20 Jul 2024

ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions

273

02 Jul 2024

Decoding-Time Language Model Alignment with Multiple Objectives

Hannaneh Hajishirzi

Simon Du

431

27 Jun 2024

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

Aakanksha

Arash Ahmadian

Beyza Ermis

Seraphina Goldfarb-Tarrant

Julia Kreutzer

Marzieh Fadaee

Sara Hooker

410

26 Jun 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Bill Yuchen Lin

Nathan Lambert

Yejin Choi

Nouha Dziri

461

318

26 Jun 2024

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Thom Lake

Eunsol Choi

Greg Durrett

476

25 Jun 2024

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Joseph E. Gonzalez

Ion Stoica

ALM

413

411

17 Jun 2024

How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment

Heyan Huang

Yinghao Li

Huashan Sun

Yu Bai

Yang Gao

250

17 Jun 2024

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

Daiwei Chen

Yi Chen

Aniket Rege

Ramya Korlakai Vinayak

377

12 Jun 2024

Collective Constitutional AI: Aligning a Language Model with Public Input

463

163

12 Jun 2024

Is In-Context Learning Sufficient for Instruction Following in LLMs?

Hao Zhao

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

666

30 May 2024

Normative Modules: A Generative Agent Architecture for Learning Norms that Supports Multi-Agent Cooperation

Gillian K Hadfield

360

29 May 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

435

300

19 Apr 2024

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

...

409

16 Apr 2024

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

Yu Ying Chiu

Amirhossein Ajalloeian

Yejin Choi

248

10 Apr 2024

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

...

Jie Zhou

Yankai Lin

Zhiyuan Liu

Maosong Sun

360

110

29 Feb 2024

Investigating Cultural Alignment of Large Language Models

Badr AlKhamissi

Muhammad N. ElNokrashy

Mai AlKhamissi

Mona T. Diab

492

150

20 Feb 2024

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

Dong Yu

651

138

15 Feb 2024

Suppressing Pink Elephants with Direct Principle Feedback

302

12 Feb 2024

CultureLLM: Incorporating Cultural Differences into Large Language Models

Xing Xie

346

09 Feb 2024

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Bill Yuchen Lin

Abhilasha Ravichander

Yejin Choi

321

297

04 Dec 2023

Cultural Bias and Cultural Alignment of Large Language ModelsPNAS Nexus (PNAS Nexus), 2023

547

275

23 Nov 2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

Paul Röttger

541

14 Nov 2023

Removing RLHF Protections in GPT-4 via Fine-TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Tatsunori Hashimoto

369

162

09 Nov 2023

Tensor Trust: Interpretable Prompt Injection Attacks from an Online GameInternational Conference on Learning Representations (ICLR), 2023

...

435

120

02 Nov 2023

Controlled Decoding from Language ModelsInternational Conference on Machine Learning (ICML), 2023

...

562

127

25 Oct 2023

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Luke Zettlemoyer

Yejin Choi

Prithviraj Ammanabrolu

MoMe

384

245

17 Oct 2023

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

H. Deng

Colin Raffel

568

14 Oct 2023

SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHFConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yi Dong

Zhilin Wang

Makesh Narsimhan Sreedhar

Xianchao Wu

Oleksii Kuchaiev

ALM LLMSV

371

106

09 Oct 2023

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!International Conference on Learning Representations (ICLR), 2023

Yi Zeng

487

1,058

05 Oct 2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow InstructionsInternational Conference on Learning Representations (ICLR), 2023

Federico Bianchi

Mirac Suzgun

Giuseppe Attanasio

Paul Röttger

Dan Jurafsky

Tatsunori Hashimoto

James Zou

ALM LM&MA LRM

400

362

14 Sep 2023

Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and DutiesAAAI Conference on Artificial Intelligence (AAAI), 2023

...

Yejin Choi

598

108

02 Sep 2023

In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning

Xiaochuang Han

171

08 Aug 2023