v1v2 (latest)

RAIN: Your Language Models Can Align Themselves without Finetuning

International Conference on Learning Representations (ICLR), 2023

13 September 2023

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "RAIN: Your Language Models Can Align Themselves without Finetuning"

50 / 114 papers shown

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

196

04 Dec 2025

Factors That Support Grounded Responses in LLM Conversations: A Rapid Review

Gabriele Cesar Iwashima

Claudia Susie Rodrigues

Claudio Dipolitto

Geraldo Xexéo

24 Nov 2025

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

236

15 Nov 2025

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

145

30 Oct 2025

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

296

17 Oct 2025

Proactive defense against LLM Jailbreak

203

06 Oct 2025

Kwai Keye-VL 1.5 Technical Report

...

377

01 Sep 2025

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

118

21 Aug 2025

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

225

20 Aug 2025

A Survey on Training-free Alignment of Large Language Models

533

12 Aug 2025

P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

225

06 Aug 2025

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn

Jaejin Lee

AAML

02 Aug 2025

SDD: Self-Degraded Defense against Malicious Fine-tuningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

201

27 Jul 2025

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

283

22 Jul 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

390

11 Jun 2025

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

258

09 Jun 2025

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

381

01 Jun 2025

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

354

30 May 2025

LLM Agents Should Employ Security Principles

403

29 May 2025

Token-level Accept or Reject: A Micro Alignment Approach for Large Language ModelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2025

...

517

26 May 2025

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Zhexin Zhang

Xian Qi Loye

Victor Shea-Jay Huang

...

384

21 May 2025

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models

443

20 May 2025

LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities

282

08 May 2025

What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token PatternsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

405

22 Apr 2025

Geneshift: Impact of different scenario shift on Jailbreaking LLM

398

10 Apr 2025

A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models

...

Bodhisattwa Prasad Majumder

Jingbo Shang

Prithviraj Ammanabrolu

Julian McAuley

501

09 Apr 2025

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

468

14 Mar 2025

Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation

260

09 Mar 2025

DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

565

06 Mar 2025

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

401

28 Feb 2025

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

873

27 Feb 2025

Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMsInternational Conference on Learning Representations (ICLR), 2025

703

26 Feb 2025

Single-pass Detection of Jailbreaking Input in Large Language Models

343

24 Feb 2025

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

...

275

24 Feb 2025

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

436

16 Feb 2025

Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions

368

08 Feb 2025

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided SearchNeural Information Processing Systems (NeurIPS), 2024

465

28 Jan 2025

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity ConstraintsAAAI Conference on Artificial Intelligence (AAAI), 2025

472

14 Jan 2025

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack DefenseNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

371

05 Jan 2025

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

...

1.3K

388

25 Nov 2024

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

...

647

18 Nov 2024

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

374

13 Nov 2024

SQL Injection Jailbreak: A Structural Disaster of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

675

03 Nov 2024

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

391

25 Oct 2024

Adversarial Attacks on Large Language Models Using Regularized Relaxation

277

24 Oct 2024

LLMScan: Causal Scan for LLM Misbehavior Detection

740

22 Oct 2024

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

...

582

18 Oct 2024

SPIN: Self-Supervised Prompt INjection

291

17 Oct 2024

JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

348

11 Oct 2024

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu

Miao Xiong

Bryan Hooi

270

02 Oct 2024