Refusal Direction is Universal Across Safety-Aligned Languages

22 May 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Refusal Direction is Universal Across Safety-Aligned Languages"

43 / 43 papers shown

Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong

Muhammad Farid Adilazuarda

979

08 May 2025

Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

307

05 Apr 2025

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Nikhil Verma

Manasa Bharadwaj

288

03 Apr 2025

Do Multilingual LLMs Think In English?

Lisa Schut

Y. Gal

Sebastian Farquhar

295

24 Feb 2025

Robustness of Large Language Models Against Adversarial Attacks

269

22 Dec 2024

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024

Xinpeng Wang

Chengzhi Hu

Paul Röttger

Barbara Plank

442

04 Oct 2024

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

388

12 Jul 2024

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Orgest Xhelili

Yihong Liu

Hinrich Schütze

307

28 Jun 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Bill Yuchen Lin

Nathan Lambert

Yejin Choi

Nouha Dziri

383

234

26 Jun 2024

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie

Xiangyu Qi

Yi Zeng

Yangsibo Huang

Udari Madhushani Sehwag

...

Bo Li

Kai Li

431

141

20 Jun 2024

Refusal in Language Models Is Mediated by a Single Direction

Nina Panickssery

394

430

17 Jun 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

355

238

19 Apr 2024

mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?

193

18 Apr 2024

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

Hongshen Xu

Kai Yu

335

27 Mar 2024

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

566

226

16 Feb 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

...

363

753

06 Feb 2024

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Yang Liu

281

30 Jan 2024

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual ContextsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Lingfeng Shen

Jingyu Zhang

Daniel Khashabi

198

23 Jan 2024

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

282

12 Jan 2024

Removing RLHF Protections in GPT-4 via Fine-TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Tatsunori Hashimoto

335

143

09 Nov 2023

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Jiaming Ji

418

540

19 Oct 2023

Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationInternational Conference on Learning Representations (ICLR), 2023

260

414

10 Oct 2023

Multilingual Jailbreak Challenges in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

509

196

10 Oct 2023

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Dahua Lin

229

252

04 Oct 2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow InstructionsInternational Conference on Learning Representations (ICLR), 2023

Federico Bianchi

Mirac Suzgun

Giuseppe Attanasio

Paul Röttger

Dan Jurafsky

Tatsunori Hashimoto

James Zou

ALM LM&MA LRM

316

328

14 Sep 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

J. Zico Kolter

647

2,367

27 Jul 2023

Jailbroken: How Does LLM Safety Training Fail?Neural Information Processing Systems (NeurIPS), 2023

Alexander Wei

Nika Haghtalab

Jacob Steinhardt

773

1,467

05 Jul 2023

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu

Lida Zhao

Kailong Wang

Yang Liu

430

617

23 May 2023

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and BeyondACM Transactions on Knowledge Discovery from Data (TKDD), 2023

433

940

26 Apr 2023

The Geometry of Multilingual Language Model RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Tyler A. Chang

Zhuowen Tu

Benjamin Bergen

359

22 May 2022

Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022

Carroll L. Wainwright

...

2.1K

17,754

04 Mar 2022

Smoothed Contrastive Learning for Unsupervised Sentence EmbeddingInternational Conference on Computational Linguistics (COLING), 2021

245

09 Sep 2021

On Learning Universal Representations Across LanguagesInternational Conference on Learning Representations (ICLR), 2020

422

31 Jul 2020

On the Language Neutrality of Pre-trained Multilingual RepresentationsFindings (Findings), 2020

Jindrich Libovický

Rudolf Rosa

Kangyang Luo

466

115

09 Apr 2020

Unsupervised Cross-lingual Representation Learning at ScaleAnnual Meeting of the Association for Computational Linguistics (ACL), 2019

Francisco Guzmán

Luke Zettlemoyer

499

7,725

05 Nov 2019

On the Cross-lingual Transferability of Monolingual RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2019

Mikel Artetxe

Sebastian Ruder

Dani Yogatama

650

851

25 Oct 2019

Sentence-BERT: Sentence Embeddings using Siamese BERT-NetworksConference on Empirical Methods in Natural Language Processing (EMNLP), 2019

Nils Reimers

Iryna Gurevych

2.0K

15,707

27 Aug 2019

How multilingual is Multilingual BERT?Annual Meeting of the Association for Computational Linguistics (ACL), 2019

548

1,592

04 Jun 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3.0K

109,193

11 Oct 2018

Unsupervised Machine Translation Using Monolingual Corpora Only

505

1,130

31 Oct 2017

Word Translation Without Parallel Data

966

1,730

11 Oct 2017

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

4.4K

163,656

12 Jun 2017

Deep reinforcement learning from human preferencesNeural Information Processing Systems (NeurIPS), 2017

1.6K

4,461

12 Jun 2017