Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.17306
Cited By
Refusal Direction is Universal Across Safety-Aligned Languages
22 May 2025
Xinpeng Wang
Mingyang Wang
Yihong Liu
Hinrich Schutze
Barbara Plank
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"Refusal Direction is Universal Across Safety-Aligned Languages"
43 / 43 papers shown
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRM
ELM
979
31
0
08 May 2025
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Mingyang Wang
Heike Adel
Lukas Lange
Yihong Liu
Ercong Nie
Jannik Strötgen
Hinrich Schütze
HILM
307
27
0
05 Apr 2025
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Nikhil Verma
Manasa Bharadwaj
288
3
0
03 Apr 2025
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
295
53
0
24 Feb 2025
Robustness of Large Language Models Against Adversarial Attacks
Yiyi Tao
Yixian Shen
Hang Zhang
Yanxin Shen
Lun Wang
Chuanqi Shi
Shaoshuai Du
AAML
269
14
0
22 Dec 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
International Conference on Learning Representations (ICLR), 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
442
24
0
04 Oct 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Shu Yang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
388
49
0
12 Jul 2024
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Orgest Xhelili
Yihong Liu
Hinrich Schütze
307
14
0
28 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
383
234
0
26 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
431
141
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
394
430
0
17 Jun 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
355
238
0
19 Apr 2024
mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?
Tianze Hua
Tian Yun
Ellie Pavlick
LRM
193
16
0
18 Apr 2024
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
Hongshen Xu
Zichen Zhu
Situo Zhang
Da Ma
Shuai Fan
Lu Chen
Kai Yu
HILM
335
59
0
27 Mar 2024
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
Chris Wendler
V. Veselovsky
Giovanni Monea
Robert West
566
226
0
16 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
363
753
0
06 Feb 2024
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
Jie Li
Yi Liu
Chongyang Liu
Ling Shi
Xiaoning Ren
Yaowen Zheng
Yang Liu
Yinxing Xue
AAML
281
39
0
30 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
198
69
0
23 Jan 2024
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
282
3
0
12 Jan 2024
Removing RLHF Protections in GPT-4 via Fine-Tuning
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MU
AAML
335
143
0
09 Nov 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
418
540
0
19 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
International Conference on Learning Representations (ICLR), 2023
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
260
414
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
International Conference on Learning Representations (ICLR), 2023
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
509
196
0
10 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Tao Gui
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
229
252
0
04 Oct 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
International Conference on Learning Representations (ICLR), 2023
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Zou
ALM
LM&MA
LRM
316
328
0
14 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
647
2,367
0
27 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Neural Information Processing Systems (NeurIPS), 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
773
1,467
0
05 Jul 2023
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu
Gelei Deng
Yulong Shen
Yuekang Li
Yaowen Zheng
Ying Zhang
Lida Zhao
Tianwei Zhang
Kailong Wang
Yang Liu
430
617
0
23 May 2023
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
ACM Transactions on Knowledge Discovery from Data (TKDD), 2023
Jingfeng Yang
Hongye Jin
Ruixiang Tang
Xiaotian Han
Qizhang Feng
Haoming Jiang
Bing Yin
Helen Zhou
LM&MA
433
940
0
26 Apr 2023
The Geometry of Multilingual Language Model Representations
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Tyler A. Chang
Zhuowen Tu
Benjamin Bergen
359
83
0
22 May 2022
Training language models to follow instructions with human feedback
Neural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
2.1K
17,754
0
04 Mar 2022
Smoothed Contrastive Learning for Unsupervised Sentence Embedding
International Conference on Computational Linguistics (COLING), 2021
Xing Wu
Chaochen Gao
Liangjun Zang
Jizhong Han
Zhongyuan Wang
Songlin Hu
SSL
AILaw
245
28
0
09 Sep 2021
On Learning Universal Representations Across Languages
International Conference on Learning Representations (ICLR), 2020
Xiangpeng Wei
Rongxiang Weng
Yue Hu
Luxi Xing
Heng Yu
Weihua Luo
SSL
VLM
422
91
0
31 Jul 2020
On the Language Neutrality of Pre-trained Multilingual Representations
Findings (Findings), 2020
Jindrich Libovický
Rudolf Rosa
Kangyang Luo
466
115
0
09 Apr 2020
Unsupervised Cross-lingual Representation Learning at Scale
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Edouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
499
7,725
0
05 Nov 2019
On the Cross-lingual Transferability of Monolingual Representations
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Mikel Artetxe
Sebastian Ruder
Dani Yogatama
650
851
0
25 Oct 2019
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Nils Reimers
Iryna Gurevych
2.0K
15,707
0
27 Aug 2019
How multilingual is Multilingual BERT?
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Telmo Pires
Eva Schlinger
Dan Garrette
LRM
VLM
548
1,592
0
04 Jun 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
3.0K
109,193
0
11 Oct 2018
Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample
Alexis Conneau
Ludovic Denoyer
MarcÁurelio Ranzato
SSL
505
1,130
0
31 Oct 2017
Word Translation Without Parallel Data
Alexis Conneau
Guillaume Lample
MarcÁurelio Ranzato
Ludovic Denoyer
Edouard Grave
966
1,730
0
11 Oct 2017
Attention Is All You Need
Neural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
4.4K
163,656
0
12 Jun 2017
Deep reinforcement learning from human preferences
Neural Information Processing Systems (NeurIPS), 2017
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
1.6K
4,461
0
12 Jun 2017
1
Page 1 of 1