ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17306
  4. Cited By
Refusal Direction is Universal Across Safety-Aligned Languages

Refusal Direction is Universal Across Safety-Aligned Languages

22 May 2025
Xinpeng Wang
Mingyang Wang
Yihong Liu
Hinrich Schutze
Barbara Plank
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Refusal Direction is Universal Across Safety-Aligned Languages"

43 / 43 papers shown
Crosslingual Reasoning through Test-Time Scaling
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRMELM
979
31
0
08 May 2025
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Mingyang Wang
Heike Adel
Lukas Lange
Yihong Liu
Ercong Nie
Jannik Strötgen
Hinrich Schütze
HILM
307
27
0
05 Apr 2025
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Nikhil Verma
Manasa Bharadwaj
288
3
0
03 Apr 2025
Do Multilingual LLMs Think In English?
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
295
53
0
24 Feb 2025
Robustness of Large Language Models Against Adversarial Attacks
Robustness of Large Language Models Against Adversarial Attacks
Yiyi Tao
Yixian Shen
Hang Zhang
Yanxin Shen
Lun Wang
Chuanqi Shi
Shaoshuai Du
AAML
269
14
0
22 Dec 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
442
24
0
04 Oct 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Shu Yang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
388
49
0
12 Jul 2024
Breaking the Script Barrier in Multilingual Pre-Trained Language Models
  with Transliteration-Based Post-Training Alignment
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Orgest Xhelili
Yihong Liu
Hinrich Schütze
307
14
0
28 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
383
234
0
26 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALMELM
431
141
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
394
430
0
17 Jun 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged
  Instructions
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
355
238
0
19 Apr 2024
mOthello: When Do Cross-Lingual Representation Alignment and
  Cross-Lingual Transfer Emerge in Multilingual Models?
mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?
Tianze Hua
Tian Yun
Ellie Pavlick
LRM
193
16
0
18 Apr 2024
Rejection Improves Reliability: Training LLMs to Refuse Unknown
  Questions Using RL from Knowledge Feedback
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
Hongshen Xu
Zichen Zhu
Situo Zhang
Da Ma
Shuai Fan
Lu Chen
Kai Yu
HILM
335
59
0
27 Mar 2024
Do Llamas Work in English? On the Latent Language of Multilingual
  Transformers
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
Chris Wendler
V. Veselovsky
Giovanni Monea
Robert West
566
226
0
16 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
363
753
0
06 Feb 2024
A Cross-Language Investigation into Jailbreak Attacks in Large Language
  Models
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
Jie Li
Yi Liu
Chongyang Liu
Ling Shi
Xiaoning Ren
Yaowen Zheng
Yang Liu
Yinxing Xue
AAML
281
39
0
30 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in
  Multilingual Contexts
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual ContextsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
198
69
0
23 Jan 2024
TransliCo: A Contrastive Learning Framework to Address the Script
  Barrier in Multilingual Pretrained Language Models
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
282
3
0
12 Jan 2024
Removing RLHF Protections in GPT-4 via Fine-Tuning
Removing RLHF Protections in GPT-4 via Fine-TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MUAAML
335
143
0
09 Nov 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
418
540
0
19 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationInternational Conference on Learning Representations (ICLR), 2023
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
260
414
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
Multilingual Jailbreak Challenges in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
509
196
0
10 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Tao Gui
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
229
252
0
04 Oct 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language
  Models that Follow Instructions
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow InstructionsInternational Conference on Learning Representations (ICLR), 2023
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Zou
ALMLM&MALRM
316
328
0
14 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
647
2,367
0
27 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Jailbroken: How Does LLM Safety Training Fail?Neural Information Processing Systems (NeurIPS), 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
773
1,467
0
05 Jul 2023
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu
Gelei Deng
Yulong Shen
Yuekang Li
Yaowen Zheng
Ying Zhang
Lida Zhao
Tianwei Zhang
Kailong Wang
Yang Liu
430
617
0
23 May 2023
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and BeyondACM Transactions on Knowledge Discovery from Data (TKDD), 2023
Jingfeng Yang
Hongye Jin
Ruixiang Tang
Xiaotian Han
Qizhang Feng
Haoming Jiang
Bing Yin
Helen Zhou
LM&MA
433
940
0
26 Apr 2023
The Geometry of Multilingual Language Model Representations
The Geometry of Multilingual Language Model RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Tyler A. Chang
Zhuowen Tu
Benjamin Bergen
359
83
0
22 May 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
2.1K
17,754
0
04 Mar 2022
Smoothed Contrastive Learning for Unsupervised Sentence Embedding
Smoothed Contrastive Learning for Unsupervised Sentence EmbeddingInternational Conference on Computational Linguistics (COLING), 2021
Xing Wu
Chaochen Gao
Liangjun Zang
Jizhong Han
Zhongyuan Wang
Songlin Hu
SSLAILaw
245
28
0
09 Sep 2021
On Learning Universal Representations Across Languages
On Learning Universal Representations Across LanguagesInternational Conference on Learning Representations (ICLR), 2020
Xiangpeng Wei
Rongxiang Weng
Yue Hu
Luxi Xing
Heng Yu
Weihua Luo
SSLVLM
422
91
0
31 Jul 2020
On the Language Neutrality of Pre-trained Multilingual Representations
On the Language Neutrality of Pre-trained Multilingual RepresentationsFindings (Findings), 2020
Jindrich Libovický
Rudolf Rosa
Kangyang Luo
466
115
0
09 Apr 2020
Unsupervised Cross-lingual Representation Learning at Scale
Unsupervised Cross-lingual Representation Learning at ScaleAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Edouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
499
7,725
0
05 Nov 2019
On the Cross-lingual Transferability of Monolingual Representations
On the Cross-lingual Transferability of Monolingual RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Mikel Artetxe
Sebastian Ruder
Dani Yogatama
650
851
0
25 Oct 2019
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT: Sentence Embeddings using Siamese BERT-NetworksConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Nils Reimers
Iryna Gurevych
2.0K
15,707
0
27 Aug 2019
How multilingual is Multilingual BERT?
How multilingual is Multilingual BERT?Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Telmo Pires
Eva Schlinger
Dan Garrette
LRMVLM
548
1,592
0
04 Jun 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
3.0K
109,193
0
11 Oct 2018
Unsupervised Machine Translation Using Monolingual Corpora Only
Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample
Alexis Conneau
Ludovic Denoyer
MarcÁurelio Ranzato
SSL
505
1,130
0
31 Oct 2017
Word Translation Without Parallel Data
Word Translation Without Parallel Data
Alexis Conneau
Guillaume Lample
MarcÁurelio Ranzato
Ludovic Denoyer
Edouard Grave
966
1,730
0
11 Oct 2017
Attention Is All You Need
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
4.4K
163,656
0
12 Jun 2017
Deep reinforcement learning from human preferences
Deep reinforcement learning from human preferencesNeural Information Processing Systems (NeurIPS), 2017
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
1.6K
4,461
0
12 Jun 2017
1
Page 1 of 1