A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

3 January 2024

Jonathan K. Kummerfeld

Rada Mihalcea

ArXiv PDF HTML

Papers citing "A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity"

20 / 20 papers shown

Title
TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation Gwen Yidou Weng Benjie Wang Guy Van den Broeck BDL 20 0 0 25 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots Erfan Shayegani G M Shahariar Sara Abdali Lei Yu Nael B. Abu-Ghazaleh Yue Dong AAML 34 0 0 01 Apr 2025
Shared Global and Local Geometry of Language Model Embeddings Andrew Lee Melanie Weber F. Viégas Martin Wattenberg FedML 45 1 0 27 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 59 0 0 08 Mar 2025
FADE: Why Bad Descriptions Happen to Good Features Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 35 0 0 24 Feb 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint Qianli Ma Dongrui Liu Qian Chen Linfeng Zhang Jing Shao MoMe 41 0 0 24 Feb 2025
Understanding and Rectifying Safety Perception Distortion in VLMs Xiaohan Zou Jian Kang George Kesidis Lu Lin 84 0 0 18 Feb 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation Vera Neplenbroek Arianna Bisazza Raquel Fernández 85 0 0 17 Feb 2025
ICLR: In-Context Learning of Representations Core Francisco Park Andrew Lee Ekdeep Singh Lubana Yongyi Yang Maya Okawa Kento Nishi Martin Wattenberg Hidenori Tanaka AIFin 105 3 0 29 Dec 2024
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning H. Fernando Han Shen Parikshit Ram Yi Zhou Horst Samulowitz Nathalie Baracaldo Tianyi Chen CLL 44 2 0 20 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Z. Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Junfeng Fang Yongbin Li 27 5 0 17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 40 13 0 15 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 55 2 0 30 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki Boyi Wei Yangsibo Huang Peter Henderson F. Tramèr Javier Rando MU AAML 54 31 0 26 Sep 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 35 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 47 18 0 02 Jul 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 180 152 0 28 Apr 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
The Woman Worked as a Babysitter: On Biases in Language Generation Emily Sheng Kai-Wei Chang Premkumar Natarajan Nanyun Peng 193 607 0 03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 196 876 0 03 May 2018