Interpreting Bias in Large Language Models: A Feature-Based Approach

Interpreting Bias in Large Language Models: A Feature-Based Approach

18 June 2024

Nirmalendu Prakash

Lee Ka Wei Roy

Papers citing "Interpreting Bias in Large Language Models: A Feature-Based Approach"

4 / 4 papers shown

Title
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 43 42 0 01 Mar 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 261 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 494 0 01 Nov 2022
The Woman Worked as a Babysitter: On Biases in Language Generation Emily Sheng Kai-Wei Chang Premkumar Natarajan Nanyun Peng 206 616 0 03 Sep 2019