Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.12347
Cited By
Interpreting Bias in Large Language Models: A Feature-Based Approach
18 June 2024
Nirmalendu Prakash
Lee Ka Wei Roy
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpreting Bias in Large Language Models: A Feature-Based Approach"
4 / 4 papers shown
Title
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
43
42
0
01 Mar 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
189
261
0
28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
494
0
01 Nov 2022
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
206
616
0
03 Sep 2019
1