Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2502.19721
Cited By
v1
v2
v3 (latest)
Unsupervised Concept Vector Extraction for Bias Control in LLMs
27 February 2025
Hannah Cyberey
Yangfeng Ji
David Evans
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Github
Papers citing
"Unsupervised Concept Vector Extraction for Bias Control in LLMs"
23 / 23 papers shown
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
623
13
0
23 Apr 2025
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
Zhiting Fan
Ruizhe Chen
Ruiling Xu
Zuozhu Liu
KELM
466
32
0
14 Jul 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
454
558
0
17 Jun 2024
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
Eve Fleisig
G. Smith
Madeline Bossi
Ishita Rustagi
Xavier Yin
Dan Klein
356
67
0
13 Jun 2024
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting
Masahiro Kaneko
Danushka Bollegala
Naoaki Okazaki
Timothy Baldwin
LRM
320
58
0
28 Jan 2024
Steering Llama 2 via Contrastive Activation Addition
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
665
617
0
09 Dec 2023
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
664
891
0
07 Dec 2023
Linear Representations of Sentiment in Large Language Models
Curt Tigges
Oskar John Hollinsworth
Atticus Geiger
Neel Nanda
MILM
267
143
0
23 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
354
20
0
19 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
623
459
0
10 Oct 2023
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
1.0K
3,549
0
28 Sep 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
12.3K
16,448
0
18 Jul 2023
Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Myra Cheng
Esin Durmus
Dan Jurafsky
330
300
0
29 May 2023
A Trip Towards Fairness: Bias and De-Biasing in Large Language Models
Leonardo Ranaldi
Elena Sofia Ruzzetti
Davide Venditti
Dario Onorati
Fabio Massimo Zanzotto
317
44
0
23 May 2023
The Capacity for Moral Self-Correction in Large Language Models
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
...
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
LRM
ReLM
379
201
0
15 Feb 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
447
692
0
19 Dec 2022
Theories of "Gender" in NLP Bias Research
Conference on Fairness, Accountability and Transparency (FAccT), 2022
Hannah Devinney
Jenny Björklund
H. Björklund
AI4CE
368
93
0
05 May 2022
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Alisa Liu
Maarten Sap
Ximing Lu
Swabha Swayamdipta
Chandra Bhagavatula
Noah A. Smith
Yejin Choi
MU
664
473
0
07 May 2021
NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints
North American Chapter of the Association for Computational Linguistics (NAACL), 2020
Ximing Lu
Peter West
Rowan Zellers
Ronan Le Bras
Chandra Bhagavatula
Yejin Choi
NAI
418
167
0
24 Oct 2020
Investigating African-American Vernacular English in Transformer-Based Text Generation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Sophie Groenwold
Li-hsueh Ou
Aesha Parekh
Samhita Honnavalli
Sharon Levy
Diba Mirza
William Yang Wang
349
86
0
06 Oct 2020
Toward Gender-Inclusive Coreference Resolution
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Yang Trista Cao
Hal Daumé
541
157
0
30 Oct 2019
Gender Bias in Coreference Resolution
Rachel Rudinger
Jason Naradowsky
Brian Leonard
Benjamin Van Durme
471
722
0
25 Apr 2018
Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness
Michael Kearns
Seth Neel
Aaron Roth
Zhiwei Steven Wu
FaML
986
880
0
14 Nov 2017
1
Page 1 of 1