v1v2v3 (latest)

Unsupervised Concept Vector Extraction for Bias Control in LLMs

27 February 2025

ArXiv (abs)PDF HTML Github

Papers citing "Unsupervised Concept Vector Extraction for Bias Control in LLMs"

23 / 23 papers shown

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Hannah Cyberey

David Evans

LLMSV

623

23 Apr 2025

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs

Zuozhu Liu

466

14 Jul 2024

Refusal in Language Models Is Mediated by a Single Direction

Nina Panickssery

454

558

17 Jun 2024

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

356

13 Jun 2024

Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting

320

28 Jan 2024

Steering Llama 2 via Contrastive Activation AdditionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Alexander Matt Turner

LLMSV

665

617

09 Dec 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Kartikeya Upasani

Rashi Rungta

...

Madian Khabsa

664

891

07 Dec 2023

Linear Representations of Sentiment in Large Language Models

Curt Tigges

Oskar John Hollinsworth

Atticus Geiger

Neel Nanda

MILM

267

143

23 Oct 2023

Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language ModelBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

354

19 Oct 2023

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks

Max Tegmark

HILM

623

459

10 Oct 2023

Qwen Technical Report

Jinze Bai

Shuai Bai

Yunfei Chu

Zeyu Cui

Kai Dang

...

Zhenru Zhang

Chang Zhou

Jingren Zhou

Xiaohuan Zhou

Tianhang Zhu

OSLM

1.0K

3,549

28 Sep 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

12.3K

16,448

18 Jul 2023

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Myra Cheng

Esin Durmus

Dan Jurafsky

330

300

29 May 2023

A Trip Towards Fairness: Bias and De-Biasing in Large Language Models

Fabio Massimo Zanzotto

317

23 May 2023

The Capacity for Moral Self-Correction in Large Language Models

Deep Ganguli

...

379

201

15 Feb 2023

Discovering Language Model Behaviors with Model-Written EvaluationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

...

Deep Ganguli

447

692

19 Dec 2022

Theories of "Gender" in NLP Bias ResearchConference on Fairness, Accountability and Transparency (FAccT), 2022

368

05 May 2022

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Yejin Choi

664

473

07 May 2021

NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic ConstraintsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2020

Yejin Choi

418

167

24 Oct 2020

Investigating African-American Vernacular English in Transformer-Based Text GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

349

06 Oct 2020

Toward Gender-Inclusive Coreference ResolutionAnnual Meeting of the Association for Computational Linguistics (ACL), 2019

Yang Trista Cao

Hal Daumé

541

157

30 Oct 2019

Gender Bias in Coreference Resolution

471

722

25 Apr 2018

Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness

986

880

14 Nov 2017