ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.19721
  4. Cited By
Unsupervised Concept Vector Extraction for Bias Control in LLMs
v1v2v3 (latest)

Unsupervised Concept Vector Extraction for Bias Control in LLMs

27 February 2025
Hannah Cyberey
Yangfeng Ji
David Evans
    LLMSV
ArXiv (abs)PDFHTMLGithub

Papers citing "Unsupervised Concept Vector Extraction for Bias Control in LLMs"

23 / 23 papers shown
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
623
13
0
23 Apr 2025
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
Zhiting Fan
Ruizhe Chen
Ruiling Xu
Zuozhu Liu
KELM
466
32
0
14 Jul 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
454
558
0
17 Jun 2024
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect
  Discrimination
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
Eve Fleisig
G. Smith
Madeline Bossi
Ishita Rustagi
Xavier Yin
Dan Klein
356
67
0
13 Jun 2024
Evaluating Gender Bias in Large Language Models via Chain-of-Thought
  Prompting
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting
Masahiro Kaneko
Danushka Bollegala
Naoaki Okazaki
Timothy Baldwin
LRM
320
58
0
28 Jan 2024
Steering Llama 2 via Contrastive Activation Addition
Steering Llama 2 via Contrastive Activation AdditionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
665
617
0
09 Dec 2023
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
664
891
0
07 Dec 2023
Linear Representations of Sentiment in Large Language Models
Linear Representations of Sentiment in Large Language Models
Curt Tigges
Oskar John Hollinsworth
Atticus Geiger
Neel Nanda
MILM
267
143
0
23 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender
  Bias in an English Language Model
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language ModelBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
354
20
0
19 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model
  Representations of True/False Datasets
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
623
459
0
10 Oct 2023
Qwen Technical Report
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
1.0K
3,549
0
28 Sep 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
12.3K
16,448
0
18 Jul 2023
Marked Personas: Using Natural Language Prompts to Measure Stereotypes
  in Language Models
Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Myra Cheng
Esin Durmus
Dan Jurafsky
330
300
0
29 May 2023
A Trip Towards Fairness: Bias and De-Biasing in Large Language Models
A Trip Towards Fairness: Bias and De-Biasing in Large Language Models
Leonardo Ranaldi
Elena Sofia Ruzzetti
Davide Venditti
Dario Onorati
Fabio Massimo Zanzotto
317
44
0
23 May 2023
The Capacity for Moral Self-Correction in Large Language Models
The Capacity for Moral Self-Correction in Large Language Models
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
...
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
LRMReLM
379
201
0
15 Feb 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written EvaluationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
447
692
0
19 Dec 2022
Theories of "Gender" in NLP Bias Research
Theories of "Gender" in NLP Bias ResearchConference on Fairness, Accountability and Transparency (FAccT), 2022
Hannah Devinney
Jenny Björklund
H. Björklund
AI4CE
368
93
0
05 May 2022
DExperts: Decoding-Time Controlled Text Generation with Experts and
  Anti-Experts
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Alisa Liu
Maarten Sap
Ximing Lu
Swabha Swayamdipta
Chandra Bhagavatula
Noah A. Smith
Yejin Choi
MU
664
473
0
07 May 2021
NeuroLogic Decoding: (Un)supervised Neural Text Generation with
  Predicate Logic Constraints
NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic ConstraintsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2020
Ximing Lu
Peter West
Rowan Zellers
Ronan Le Bras
Chandra Bhagavatula
Yejin Choi
NAI
418
167
0
24 Oct 2020
Investigating African-American Vernacular English in Transformer-Based
  Text Generation
Investigating African-American Vernacular English in Transformer-Based Text GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Sophie Groenwold
Li-hsueh Ou
Aesha Parekh
Samhita Honnavalli
Sharon Levy
Diba Mirza
William Yang Wang
349
86
0
06 Oct 2020
Toward Gender-Inclusive Coreference Resolution
Toward Gender-Inclusive Coreference ResolutionAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Yang Trista Cao
Hal Daumé
541
157
0
30 Oct 2019
Gender Bias in Coreference Resolution
Gender Bias in Coreference Resolution
Rachel Rudinger
Jason Naradowsky
Brian Leonard
Benjamin Van Durme
471
722
0
25 Apr 2018
Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup
  Fairness
Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness
Michael Kearns
Seth Neel
Aaron Roth
Zhiwei Steven Wu
FaML
986
880
0
14 Nov 2017
1
Page 1 of 1