v1v2 (latest)

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

16 April 2020

Papers citing "Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection"

50 / 310 papers shown

An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

165

02 Dec 2025

Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

162

22 Nov 2025

Spectral Identifiability for Interpretable Probe Geometry

William Hao-Cheng Huang

167

20 Nov 2025

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning

220

19 Nov 2025

Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods

Felix Störck

Fabian Hinder

Barbara Hammer

126

05 Nov 2025

TriCon-Fair: Triplet Contrastive Learning for Mitigating Social Bias in Pre-trained Language Models

182

02 Nov 2025

Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

Hiba Ahsan

Byron C. Wallace

LLMSV

246

31 Oct 2025

Understanding Fairness and Prediction Error through Subspace Decomposition and Influence Analysis

161

27 Oct 2025

The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems

154

13 Oct 2025

Language steering in latent space to mitigate unintended code-switching

254

11 Oct 2025

Counterfactually Fair Conformal Prediction

193

09 Oct 2025

Mitigating Biases in Language Models via Bias Unlearning

250

30 Sep 2025

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

142

30 Sep 2025

Causally-Enhanced Reinforcement Policy Optimization

236

27 Sep 2025

Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

223

25 Sep 2025

Memory in Large Language Models: Mechanisms, Evaluation and Evolution

260

23 Sep 2025

Fair-GPTQ: Bias-Aware Quantization for Large Language Models

Irina Proskurina

Guillaume Metzler

Julien Velcin

257

18 Sep 2025

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

349

16 Sep 2025

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

408

16 Sep 2025

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Antonio Bărbălău

Cristian Daniel Păduraru

Teodor Poncu

Alexandru Tifrea

Elena Burceanu

239

13 Sep 2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

181

28 Aug 2025

Caught in the Act: a mechanistic approach to detecting deception

154

27 Aug 2025

CausalSent: Interpretable Sentiment Classification with RieszNet

Daniel Frees

Martin Pollack

CML

209

25 Aug 2025

Debiasing Multilingual LLMs in Cross-lingual Latent Space

176

25 Aug 2025

VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

357

21 Aug 2025

Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing

213

15 Aug 2025

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

297

02 Aug 2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Senthooran Rajamanoharan

Neel Nanda

OODD LLMSV

513

22 Jul 2025

Distributional Machine Unlearning via Selective Data Removal

Youssef Allouah

R. Guerraoui

Sanmi Koyejo

307

20 Jul 2025

Nonlinear Concept Erasure: a Density Matching Approach

Antoine Saillenfest

Pirmin Lemberger

260

16 Jul 2025

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

240

11 Jul 2025

Reason to Rote: Rethinking Memorization in Reasoning

252

07 Jul 2025

The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Alexander Miserlis Hoyle

372

01 Jul 2025

Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs

Sayed Mohammad Vakilzadeh Hatefi

301

16 Jun 2025

Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACEAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Alicja Dobrzeniecka

Antske Fokkens

Pia Sommerauer

157

13 Jun 2025

Convergent Linear Representations of Emergent Misalignment

Anna Soligo

Edward Turner

Senthooran Rajamanoharan

Neel Nanda

MoMe

288

13 Jun 2025

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Adam Karvonen

Samuel Marks

387

12 Jun 2025

Preserving Task-Relevant Information Under Linear Concept Removal

420

12 Jun 2025

Iterative Multilingual Spectral Attribute Erasure

267

12 Jun 2025

MANBench: Is Your Multimodal Model Smarter than Human?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

272

04 Jun 2025

COSMIC: Generalized Refusal Direction Identification in LLM ActivationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

469

30 May 2025

Precise In-Parameter Concept Erasure in Large Language Models

452

28 May 2025

Paying Alignment Tax with Contrastive Learning

Buse Sibel Korkmaz

Rahul Nair

Elizabeth M. Daly

Antonio del Rio Chanona

354

25 May 2025

Advertising in AI systems: Society must be vigilant

Menghua Wu

Yujia Bao

377

23 May 2025

Sparse Activation Editing for Reliable Instruction Following in Narratives

235

22 May 2025

Do Language Models Use Their Depth Efficiently?

Róbert Csordás

Christopher D. Manning

Christopher Potts

666

20 May 2025

Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

940

03 May 2025

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

508

24 Apr 2025

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation SteeringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

406

20 Apr 2025

On Linear Representations and Pretraining Data Frequency in Language ModelsInternational Conference on Learning Representations (ICLR), 2025

554

16 Apr 2025