Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

25 June 2024

Papers citing "Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective"

6 / 6 papers shown

Title
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation Chaitali Bhattacharyya Yeseong Kim 45 0 0 01 May 2025
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 32 20 0 19 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 64 95 0 03 Jan 2024
Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction Ashish Sharma Kevin Rushton Inna Wanyin Lin David Wadden Khendra G. Lucas Adam S. Miner Theresa Nguyen Tim Althoff 71 71 0 04 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
On Feature Decorrelation in Self-Supervised Learning Tianyu Hua Wenxiao Wang Zihui Xue Sucheng Ren Yue Wang Hang Zhao SSL OOD 112 186 0 02 May 2021