Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

23 May 2025

Papers citing "Understanding How Value Neurons Shape the Generation of Specified Values in LLMs"

30 / 30 papers shown

Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

15 Aug 2025

MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

401

14 Aug 2025

When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

470

04 Aug 2025

The Compositional Architecture of Regret in Large Language Models

244

18 Jun 2025

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

199

08 Jun 2025

Towards User-level Private Reinforcement Learning with Human Feedback

267

22 Feb 2025

Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing InducementsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

433

18 Feb 2025

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model ReasoningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

391

17 Feb 2025

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

233

07 Feb 2025

Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing

Zeping Yu

Sophia Ananiadou

KELM

300

24 Jan 2025

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

228

17 Jan 2025

Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing

Lijie Hu

Di Wang

KELM

432

08 Oct 2024

Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron AnalysisConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Zeping Yu

Sophia Ananiadou

LRM MILM

319

21 Sep 2024

Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons

Yongqi Leng

Deyi Xiong

398

09 Jul 2024

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

364

26 Feb 2024

KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge

Edward Choi

561

21 Feb 2024

MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning

Lijie Hu

278

17 Feb 2024

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment TasksNeural Information Processing Systems (NeurIPS), 2023

Tatsunori Hashimoto

285

30 Oct 2023

Evaluating the Moral Beliefs Encoded in LLMsNeural Information Processing Systems (NeurIPS), 2023

262

208

26 Jul 2023

Enhancing Chat Language Models by Scaling High-quality Instructional ConversationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Zhiyuan Liu

Maosong Sun

Bowen Zhou

ALM

365

765

23 May 2023

Toxicity in ChatGPT: Analyzing Persona-assigned Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

240

459

11 Apr 2023

Generative Agents: Interactive Simulacra of Human BehaviorACM Symposium on User Interface Software and Technology (UIST), 2023

Cristina Mata

Joseph C. O'Brien

Carrie J. Cai

Meredith Ringel Morris

Abigail Z. Jacobs

Michael S. Bernstein

LM&Ro AI4CE

887

3,153

07 Apr 2023

Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study

Yong Cao

Li Zhou

Seolhwa Lee

Laura Cabello

Min Chen

Daniel Hershcovich

266

265

30 Mar 2023

G-Eval: NLG Evaluation using GPT-4 with Better Human AlignmentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yang Liu

Shuohang Wang

608

1,873

29 Mar 2023

Mass-Editing Memory in a TransformerInternational Conference on Learning Representations (ICLR), 2022

437

809

13 Oct 2022

Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022

Carroll L. Wainwright

...

2.1K

18,067

04 Mar 2022

Locating and Editing Factual Associations in GPTNeural Information Processing Systems (NeurIPS), 2022

1.0K

1,999

10 Feb 2022

Alignment of Language Agents

Iason Gabriel

247

208

26 Mar 2021

Transformer Feed-Forward Layers Are Key-Value MemoriesConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

650

1,177

29 Dec 2020

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

4.4K

163,656

12 Jun 2017