v1v2 (latest)

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

23 May 2025

ArXiv (abs)PDF HTML Github

Papers citing "But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors"

8 / 8 papers shown

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Madeline Brumley

Joe Kwon

David M. Krueger

Dmitrii Krasheninnikov

Usman Anwar

LLMSV

212

11 Nov 2024

Improving Steering Vectors by Targeting Sparse Autoencoder Features

373

04 Nov 2024

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Robert D Morabito

Sangmitra Madhusudan

Tyler McDonald

Ali Emami

286

20 Sep 2024

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team

Gemma Team Morgane Riviere

...

623

1,583

31 Jul 2024

Managing extreme AI risks amid rapid progress

...

351

26 Oct 2023

Discovering Language Model Behaviors with Model-Written EvaluationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

...

Deep Ganguli

359

601

19 Dec 2022

Discovering Latent Knowledge in Language Models Without SupervisionInternational Conference on Learning Representations (ICLR), 2022

417

542

07 Dec 2022

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

394

310

16 Sep 2018