Dissecting Human and LLM Preferences

Dissecting Human and LLM Preferences

17 February 2024

Yikai Zhang

Papers citing "Dissecting Human and LLM Preferences"

7 / 7 papers shown

Title
Human Preferences for Constructive Interactions in Language Model Alignment Yara Kyrychenko Jon Roozenbeek Brandon Davidson S. V. D. Linden Ramit Debnath 41 0 0 05 Mar 2025
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction Michal Bravansky Vaclav Kubon Suhas Hariharan Robert Kirk 62 0 0 24 Feb 2025
AI Alignment at Your Discretion Maarten Buyl Hadi Khalaf C. M. Verdun Lucas Monteiro Paes Caio Vieira Machado Flavio du Pin Calmon 35 0 0 10 Feb 2025
Aligning to Thousands of Preferences via System Message Generalization Seongyun Lee Sue Hyun Park Seungone Kim Minjoon Seo ALM 32 36 0 28 May 2024
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak D. Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 209 178 0 20 Oct 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,583 0 18 Sep 2019