Inverse Constitutional AI: Compressing Preferences into Principles

Inverse Constitutional AI: Compressing Preferences into Principles

2 June 2024

Eyke Hüllermeier

Papers citing "Inverse Constitutional AI: Compressing Preferences into Principles"

11 / 11 papers shown

Title
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction Michal Bravansky Vaclav Kubon Suhas Hariharan Robert Kirk 50 0 0 24 Feb 2025
AI Alignment at Your Discretion Maarten Buyl Hadi Khalaf C. M. Verdun Lucas Monteiro Paes Caio Vieira Machado Flavio du Pin Calmon 26 0 0 10 Feb 2025
IntentGPT: Few-shot Intent Discovery with Large Language Models Juan A. Rodriguez Nicholas Botzer David Vazquez Christopher Pal M. Pedersoli I. Laradji VLM 52 1 0 16 Nov 2024
Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment Andrew Konya Aviv Ovadya K. J. Kevin Feng Quan Ze Chen Lisa Schirch Colin Irwin Amy X. Zhang ALM 42 0 0 15 Nov 2024
Policy Prototyping for LLMs: Pluralistic Alignment via Interactive and Collaborative Policymaking K. J. Kevin Feng Inyoung Cheong Quan Ze Chen Amy X. Zhang 31 0 0 13 Sep 2024
Self-Directed Synthetic Dialogues and Revisions Technical Report Nathan Lambert Hailey Schoelkopf Aaron Gokaslan Luca Soldaini Valentina Pyatkin Louis Castricato SyDa 35 2 0 25 Jul 2024
ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions Chan Young Park Shuyue Stella Li Hayoung Jung Svitlana Volkova Tanushree Mitra David Jurgens Yulia Tsvetkov 36 5 0 02 Jul 2024
Humans or LLMs as the Judge? A Study on Judgement Biases Guiming Hardy Chen Shunian Chen Ziche Liu Feng Jiang Benyou Wang 56 89 0 16 Feb 2024
Specific versus General Principles for Constitutional AI Sandipan Kundu Yuntao Bai Saurav Kadavath Amanda Askell Andrew Callahan ... Zac Hatfield-Dodds Sören Mindermann Nicholas Joseph Sam McCandlish Jared Kaplan AILaw 54 24 0 20 Oct 2023
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak D. Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 207 178 0 20 Oct 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022