Axioms for AI Alignment from Human Feedback

Axioms for AI Alignment from Human Feedback

23 May 2024

Ariel D. Procaccia

Yevgeniy Vorobeychik

Papers citing "Axioms for AI Alignment from Human Feedback"

12 / 12 papers shown

Title
Strategyproof Reinforcement Learning from Human Feedback Thomas Kleine Buening Jiarui Gan Debmalya Mandal Marta Z. Kwiatkowska 47 0 0 13 Mar 2025
Machine Learning Should Maximize Welfare, Not (Only) Accuracy Nir Rosenfeld Haifeng Xu HAI FaML 71 1 0 17 Feb 2025
Game Theory Meets Large Language Models: A Systematic Survey Haoran Sun Yusen Wu Yukun Cheng Xu Chu LM&MA OffRL AI4CE 55 1 0 13 Feb 2025
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking Benjamin Feuer Micah Goldblum Teresa Datta Sanjana Nambiar Raz Besaleli Samuel Dooley Max Cembalest John P. Dickerson ALM 37 0 0 28 Jan 2025
Clone-Robust AI Alignment Ariel D. Procaccia Benjamin G. Schiffer Shirley Zhang 26 1 0 17 Jan 2025
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards Shresth Verma Niclas Boehmer Lingkai Kong Milind Tambe 69 2 0 17 Jan 2025
Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models Roberto-Rafael Maura-Rivero Chirag Nagpal Roma Patel Francesco Visin 46 1 0 08 Jan 2025
Representative Social Choice: From Learning Theory to AI Alignment Tianyi Qiu FedML 27 2 0 31 Oct 2024
SEAL: Systematic Error Analysis for Value ALignment Manon Revel Matteo Cargnelutti Tyna Eloundou Greg Leppert 40 3 0 16 Aug 2024
Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback Vincent Conitzer Rachel Freedman J. Heitzig Wesley H. Holliday Bob M. Jacobs ... Eric Pacuit Stuart Russell Hailey Schoelkopf Emanuel Tewolde W. Zwicker 31 28 0 16 Apr 2024
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,583 0 18 Sep 2019