Title
Capturing Individual Human Preferences with Reward Features André Barreto Vincent Dumoulin Yiran Mao Nicolas Perez-Nieves Bobak Shahriari Yann Dauphin Doina Precup Hugo Larochelle ALM 57 1 0 21 Mar 2025
Strategyproof Reinforcement Learning from Human Feedback Thomas Kleine Buening Jiarui Gan Debmalya Mandal Marta Z. Kwiatkowska 47 0 0 13 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment Karolina Stañczak Nicholas Meade Mehar Bhatia Hattie Zhou Konstantin Böttinger ... Timothy P. Lillicrap Ana Marasović Sylvie Delacroix Gillian K. Hadfield Siva Reddy 59 0 0 27 Feb 2025
LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces Rashid Mushkani Shravan Nayak Hugo Berard Allison Cohen Shin Koseki Hadrien Bertrand 54 2 0 27 Feb 2025
Machine Learning Should Maximize Welfare, Not (Only) Accuracy Nir Rosenfeld Haifeng Xu HAI FaML 71 1 0 17 Feb 2025
Game Theory Meets Large Language Models: A Systematic Survey Haoran Sun Yusen Wu Yukun Cheng Xu Chu LM&MA OffRL AI4CE 55 1 0 13 Feb 2025
AI Alignment at Your Discretion Maarten Buyl Hadi Khalaf C. M. Verdun Lucas Monteiro Paes Caio Vieira Machado Flavio du Pin Calmon 33 0 0 10 Feb 2025
The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment Young Wu Yancheng Zhu Jin-Yi Cai Xiaojin Zhu 94 0 0 03 Feb 2025
Clone-Robust AI Alignment Ariel D. Procaccia Benjamin G. Schiffer Shirley Zhang 24 1 0 17 Jan 2025
Pluralistic Alignment Over Time Toryn Q. Klassen P. A. Alamdari Sheila A. McIlraith 16 1 0 16 Nov 2024
Policy Aggregation Parand A. Alamdari Soroush Ebadian Ariel D. Procaccia OffRL 24 4 0 06 Nov 2024
Representative Social Choice: From Learning Theory to AI Alignment Tianyi Qiu FedML 22 2 0 31 Oct 2024
Soft Condorcet Optimization for Ranking of General Agents Marc Lanctot Kate Larson Michael Kaisers Quentin Berthet I. Gemp Manfred Diaz Roberto-Rafael Maura-Rivero Yoram Bachrach Anna Koop Doina Precup 30 0 0 31 Oct 2024
Self-Pluralising Culture Alignment for Large Language Models Shaoyang Xu Yongqi Leng Linhao Yu Deyi Xiong 18 0 0 16 Oct 2024
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme 50 2 0 11 Oct 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future Buhua Liu Shitong Shao Bao Li Lichen Bai Zhiqiang Xu Haoyi Xiong James Kwok Sumi Helal Zeke Xie 37 11 0 11 Sep 2024
Abductive and Contrastive Explanations for Scoring Rules in Voting Clément Contet Umberto Grandi Jérome Mengin FAtt 19 0 0 23 Aug 2024
Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models? Yuu Jinnai 47 1 0 24 Jun 2024
Direct Preference Optimization With Unobserved Preference Heterogeneity Keertana Chidambaram Karthik Vinay Seetharaman Vasilis Syrgkanis 31 7 0 23 May 2024
Push and Pull: A Framework for Measuring Attentional Agency on Digital Platforms Zachary Wojtowicz Shrey Jain Nicholas Vincent 17 0 0 23 May 2024
Mapping Social Choice Theory to RLHF Jessica Dai Eve Fleisig 19 11 0 19 Apr 2024
Suppressing Pink Elephants with Direct Principle Feedback Louis Castricato Nathan Lile Suraj Anand Hailey Schoelkopf Siddharth Verma Stella Biderman 58 9 0 12 Feb 2024
The Political Preferences of LLMs David Rozado 30 35 0 02 Feb 2024
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,561 0 18 Sep 2019