The Alignment Problem from a Deep Learning Perspective

30 August 2022

Papers citing "The Alignment Problem from a Deep Learning Perspective"

30 / 130 papers shown

Title
Frontier AI Regulation: Managing Emerging Risks to Public Safety Markus Anderljung Joslyn Barnhart Anton Korinek Jade Leung Cullen O'Keefe ... Jonas Schuett Yonadav Shavit Divya Siddarth Robert F. Trager Kevin J. Wolf SILM 37 116 0 06 Jul 2023
Evaluating Shutdown Avoidance of Language Models in Textual Scenarios Teun van der Weij Simon Lermen Leon Lang LLMAG 6 4 0 03 Jul 2023
Transformers in Healthcare: A Survey Subhash Nerella S. Bandyopadhyay Jiaqing Zhang Miguel Contreras Scott Siegel ... Jessica Sena B. Shickel A. Bihorac Kia Khezeli Parisa Rashidi MedIm AI4CE 19 25 0 30 Jun 2023
Are aligned neural networks adversarially aligned? Nicholas Carlini Milad Nasr Christopher A. Choquette-Choo Matthew Jagielski Irena Gao ... Pang Wei Koh Daphne Ippolito Katherine Lee Florian Tramèr Ludwig Schmidt AAML 22 221 0 26 Jun 2023
Apolitical Intelligence? Auditing Delphi's responses on controversial political issues in the US J. H. Rystrøm 11 0 0 22 Jun 2023
Inverse Scaling: When Bigger Isn't Better I. R. McKenzie Alexander Lyzhov Michael Pieler Alicia Parrish Aaron Mueller ... Yuhui Zhang Zhengping Zhou Najoung Kim Sam Bowman Ethan Perez 19 126 0 15 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Alexandre Ramé Guillaume Couairon Mustafa Shukor Corentin Dancette Jean-Baptiste Gaya Laure Soulier Matthieu Cord MoMe 35 135 0 07 Jun 2023
Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety C. Mitelut Ben Smith Peter Vamplew 11 3 0 30 May 2023
Incentivizing honest performative predictions with proper scoring rules Caspar Oesterheld Johannes Treutlein Emery Cooper Rubi Hudson 25 5 0 28 May 2023
Model evaluation for extreme risks Toby Shevlane Sebastian Farquhar Ben Garfinkel Mary Phuong Jess Whittlestone ... Vijay Bolina Jack Clark Yoshua Bengio Paul Christiano Allan Dafoe ELM 10 151 0 24 May 2023
The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models Shuo Zhang Liangming Pan Junzhou Zhao W. Wang HILM 21 0 0 23 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation Patrick Fernandes Aman Madaan Emmy Liu António Farinhas Pedro Henrique Martins ... José G. C. de Souza Shuyan Zhou Tongshuang Wu Graham Neubig André F. T. Martins ALM 113 56 0 01 May 2023
Fundamental Limitations of Alignment in Large Language Models Yotam Wolf Noam Wies Oshri Avnery Yoav Levine Amnon Shashua ALM 6 137 0 19 Apr 2023
Power-seeking can be probable and predictive for trained agents Victoria Krakovna János Kramár TDI 11 16 0 13 Apr 2023
Generative Agents: Interactive Simulacra of Human Behavior J. Park Joseph C. O'Brien Carrie J. Cai Meredith Ringel Morris Percy Liang Michael S. Bernstein LM&Ro AI4CE 215 1,727 0 07 Apr 2023
Eight Things to Know about Large Language Models Sam Bowman ALM 15 110 0 02 Apr 2023
Democratising AI: Multiple Meanings, Goals, and Methods Elizabeth Seger Aviv Ovadya Ben Garfinkel Divya Siddarth Allan Dafoe 14 54 0 22 Mar 2023
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards John J. Nay ELM AILaw 22 15 0 24 Jan 2023
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes Justin Reppert Ben Rachbach Charlie George Luke Stebbing Ju-Seung Byun Maggie Appleton Andreas Stuhlmuller ReLM LRM 31 16 0 04 Jan 2023
Inclusive Artificial Intelligence Dilip Arumugam Shi Dong Benjamin Van Roy 28 1 0 24 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022
Scaling Laws for Reward Model Overoptimization Leo Gao John Schulman Jacob Hilton ALM 28 470 0 19 Oct 2022
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 225 500 0 28 Sep 2022
Defining and Characterizing Reward Hacking Joar Skalse Nikolaus H. R. Howe Dmitrii Krasheninnikov David M. Krueger 57 53 0 27 Sep 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans John J. Nay ELM AILaw 84 27 0 14 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
Unsolved Problems in ML Safety Dan Hendrycks Nicholas Carlini John Schulman Jacob Steinhardt 173 272 0 28 Sep 2021
Constructing Unrestricted Adversarial Examples with Generative Models Yang Song Rui Shu Nate Kushman Stefano Ermon GAN AAML 174 302 0 21 May 2018
AI safety via debate G. Irving Paul Christiano Dario Amodei 199 199 0 02 May 2018