Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.00626
Cited By
The Alignment Problem from a Deep Learning Perspective
30 August 2022
Richard Ngo
Lawrence Chan
Sören Mindermann
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Alignment Problem from a Deep Learning Perspective"
30 / 130 papers shown
Title
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
37
116
0
06 Jul 2023
Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Teun van der Weij
Simon Lermen
Leon Lang
LLMAG
6
4
0
03 Jul 2023
Transformers in Healthcare: A Survey
Subhash Nerella
S. Bandyopadhyay
Jiaqing Zhang
Miguel Contreras
Scott Siegel
...
Jessica Sena
B. Shickel
A. Bihorac
Kia Khezeli
Parisa Rashidi
MedIm
AI4CE
19
25
0
30 Jun 2023
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
22
221
0
26 Jun 2023
Apolitical Intelligence? Auditing Delphi's responses on controversial political issues in the US
J. H. Rystrøm
11
0
0
22 Jun 2023
Inverse Scaling: When Bigger Isn't Better
I. R. McKenzie
Alexander Lyzhov
Michael Pieler
Alicia Parrish
Aaron Mueller
...
Yuhui Zhang
Zhengping Zhou
Najoung Kim
Sam Bowman
Ethan Perez
19
126
0
15 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
35
135
0
07 Jun 2023
Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety
C. Mitelut
Ben Smith
Peter Vamplew
11
3
0
30 May 2023
Incentivizing honest performative predictions with proper scoring rules
Caspar Oesterheld
Johannes Treutlein
Emery Cooper
Rubi Hudson
25
5
0
28 May 2023
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
10
151
0
24 May 2023
The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models
Shuo Zhang
Liangming Pan
Junzhou Zhao
W. Wang
HILM
21
0
0
23 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
153
186
0
02 May 2023
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
Patrick Fernandes
Aman Madaan
Emmy Liu
António Farinhas
Pedro Henrique Martins
...
José G. C. de Souza
Shuyan Zhou
Tongshuang Wu
Graham Neubig
André F. T. Martins
ALM
113
56
0
01 May 2023
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
6
137
0
19 Apr 2023
Power-seeking can be probable and predictive for trained agents
Victoria Krakovna
János Kramár
TDI
11
16
0
13 Apr 2023
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
215
1,727
0
07 Apr 2023
Eight Things to Know about Large Language Models
Sam Bowman
ALM
15
110
0
02 Apr 2023
Democratising AI: Multiple Meanings, Goals, and Methods
Elizabeth Seger
Aviv Ovadya
Ben Garfinkel
Divya Siddarth
Allan Dafoe
14
54
0
22 Mar 2023
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards
John J. Nay
ELM
AILaw
22
15
0
24 Jan 2023
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
Justin Reppert
Ben Rachbach
Charlie George
Luke Stebbing
Ju-Seung Byun
Maggie Appleton
Andreas Stuhlmuller
ReLM
LRM
31
16
0
04 Jan 2023
Inclusive Artificial Intelligence
Dilip Arumugam
Shi Dong
Benjamin Van Roy
28
1
0
24 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
28
470
0
19 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
500
0
28 Sep 2022
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
57
53
0
27 Sep 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans
John J. Nay
ELM
AILaw
84
27
0
14 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,881
0
04 Mar 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
173
272
0
28 Sep 2021
Constructing Unrestricted Adversarial Examples with Generative Models
Yang Song
Rui Shu
Nate Kushman
Stefano Ermon
GAN
AAML
174
302
0
21 May 2018
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
199
199
0
02 May 2018
Previous
1
2
3