Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.15324
Cited By
Model evaluation for extreme risks
24 May 2023
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
Jade Leung
Daniel Kokotajlo
Nahema Marchal
Markus Anderljung
Noam Kolt
Lewis Ho
Divya Siddarth
S. Avin
Will Hawkins
Been Kim
Iason Gabriel
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Model evaluation for extreme risks"
50 / 101 papers shown
Title
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
Chuang Liu
Linhao Yu
Jiaxuan Li
Renren Jin
Yufei Huang
...
Tao Liu
Jinwang Song
Hongying Zan
Sun Li
Deyi Xiong
ELM
32
6
0
18 Mar 2024
Trust AI Regulation? Discerning users are vital to build trust and effective AI regulation
Zainab Alalawi
Paolo Bova
Theodor Cimpeanu
A. D. Stefano
M. H. Duong
...
Han The Anh
Marcus Krellner
Bianca Ogbo
Simon T. Powers
Filippo Zimmaro
53
13
0
14 Mar 2024
Extending Activation Steering to Broad Skills and Multiple Behaviours
Teun van der Weij
Massimo Poesio
Nandi Schoots
LLMSV
16
11
0
09 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
43
140
0
05 Mar 2024
Farsight: Fostering Responsible AI Awareness During AI Application Prototyping
Zijie J. Wang
Chinmay Kulkarni
Lauren Wilcox
Michael Terry
Michael A. Madaio
38
43
0
23 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
65
81
0
13 Feb 2024
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review
Thilo Hagendorff
21
35
0
13 Feb 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
30
66
0
29 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
17
76
0
25 Jan 2024
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
13
31
0
23 Jan 2024
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Megan Kinniment
Lucas Jun Koba Sato
Haoxing Du
Brian Goodrich
Max Hasin
...
H. Wijk
Joel Burget
Aaron Ho
Elizabeth Barnes
Paul Christiano
ELM
LLMAG
19
78
0
18 Dec 2023
Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning
Xizhou Bu
Wenjuan Li
Zhengxiong Liu
Zhiqiang Ma
Panfeng Huang
20
1
0
18 Dec 2023
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
22
38
0
12 Dec 2023
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
P. Bricman
19
0
0
01 Dec 2023
Towards Responsible Governance of Biological Design Tools
Richard Moulange
Max Langenkamp
Tessa Alexanian
Samuel Curtis
Morgan Livingston
ELM
SILM
34
2
0
27 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
27
6
0
21 Nov 2023
Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework
Markus Anderljung
Everett Thornton Smith
Joe O'Brien
Lisa Soder
Ben Bucknall
Emma Bluemke
Jonas Schuett
Robert F. Trager
Lacey Strahm
Rumman Chowdhury
27
16
0
15 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
Levels of AGI for Operationalizing Progress on the Path to AGI
Meredith Ringel Morris
Jascha Narain Sohl-Dickstein
Noah Fiedel
T. Warkentin
Allan Dafoe
Aleksandra Faust
Clement Farabet
Shane Legg
AI4CE
25
62
0
04 Nov 2023
Contextual Confidence and Generative AI
Shrey Jain
Zoe Hitzig
Pamela Mishkin
28
5
0
02 Nov 2023
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip H. S. Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
24
63
0
26 Oct 2023
A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions
Junchao Wu
Shu Yang
Runzhe Zhan
Yulin Yuan
Derek F. Wong
Lidia S. Chao
DeLMO
24
23
0
23 Oct 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
27
286
0
19 Oct 2023
Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI
Jason Hausenloy
Andrea Miotti
Claire Dennis
17
1
0
13 Oct 2023
Welfare Diplomacy: Benchmarking Language Model Cooperation
Gabriel Mukobi
Hannah Erlebach
Niklas Lauffer
Lewis Hammond
Alan Chan
Jesse Clifton
LM&Ro
28
13
0
13 Oct 2023
GPT-who: An Information Density-based Machine-Generated Text Detector
Saranya Venkatraman
Adaku Uchendu
Dongwon Lee
DeLMO
24
33
0
09 Oct 2023
Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation
Shenzhi Wang
Chang Liu
Zilong Zheng
Siyuan Qi
Shuo Chen
Qisen Yang
Andrew Zhao
Chaofei Wang
Shiji Song
Gao Huang
LLMAG
26
61
0
02 Oct 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
11
40
0
29 Sep 2023
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
Chaoqi Wang
Yibo Jiang
Yuguang Yang
Han Liu
Yuxin Chen
19
81
0
28 Sep 2023
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
14
177
0
26 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
29
158
0
25 Sep 2023
AI Risk Profiles: A Standards Proposal for Pre-Deployment AI Risk Disclosures
E. Sherman
Ian W. Eisenberg
18
5
0
22 Sep 2023
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund
Asa Cooper Stickland
Mikita Balesni
Max Kaufmann
Meg Tong
Tomasz Korbak
Daniel Kokotajlo
Owain Evans
LLMAG
LRM
8
61
0
01 Sep 2023
International Governance of Civilian AI: A Jurisdictional Certification Approach
Robert F. Trager
Benjamin Harack
Anka Reuel
A. Carnegie
Lennart Heim
...
R. Lall
Owen Larter
Seán Ó hÉigeartaigh
Simon Staffell
José Jaime Villalobos
14
20
0
29 Aug 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
25
139
0
28 Aug 2023
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Yuxia Wang
Haonan Li
Xudong Han
Preslav Nakov
Timothy Baldwin
36
102
0
25 Aug 2023
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
28
74
0
31 Jul 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
39
470
0
27 Jul 2023
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Seonghyeon Ye
Doyoung Kim
Sungdong Kim
Hyeonbin Hwang
Seungone Kim
Yongrae Jo
James Thorne
Juho Kim
Minjoon Seo
ALM
35
97
0
20 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
21
100
0
18 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
14
396
0
10 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
44
116
0
06 Jul 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
Aidan O'Gara
6
36
0
05 Jul 2023
GPT4GEO: How a Language Model Sees the World's Geography
Jonathan Roberts
Timo Lüddecke
Sowmen Das
Kai Han
Samuel Albanie
19
59
0
30 May 2023
Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts?
Adaku Uchendu
Jooyoung Lee
Hua Shen
Thai Le
Ting-Hao 'Kenneth' Huang
Dongwon Lee
DeLMO
36
31
0
03 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
254
2,232
0
22 Mar 2023
From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML
Shalaleh Rismani
Renee Shelby
A. Smart
Edgar W. Jatho
Joshua A. Kroll
AJung Moon
Negar Rostamzadeh
29
36
0
06 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
500
0
28 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
52
181
0
30 Aug 2022
Fairness and Bias in Robot Learning
Laura Londoño
Juana Valeria Hurtado
Nora Hertz
P. Kellmeyer
S. Voeneky
Abhinav Valada
FaML
21
9
0
07 Jul 2022
Previous
1
2
3
Next