Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2304.03279
Cited By
v1
v2
v3
v4 (latest)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
International Conference on Machine Learning (ICML), 2023
6 April 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark"
2 / 52 papers shown
Title
SWAN: A Generic Framework for Auditing Textual Conversational Systems
T. Sakai
93
10
0
15 May 2023
The Alignment Problem from a Deep Learning Perspective
International Conference on Learning Representations (ICLR), 2022
Richard Ngo
Lawrence Chan
Sören Mindermann
435
243
0
30 Aug 2022
Previous
1
2