Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark

v1v2v3v4 (latest)

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

International Conference on Machine Learning (ICML), 2023

6 April 2023

Thomas Woodside

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark"

2 / 52 papers shown

Title
SWAN: A Generic Framework for Auditing Textual Conversational Systems T. Sakai 93 10 0 15 May 2023
The Alignment Problem from a Deep Learning PerspectiveInternational Conference on Learning Representations (ICLR), 2022 Richard Ngo Lawrence Chan Sören Mindermann 435 243 0 30 Aug 2022