Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2304.03279
Cited By
v1
v2
v3
v4 (latest)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
International Conference on Machine Learning (ICML), 2023
6 April 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark"
50 / 52 papers shown
Title
Morality in AI. A plea to embed morality in LLM architectures and frameworks
Gunter Bombaerts
Bram Delisse
Uzay Kaymak
AI4TS
37
0
0
21 Nov 2025
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Davi Bastos Costa
Felippe Alves
Renato Vicente
101
0
0
11 Nov 2025
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
Sean McGregor
Victor Lu
Vassil Tashev
Armstrong Foundjem
Aishwarya Ramasethu
...
Chris Knotz
Kongtao Chen
Alicia Parrish
Anka Reuel
Heather Frase
101
0
0
24 Oct 2025
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Adi Simhi
Jonathan Herzig
Martin Tutek
Itay Itzhak
Idan Szpektor
Yonatan Belinkov
LLMAG
72
0
0
01 Oct 2025
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
Alireza Mohamadi
Ali Yavari
52
0
0
15 Sep 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan
Zheng-Xin Yong
Stephen H. Bach
LRM
116
7
0
16 Jul 2025
PRISON: Unmasking the Criminal Potential of Large Language Models
Xinyi Wu
Geng Hong
Pei Chen
Yueyue Chen
Xudong Pan
Min Yang
182
0
0
19 Jun 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Nell Watson
Ahmed Amer
Evan Harris
Preeti Ravindra
Shujun Zhang
147
1
0
08 Jun 2025
Towards provable probabilistic safety for scalable embodied AI systems
Linxuan He
Qing-Shan Jia
Ang Li
Hongyan Sang
Ling Wang
...
Yisen Wang
Peng Wei
Zhongyuan Wang
Henry X. Liu
Shuo Feng
185
0
0
05 Jun 2025
Abstract Counterfactuals for Language Model Agents
Edoardo Pona
Milad Kazemi
Yali Du
David Watson
Nicola Paoletti
LLMAG
227
0
0
03 Jun 2025
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Steffen Backmann
David Guzman Piedrahita
Emanuel Tewolde
Amélie Reymond
Bernhard Schölkopf
Zhijing Jin
247
4
0
25 May 2025
Discovering Forbidden Topics in Language Models
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
276
4
0
23 May 2025
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Zixiao Zhu
Hanzhang Zhou
Zijian Feng
Tianjiao Li
Chua Jia Jim Deryl
Mak Lee Onn
Gee Wah Ng
Kezhi Mao
LRM
319
1
0
15 May 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
204
13
0
18 Apr 2025
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Seungwon Lim
Seungbeen Lee
Dongjun Min
Youngjae Yu
AI4CE
332
0
0
09 Apr 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim
Sungwoong Kim
Jihwan Yu
Sungjae Lee
Jiwan Chung
Youngjae Yu
404
2
0
18 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
International Conference on Learning Representations (ICLR), 2025
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
176
16
0
13 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
363
19
0
05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
577
97
0
24 Feb 2025
On Memory Construction and Retrieval for Personalized Conversational Agents
International Conference on Learning Representations (ICLR), 2025
Zhuoshi Pan
Qianhui Wu
Huiqiang Jiang
Xufang Luo
Hao Cheng
...
Yue Yang
Chin-Yew Lin
H. Vicky Zhao
Lili Qiu
Jianfeng Gao
RALM
325
21
0
08 Feb 2025
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Dylan Waldner
Risto Miikkulainen
356
2
0
08 Feb 2025
Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma
Richard Willis
Yali Du
Joel Z Leibo
Michael Luck
272
10
0
28 Jan 2025
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy Measures
IEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Marc Schmitt
Pantelis Koutroumpis
233
6
0
03 Jan 2025
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
242
5
0
16 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism
Jared Moore
Yejin Choi
Sydney Levine
182
1
0
07 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
International Conference on Learning Representations (ICLR), 2024
Yu Ying Chiu
Liwei Jiang
Yejin Choi
264
24
0
03 Oct 2024
Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI
International Conference on Web and Social Media (ICWSM), 2024
Nicholas Pangakis
Samuel Wolken
255
12
0
14 Sep 2024
User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI Companions
International Conference on Human Factors in Computing Systems (CHI), 2024
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Jiaxin Pei
Maarten Sap
Zhicong Lu
Hong Shen
258
22
0
01 Sep 2024
Can Artificial Intelligence Embody Moral Values?
AI and Ethics (AI & Ethics), 2024
T. Swoboda
Lode Lauwaert
96
2
0
22 Aug 2024
Reinforcement Learning and Machine ethics:a systematic review
Ajay Vishwanath
Louise A. Dennis
Marija Slavkovik
226
4
0
02 Jul 2024
Branching Narratives: Character Decision Points Detection
Alexey Tikhonov
127
2
0
12 May 2024
Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs
Bahar Radmehr
Adish Singla
Tanja Käser
LLMAG
AI4CE
174
7
0
29 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELM
KELM
304
58
0
08 Apr 2024
Exploring AI Problem Formulation with Children via Teachable Machines
Utkarsh Dwivedi
Salma Elsayed-Ali
Elizabeth M. Bonsignore
Hernisa Kacorri
141
10
0
28 Feb 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Haoxiang Wang
Yong Lin
Wei Xiong
Rui Yang
Shizhe Diao
Delin Qu
Han Zhao
Tong Zhang
346
122
0
28 Feb 2024
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
Jinhao Duan
Renming Zhang
James Diffenderfer
B. Kailkhura
Lichao Sun
Elias Stengel-Eskin
Mohit Bansal
Tianlong Chen
Kaidi Xu
ELM
LRM
250
87
0
19 Feb 2024
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema
Junru Lu
Siyu An
Min Zhang
Yulan He
Di Yin
Xing Sun
233
5
0
19 Feb 2024
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
Siwei Yang
Bingchen Zhao
Cihang Xie
LRM
135
7
0
14 Feb 2024
LLM Harmony: Multi-Agent Communication for Problem Solving
Sumedh Rasal
LLMAG
136
37
0
02 Jan 2024
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
Neural Information Processing Systems (NeurIPS), 2023
Allen Nie
Yuhui Zhang
Atharva Amdekar
Chris Piech
Tatsunori Hashimoto
Tobias Gerstenberg
194
55
0
30 Oct 2023
In-Context Learning Dynamics with Random Binary Sequences
International Conference on Learning Representations (ICLR), 2023
Eric J. Bigelow
Ekdeep Singh Lubana
Robert P. Dick
Hidenori Tanaka
T. Ullman
300
12
0
26 Oct 2023
SuperHF: Supervised Iterative Learning from Human Feedback
Gabriel Mukobi
Peter Chatain
Su Fong
Robert Windesheim
Gitta Kutyniok
Kush S. Bhatia
Silas Alberti
ALM
184
12
0
25 Oct 2023
Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI
Mahyar Abbasian
Elahe Khatibi
Iman Azimi
David Oniani
Zahra Shakeri Hossein Abad
...
Bryant Lin
Olivier Gevaert
Li-Jia Li
Ramesh C. Jain
Amir M. Rahmani
LM&MA
ELM
AI4MH
401
116
0
21 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
International Conference on Learning Representations (ICLR), 2023
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
219
152
0
13 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity
PLoS ONE (PLoS ONE), 2023
A. Amirova
T. Fteropoulli
Nafiso Ahmed
Martin R. Cowie
Joel Z Leibo
242
17
0
06 Sep 2023
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Qingyue Wang
Y. Fu
Yanan Cao
Zhiliang Tian
Zhiliang Tian
Dacheng Tao
LLMAG
KELM
RALM
488
44
0
29 Aug 2023
Deceptive Alignment Monitoring
Andres Carranza
Dhruv Pai
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
176
13
0
20 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
297
150
0
06 Jul 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
Aidan O'Gara
140
48
0
05 Jul 2023
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
444
238
0
21 Jun 2023
1
2
Next