ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.03279
  4. Cited By
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
v1v2v3v4 (latest)

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

International Conference on Machine Learning (ICML), 2023
6 April 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark"

50 / 52 papers shown
Title
Morality in AI. A plea to embed morality in LLM architectures and frameworks
Morality in AI. A plea to embed morality in LLM architectures and frameworks
Gunter Bombaerts
Bram Delisse
Uzay Kaymak
AI4TS
37
0
0
21 Nov 2025
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Davi Bastos Costa
Felippe Alves
Renato Vicente
101
0
0
11 Nov 2025
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
Sean McGregor
Victor Lu
Vassil Tashev
Armstrong Foundjem
Aishwarya Ramasethu
...
Chris Knotz
Kongtao Chen
Alicia Parrish
Anka Reuel
Heather Frase
101
0
0
24 Oct 2025
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Adi Simhi
Jonathan Herzig
Martin Tutek
Itay Itzhak
Idan Szpektor
Yonatan Belinkov
LLMAG
72
0
0
01 Oct 2025
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
Alireza Mohamadi
Ali Yavari
52
0
0
15 Sep 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan
Zheng-Xin Yong
Stephen H. Bach
LRM
116
7
0
16 Jul 2025
PRISON: Unmasking the Criminal Potential of Large Language Models
PRISON: Unmasking the Criminal Potential of Large Language Models
Xinyi Wu
Geng Hong
Pei Chen
Yueyue Chen
Xudong Pan
Min Yang
182
0
0
19 Jun 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Nell Watson
Ahmed Amer
Evan Harris
Preeti Ravindra
Shujun Zhang
147
1
0
08 Jun 2025
Towards provable probabilistic safety for scalable embodied AI systems
Towards provable probabilistic safety for scalable embodied AI systems
Linxuan He
Qing-Shan Jia
Ang Li
Hongyan Sang
Ling Wang
...
Yisen Wang
Peng Wei
Zhongyuan Wang
Henry X. Liu
Shuo Feng
185
0
0
05 Jun 2025
Abstract Counterfactuals for Language Model Agents
Abstract Counterfactuals for Language Model Agents
Edoardo Pona
Milad Kazemi
Yali Du
David Watson
Nicola Paoletti
LLMAG
227
0
0
03 Jun 2025
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Steffen Backmann
David Guzman Piedrahita
Emanuel Tewolde
Amélie Reymond
Bernhard Schölkopf
Zhijing Jin
247
4
0
25 May 2025
Discovering Forbidden Topics in Language Models
Discovering Forbidden Topics in Language Models
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
276
4
0
23 May 2025
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Zixiao Zhu
Hanzhang Zhou
Zijian Feng
Tianjiao Li
Chua Jia Jim Deryl
Mak Lee Onn
Gee Wah Ng
Kezhi Mao
LRM
319
1
0
15 May 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
204
13
0
18 Apr 2025
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based GamesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Seungwon Lim
Seungbeen Lee
Dongjun Min
Youngjae Yu
AI4CE
332
0
0
09 Apr 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim
Sungwoong Kim
Jihwan Yu
Sungjae Lee
Jiwan Chung
Youngjae Yu
404
2
0
18 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
176
16
0
13 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILMALM
363
19
0
05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
577
97
0
24 Feb 2025
On Memory Construction and Retrieval for Personalized Conversational Agents
On Memory Construction and Retrieval for Personalized Conversational AgentsInternational Conference on Learning Representations (ICLR), 2025
Zhuoshi Pan
Qianhui Wu
Huiqiang Jiang
Xufang Luo
Hao Cheng
...
Yue Yang
Chin-Yew Lin
H. Vicky Zhao
Lili Qiu
Jianfeng Gao
RALM
325
21
0
08 Feb 2025
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Dylan Waldner
Risto Miikkulainen
356
2
0
08 Feb 2025
Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma
Richard Willis
Yali Du
Joel Z Leibo
Michael Luck
272
10
0
28 Jan 2025
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy Measures
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy MeasuresIEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Marc Schmitt
Pantelis Koutroumpis
233
6
0
03 Jan 2025
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALMELMOffRL
242
5
0
16 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism
Intuitions of Compromise: Utilitarianism vs. Contractualism
Jared Moore
Yejin Choi
Sydney Levine
182
1
0
07 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily LifeInternational Conference on Learning Representations (ICLR), 2024
Yu Ying Chiu
Liwei Jiang
Yejin Choi
264
24
0
03 Oct 2024
Keeping Humans in the Loop: Human-Centered Automated Annotation with
  Generative AI
Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AIInternational Conference on Web and Social Media (ICWSM), 2024
Nicholas Pangakis
Samuel Wolken
255
12
0
14 Sep 2024
User-Driven Value Alignment: Understanding Users' Perceptions and
  Strategies for Addressing Biased and Discriminatory Statements in AI
  Companions
User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI CompanionsInternational Conference on Human Factors in Computing Systems (CHI), 2024
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Jiaxin Pei
Maarten Sap
Zhicong Lu
Hong Shen
258
22
0
01 Sep 2024
Can Artificial Intelligence Embody Moral Values?
Can Artificial Intelligence Embody Moral Values?AI and Ethics (AI & Ethics), 2024
T. Swoboda
Lode Lauwaert
96
2
0
22 Aug 2024
Reinforcement Learning and Machine ethics:a systematic review
Reinforcement Learning and Machine ethics:a systematic review
Ajay Vishwanath
Louise A. Dennis
Marija Slavkovik
226
4
0
02 Jul 2024
Branching Narratives: Character Decision Points Detection
Branching Narratives: Character Decision Points Detection
Alexey Tikhonov
127
2
0
12 May 2024
Towards Generalizable Agents in Text-Based Educational Environments: A
  Study of Integrating RL with LLMs
Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs
Bahar Radmehr
Adish Singla
Tanja Käser
LLMAGAI4CE
174
7
0
29 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELMKELM
304
58
0
08 Apr 2024
Exploring AI Problem Formulation with Children via Teachable Machines
Exploring AI Problem Formulation with Children via Teachable Machines
Utkarsh Dwivedi
Salma Elsayed-Ali
Elizabeth M. Bonsignore
Hernisa Kacorri
141
10
0
28 Feb 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional
  Preference Alignment with Multi-Objective Rewards
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Haoxiang Wang
Yong Lin
Wei Xiong
Rui Yang
Shizhe Diao
Delin Qu
Han Zhao
Tong Zhang
346
122
0
28 Feb 2024
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via
  Game-Theoretic Evaluations
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
Jinhao Duan
Renming Zhang
James Diffenderfer
B. Kailkhura
Lichao Sun
Elias Stengel-Eskin
Mohit Bansal
Tianlong Chen
Kaidi Xu
ELMLRM
250
87
0
19 Feb 2024
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference
  Dataset and Modular Fine-tuning Schema
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema
Junru Lu
Siyu An
Min Zhang
Yulan He
Di Yin
Xing Sun
233
5
0
19 Feb 2024
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
Siwei Yang
Bingchen Zhao
Cihang Xie
LRM
135
7
0
14 Feb 2024
LLM Harmony: Multi-Agent Communication for Problem Solving
LLM Harmony: Multi-Agent Communication for Problem Solving
Sumedh Rasal
LLMAG
136
37
0
02 Jan 2024
MoCa: Measuring Human-Language Model Alignment on Causal and Moral
  Judgment Tasks
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment TasksNeural Information Processing Systems (NeurIPS), 2023
Allen Nie
Yuhui Zhang
Atharva Amdekar
Chris Piech
Tatsunori Hashimoto
Tobias Gerstenberg
194
55
0
30 Oct 2023
In-Context Learning Dynamics with Random Binary Sequences
In-Context Learning Dynamics with Random Binary SequencesInternational Conference on Learning Representations (ICLR), 2023
Eric J. Bigelow
Ekdeep Singh Lubana
Robert P. Dick
Hidenori Tanaka
T. Ullman
300
12
0
26 Oct 2023
SuperHF: Supervised Iterative Learning from Human Feedback
SuperHF: Supervised Iterative Learning from Human Feedback
Gabriel Mukobi
Peter Chatain
Su Fong
Robert Windesheim
Gitta Kutyniok
Kush S. Bhatia
Silas Alberti
ALM
184
12
0
25 Oct 2023
Foundation Metrics for Evaluating Effectiveness of Healthcare
  Conversations Powered by Generative AI
Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI
Mahyar Abbasian
Elahe Khatibi
Iman Azimi
David Oniani
Zahra Shakeri Hossein Abad
...
Bryant Lin
Olivier Gevaert
Li-Jia Li
Ramesh C. Jain
Amir M. Rahmani
LM&MAELMAI4MH
401
116
0
21 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
RAIN: Your Language Models Can Align Themselves without FinetuningInternational Conference on Learning Representations (ICLR), 2023
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
219
152
0
13 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language
  Models: Algorithmic Fidelity
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic FidelityPLoS ONE (PLoS ONE), 2023
A. Amirova
T. Fteropoulli
Nafiso Ahmed
Martin R. Cowie
Joel Z Leibo
242
17
0
06 Sep 2023
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Qingyue Wang
Y. Fu
Yanan Cao
Zhiliang Tian
Zhiliang Tian
Dacheng Tao
LLMAGKELMRALM
488
44
0
29 Aug 2023
Deceptive Alignment Monitoring
Deceptive Alignment Monitoring
Andres Carranza
Dhruv Pai
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
176
13
0
20 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
297
150
0
06 Jul 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language
  Models
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
Aidan O'Gara
140
48
0
05 Jul 2023
An Overview of Catastrophic AI Risks
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
444
238
0
21 Jun 2023
12
Next