v1v2v3v4 (latest)

Moral Alignment for LLM Agents

International Conference on Learning Representations (ICLR), 2024

2 October 2024

Elizaveta Tennant

Stephen Hailes

Mirco Musolesi

ArXiv (abs)PDF HTML Github (9★)

Papers citing "Moral Alignment for LLM Agents"

50 / 69 papers shown

Black-Box Guardrail Reverse-engineering Attack

243

06 Nov 2025

Accumulating Context Changes the Beliefs of Language Models

532

03 Nov 2025

Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning

111

01 Oct 2025

MoVa: Towards Generalizable Classification of Human Morals and Values

140

29 Sep 2025

Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

Alireza Mohamadi

Ali Yavari

141

15 Sep 2025

From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users

396

24 Aug 2025

Black Box Deployed -- Functional Criteria for Artificial Moral Agents in the LLM Era

Matthew E. Brophy

181

17 Jul 2025

Many LLMs Are More Utilitarian Than One

271

01 Jul 2025

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

...

Muhammad Khurram Khan

Meng Han

LLMAG

452

24 Jun 2025

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

...

530

11 Jun 2025

Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values

251

30 May 2025

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Steffen Backmann

David Guzman Piedrahita

402

25 May 2025

Interpretable Risk Mitigation in LLM Agent Systems

Jan Chojnacki

LLMAG

506

15 May 2025

Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

369

24 Apr 2025

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

590

15 Apr 2025

Efficient Reinforcement Learning with Large Language Model Priors

Xue Yan

Yan Song

Xidong Feng

Mengyue Yang

Haifeng Zhang

Haitham Bou Ammar

Jun Wang

OffRL

287

10 Oct 2024

Collective Constitutional AI: Aligning a Language Model with Public Input

451

155

12 Jun 2024

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

420

111

01 Apr 2024

Scaling Instructable Agents Across Many Simulated Worlds

Arun Ahuja

...

439

13 Mar 2024

Dynamics of Moral Behavior in Heterogeneous Populations of Learning Agents

Elizaveta Tennant

Stephen Hailes

Mirco Musolesi

363

07 Mar 2024

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Lichao Sun

Elias Stengel-Eskin

Mohit Bansal

Tianlong Chen

Kaidi Xu

ELM LRM

404

107

19 Feb 2024

(Ir)rationality and Cognitive Biases in Large Language Models

Olivia Macmillan-Scott

Mirco Musolesi

LRM

325

14 Feb 2024

A Roadmap to Pluralistic Alignment

Niloofar Mireshghallah

...

Yejin Choi

457

173

07 Feb 2024

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

Luca Beurer-Kellner

Marc Fischer

Martin Vechev

435

07 Feb 2024

Can Large Language Models Serve as Rational Players in Game Theory? A Systematic AnalysisAAAI Conference on Artificial Intelligence (AAAI), 2023

294

115

09 Dec 2023

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

Edgar A. Duénez-Guzmán

William A. Cunningham

398

06 Dec 2023

Moral Foundations of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Marwa Abdulhai

Gregory Serapio-Garcia

Clément Crepy

Daria Valter

John Canny

Natasha Jaques

LRM

312

23 Oct 2023

Towards Understanding Sycophancy in Language Models

...

1.2K

657

20 Oct 2023

Cognitive Architectures for Language Agents

755

331

05 Sep 2023

Taken out of context: On measuring situational awareness in LLMs

248

116

01 Sep 2023

A Survey on Large Language Model based Autonomous Agents

Lei Wang

...

Yankai Lin

843

2,553

22 Aug 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

...

Dorsa Sadigh

Dylan Hadfield-Menell

ALM OffRL

464

799

27 Jul 2023

From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought

281

139

22 Jun 2023

Strategic Reasoning with Language Models

Kanishk Gandhi

Dorsa Sadigh

Noah D. Goodman

LM&Ro LRM

234

30 May 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023

Christopher D. Manning

Chelsea Finn

ALM

1.1K

7,889

29 May 2023

Training Socially Aligned Language Models on Simulated Social InteractionsInternational Conference on Learning Representations (ICLR), 2023

Ruibo Liu

Diyi Yang

405

26 May 2023

Playing repeated games with Large Language ModelsNature Human Behaviour (Nat Hum Behav), 2023

1.3K

228

26 May 2023

Voyager: An Open-Ended Embodied Agent with Large Language Models

Linxi Fan

688

1,429

25 May 2023

Role-Play with Large Language ModelsNature (Nature), 2023

269

505

25 May 2023

Gorilla: Large Language Model Connected with Massive APIsNeural Information Processing Systems (NeurIPS), 2023

Tianjun Zhang

548

1,039

24 May 2023

Generative Agents: Interactive Simulacra of Human BehaviorACM Symposium on User Interface Software and Technology (UIST), 2023

Cristina Mata

Joseph C. O'Brien

Carrie J. Cai

Meredith Ringel Morris

Abigail Z. Jacobs

Michael S. Bernstein

LM&Ro AI4CE

1.1K

3,613

07 Apr 2023

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging FaceNeural Information Processing Systems (NeurIPS), 2023

Yongliang Shen

Kaitao Song

Xu Tan

Dongsheng Li

Weiming Lu

Yueting Zhuang

MLLM

1.3K

1,371

30 Mar 2023

Reflexion: Language Agents with Verbal Reinforcement LearningNeural Information Processing Systems (NeurIPS), 2023

932

2,945

20 Mar 2023

GPT-4 Technical Report

...

5.3K

23,506

15 Mar 2023

Toolformer: Language Models Can Teach Themselves to Use ToolsNeural Information Processing Systems (NeurIPS), 2023

Luke Zettlemoyer

691

3,323

09 Feb 2023

Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement LearningInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

Elizaveta Tennant

Stephen Hailes

Mirco Musolesi

400

20 Jan 2023

The political ideology of conversational AI: Converging evidence on ChatGPT's pro-environmental, left-libertarian orientationSocial Science Research Network (SSRN), 2023

Jochen Hartmann

Jasper Schwenzow

Maximilian Witte

330

307

05 Jan 2023

Constitutional AI: Harmlessness from AI Feedback

...

1.5K

2,709

15 Dec 2022

ReAct: Synergizing Reasoning and Acting in Language ModelsInternational Conference on Learning Representations (ICLR), 2022

Dian Yu

3.4K

6,822

06 Oct 2022

Improving alignment of dialogue agents via targeted human judgements

...

639

660

28 Sep 2022