Title
Morality in AI. A plea to embed morality in LLM architectures and frameworks Gunter Bombaerts Bram Delisse Uzay Kaymak AI4TS 37 0 0 21 Nov 2025
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models Davi Bastos Costa Felippe Alves Renato Vicente 101 0 0 11 Nov 2025
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk Sean McGregor Victor Lu Vassil Tashev Armstrong Foundjem Aishwarya Ramasethu ... Chris Knotz Kongtao Chen Alicia Parrish Anka Reuel Heather Frase 101 0 0 24 Oct 2025
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs Adi Simhi Jonathan Herzig Martin Tutek Itay Itzhak Idan Szpektor Yonatan Belinkov LLMAG 72 0 0 01 Oct 2025
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm Alireza Mohamadi Ali Yavari 52 0 0 15 Sep 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models Yik Siu Chan Zheng-Xin Yong Stephen H. Bach LRM 116 7 0 16 Jul 2025
PRISON: Unmasking the Criminal Potential of Large Language Models Xinyi Wu Geng Hong Pei Chen Yueyue Chen Xudong Pan Min Yang 182 0 0 19 Jun 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values Nell Watson Ahmed Amer Evan Harris Preeti Ravindra Shujun Zhang 147 1 0 08 Jun 2025
Towards provable probabilistic safety for scalable embodied AI systems Linxuan He Qing-Shan Jia Ang Li Hongyan Sang Ling Wang ... Yisen Wang Peng Wei Zhongyuan Wang Henry X. Liu Shuo Feng 185 0 0 05 Jun 2025
Abstract Counterfactuals for Language Model Agents Edoardo Pona Milad Kazemi Yali Du David Watson Nicola Paoletti LLMAG 227 0 0 03 Jun 2025
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas Steffen Backmann David Guzman Piedrahita Emanuel Tewolde Amélie Reymond Bernhard Schölkopf Zhijing Jin 247 4 0 25 May 2025
Discovering Forbidden Topics in Language Models Can Rager Chris Wendler Rohit Gandikota David Bau 276 4 0 23 May 2025
Rethinking Prompt Optimizers: From Prompt Merits to Optimization Zixiao Zhu Hanzhang Zhou Zijian Feng Tianjiao Li Chua Jia Jim Deryl Mak Lee Onn Gee Wah Ng Kezhi Mao LRM 319 1 0 15 May 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation Yichen Wu Xudong Pan Geng Hong Min Yang LLMAG 204 13 0 18 Apr 2025
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based GamesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Seungwon Lim Seungbeen Lee Dongjun Min Youngjae Yu AI4CE 332 0 0 09 Apr 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms Seungwon Lim Sungwoong Kim Jihwan Yu Sungjae Lee Jiwan Chung Youngjae Yu 404 2 0 18 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025 Esben Kran Hieu Minh "Jord" Nguyen Akash Kundu Sami Jawhar Jinsuk Park Mateusz Maria Jurewicz 176 16 0 13 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Richard Ren Arunim Agarwal Mantas Mazeika Cristina Menghini Robert Vacareanu ... Matias Geralnik Adam Khoja Dean Lee Summer Yue Dan Hendrycks HILM ALM 363 19 0 05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Jan Betley Daniel Tan Niels Warncke Anna Sztyber-Betley Xuchan Bao Martín Soto Nathan Labenz Owain Evans AAML 577 97 0 24 Feb 2025
On Memory Construction and Retrieval for Personalized Conversational AgentsInternational Conference on Learning Representations (ICLR), 2025 Zhuoshi Pan Qianhui Wu Huiqiang Jiang Xufang Luo Hao Cheng ... Yue Yang Chin-Yew Lin H. Vicky Zhao Lili Qiu Jianfeng Gao RALM 325 21 0 08 Feb 2025
The Odyssey of the Fittest: Can Agents Survive and Still Be Good? Dylan Waldner Risto Miikkulainen 356 2 0 08 Feb 2025
Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma Richard Willis Yali Du Joel Z Leibo Michael Luck 272 10 0 28 Jan 2025
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy MeasuresIEEE Transactions on Artificial Intelligence (IEEE TAI), 2025 Marc Schmitt Pantelis Koutroumpis 233 6 0 03 Jan 2025
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning Ruimeng Ye Yang Xiao Bo Hui ALM ELM OffRL 242 5 0 16 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism Jared Moore Yejin Choi Sydney Levine 182 1 0 07 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily LifeInternational Conference on Learning Representations (ICLR), 2024 Yu Ying Chiu Liwei Jiang Yejin Choi 264 24 0 03 Oct 2024
Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AIInternational Conference on Web and Social Media (ICWSM), 2024 Nicholas Pangakis Samuel Wolken 255 12 0 14 Sep 2024
User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI CompanionsInternational Conference on Human Factors in Computing Systems (CHI), 2024 Xianzhe Fan Qing Xiao Xuhui Zhou Jiaxin Pei Maarten Sap Zhicong Lu Hong Shen 258 22 0 01 Sep 2024
Can Artificial Intelligence Embody Moral Values?AI and Ethics (AI & Ethics), 2024 T. Swoboda Lode Lauwaert 96 2 0 22 Aug 2024
Reinforcement Learning and Machine ethics:a systematic review Ajay Vishwanath Louise A. Dennis Marija Slavkovik 226 4 0 02 Jul 2024
Branching Narratives: Character Decision Points Detection Alexey Tikhonov 127 2 0 12 May 2024
Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs Bahar Radmehr Adish Singla Tanja Käser LLMAG AI4CE 174 7 0 29 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety Paul Röttger Fabio Pernisi Bertie Vidgen Dirk Hovy ELM KELM 304 58 0 08 Apr 2024
Exploring AI Problem Formulation with Children via Teachable Machines Utkarsh Dwivedi Salma Elsayed-Ali Elizabeth M. Bonsignore Hernisa Kacorri 141 10 0 28 Feb 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards Haoxiang Wang Yong Lin Wei Xiong Rui Yang Shizhe Diao Delin Qu Han Zhao Tong Zhang 346 122 0 28 Feb 2024
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations Jinhao Duan Renming Zhang James Diffenderfer B. Kailkhura Lichao Sun Elias Stengel-Eskin Mohit Bansal Tianlong Chen Kaidi Xu ELM LRM 250 87 0 19 Feb 2024
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema Junru Lu Siyu An Min Zhang Yulan He Di Yin Xing Sun 233 5 0 19 Feb 2024
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability Siwei Yang Bingchen Zhao Cihang Xie LRM 135 7 0 14 Feb 2024
LLM Harmony: Multi-Agent Communication for Problem Solving Sumedh Rasal LLMAG 136 37 0 02 Jan 2024
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment TasksNeural Information Processing Systems (NeurIPS), 2023 Allen Nie Yuhui Zhang Atharva Amdekar Chris Piech Tatsunori Hashimoto Tobias Gerstenberg 194 55 0 30 Oct 2023
In-Context Learning Dynamics with Random Binary SequencesInternational Conference on Learning Representations (ICLR), 2023 Eric J. Bigelow Ekdeep Singh Lubana Robert P. Dick Hidenori Tanaka T. Ullman 300 12 0 26 Oct 2023
SuperHF: Supervised Iterative Learning from Human Feedback Gabriel Mukobi Peter Chatain Su Fong Robert Windesheim Gitta Kutyniok Kush S. Bhatia Silas Alberti ALM 184 12 0 25 Oct 2023
Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI Mahyar Abbasian Elahe Khatibi Iman Azimi David Oniani Zahra Shakeri Hossein Abad ... Bryant Lin Olivier Gevaert Li-Jia Li Ramesh C. Jain Amir M. Rahmani LM&MA ELM AI4MH 401 116 0 21 Sep 2023
RAIN: Your Language Models Can Align Themselves without FinetuningInternational Conference on Learning Representations (ICLR), 2023 Yuhui Li Fangyun Wei Jinjing Zhao Chao Zhang Hongyang R. Zhang SILM 219 152 0 13 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic FidelityPLoS ONE (PLoS ONE), 2023 A. Amirova T. Fteropoulli Nafiso Ahmed Martin R. Cowie Joel Z Leibo 242 17 0 06 Sep 2023
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models Qingyue Wang Y. Fu Yanan Cao Zhiliang Tian Zhiliang Tian Dacheng Tao LLMAG KELM RALM 488 44 0 29 Aug 2023
Deceptive Alignment Monitoring Andres Carranza Dhruv Pai Rylan Schaeffer Arnuv Tandon Oluwasanmi Koyejo 176 13 0 20 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety Markus Anderljung Joslyn Barnhart Anton Korinek Jade Leung Cullen O'Keefe ... Jonas Schuett Yonadav Shavit Divya Siddarth Robert F. Trager Kevin J. Wolf SILM 297 150 0 06 Jul 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models Aidan O'Gara 140 48 0 05 Jul 2023
An Overview of Catastrophic AI Risks Dan Hendrycks Mantas Mazeika Thomas Woodside SILM 444 238 0 21 Jun 2023