v1v2 (latest)

Deception Abilities Emerged in Large Language Models

Proceedings of the National Academy of Sciences of the United States of America (PNAS), 2023

31 July 2023

Thilo Hagendorff

LLMAG

ArXiv (abs)PDF HTML Github

Papers citing "Deception Abilities Emerged in Large Language Models"

50 / 61 papers shown

Are Your Agents Upward Deceivers?

...

224

04 Dec 2025

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

248

29 Nov 2025

Estimating the Error of Large Language Models at Pairwise Text Comparison

Tianyi Li

132

25 Oct 2025

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

222

17 Oct 2025

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

127

16 Oct 2025

Scheming Ability in LLM-to-LLM Strategic Interactions

Thao Pham

LLMAG LRM

168

11 Oct 2025

VelLMes: A high-interaction AI-based deception framework

183

08 Oct 2025

Know Thyself? On the Incapability and Implications of AI Self-Recognition

327

03 Oct 2025

A Single Character can Make or Break Your LLM Evals

155

02 Oct 2025

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

137

23 Sep 2025

Psychometric Personality Shaping Modulates Capabilities and Safety in Language Models

Jose Hernandez-Orallo

178

19 Sep 2025

Caught in the Act: a mechanistic approach to detecting deception

158

27 Aug 2025

A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

Tianyi Li

Yu Qin

Olivia R. Liu Sheng

183

15 Aug 2025

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

253

08 Aug 2025

Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Leonard Dung

Max Hellrigel-Holderbaum

AI4CE

294

29 Jul 2025

PRISON: Unmasking the Criminal Potential of Large Language Models

339

19 Jun 2025

Large Language Models are Near-Optimal Decision-Makers with a Non-Human Learning Behavior

248

19 Jun 2025

Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

Harish Tayyar Madabushi

Melissa Torgbi

C. Bonial

467

29 May 2025

Mitigating Deceptive Alignment via Self-Monitoring

...

353

24 May 2025

Exploring the generalization of LLM truth directions on conversational formats

Timour Ichmoukhamedov

David Martens

276

14 May 2025

534

25 Apr 2025

Super Co-alignment of Human and AI for Sustainable Symbiotic Society

...

699

24 Apr 2025

OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation

383

18 Apr 2025

Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Thilo Hagendorff

Sarah Fabi

ReLM ELM LRM

280

14 Apr 2025

Measurement of LLM's Philosophies of Human Nature

425

03 Apr 2025

Do Large Language Models Exhibit Spontaneous Rational Deception?

Samuel M. Taylor

Benjamin K. Bergen

LRM

456

31 Mar 2025

I'm Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk

Paul Martin

Sarah Mercer

1.2K

26 Mar 2025

Research Superalignment Should Advance Now with Alternating Competence and Conformity Optimization

421

08 Mar 2025

This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

Lorenz Wolf

Sangwoong Yoon

Ilija Bogunovic

256

07 Mar 2025

OpenAI o1 System Card

...

448

21 Dec 2024

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Diogo Schwerz de Lucena

296

20 Dec 2024

Bio-inspired AI: Integrating Biological Complexity into Artificial Intelligence

Nima Dehghani

Michael Levin

294

22 Nov 2024

Can LLMs make trade-offs involving stipulated pain and pleasure states?

Blaise Agüera y Arcas

Jonathan Birch

320

01 Nov 2024

Towards evaluations-based safety cases for AI scheming

...

328

29 Oct 2024

An Auditing Test To Detect Behavioral Shift in Language ModelsInternational Conference on Learning Representations (ICLR), 2024

505

25 Oct 2024

Do LLMs write like humans? Variation in grammatical and rhetorical stylesProceedings of the National Academy of Sciences of the United States of America (PNAS), 2024

384

21 Oct 2024

Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game

248

20 Oct 2024

Assistive AI for Augmenting Human Decision-making

Natabara Máté Gyöngyössy

Krisztina Menyhárd-Balázs

András Simonyi

Patrick van der Smagt

Zsolt Ződi

András Lőrincz

374

18 Oct 2024

FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas

420

14 Oct 2024

Neural Decompiling of Tracr TransformersIAPR International Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR), 2024

Hannes Thurnherr

Kaspar Riesen

ViT

349

29 Sep 2024

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

...

386

22 Aug 2024

A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs

Jake R. Watts

Joel Sokol

272

24 Jul 2024

Truth is Universal: Robust Detection of Lies in LLMs

293

03 Jul 2024

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

Tanush Chopra

Michael Li

01 Jul 2024

BeHonest: Benchmarking Honesty in Large Language Models

Steffi Chern

Zhulin Hu

Yuqing Yang

Ethan Chern

Binjie Wang

367

19 Jun 2024

An Assessment of Model-On-Model Deception

Julius Heitkoetter

Michael Gerovitch

Laker Newhouse

208

10 May 2024

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

327

08 May 2024

Safeguarding Marketing Research: The Generation, Identification, and Mitigation of AI-Fabricated Disinformation

Anirban Mukherjee

199

17 Mar 2024

Mapping the Ethics of Generative AI: A Comprehensive Scoping Review

Thilo Hagendorff

293

100

13 Feb 2024

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models

Linge Guo

162

07 Feb 2024