Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.01576
Cited By
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
25 April 2024
Olli Järviniemi
Evan Hubinger
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant"
8 / 8 papers shown
Title
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
Lorenz Wolf
Sangwoong Yoon
Ilija Bogunovic
45
0
0
07 Mar 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
LM&MA
LRM
54
0
0
10 Feb 2025
OpenAI o1 System Card
OpenAI OpenAI
:
Aaron Jaech
Adam Tauman Kalai
Adam Lerer
...
Yuchen He
Yuchen Zhang
Yunyun Wang
Zheng Shao
Zhuohan Li
ELM
LRM
AI4CE
79
1
0
21 Dec 2024
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
44
9
0
29 Oct 2024
Truth is Universal: Robust Detection of Lies in LLMs
Lennart Bürger
Fred Hamprecht
B. Nadler
HILM
43
8
0
03 Jul 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
45
23
0
11 Jun 2024
Scheming AIs: Will AIs fake alignment during training in order to get power?
Joe Carlsmith
67
30
0
14 Nov 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
213
192
0
20 Oct 2023
1