Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.18496
Cited By
Language Models Represent Beliefs of Self and Others
28 February 2024
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Language Models Represent Beliefs of Self and Others"
13 / 13 papers shown
Title
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
Yuchun Miao
Sen Zhang
Liang Ding
Yuqi Zhang
L. Zhang
Dacheng Tao
81
3
0
31 Jan 2025
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu
Michael Vaiana
Judd Rosenblatt
Cameron Berg
Diogo Schwerz de Lucena
66
0
0
20 Dec 2024
Learning Human-Aware Robot Policies for Adaptive Assistance
Jason Qin
Shikun Ban
Wentao Zhu
Yizhou Wang
Dimitris Samaras
76
0
0
16 Dec 2024
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas
Yu Lei
Hao Liu
Chengxing Xie
Songjia Liu
Zhiyu Yin
Canyu Chen
G. Li
Philip H. S. Torr
Zhen Wu
18
1
0
14 Oct 2024
DynFrs: An Efficient Framework for Machine Unlearning in Random Forest
Shurong Wang
Zhuoyang Shen
Xinbao Qiao
Tongning Zhang
Meng Zhang
MU
11
0
0
02 Oct 2024
Benchmarking Mental State Representations in Language Models
Matteo Bortoletto
Constantin Ruhdorfer
Lei Shi
Andreas Bulling
AI4MH
LRM
33
4
0
25 Jun 2024
Truth-value judgment in language models: belief directions are context sensitive
Stefan F. Schouten
Peter Bloem
Ilia Markov
Piek Vossen
KELM
55
0
0
29 Apr 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRM
ALM
131
298
0
05 Jan 2024
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,953
0
22 Mar 2023
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
117
314
0
21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
219
291
0
24 Feb 2021
Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others
Kanishk Gandhi
Gala Stojnic
Brenden Lake
M. Dillon
36
46
0
23 Feb 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,927
0
20 Apr 2018
1