Language Models Represent Beliefs of Self and Others

Language Models Represent Beliefs of Self and Others

28 February 2024

Papers citing "Language Models Represent Beliefs of Self and Others"

13 / 13 papers shown

Title
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking Yuchun Miao Sen Zhang Liang Ding Yuqi Zhang L. Zhang Dacheng Tao 81 3 0 31 Jan 2025
Towards Safe and Honest AI Agents with Neural Self-Other Overlap Marc Carauleanu Michael Vaiana Judd Rosenblatt Cameron Berg Diogo Schwerz de Lucena 66 0 0 20 Dec 2024
Learning Human-Aware Robot Policies for Adaptive Assistance Jason Qin Shikun Ban Wentao Zhu Yizhou Wang Dimitris Samaras 76 0 0 16 Dec 2024
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas Yu Lei Hao Liu Chengxing Xie Songjia Liu Zhiyu Yin Canyu Chen G. Li Philip H. S. Torr Zhen Wu 18 1 0 14 Oct 2024
DynFrs: An Efficient Framework for Machine Unlearning in Random Forest Shurong Wang Zhuoyang Shen Xinbao Qiao Tongning Zhang Meng Zhang MU 11 0 0 02 Oct 2024
Benchmarking Mental State Representations in Language Models Matteo Bortoletto Constantin Ruhdorfer Lei Shi Andreas Bulling AI4MH LRM 33 4 0 25 Jun 2024
Truth-value judgment in language models: belief directions are context sensitive Stefan F. Schouten Peter Bloem Ilia Markov Piek Vossen KELM 55 0 0 29 Apr 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism DeepSeek-AI Xiao Bi : Xiao Bi Deli Chen Guanting Chen ... Yao Zhao Shangyan Zhou Shunfeng Zhou Qihao Zhu Yuheng Zou LRM ALM 131 298 0 05 Jan 2024
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 197 2,953 0 22 Mar 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 219 291 0 24 Feb 2021
Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others Kanishk Gandhi Gala Stojnic Brenden Lake M. Dillon 36 46 0 23 Feb 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,927 0 20 Apr 2018