Monitoring Latent World States in Language Models with Propositional
Probes

Monitoring Latent World States in Language Models with Propositional Probes

27 June 2024

Jacob Steinhardt

Papers citing "Monitoring Latent World States in Language Models with Propositional Probes"

13 / 13 papers shown

Title
Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models Guy Kaplan Michael Toker Yuval Reif Yonatan Belinkov Roy Schwartz DiffM 48 0 0 01 Apr 2025
Mind the Gap: Bridging the Divide Between AI Aspirations and the Reality of Autonomous Characterization Grace Guinan Addison Salvador Michelle A. Smeaton Andrew Glaws Hilary Egan Brian C. Wyatt Babak Anasori K. Fiedler M. Olszta Steven Spurgeon 63 0 0 25 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models Ala Nekouvaght Tak Amin Banayeeanzade Anahita Bolourani Mina Kian Robin Jia Jonathan Gratch 49 0 0 08 Feb 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 41 9 0 10 Jan 2025
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis Guan Zhe Hong Nishanth Dikkala Enming Luo Cyrus Rashtchian Xin Wang Rina Panigrahy OffRL LRM NAI 29 0 0 06 Nov 2024
Relational Composition in Neural Networks: A Survey and Call to Action Martin Wattenberg Fernanda Viégas CoGe 36 9 0 19 Jul 2024
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak D. Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 209 178 0 20 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 164 0 10 Oct 2023
The System Model and the User Model: Exploring AI Dashboard Design Fernanda Viégas Martin Wattenberg 21 6 0 04 May 2023
Entity Tracking in Language Models Najoung Kim Sebastian Schuster 50 16 0 03 May 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Leveraging Large Language Models for Multiple Choice Question Answering Joshua Robinson Christopher Rytting David Wingate ELM 126 181 0 22 Oct 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering Alicia Parrish Angelica Chen Nikita Nangia Vishakh Padmakumar Jason Phang Jana Thompson Phu Mon Htut Sam Bowman 210 364 0 15 Oct 2021