Truth is Universal: Robust Detection of Lies in LLMs

3 July 2024

Papers citing "Truth is Universal: Robust Detection of Lies in LLMs"

7 / 7 papers shown

Title
Prompt-Guided Internal States for Hallucination Detection of Large Language Models Fujie Zhang Peiqi Yu Biao Yi Baolei Zhang Tong Li Zheli Liu HILM LRM 46 0 0 07 Nov 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 54 18 0 02 Jul 2024
The Platonic Representation Hypothesis Minyoung Huh Brian Cheung Tongzhou Wang Phillip Isola 72 107 0 13 May 2024
Gemma: Open Models Based on Gemini Research and Technology Gemma Team Gemma Team Thomas Mesnard Cassidy Hardin Robert Dadashi Surya Bhupatiraju ... Armand Joulin Noah Fiedel Evan Senter Alek Andreev Kathleen Kenealy VLM LLMAG 123 415 0 13 Mar 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 164 0 10 Oct 2023
The Internal State of an LLM Knows When It's Lying A. Azaria Tom Michael Mitchell HILM 210 297 0 26 Apr 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022