Discovering Latent Knowledge in Language Models Without Supervision

7 December 2022

Papers citing "Discovering Latent Knowledge in Language Models Without Supervision"

50 / 267 papers shown

Title
Does Representation Matter? Exploring Intermediate Layers in Large Language Models Oscar Skean Md Rifat Arefin Yann LeCun Ravid Shwartz-Ziv 76 7 0 12 Dec 2024
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models Cameron Tice Philipp Alexander Kreer Nathan Helm-Burger Prithviraj Singh Shahani Fedor Ryzhenkov Jacob Haimes Felix Hofstätter Teun van der Weij 74 1 0 02 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning Keito Kudo Yoichi Aoki Tatsuki Kuribayashi Shusaku Sone Masaya Taniguchi Ana Brassard Keisuke Sakaguchi Kentaro Inui ReLM LRM 69 0 0 02 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 49 3 0 17 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Severin Field Severin Field Helen Yannakoudakis Stephen Casper 26 1 0 02 Nov 2024
Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings Yashvir S. Grewal Edwin V. Bonilla Thang D. Bui UQCV 23 3 0 30 Oct 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 61 3 0 30 Oct 2024
Distinguishing Ignorance from Error in LLM Hallucinations Adi Simhi Jonathan Herzig Idan Szpektor Yonatan Belinkov HILM 53 2 0 29 Oct 2024
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models Mohammad Beigi Sijia Wang Ying Shen Zihao Lin Adithya Kulkarni ... Ming Jin Jin-Hee Cho Dawei Zhou Chang-Tien Lu Lifu Huang 21 1 0 26 Oct 2024
Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination Salman Rakin Md. A. R. Shibly Zahin M. Hossain Zeeshan Khan Md. Mostofa Akbar 18 1 0 23 Oct 2024
DEAN: Deactivating the Coupled Neurons to Mitigate Fairness-Privacy Conflicts in Large Language Models Chen Qian Dongrui Liu Jie Zhang Yong Liu Jing Shao 24 1 0 22 Oct 2024
Chatting with Bots: AI, Speech Acts, and the Edge of Assertion Iwan Williams Tim Bayne 24 1 0 22 Oct 2024
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion Denitsa Saynova Lovisa Hagström Moa Johansson Richard Johansson Marco Kuhlmann HILM 34 0 0 18 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 44 3 0 18 Oct 2024
Enhancing Fact Retrieval in PLMs through Truthfulness Paul Youssef Jorg Schlotterer C. Seifert KELM HILM 20 0 0 17 Oct 2024
Anchored Alignment for Self-Explanations Enhancement Luis Felipe Villa-Arenas Ata Nizamoglu Qianli Wang Sebastian Möller Vera Schmitt 19 0 0 17 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation Alex Troy Mallen Nora Belrose 25 1 0 17 Oct 2024
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation Yiming Wang Pei Zhang Baosong Yang Derek F. Wong Rui-cang Wang LRM 40 4 0 17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 49 13 0 15 Oct 2024
Safety-Aware Fine-Tuning of Large Language Models Hyeong Kyu Choi Xuefeng Du Yixuan Li 35 10 0 13 Oct 2024
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models Zheng Yi Ho Siyuan Liang Sen Zhang Yibing Zhan Dacheng Tao 26 1 0 11 Oct 2024
Chip-Tuning: Classify Before Language Models Say Fangwei Zhu Dian Li Jiajun Huang Gang Liu Hui Wang Zhifang Sui 23 0 0 09 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 39 12 0 08 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism Jared Moore Yejin Choi Sydney Levine 21 0 0 07 Oct 2024
Evaluating Language Model Character Traits Francis Rhys Ward Zejia Yang Alex Jackson Randy Brown Chandler Smith Grace Colverd Louis Thomson Raymond Douglas Patrik Bartak Andrew Rowan 32 0 0 05 Oct 2024
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View Lijie Hu Liang Liu Shu Yang Xin Chen Zhen Tan Muhammad Asif Ali Mengdi Li Di Wang LRM 39 1 0 04 Oct 2024
FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs Deema Alnuhait Neeraja Kirtane Muhammad Khalifa Hao Peng HILM LRM 34 2 0 03 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek Yonatan Belinkov HILM AIFin 46 24 0 03 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA Eduard Tulchinskii Laida Kushnareva Kristian Kuznetsov Anastasia Voznyuk Andrei Andriiainen Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov 63 1 0 03 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency Yi Cheng Xiao Liang Yeyun Gong Wen Xiao Song Wang ... Wenjie Li Jian Jiao Qi Chen Peng Cheng Wayne Xiong HILM 50 1 0 02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models Amrita Bhattacharjee Shaona Ghosh Traian Rebedea Christopher Parisien LLMSV 21 2 0 02 Oct 2024
Attention layers provably solve single-location regression P. Marion Raphael Berthier Gérard Biau Claire Boyer 49 2 0 02 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data Xuefeng Du Reshmi Ghosh Robert Sim Ahmed Salem Vitor Carvalho Emily Lawton Yixuan Li Jack W. Stokes VLM AAML 32 5 0 01 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 53 9 0 30 Sep 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 57 2 0 30 Sep 2024
A Survey on the Honesty of Large Language Models Siheng Li Cheng Yang Taiqiang Wu Chufan Shi Yuji Zhang ... Jie Zhou Yujiu Yang Ngai Wong Xixin Wu Wai Lam HILM 27 4 0 27 Sep 2024
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection Xuefeng Du Chaowei Xiao Yixuan Li HILM 27 16 0 26 Sep 2024
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective Van-Cuong Pham Thien Huu Nguyen LLMSV 33 3 0 16 Sep 2024
HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making Sumera Anjum Hanzhi Zhang Wenjun Zhou Eun Jin Paek Xiaopeng Zhao Yunhe Feng 15 1 0 16 Sep 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 44 2 0 16 Sep 2024
On the Relationship between Truth and Political Bias in Language Models S. Fulay William Brannon Shrestha Mohanty Cassandra Overney Elinor Poole-Dayan Deb Roy Jad Kabbara HILM 22 1 0 09 Sep 2024
Modularity in Transformers: Investigating Neuron Separability & Specialization Nicholas Pochinkov Thomas Jones Mohammed Rashidur Rahman 25 0 0 30 Aug 2024
Personality Alignment of Large Language Models Minjun Zhu Linyi Yang Yue Zhang Yue Zhang ALM 50 5 0 21 Aug 2024
A Little Confidence Goes a Long Way J. Scoville Shang Gao Devanshu Agrawal Javed Qadrud-Din 19 0 0 20 Aug 2024
Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused Dingwei Chen Feiteng Fang Shiwen Ni Feng Liang Ruifeng Xu Min Yang Chengming Li HILM 14 1 0 16 Aug 2024
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs Shiping Liu Kecheng Zheng Wei Chen MLLM 41 33 0 31 Jul 2024
Cluster-norm for Unsupervised Probing of Knowledge Walter Laurito Sharan Maiya Grégoire Dhimoïla Owen Owen Yeung Kaarel Hänni 27 2 0 26 Jul 2024
Internal Consistency and Self-Feedback in Large Language Models: A Survey Xun Liang Shichao Song Zifan Zheng Hanyu Wang Qingchen Yu ... Rong-Hua Li Peng Cheng Zhonghao Wang Feiyu Xiong Zhiyu Li HILM LRM 56 24 0 19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 34 1 0 18 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors Daniel Tan David Chanin Aengus Lynch Dimitrios Kanoulas Brooks Paige Adrià Garriga-Alonso Robert Kirk LLMSV 84 16 0 17 Jul 2024