SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

1 June 2025

Papers citing "SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs"

3 / 3 papers shown

Title
Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT Guy Bar-Shalom Fabrizio Frasca Yaniv Galron Yftah Ziser Haggai Maron MLLM 0 0 0 30 Sep 2025
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection Harethah Shairah Hasan Hammoud G. Turkiyyah Bernard Ghanem LLMSV 60 1 0 28 Aug 2025
MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair Changqing Li Tianlin Li Xiaohan Zhang Aishan Liu Li Pan KELM LLMSV 44 0 0 09 Aug 2025