Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

v1v2v3 (latest)

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Annual Meeting of the Association for Computational Linguistics (ACL), 2025

12 March 2025

Seraphina Goldfarb-Tarrant

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts"

6 / 6 papers shown

Title
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler Zixuan Hu Li Shen Zhenyi Wang Yongxian Wei Dacheng Tao AAML 107 0 0 31 Oct 2025
A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users Nishant Balepur Matthew Shu Yoo Yeon Sung Seraphina Goldfarb-Tarrant Shi Feng Fumeng Yang Rachel Rudinger Jordan L. Boyd-Graber 150 0 0 23 Sep 2025
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges Khaoula Chehbouni Mohammed Haddou Jackie CK Cheung G. Farnadi LLMAG 273 5 0 25 Aug 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations Ruosen Li Teerth Patel Xinya Du LLMAG ALM 467 124 0 03 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge Dawei Li Bohan Jiang Liangjie Huang Alimohammad Beigi Chengshuai Zhao ... Canyu Chen Tianhao Wu Kai Shu Lu Cheng Huan Liu ELM AILaw 974 240 0 25 Nov 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Aman Singh Thakur Kartik Choudhary Venkat Srinik Ramayapally Sankaran Vaidyanathan Dieuwke Hupkes ELM ALM 592 127 0 18 Jun 2024