Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.08370
Cited By
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
14 November 2023
Bertie Vidgen
Nino Scherrer
Hannah Rose Kirk
Rebecca Qian
Anand Kannappan
Scott A. Hale
Paul Röttger
ALM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models"
20 / 20 papers shown
Title
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
31
0
0
09 Apr 2025
Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information
Elizaveta Kuznetsova
Ilaria Vitulano
M. Makhortykh
Martha Stolze
Tomas Nagy
Victoria Vziatysheva
HILM
63
1
0
11 Mar 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
120
13
0
30 Jan 2025
SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior
Jing-Jing Li
Valentina Pyatkin
Max Kleiman-Weiner
Liwei Jiang
Nouha Dziri
Anne Collins
Jana Schaich Borg
Maarten Sap
Yejin Choi
Sydney Levine
19
1
0
22 Oct 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
51
2
0
14 Oct 2024
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang
Ahmed Elgohary
Ahmed Magooda
Daniel Khashabi
Benjamin Van Durme
40
2
0
11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
64
1
0
09 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
5
0
04 Oct 2024
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
34
5
0
22 Sep 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
23
1
0
28 Aug 2024
How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine Studies
Alina Leidinger
Richard Rogers
32
2
0
16 Jul 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
29
9
0
26 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
47
50
0
20 Jun 2024
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Shaona Ghosh
Prasoon Varshney
Erick Galinkin
Christopher Parisien
ELM
22
35
0
09 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELM
KELM
51
30
0
08 Apr 2024
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Makesh Narsimhan Sreedhar
Traian Rebedea
Shaona Ghosh
Jiaqi Zeng
Christopher Parisien
ALM
27
4
0
04 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
79
155
0
02 Apr 2024
In Generative AI we Trust: Can Chatbots Effectively Verify Political Information?
Elizaveta Kuznetsova
M. Makhortykh
Victoria Vziatysheva
Martha Stolze
Ani Baghumyan
Aleksandra Urman
12
2
0
20 Dec 2023
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Haoqin Tu
Chenhang Cui
Zijun Wang
Yiyang Zhou
Bingchen Zhao
Junlin Han
Wangchunshu Zhou
Huaxiu Yao
Cihang Xie
MLLM
36
70
0
27 Nov 2023
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
495
0
28 Sep 2022
1