BiasX: "Thinking Slow" in Toxic Content Moderation with Explanations of
Implied Social Biases

BiasX: "Thinking Slow" in Toxic Content Moderation with Explanations of Implied Social Biases

23 May 2023

Sravani Nanduri

Papers citing "BiasX: "Thinking Slow" in Toxic Content Moderation with Explanations of Implied Social Biases"

8 / 8 papers shown

Title
Real-World Gaps in AI Governance Research Ilan Strauss Isobel Moure Tim O'Reilly Sruly Rosenblat 61 0 0 30 Apr 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations David Hartmann Amin Oueslati Dimitri Staufer Lena Pohlmann Simon Munzert Hendrik Heuer 48 0 0 03 Mar 2025
SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior Jing-Jing Li Valentina Pyatkin Max Kleiman-Weiner Liwei Jiang Nouha Dziri Anne Collins Jana Schaich Borg Maarten Sap Yejin Choi Sydney Levine 19 1 0 22 Oct 2024
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech Neemesh Yadav Sarah Masud Vikram Goyal Vikram Goyal Md. Shad Akhtar Tanmoy Chakraborty 28 3 0 06 Jun 2024
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts Shaona Ghosh Prasoon Varshney Erick Galinkin Christopher Parisien ELM 38 35 0 09 Apr 2024
Self-Consistency Improves Chain of Thought Reasoning in Language Models Xuezhi Wang Jason W. Wei Dale Schuurmans Quoc Le Ed H. Chi Sharan Narang Aakanksha Chowdhery Denny Zhou ReLM BDL LRM AI4CE 297 3,217 0 21 Mar 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
e-SNLI: Natural Language Inference with Natural Language Explanations Oana-Maria Camburu Tim Rocktaschel Thomas Lukasiewicz Phil Blunsom LRM 255 620 0 04 Dec 2018