ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.06009
  4. Cited By
Detectors for Safe and Reliable LLMs: Implementations, Uses, and
  Limitations

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

9 March 2024
Swapnaja Achintalwar
Adriana Alvarado Garcia
Ateret Anaby-Tavor
Ioana Baldini
Sara E. Berger
Bishwaranjan Bhattacharjee
Djallel Bouneffouf
Subhajit Chaudhury
Pin-Yu Chen
Lamogha Chiazor
Elizabeth M. Daly
Kirushikesh DB
Rogério Abreu de Paula
Pierre L. Dognin
E. Farchi
Soumya Ghosh
Michael Hind
R. Horesh
George Kour
Ja Young Lee
Nishtha Madaan
Sameep Mehta
Erik Miehling
K. Murugesan
Manish Nagireddy
Inkit Padhi
David Piorkowski
Ambrish Rawat
Orna Raz
P. Sattigeri
Hendrik Strobelt
Sarathkrishna Swaminathan
Christoph Tillmann
Aashka Trivedi
Kush R. Varshney
Dennis L. Wei
Shalisha Witherspooon
Marcel Zalmanovici
ArXivPDFHTML

Papers citing "Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations"

14 / 14 papers shown
Title
A Library of LLM Intrinsics for Retrieval-Augmented Generation
A Library of LLM Intrinsics for Retrieval-Augmented Generation
Marina Danilevsky
Kristjan Greenewald
Chulaka Gunasekara
Maeda Hanafi
Lihong He
...
Frederick Reiss
Vraj Shah
Khoi-Nguyen Tran
Huaiyu Zhu
Luis A. Lastras
19
1
0
16 Apr 2025
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in
  Red Teaming GenAI
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat
Stefan Schoepf
Giulio Zizzo
Giandomenico Cornacchia
Muhammad Zaid Hameed
...
Elizabeth M. Daly
Mark Purcell
P. Sattigeri
Pin-Yu Chen
Kush R. Varshney
AAML
34
6
0
23 Sep 2024
When in Doubt, Cascade: Towards Building Efficient and Capable
  Guardrails
When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails
Manish Nagireddy
Inkit Padhi
Soumya Ghosh
P. Sattigeri
22
1
0
08 Jul 2024
Grade Like a Human: Rethinking Automated Assessment with Large Language
  Models
Grade Like a Human: Rethinking Automated Assessment with Large Language Models
Wenjing Xie
Juxin Niu
Chun Jason Xue
Nan Guan
AI4Ed
26
0
0
30 May 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path
  Forward
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
35
6
0
12 Apr 2024
MISMATCH: Fine-grained Evaluation of Machine-generated Text with
  Mismatch Error Types
MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types
K. Murugesan
Sarathkrishna Swaminathan
Soham Dan
Subhajit Chaudhury
Chulaka Gunasekara
...
Ibrahim Abdelaziz
Achille Fokoue
Pavan Kapanipathi
Salim Roukos
Alexander G. Gray
27
5
0
18 Jun 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
201
559
0
03 May 2023
"I'm sorry to hear that": Finding New Biases in Language Models with a
  Holistic Descriptor Dataset
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset
Eric Michael Smith
Melissa Hall
Melanie Kambadur
Eleonora Presani
Adina Williams
62
128
0
18 May 2022
Whose AI Dream? In search of the aspiration in data annotation
Whose AI Dream? In search of the aspiration in data annotation
Ding-wen Wang
Shantanu Prabhat
Nithya Sambasivan
148
47
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
205
364
0
15 Oct 2021
Challenges in Detoxifying Language Models
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
236
191
0
15 Sep 2021
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech
Mai Elsherief
Caleb Ziems
D. Muchlinski
Vaishnavi Anupindi
Jordyn Seybolt
M. D. Choudhury
Diyi Yang
85
233
0
11 Sep 2021
Simple and Scalable Predictive Uncertainty Estimation using Deep
  Ensembles
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan
Alexander Pritzel
Charles Blundell
UQCV
BDL
268
5,635
0
05 Dec 2016
1