ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.13387
  4. Cited By
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

25 August 2023
Yuxia Wang
Haonan Li
Xudong Han
Preslav Nakov
Timothy Baldwin
ArXivPDFHTML

Papers citing "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs"

32 / 82 papers shown
Title
Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models
Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models
Kalyan Nakka
Jimmy Dani
Nitesh Saxena
43
1
0
08 Jun 2024
Dishonesty in Helpful and Harmless Alignment
Dishonesty in Helpful and Harmless Alignment
Youcheng Huang
Jingkun Tang
Duanyu Feng
Zheng-Wei Zhang
Wenqiang Lei
Jiancheng Lv
Anthony G. Cohn
LLMSV
38
3
0
04 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education
The Life Cycle of Large Language Models: A Review of Biases in Education
Jinsook Lee
Yann Hicke
Renzhe Yu
Christopher A. Brooks
René F. Kizilcec
AI4Ed
34
1
0
03 Jun 2024
Improving Reward Models with Synthetic Critiques
Improving Reward Models with Synthetic Critiques
Zihuiwen Ye
Fraser Greenlee-Scott
Max Bartolo
Phil Blunsom
Jon Ander Campos
Matthias Gallé
ALM
SyDa
LRM
38
17
0
31 May 2024
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based
  Evaluation
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Jingnan Zheng
Han Wang
An Zhang
Tai D. Nguyen
Jun Sun
Tat-Seng Chua
LLMAG
38
14
0
23 May 2024
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large
  Language Models from Start to Finish
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish
Masahiro Kaneko
Timothy Baldwin
PILM
23
3
0
24 Mar 2024
Risk and Response in Large Language Models: Evaluating Key Threat
  Categories
Risk and Response in Large Language Models: Evaluating Key Threat Categories
Bahareh Harandizadeh
A. Salinas
Fred Morstatter
20
3
0
22 Mar 2024
RewardBench: Evaluating Reward Models for Language Modeling
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
Valentina Pyatkin
Jacob Morrison
Lester James Validad Miranda
Bill Yuchen Lin
...
Sachin Kumar
Tom Zick
Yejin Choi
Noah A. Smith
Hanna Hajishirzi
ALM
74
211
0
20 Mar 2024
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for
  Language Models
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
Yi Luo
Zheng-Wen Lin
Yuhao Zhang
Jiashuo Sun
Chen Lin
Chengjin Xu
Xiangdong Su
Yelong Shen
Jian Guo
Yeyun Gong
LM&MA
ELM
ALM
AI4TS
16
1
0
18 Mar 2024
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic
Emad A. Alghamdi
Reem I. Masoud
Deema Alnuhait
Afnan Y. Alomairi
Ahmed Ashraf
Mohamed Zaytoon
40
4
0
14 Mar 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities
  of Large Vision Models
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLM
VGen
EGVM
68
254
0
27 Feb 2024
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
Yuxia Wang
Zenan Zhai
Haonan Li
Xudong Han
Lizhi Lin
Zhenxuan Zhang
Jingru Zhao
Preslav Nakov
Timothy Baldwin
42
9
0
19 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
30
84
0
07 Feb 2024
Building Guardrails for Large Language Models
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
57
27
0
02 Feb 2024
Instruction Makes a Difference
Instruction Makes a Difference
Tosin P. Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLM
MLLM
8
1
0
01 Feb 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in
  Multilingual Contexts
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
21
37
0
23 Jan 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRM
ALM
139
304
0
05 Jan 2024
The Art of Defending: A Systematic Evaluation and Analysis of LLM
  Defense Strategies on Safety and Over-Defensiveness
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
Neeraj Varshney
Pavel Dolin
Agastya Seth
Chitta Baral
AAML
ELM
14
47
0
30 Dec 2023
Make Them Spill the Beans! Coercive Knowledge Extraction from
  (Production) LLMs
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
28
13
0
08 Dec 2023
LifeTox: Unveiling Implicit Toxicity in Life Advice
LifeTox: Unveiling Implicit Toxicity in Life Advice
Minbeom Kim
Jahyun Koo
Hwanhee Lee
Joonsuk Park
Hwaran Lee
Kyomin Jung
8
6
0
16 Nov 2023
Disinformation Capabilities of Large Language Models
Disinformation Capabilities of Large Language Models
Ivan Vykopal
Matúvs Pikuliak
Ivan Srba
Robert Moro
Dominik Macko
M. Bieliková
12
8
0
15 Nov 2023
A Survey of Confidence Estimation and Calibration in Large Language
  Models
A Survey of Confidence Estimation and Calibration in Large Language Models
Jiahui Geng
Fengyu Cai
Yuxia Wang
Heinz Koeppl
Preslav Nakov
Iryna Gurevych
UQCV
39
54
0
14 Nov 2023
Flames: Benchmarking Value Alignment of LLMs in Chinese
Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang
Xiangyang Liu
Qianyu Guo
Tianxiang Sun
Jiawei Sun
...
Yixu Wang
Yan Teng
Xipeng Qiu
Yingchun Wang
Dahua Lin
ALM
27
8
0
12 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well?
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
25
14
0
10 Nov 2023
Factuality Challenges in the Era of Large Language Models
Factuality Challenges in the Era of Large Language Models
Isabelle Augenstein
Timothy Baldwin
Meeyoung Cha
Tanmoy Chakraborty
Giovanni Luca Ciampaglia
...
Rubén Míguez
Preslav Nakov
Dietram A. Scheufele
Shivam Sharma
Giovanni Zagni
HILM
29
31
0
08 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
18
169
0
03 Oct 2023
Can LLM-Generated Misinformation Be Detected?
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
29
157
0
25 Sep 2023
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large
  Language Models
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta
Adel Khorramrouz
Sujan Dutta
Ashiqur R. KhudaBukhsh
6
0
0
08 Sep 2023
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
  Generative Large Language Models
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Neha Sengupta
Sunil Kumar Sahu
Bokang Jia
Satheesh Katipomu
Haonan Li
...
A. Jackson
Hector Xuguang Ren
Preslav Nakov
Timothy Baldwin
Eric P. Xing
LRM
16
40
0
30 Aug 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
218
441
0
23 Aug 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,402
0
28 Jan 2022
Evaluating Debiasing Techniques for Intersectional Biases
Evaluating Debiasing Techniques for Intersectional Biases
Shivashankar Subramanian
Xudong Han
Timothy Baldwin
Trevor Cohn
Lea Frermann
77
49
0
21 Sep 2021
Previous
12