Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2308.13768
Cited By
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content
26 August 2023
Charles OÑeill
Jack Miller
I. Ciucă
Y. Ting 丁
Thang Bui
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content"
10 / 10 papers shown
Title
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Bernard Ghanem
Philip H. S. Torr
Adel Bibi
45
1
0
26 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
47
0
0
24 Feb 2025
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
90
124
0
01 May 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
218
443
0
23 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
306
11,909
0
04 Mar 2022
A Survey of Toxic Comment Classification Methods
Kehan Wang
Jiaxi Yang
Hongjun Wu
15
8
0
13 Dec 2021
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
184
30
0
16 Oct 2021
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
242
193
0
15 Sep 2021
Machine Learning Suites for Online Toxicity Detection
David A. Noever
80
33
0
03 Oct 2018
NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
Rui Wang
Benjamin Marie
Masao Utiyama
Eiichiro Sumita
11
4
0
19 Sep 2018
1