Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.04173
Cited By
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
8 February 2022
Boxin Wang
Wei Ping
Chaowei Xiao
P. Xu
M. Patwary
M. Shoeybi
Bo-wen Li
Anima Anandkumar
Bryan Catanzaro
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models"
50 / 55 papers shown
Title
Safety in Large Reasoning Models: A Survey
Cheng Wang
Y. Liu
B. Li
Duzhen Zhang
Z. Li
Junfeng Fang
Bryan Hooi
LRM
142
1
0
24 Apr 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
131
13
0
30 Jan 2025
On the Consideration of AI Openness: Can Good Intent Be Abused?
Yeeun Kim
Eunkyung Choi
Hyunjun Kim
Hongseok Oh
Hyunseo Shin
Wonseok Hwang
SILM
46
1
0
08 Jan 2025
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
72
0
0
19 Dec 2024
RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework
Yifan Wang
Vera Demberg
24
0
0
24 Oct 2024
Controllable Generation via Locally Constrained Resampling
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
18
2
0
17 Oct 2024
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Aram Galstyan
Richard Zemel
Kai-Wei Chang
Rahul Gupta
Charith Peris
25
1
0
07 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
34
12
0
02 Oct 2024
AgentPeerTalk: Empowering Students through Agentic-AI-Driven Discernment of Bullying and Joking in Peer Interactions in Schools
Aditya Paul
Chi Lok Yu
Eva Adelina Susanto
Nicholas Wai Long Lau
Gwenyth Isobel Meadows
LLMAG
35
3
0
27 Jul 2024
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin
Rongchang Li
Xun Wang
Changting Lin
Xun Wang
Wenpeng Xing
Meng Han
Meng Han
55
3
0
23 Jul 2024
Composable Interventions for Language Models
Arinbjorn Kolbeinsson
Kyle O'Brien
Tianjin Huang
Shanghua Gao
Shiwei Liu
...
Anurag J. Vaidya
Faisal Mahmood
Marinka Zitnik
Tianlong Chen
Thomas Hartvigsen
KELM
MU
87
5
0
09 Jul 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Mansour Al Ghanim
Saleh Almohaimeed
Mengxin Zheng
Yan Solihin
Qian Lou
34
2
0
26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
36
1
0
25 Jun 2024
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David M. Krueger
32
13
0
29 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
54
17
0
05 May 2024
The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification
Minh Duc Bui
K. Wense
31
0
0
03 May 2024
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Yu Li
Zhihua Wei
Han Jiang
Chuanyang Gong
LLMSV
29
2
0
16 Apr 2024
Fairness in Large Language Models: A Taxonomic Survey
Zhibo Chu
Zichong Wang
Wenbin Zhang
AILaw
43
32
0
31 Mar 2024
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Luiza Amador Pozzobon
Patrick Lewis
Sara Hooker
B. Ermiş
36
7
0
06 Mar 2024
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models
Xin Yi
Linlin Wang
Xiaoling Wang
Liang He
MoMe
35
1
0
23 Feb 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang
Fengqing Jiang
Zidi Xiong
Bhaskar Ramasubramanian
Radha Poovendran
Bo Li
LRM
SILM
34
38
0
20 Jan 2024
Parameter-Efficient Detoxification with Contrastive Decoding
Tong Niu
Caiming Xiong
Semih Yavuz
Yingbo Zhou
25
12
0
13 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
57
56
0
11 Jan 2024
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
25
12
0
13 Dec 2023
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
Kareem Ahmed
Kai-Wei Chang
Guy Van den Broeck
26
10
0
06 Dec 2023
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
Peng Ding
Jun Kuang
Dan Ma
Xuezhi Cao
Yunsen Xian
Jiajun Chen
Shujian Huang
AAML
19
95
0
14 Nov 2023
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Y. Zou
26
83
0
11 Nov 2023
Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
Zheng Liu
Yiwei Li
Qian Cao
Junwen Chen
Tianze Yang
...
John Gibbs
Khaled Rasheed
Ninghao Liu
Gengchen Mai
Tianming Liu
AI4CE
36
10
0
30 Oct 2023
Unpacking the Ethical Value Alignment in Big Models
Xiaoyuan Yi
Jing Yao
Xiting Wang
Xing Xie
24
11
0
26 Oct 2023
Harnessing the Power of LLMs: Evaluating Human-AI Text Co-Creation through the Lens of News Headline Generation
Zijian Ding
Alison Smith-Renner
Wenjuan Zhang
Joel R. Tetreault
Alejandro Jaimes
19
23
0
16 Oct 2023
Self-Detoxifying Language Models via Toxification Reversal
Chak Tou Leong
Yi Cheng
Jiashuo Wang
Jian Wang
Wenjie Li
MU
16
29
0
14 Oct 2023
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models
Luiza Amador Pozzobon
B. Ermiş
Patrick Lewis
Sara Hooker
28
20
0
11 Oct 2023
Zero-shot Learning of Drug Response Prediction for Preclinical Drug Screening
Kun Li
Yong Luo
Xiantao Cai
Wenbin Hu
Bo Du
21
1
0
05 Oct 2023
All Languages Matter: On the Multilingual Safety of Large Language Models
Wenxuan Wang
Zhaopeng Tu
Chang Chen
Youliang Yuan
Jen-tse Huang
Wenxiang Jiao
Michael R. Lyu
ALM
LRM
39
31
0
02 Oct 2023
CMD: a framework for Context-aware Model self-Detoxification
Zecheng Tang
Keyan Zhou
Juntao Li
Yuyang Ding
Pinzheng Wang
Bowen Yan
Minzhang
MU
23
5
0
16 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
65
231
0
12 Aug 2023
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
Sanghyun Kim
Seohyeong Jung
Balhae Kim
Moonseok Choi
Jinwoo Shin
Juho Lee
DiffM
29
29
0
12 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
75
832
0
05 Jul 2023
Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning
Chujie Zheng
Pei Ke
Zheng Zhang
Minlie Huang
BDL
23
31
0
06 Jun 2023
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research
Luiza Amador Pozzobon
B. Ermiş
Patrick Lewis
Sara Hooker
35
45
0
24 Apr 2023
Pretraining Language Models with Human Preferences
Tomasz Korbak
Kejian Shi
Angelica Chen
Rasika Bhalerao
C. L. Buckley
Jason Phang
Sam Bowman
Ethan Perez
ALM
SyDa
30
205
0
16 Feb 2023
Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models
Shrimai Prabhumoye
M. Patwary
M. Shoeybi
Bryan Catanzaro
LM&MA
30
19
0
14 Feb 2023
BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models
Rafal Kocielnik
Shrimai Prabhumoye
Vivian Zhang
Roy Jiang
R. Alvarez
Anima Anandkumar
30
6
0
14 Feb 2023
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar
Vidhisha Balachandran
Lucille Njoo
Antonios Anastasopoulos
Yulia Tsvetkov
ELM
68
85
0
14 Oct 2022
Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values
Yejin Bang
Tiezheng Yu
Andrea Madotto
Zhaojiang Lin
Mona T. Diab
Pascale Fung
19
13
0
14 Oct 2022
Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization
Zonghan Yang
Xiaoyuan Yi
Peng Li
Yang Liu
Xing Xie
25
33
0
10 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
311
11,915
0
04 Mar 2022
Analyzing the Limits of Self-Supervision in Handling Bias in Language
Lisa Bauer
Karthik Gopalakrishnan
Spandana Gella
Yang Liu
Mohit Bansal
Dilek Z. Hakkani-Tür
ELM
22
1
0
16 Dec 2021
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
244
193
0
15 Sep 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,844
0
18 Apr 2021
1
2
Next