Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2202.09662
Cited By
v1
v2
v3
v4
v5
v6 (latest)
Reward Modeling for Mitigating Toxicity in Transformer-based Language Models
19 February 2022
Farshid Faal
K. Schmitt
Jia Yuan Yu
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Reward Modeling for Mitigating Toxicity in Transformer-based Language Models"
17 / 17 papers shown
Title
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Lukas Toral
Teddy Lazebnik
100
0
0
10 Sep 2025
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Zachary Coalson
Juhan Bae
Nicholas Carlini
Sanghyun Hong
TDI
333
1
0
02 Jun 2025
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
200
5
0
25 Jun 2024
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Aaron Jiaxun Li
Satyapriya Krishna
Himabindu Lakkaraju
122
9
0
29 Apr 2024
From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape
Timothy R. McIntosh
Teo Susnjak
Tong Liu
Paul Watters
Malka N. Halgamuge
341
70
0
18 Dec 2023
Axiomatic Preference Modeling for Longform Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Corby Rosset
Guoqing Zheng
Victor C. Dibia
Ahmed Hassan Awadallah
Paul Bennett
SyDa
111
6
0
02 Dec 2023
Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models
Sungjoo Byun
Dongjun Jang
Hyemi Jo
Hyopil Shin
106
3
0
30 Nov 2023
Dont Add, dont Miss: Effective Content Preserving Generation from Pre-Selected Text Spans
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Aviv Slobodkin
Avi Caciularu
Eran Hirsch
Ido Dagan
150
3
0
13 Oct 2023
Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Boyang Xue
Weichao Wang
Hongru Wang
Fei Mi
Rui Wang
Yasheng Wang
Lifeng Shang
Xin Jiang
Qun Liu
Kam-Fai Wong
KELM
HILM
473
22
0
12 Oct 2023
Aligning Language Models with Human Preferences via a Bayesian Approach
Neural Information Processing Systems (NeurIPS), 2023
Jiashuo Wang
Haozhao Wang
Shichao Sun
Wenjie Li
ALM
262
34
0
09 Oct 2023
Transformers in Reinforcement Learning: A Survey
Pranav Agarwal
A. Rahman
P. St-Charles
Simon J. D. Prince
Samira Ebrahimi Kahou
OffRL
204
27
0
12 Jul 2023
CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
186
6
0
01 Jun 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
European Association for Machine Translation Conferences/Workshops (EAMT), 2023
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLL
MU
139
2
0
19 May 2023
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Luiza Amador Pozzobon
Beyza Ermis
Patrick Lewis
Sara Hooker
148
50
0
24 Apr 2023
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ameet Deshpande
Vishvak Murahari
Tanmay Rajpurohit
Ashwin Kalyan
Karthik Narasimhan
LM&MA
LLMAG
181
442
0
11 Apr 2023
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Maximillian Chen
Alexandros Papangelis
Chenyang Tao
Andrew Rosenbaum
Seokhwan Kim
Yang Liu
Zhou Yu
Dilek Z. Hakkani-Tür
168
38
0
25 Oct 2022
Quark: Controllable Text Generation with Reinforced Unlearning
Neural Information Processing Systems (NeurIPS), 2022
Ximing Lu
Sean Welleck
Jack Hessel
Liwei Jiang
Lianhui Qin
Peter West
Prithviraj Ammanabrolu
Yejin Choi
MU
393
251
0
26 May 2022
1