ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.09662
  4. Cited By
Reward Modeling for Mitigating Toxicity in Transformer-based Language
  Models
v1v2v3v4v5v6 (latest)

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

19 February 2022
Farshid Faal
K. Schmitt
Jia Yuan Yu
ArXiv (abs)PDFHTML

Papers citing "Reward Modeling for Mitigating Toxicity in Transformer-based Language Models"

17 / 17 papers shown
Title
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Lukas Toral
Teddy Lazebnik
100
0
0
10 Sep 2025
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Zachary Coalson
Juhan Bae
Nicholas Carlini
Sanghyun Hong
TDI
333
1
0
02 Jun 2025
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating
  Toxicity in French Texts
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
200
5
0
25 Jun 2024
More RLHF, More Trust? On The Impact of Human Preference Alignment On
  Language Model Trustworthiness
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Aaron Jiaxun Li
Satyapriya Krishna
Himabindu Lakkaraju
122
9
0
29 Apr 2024
From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the
  Generative Artificial Intelligence (AI) Research Landscape
From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape
Timothy R. McIntosh
Teo Susnjak
Tong Liu
Paul Watters
Malka N. Halgamuge
341
70
0
18 Dec 2023
Axiomatic Preference Modeling for Longform Question Answering
Axiomatic Preference Modeling for Longform Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Corby Rosset
Guoqing Zheng
Victor C. Dibia
Ahmed Hassan Awadallah
Paul Bennett
SyDa
111
6
0
02 Dec 2023
Automatic Construction of a Korean Toxic Instruction Dataset for Ethical
  Tuning of Large Language Models
Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models
Sungjoo Byun
Dongjun Jang
Hyemi Jo
Hyopil Shin
106
3
0
30 Nov 2023
Dont Add, dont Miss: Effective Content Preserving Generation from
  Pre-Selected Text Spans
Dont Add, dont Miss: Effective Content Preserving Generation from Pre-Selected Text SpansConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Aviv Slobodkin
Avi Caciularu
Eran Hirsch
Ido Dagan
150
3
0
13 Oct 2023
Improving Factual Consistency for Knowledge-Grounded Dialogue Systems
  via Knowledge Enhancement and Alignment
Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and AlignmentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Boyang Xue
Weichao Wang
Hongru Wang
Fei Mi
Rui Wang
Yasheng Wang
Lifeng Shang
Xin Jiang
Qun Liu
Kam-Fai Wong
KELMHILM
473
22
0
12 Oct 2023
Aligning Language Models with Human Preferences via a Bayesian Approach
Aligning Language Models with Human Preferences via a Bayesian ApproachNeural Information Processing Systems (NeurIPS), 2023
Jiashuo Wang
Haozhao Wang
Shichao Sun
Wenjie Li
ALM
262
34
0
09 Oct 2023
Transformers in Reinforcement Learning: A Survey
Transformers in Reinforcement Learning: A Survey
Pranav Agarwal
A. Rahman
P. St-Charles
Simon J. D. Prince
Samira Ebrahimi Kahou
OffRL
204
27
0
12 Jul 2023
CFL: Causally Fair Language Models Through Token-level Attribute
  Controlled Generation
CFL: Causally Fair Language Models Through Token-level Attribute Controlled GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Rahul Madhavan
Rishabh Garg
Kahini Wadhawan
S. Mehta
186
6
0
01 Jun 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in
  machine translation
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translationEuropean Association for Machine Translation Conferences/Workshops (EAMT), 2023
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLLMU
139
2
0
19 May 2023
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in
  Research
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in ResearchConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Luiza Amador Pozzobon
Beyza Ermis
Patrick Lewis
Sara Hooker
148
50
0
24 Apr 2023
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Toxicity in ChatGPT: Analyzing Persona-assigned Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ameet Deshpande
Vishvak Murahari
Tanmay Rajpurohit
Ashwin Kalyan
Karthik Narasimhan
LM&MALLMAG
181
442
0
11 Apr 2023
Weakly Supervised Data Augmentation Through Prompting for Dialogue
  Understanding
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Maximillian Chen
Alexandros Papangelis
Chenyang Tao
Andrew Rosenbaum
Seokhwan Kim
Yang Liu
Zhou Yu
Dilek Z. Hakkani-Tür
168
38
0
25 Oct 2022
Quark: Controllable Text Generation with Reinforced Unlearning
Quark: Controllable Text Generation with Reinforced UnlearningNeural Information Processing Systems (NeurIPS), 2022
Ximing Lu
Sean Welleck
Jack Hessel
Liwei Jiang
Lianhui Qin
Peter West
Prithviraj Ammanabrolu
Yejin Choi
MU
393
251
0
26 May 2022
1