Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.13967
Cited By
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
22 May 2024
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity"
11 / 11 papers shown
Title
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
129
3
0
18 Dec 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
47
78
0
07 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
42
95
0
03 Jan 2024
Evolving Domain Adaptation of Pretrained Language Models for Text Classification
Yun-Shiuan Chuang
Yi Wu
Dhruv Gupta
Rheeya Uppaal
Ananya Kumar
Luhang Sun
Makesh Narsimhan Sreedhar
Sijia Yang
Timothy T. Rogers
Junjie Hu
VLM
27
3
0
16 Nov 2023
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection
Rheeya Uppaal
Junjie Hu
Yixuan Li
OODD
106
22
0
22 May 2023
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
115
183
0
21 Sep 2022
Detecting Label Errors by using Pre-Trained Language Models
Derek Chong
Jenny Hong
Christopher D. Manning
NoLa
25
15
0
25 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
8,441
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
273
1,151
0
18 Sep 2019
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
190
607
0
03 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,003
0
20 Apr 2018
1