Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.02577
Cited By
Are PPO-ed Language Models Hackable?
28 May 2024
Suraj Anand
David Getzen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Are PPO-ed Language Models Hackable?"
3 / 3 papers shown
Title
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,986
0
31 Dec 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
206
615
0
03 Sep 2019
1