Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.08124
Cited By
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
12 June 2024
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng-Wei Zhang
Wenqiang Lei
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets"
4 / 4 papers shown
Title
The Platonic Representation Hypothesis
Minyoung Huh
Brian Cheung
Tongzhou Wang
Phillip Isola
72
107
0
13 May 2024
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Qinyu Zhao
Ming Xu
Kartik Gupta
Akshay Asthana
Liang Zheng
Stephen Gould
21
7
0
14 Mar 2024
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
117
314
0
21 Sep 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
273
1,561
0
18 Sep 2019
1