Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.14743
Cited By
A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift
21 November 2023
Will LeVine
Benjamin Pikus
Tony Chen
Sean Hendryx
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift"
11 / 11 papers shown
Title
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
38
2
0
12 Apr 2025
Interpreting Language Reward Models via Contrastive Explanations
Junqi Jiang
Tom Bewley
Saumitra Mishra
Freddy Lecue
Manuela Veloso
74
0
0
25 Nov 2024
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
Judy Hanwen Shen
Archit Sharma
Jun Qin
37
4
0
15 Sep 2024
MetaRM: Shifted Distributions Alignment via Meta-Learning
Shihan Dou
Yan Liu
Enyu Zhou
Tianlong Li
Haoxiang Jia
...
Junjie Ye
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
OOD
36
2
0
01 May 2024
Filtered Direct Preference Optimization
Tetsuro Morimura
Mitsuki Sakamoto
Yuu Jinnai
Kenshi Abe
Kaito Air
35
13
0
22 Apr 2024
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy
Will LeVine
Benjamin Pikus
Jacob Phillips
Berk Norman
Fernando Amat Gil
Sean Hendryx
OODD
47
1
0
22 Jan 2024
Diagnosing Model Performance Under Distribution Shift
Tiffany Cai
Hongseok Namkoong
Steve Yadlowsky
32
27
0
03 Mar 2023
Extremely Simple Activation Shaping for Out-of-Distribution Detection
Andrija Djurisic
Nebojsa Bozanic
Arjun Ashok
Rosanne Liu
OODD
158
148
0
20 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,881
0
04 Mar 2022
On the Importance of Gradients for Detecting Distributional Shifts in the Wild
Rui Huang
Andrew Geng
Yixuan Li
173
326
0
01 Oct 2021
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,583
0
18 Sep 2019
1