A Baseline Analysis of Reward Models' Ability To Accurately Analyze
Foundation Models Under Distribution Shift

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

21 November 2023

Papers citing "A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift"

11 / 11 papers shown

Title
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future Jialun Zhong Wei Shen Yanzeng Li Songyang Gao Hua Lu Yicheng Chen Yang Zhang Wei Zhou Jinjie Gu Lei Zou LRM 38 2 0 12 Apr 2025
Interpreting Language Reward Models via Contrastive Explanations Junqi Jiang Tom Bewley Saumitra Mishra Freddy Lecue Manuela Veloso 74 0 0 25 Nov 2024
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison Judy Hanwen Shen Archit Sharma Jun Qin 37 4 0 15 Sep 2024
MetaRM: Shifted Distributions Alignment via Meta-Learning Shihan Dou Yan Liu Enyu Zhou Tianlong Li Haoxiang Jia ... Junjie Ye Rui Zheng Tao Gui Qi Zhang Xuanjing Huang OOD 36 2 0 01 May 2024
Filtered Direct Preference Optimization Tetsuro Morimura Mitsuki Sakamoto Yuu Jinnai Kenshi Abe Kaito Air 35 13 0 22 Apr 2024
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy Will LeVine Benjamin Pikus Jacob Phillips Berk Norman Fernando Amat Gil Sean Hendryx OODD 47 1 0 22 Jan 2024
Diagnosing Model Performance Under Distribution Shift Tiffany Cai Hongseok Namkoong Steve Yadlowsky 32 27 0 03 Mar 2023
Extremely Simple Activation Shaping for Out-of-Distribution Detection Andrija Djurisic Nebojsa Bozanic Arjun Ashok Rosanne Liu OODD 158 148 0 20 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
On the Importance of Gradients for Detecting Distributional Shifts in the Wild Rui Huang Andrew Geng Yixuan Li 173 326 0 01 Oct 2021
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,583 0 18 Sep 2019