ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.13417
91
1

RLTHF: Targeted Human Feedback for LLM Alignment

24 February 2025
Yifei Xu
Tusher Chakraborty
Emre Kıcıman
Bibek Aryal
Eduardo Rodrigues
Srinagesh Sharma
Roberto Estevão
M. A. D. L. Balaguer
Jessica Wolk
Rafael Padilha
Leonardo Nunes
Shobana Balakrishnan
Songwu Lu
Ranveer Chandra
ArXivPDFHTML
Abstract

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF's strategic data curation.

View on arXiv
@article{xu2025_2502.13417,
  title={ RLTHF: Targeted Human Feedback for LLM Alignment },
  author={ Yifei Xu and Tusher Chakraborty and Emre Kıcıman and Bibek Aryal and Eduardo Rodrigues and Srinagesh Sharma and Roberto Estevao and Maria Angels de Luis Balaguer and Jessica Wolk and Rafael Padilha and Leonardo Nunes and Shobana Balakrishnan and Songwu Lu and Ranveer Chandra },
  journal={arXiv preprint arXiv:2502.13417},
  year={ 2025 }
}
Comments on this paper