249
v1v2 (latest)

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

1 Figures
Bibliography:2 Pages
Appendix:24 Pages
Abstract

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of exp(Ω(n/H))\exp ( - \Omega(n/H) ) where nn is the number of data samples and HH denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of (ε,δ)(\varepsilon,\delta)-differential privacy and show, somewhat surprisingly, that the hardness parameter HH is unchanged in the asymptotic regime as nn tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., O~(1n)\widetilde{O}(\frac{1}{\sqrt{n}})) for offline RLHF with pairwise comparisons.

View on arXiv
Comments on this paper