Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85\% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.
View on arXiv@article{joselowitz2025_2410.12491, title={ Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning }, author={ Jared Joselowitz and Ritam Majumdar and Arjun Jagota and Matthieu Bou and Nyal Patel and Satyapriya Krishna and Sonali Parbhoo }, journal={arXiv preprint arXiv:2410.12491}, year={ 2025 } }