ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.08590
27
0

Playpen: An Environment for Exploring Learning Through Conversational Interaction

11 April 2025
Nicola Horst
Davide Mazzaccara
Antonia Schmidt
Michael Sullivan
Filippo Momentè
Luca Franceschetti
P. Sadler
Sherzod Hakimov
A. Testoni
Raffaella Bernardi
Raquel Fernández
Alexander Koller
Oliver Lemon
David Schlangen
Mario Giulianelli
Alessandro Suglia
    OffRL
ArXivPDFHTML
Abstract

Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for "alignment" (with a reward model judging the quality of instruction following attempts) and for improving "reasoning" (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.

View on arXiv
@article{horst2025_2504.08590,
  title={ Playpen: An Environment for Exploring Learning Through Conversational Interaction },
  author={ Nicola Horst and Davide Mazzaccara and Antonia Schmidt and Michael Sullivan and Filippo Momentè and Luca Franceschetti and Philipp Sadler and Sherzod Hakimov and Alberto Testoni and Raffaella Bernardi and Raquel Fernández and Alexander Koller and Oliver Lemon and David Schlangen and Mario Giulianelli and Alessandro Suglia },
  journal={arXiv preprint arXiv:2504.08590},
  year={ 2025 }
}
Comments on this paper