ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.11111
9
11

Language Models as Science Tutors

16 February 2024
Alexis Chevalier
Jiayi Geng
Alexander Wettig
Howard Chen
Sebastian Mizera
Toni Annala
Max Jameson Aragon
Arturo Rodríguez Fanlo
Simon Frieder
Simon Machado
Akshara Prabhakar
Ellie Thieu
Jiachen T. Wang
Zirui Wang
Xindi Wu
Mengzhou Xia
Wenhan Jia
Jiatong Yu
Jun-Jie Zhu
Z. Ren
Sanjeev Arora
Danqi Chen
    ELM
ArXivPDFHTML
Abstract

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

View on arXiv
Comments on this paper