ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.13168
30
14

SciCode: A Research Coding Benchmark Curated by Scientists

18 July 2024
Minyang Tian
Luyu Gao
Shizhuo Dylan Zhang
Xinan Chen
Cunwei Fan
Xuefei Guo
Roland Haas
Pan Ji
K. Krongchon
Yao Li
Shengyan Liu
Di Luo
Yutao Ma
Hao Tong
Kha Trinh
Chenyu Tian
Zihan Wang
Bohao Wu
Yanyu Xiong
Shengzhu Yin
Min Zhu
K. Lieret
Yanxin Lu
Genglin Liu
Yufeng Du
Tianhua Tao
Ofir Press
Jamie Callan
Eliu A. Huerta
Hao Peng
    ELM
ArXivPDFHTML
Abstract

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

View on arXiv
Comments on this paper