ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.16977
30
11

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

25 March 2024
Will Yeadon
Alex Peach
Craig P. Testrow
    DeLMO
    ELM
ArXivPDFHTML
Abstract

This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed n=300n = 300n=300 data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = 2.482×10−102.482 \times 10^{-10}2.482×10−10). Prompt engineering significantly improved scores for both GPT-4 (p = 1.661×10−41.661 \times 10^{-4}1.661×10−4) and GPT-3.5 (p = 4.967×10−94.967 \times 10^{-9}4.967×10−9). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from `Definitely AI' to `Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary `AI' or `Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

View on arXiv
Comments on this paper