We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \50 bug fixes to \32,000featureimplementations−−andmanagerialtasks,wheremodelschoosebetweentechnicalimplementationproposals.Independenttasksaregradedwithend−to−endteststriple−verifiedbyexperiencedsoftwareengineers,whilemanagerialdecisionsareassessedagainstthechoicesoftheoriginalhiredengineeringmanagers.Weevaluatemodelperformanceandfindthatfrontiermodelsarestillunabletosolvethemajorityoftasks.Tofacilitatefutureresearch,weopen−sourceaunifiedDockerimageandapublicevaluationsplit,SWE−LancerDiamond(thishttpsURL).Bymappingmodelperformancetomonetaryvalue,wehopeSWE−LancerenablesgreaterresearchintotheeconomicimpactofAImodeldevelopment.
@article{miserendino2025_2502.12115,
title={ SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? },
author={ Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke },
journal={arXiv preprint arXiv:2502.12115},
year={ 2025 }
}