ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12115
  4. Cited By
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
v1v2v3v4 (latest)

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

17 February 2025
Samuel Miserendino
Ming Wang
Tejal Patwardhan
Johannes Heidecke
ArXiv (abs)PDFHTMLHuggingFace (46 upvotes)

Papers citing "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"

43 / 43 papers shown
Title
CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents
CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents
Haebin Seong
Sungmin Kim
Minchan Kim
Yongjun Cho
Myunchul Joe
...
Yoonshik Kim
Samwoo Seong
Yubeen Park
Youngjae Yu
Yunsung Lee
76
0
0
25 Nov 2025
UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
Darvin Yi
Teng Liu
Mattie Terzolo
Lance Hasson
Ayan Sinh
Pablo Mendes
Andrew Rabinovich
32
0
0
15 Nov 2025
Analyzing Political Text at Scale with Online Tensor LDA
Analyzing Political Text at Scale with Online Tensor LDA
Sara Kangaslahti
Danny Ebanks
Jean Kossaifi
Anqi Liu
R. Alvarez
A. Anandkumar
52
0
0
11 Nov 2025
Remote Labor Index: Measuring AI Automation of Remote Work
Remote Labor Index: Measuring AI Automation of Remote Work
Mantas Mazeika
Alice Gatti
Cristina Menghini
Udari Madhushani Sehwag
Shivam Singhal
...
Summer Yue
Alexandr Wang
Bing Liu
Ernesto Hernandez
Dan Hendrycks
91
2
0
30 Oct 2025
MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
Jia-Kai Dong
I-Wei Huang
Chun-Tin Wu
Yi-Tien Tsai
80
0
0
22 Oct 2025
Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents
Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents
Yihong Tang
Kehai Chen
Liang Yue
Jinxin Fan
Caishen Zhou
...
Kaiyang Guo
Xingshan Zeng
Wenjing Cun
L. Shang
Min Zhang
LLMAG
118
0
0
20 Oct 2025
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Y. Liu
D. Zhu
Zena Al-Khalili
Dai Cheng
Yanjun Chen
Dietrich Klakow
Wei Zhang
Xiaoyu Shen
48
0
0
14 Oct 2025
UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
Yiannis Charalambous
Claudionor N. Coelho Jr
Luis Lamb
Lucas C. Cordeiro
64
0
0
06 Oct 2025
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan
Rachel Dias
Elizabeth Proehl
Grace Kim
Michele Wang
...
David Li
Michael Sharman
Alexandra Barr
Amelia Glaese
Jerry Tworek
AI4TSALM
108
7
0
05 Oct 2025
The AI Productivity Index (APEX)
The AI Productivity Index (APEX)
Bertie Vidgen
Abby Fennelly
Evan Pinnix
Chirag Mahapatra
Zach Richards
...
Eric Topol
Osvald Nitski
Eric Topol
Brendan Foody
Osvald Nitski
ALMELM
65
3
0
30 Sep 2025
InfoAgent: Advancing Autonomous Information-Seeking Agents
InfoAgent: Advancing Autonomous Information-Seeking Agents
Gongrui Zhang
Jialiang Zhu
Ruiqi Yang
Kai Qiu
Miaosen Zhang
...
Yuan Zhang
Xin Li
Zhaoyi Liu
Xin Geng
Baining Guo
LM&Ro
53
1
0
29 Sep 2025
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
Zimu Lu
Houxing Ren
Yunqiao Yang
Ke Wang
Zhuofan Zong
Junting Pan
Mingjie Zhan
Jiaming Song
LLMAG
61
0
0
26 Sep 2025
Evaluating LLM Generated Detection Rules in Cybersecurity
Evaluating LLM Generated Detection Rules in Cybersecurity
Anna Bertiger
Bobby Filar
Aryan Luthra
Stefano Meschiari
Aiden Mitchell
Sam Scholten
Vivek Sharath
44
0
0
20 Sep 2025
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
Ziyi Ni
Huacan Wang
Shuo Zhang
Shuo Lu
Ziyang He
...
Xin Li
Chen-Hao Hu
Binxing Jiao
Daxin Jiang
Pin Lyu
144
4
0
26 Aug 2025
You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation
You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation
Yutong Bian
Xianhao Lin
Yupeng Xie
Tianyang Liu
Mingchen Zhuge
...
Jiaqi Chen
Xiangru Tang
Yongxin Ni
Sirui Hong
Chenglin Wu
72
1
0
17 Aug 2025
Kimi K2: Open Agentic Intelligence
Kimi K2: Open Agentic Intelligence
Kimi Team
Yifan Bai
Yiping Bao
Guanduo Chen
Jiahao Chen
...
Qifeng Teng
Chensi Wang
Dinglu Wang
Feng Wang
Haiming Wang
MoEVLMLRM
120
58
0
28 Jul 2025
Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback
Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback
C. Maddila
Adam Tait
Claire Chang
Daniel Cheng
Nauman Ahmad
...
Payam Shodjai
Killian Murphy
James Everingham
Aparna Ramani
Peter C. Rigby
103
1
0
24 Jul 2025
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
Tianze Xu
Pengrui Lu
Lyumanshan Ye
Xiangkun Hu
Pengfei Liu
ELM
163
5
0
22 Jul 2025
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
G. Oliva
G. Oliva
Gopi Krishnan Rajbahadur
Haoxiang Zhang
Yihao Chen
Zhilong Chen
Arthur Leung
Dayi Lin
Boyuan Chen
Ahmed E. Hassan
309
2
0
12 Jul 2025
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Joel Becker
Nate Rush
Elizabeth Barnes
David Rein
140
30
0
12 Jul 2025
Establishing Best Practices for Building Rigorous Agentic Benchmarks
Establishing Best Practices for Building Rigorous Agentic Benchmarks
Yuxuan Zhu
Tengjun Jin
Yada Pruksachatkun
Andy K. Zhang
Shu Liu
...
Sarah Schwettmann
Matei A. Zaharia
Ion Stoica
Percy Liang
Daniel Kang
449
8
0
03 Jul 2025
CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
Danning Xie
Mingwei Zheng
Xuwei Liu
Jiannan Wang
Chengpeng Wang
Lin Tan
Xiangyu Zhang
ALMLRM
92
7
0
03 Jul 2025
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang
Zhengyuan Yang
Chao Feng
Yongyuan Liang
Yuhang Zhou
...
Chung-Ching Lin
Kevin Lin
Linjie Li
Furong Huang
L. xilinx Wang
OffRLLRM
238
7
0
11 Jun 2025
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents
Kaivalya Hariharan
Uzay Girit
Atticus Wang
Jacob Andreas
LLMAGLRM
89
1
0
30 May 2025
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Manish Shetty
Naman Jain
Jinjian Liu
Vijay Kethanaboyina
Koushik Sen
Ion Stoica
ELM
180
9
0
29 May 2025
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
Zhenhao Zhou
Zhuochen Huang
Yike He
Chong Wang
Jiajun Wang
Yijian Wu
Xin Peng
Yiling Lou
70
0
0
26 May 2025
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu
Yiran Yang
Houxing Ren
Haotian Hou
Han Xiao
Ke Wang
Weikang Shi
Aojun Zhou
Mingjie Zhan
Haoyang Li
LLMAG
353
13
0
06 May 2025
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol
Batu El
Mirac Suzgun
Mert Yuksekgonul
J. Zou
ELM
264
13
0
17 Apr 2025
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
Wasi Uddin Ahmad
Aleksander Ficek
Mehrzad Samadi
Jocelyn Huang
Vahid Noroozi
Somshubra Majumdar
Boris Ginsburg
ALM
223
10
0
05 Apr 2025
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan
Zhirong Huang
Wei Liu
Hanwu Chen
L. Zhang
...
Jing Su
Tianyu Liu
Rui Long
Kai Shen
Liang Xiang
233
36
0
03 Apr 2025
Z1: Efficient Test-time Scaling with Code
Z1: Efficient Test-time Scaling with Code
Zhaojian Yu
Yinghao Wu
Yilun Zhao
Arman Cohan
Jinqiang Cui
LRM
277
27
0
01 Apr 2025
Survey on Evaluation of LLM-based Agents
Survey on Evaluation of LLM-based Agents
Asaf Yehudai
Lilach Eden
Alan Li
Guy Uziel
Yilun Zhao
Roy Bar-Haim
Arman Cohan
Michal Shmueli-Scheuer
LLMAGELM
413
62
0
20 Mar 2025
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Shanghaoran Quan
Jiaxi Yang
Bowen Yu
Jian Xu
Dayiheng Liu
...
Zeyu Cui
Yang Fan
Yanzhe Zhang
Binyuan Hui
Junyang Lin
ALMELMLRM
301
70
0
02 Jan 2025
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software
  Domains?
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
John Yang
Carlos E. Jimenez
Alex Zhang
K. Lieret
Joyce Yang
...
Gabriel Synnaeve
Karthik Narasimhan
Diyi Yang
Sida I. Wang
Ofir Press
186
82
0
04 Oct 2024
SciCode: A Research Coding Benchmark Curated by Scientists
SciCode: A Research Coding Benchmark Curated by Scientists
Minyang Tian
Luyu Gao
Shizhuo Dylan Zhang
Xinan Chen
Cunwei Fan
...
Tianhua Tao
Ofir Press
Jamie Callan
Eliu A. Huerta
Yuan Yao
ELM
166
57
0
18 Jul 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
515
342
0
22 Jun 2024
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and
  Natural User Prompts
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
Shudan Zhang
Hanlin Zhao
Xiao Liu
Qinkai Zheng
Zehan Qi
Xiaotao Gu
Xiaohan Zhang
Yuxiao Dong
Jie Tang
ELM
199
22
0
07 May 2024
Concept Induction: Analyzing Unstructured Text with High-Level Concepts
  Using LLooM
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM
Michelle S. Lam
Janice Teoh
James A. Landay
Jeffrey Heer
Michael S. Bernstein
207
87
0
18 Apr 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large
  Language Models for Code
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for CodeInternational Conference on Learning Representations (ICLR), 2024
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
369
867
0
12 Mar 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations (ICLR), 2023
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
308
1,186
0
10 Oct 2023
Goal Driven Discovery of Distributional Differences via Language
  Descriptions
Goal Driven Discovery of Distributional Differences via Language DescriptionsNeural Information Processing Systems (NeurIPS), 2023
Ruiqi Zhong
Peter Zhang
Steve Li
Jinwoo Ahn
Dan Klein
Jacob Steinhardt
226
61
0
28 Feb 2023
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELMALM
1.0K
7,423
0
07 Jul 2021
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
Basel Alomair
Jacob Steinhardt
ELMAIMatALM
881
869
0
20 May 2021
1