ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.12331
  4. Cited By
OSS-Bench: Benchmark Generator for Coding LLMs
v1v2 (latest)

OSS-Bench: Benchmark Generator for Coding LLMs

18 May 2025
Yuancheng Jiang
Roland Yap
Zhenkai Liang
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)Github (40023★)

Papers citing "OSS-Bench: Benchmark Generator for Coding LLMs"

15 / 15 papers shown
LL3M: Large Language 3D Modelers
LL3M: Large Language 3D Modelers
Sining Lu
Guan Chen
Nam Anh Dinh
Itai Lang
Ari Holtzman
Rana Hanocka
AI4CE
194
7
0
11 Aug 2025
SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories
SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories
Connor Dilgren
Connor Dilgren
Purva Chiniya
Luke Griffith
Yu Ding
Yizheng Chen
367
9
0
29 Apr 2025
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Peiding Wang
Lulu Zhang
Fang Liu
Lin Shi
Minxiao Li
Bo Shen
An Fu
ELMLRM
944
15
0
05 Mar 2025
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Shanghaoran Quan
Jiaxi Yang
Bowen Yu
Jian Xu
Dayiheng Liu
...
Zeyu Cui
Yang Fan
Yanzhe Zhang
Binyuan Hui
Junyang Lin
ALMELMLRM
488
89
0
02 Jan 2025
Fuzzing the PHP Interpreter via Dataflow Fusion
Fuzzing the PHP Interpreter via Dataflow Fusion
Yuancheng Jiang
Chuqi Zhang
Bonan Ruan
Jiahao Liu
Manuel Rigger
Roland Yap
Zhenkai Liang
262
5
0
29 Oct 2024
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for
  Large Language Models
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
Jia Zheng
Boxi Cao
Zhengzhao Ma
Ruotong Pan
Hongyu Lin
Yaojie Lu
Xianpei Han
Le Sun
ALM
252
18
0
16 Jul 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
789
453
0
22 Jun 2024
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual
  Natural Language Generalization
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
Qiwei Peng
Yekun Chai
Xuhong Li
ELMLM&MA
386
95
0
26 Feb 2024
ML-Bench: Evaluating Large Language Models and Agents for Machine
  Learning Tasks on Repository-Level Code
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
Xiangru Tang
Yuliang Liu
Zefan Cai
Yan Shao
Junjie Lu
...
Yujia Qin
Wangchunshu Zhou
Yilun Zhao
Arman Cohan
Mark B. Gerstein
ELMLLMAG
452
49
0
16 Nov 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations (ICLR), 2023
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
545
1,851
0
10 Oct 2023
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
  Large Language Models for Code Generation
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code GenerationNeural Information Processing Systems (NeurIPS), 2023
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
ELMALM
1.2K
1,605
0
02 May 2023
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual
  Benchmarking on HumanEval-X
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XKnowledge Discovery and Data Mining (KDD), 2023
Qinkai Zheng
Xiao Xia
Xu Zou
Yuxiao Dong
Shanshan Wang
...
Andi Wang
Yang Li
Teng Su
Zhilin Yang
Jie Tang
ELMALMSyDa
465
512
0
30 Mar 2023
CodeLMSec Benchmark: Systematically Evaluating and Finding Security
  Vulnerabilities in Black-Box Code Language Models
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models
Hossein Hajipour
Keno Hassler
Thorsten Holz
Lea Schonherr
Mario Fritz
ELM
409
64
0
08 Feb 2023
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELMALM
2.7K
9,078
0
07 Jul 2021
SPoC: Search-based Pseudocode to Code
SPoC: Search-based Pseudocode to CodeNeural Information Processing Systems (NeurIPS), 2019
Sumith Kulal
Panupong Pasupat
Kartik Chandra
Mina Lee
Oded Padon
A. Aiken
Abigail Z. Jacobs
364
309
0
12 Jun 2019
1
Page 1 of 1