ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.16494
  4. Cited By
Aligning Offline Metrics and Human Judgments of Value for Code
  Generation Models

Aligning Offline Metrics and Human Judgments of Value for Code Generation Models

29 October 2022
Victor C. Dibia
Adam Fourney
Gagan Bansal
Forough Poursabzi-Sangdeh
Han Liu
Saleema Amershi
    ALM
    OffRL
ArXivPDFHTML

Papers citing "Aligning Offline Metrics and Human Judgments of Value for Code Generation Models"

15 / 15 papers shown
Title
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Adam Fourney
Gagan Bansal
Hussein Mozannar
Cheng Tan
Eduardo Salinas
...
Victor C. Dibia
Ahmed Hassan Awadallah
Ece Kamar
Rafah Hosn
Saleema Amershi
AI4CE
LRM
LLMAG
38
34
0
07 Nov 2024
Improving Steering and Verification in AI-Assisted Data Analysis with
  Interactive Task Decomposition
Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition
Majeed Kazemitabaar
Jack Williams
Ian Drosos
Tovi Grossman
Austin Z. Henley
Carina Negreanu
Advait Sarkar
19
17
0
02 Jul 2024
Assessing and Verifying Task Utility in LLM-Powered Applications
Assessing and Verifying Task Utility in LLM-Powered Applications
Negar Arabzadeh
Siging Huo
Nikhil Mehta
Qinqyun Wu
Chi Wang
Ahmed Hassan Awadallah
Charles L. A. Clarke
Julia Kiseleva
30
10
0
03 May 2024
On the Limitations of Embedding Based Methods for Measuring Functional
  Correctness for Code Generation
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation
Atharva Naik
33
2
0
26 Apr 2024
The RealHumanEval: Evaluating Large Language Models' Abilities to
  Support Programmers
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Hussein Mozannar
Valerie Chen
Mohammed Alsobay
Subhro Das
Sebastian Zhao
Dennis L. Wei
Manish Nagireddy
P. Sattigeri
Ameet Talwalkar
David Sontag
ELM
38
18
0
03 Apr 2024
CodeBenchGen: Creating Scalable Execution-based Code Generation
  Benchmarks
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
Yiqing Xie
Alex Xie
Divyanshu Sheth
Pengfei Liu
Daniel Fried
Carolyn Rose
40
8
0
31 Mar 2024
Towards better Human-Agent Alignment: Assessing Task Utility in
  LLM-Powered Applications
Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications
Negar Arabzadeh
Julia Kiseleva
Qingyun Wu
Chi Wang
Ahmed Hassan Awadallah
Victor C. Dibia
Adam Fourney
Charles L. A. Clarke
LLMAG
24
7
0
14 Feb 2024
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation
  for Evaluating LLMs in Cybersecurity Knowledge
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
Norbert Tihanyi
M. Ferrag
Ridhi Jain
Tamás Bisztray
Merouane Debbah
ELM
22
18
0
12 Feb 2024
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
94
25
0
13 May 2022
Productivity Assessment of Neural Code Completion
Productivity Assessment of Neural Code Completion
Albert Ziegler
Eirini Kalliamvakou
Shawn Simister
Ganesh Sittampalam
Alice Li
Andrew Rice
Devon Rifkin
E. Aftandilian
102
176
0
13 May 2022
A Systematic Evaluation of Large Language Models of Code
A Systematic Evaluation of Large Language Models of Code
Frank F. Xu
Uri Alon
Graham Neubig
Vincent J. Hellendoorn
ELM
ALM
196
624
0
26 Feb 2022
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
  Code Understanding and Generation
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Yue Wang
Weishi Wang
Shafiq R. Joty
S. Hoi
204
1,451
0
02 Sep 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
194
614
0
20 May 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
  and Generation
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
190
853
0
09 Feb 2021
1