Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

30 April 2024

Yuvraj Virk

Prem Devanbu

Toufique Ahmed

ArXiv (abs)PDF HTML

Main:17 Pages

7 Figures

Bibliography:3 Pages

4 Tables

Appendix:1 Pages

Abstract

A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.

View on arXiv

Comments on this paper