325

Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

Main:17 Pages
7 Figures
Bibliography:3 Pages
4 Tables
Appendix:1 Pages
Abstract

A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.

View on arXiv
Comments on this paper