Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM). This paper investigates the integration of state-of-the-art LLMs into ArgSum, including for its evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum frameworks, (ii) the development of a new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum.
View on arXiv@article{altemeyer2025_2503.00847, title={ Argument Summarization and its Evaluation in the Era of Large Language Models }, author={ Moritz Altemeyer and Steffen Eger and Johannes Daxenberger and Tim Altendorf and Philipp Cimiano and Benjamin Schiller }, journal={arXiv preprint arXiv:2503.00847}, year={ 2025 } }