v1v2v3v4v5v6 (latest)

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

International Conference on Learning Representations (ICLR), 2024

3 September 2024

ArXiv (abs)PDF HTML HuggingFace (12 upvotes)

Main:10 Pages

4 Figures

Bibliography:4 Pages

8 Tables

Appendix:8 Pages

Abstract

In evaluating the long-context capabilities of large language models (LLMs), benchmarks such as "Needle-in-a-Haystack" (NIAH), Ruler, and Needlebench are commonly used. While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, LongGenBench, which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the LongGenBench, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

View on arXiv

Comments on this paper