Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio

18 May 2025

Abstract

Writing novels with Large Language Models (LLMs) raises a critical question: how much human-authored outline is necessary to generate high-quality million-word novels? While frameworks such as DOME, Plan&Write, and Long Writer have improved stylistic coherence and logical consistency, they primarily target shorter novels (10k--100k words), leaving ultra-long generation largely unexplored. Drawing on insights from recent text compression methods like LLMZip and LLM2Vec, we conduct an information-theoretic analysis that quantifies distortion occurring when LLMs compress and reconstruct ultra-long novels under varying compression-expansion ratios. We introduce a hierarchical two-stage generation pipeline (outline -> detailed outline -> manuscript) and find an optimal outline length that balances information preservation with human effort. Through extensive experimentation with Chinese novels, we establish that a two-stage hierarchical outline approach significantly reduces semantic distortion compared to single-stage methods. Our findings provide empirically-grounded guidance for authors and researchers collaborating with LLMs to create million-word novels.

View on arXiv

@article{shen2025_2505.12572,
  title={ Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio },
  author={ Hanwen Shen and Ting Ying },
  journal={arXiv preprint arXiv:2505.12572},
  year={ 2025 }
}

Comments on this paper