Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

International Conference on Text, Speech and Dialogue (TSD), 2025

22 September 2025

Haoyang Chen

Kumiko Tanaka-Ishii

AILaw

ArXiv (abs)PDF HTML Github

Main:8 Pages

9 Figures

Bibliography:3 Pages

4 Tables

Appendix:2 Pages

Abstract

We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $\beta$ (vocabulary growth), Taylor's exponent $\alpha$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $\beta$ ) and higher term consistency (higher $\alpha$ ) than general texts. Within legal domain, statutory codes have the lowest $\beta$ and highest $\alpha$ , reflecting strict drafting conventions, while cases and deeds show higher $\beta$ and lower $\alpha$ . In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.

View on arXiv

Comments on this paper