v1v2 (latest)

Lost in Space: Finding the Right Tokens for Structured Output

20 February 2025

Sil Hamilton

David Mimno

ArXiv (abs)PDF HTML

Main:7 Pages

3 Figures

Bibliography:3 Pages

5 Tables

Appendix:2 Pages

Abstract

General-purpose language models are trained to produce varied natural language outputs, but for some tasks, like annotation or classification, we need more specific output formats. LLM systems increasingly support structured output, which enforces formats by sampling tokens according to a grammar -- but also unpredictably reduces downstream performance. Are there systematic differences between grammars that appear semantically (and often visually) similar to humans? To answer this, we test four popular model families with five varying output formats on four common NLP benchmarks. We find all models perform most accurately when guided to use formats respecting convention, such as letters for multiple choice and real numbers for numerical prediction. Performance also improves by 5%-10% when guiding models to return tokens incorporating leading whitespace, with smaller models benefiting the most. We find leading whitespace helps models avoid structural deficiencies in subword token representations. We finally present best practices for researchers using language models as zero-shot classifiers with structured output.

View on arXiv

Comments on this paper