Neurosymbolic Information Extraction from Transactional Documents

International Journal on Document Analysis and Recognition (IJDAR), 2025

10 December 2025

Arthur Hemmer

Mickaël Coustaty

Nicola Bartolo

Jean-Marc Ogier

ArXiv (abs)PDF HTML Github (6702★)

Main:12 Pages

2 Figures

Bibliography:1 Pages

4 Tables

Appendix:1 Pages

Abstract

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$ -scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

View on arXiv

Comments on this paper