v1v2 (latest)

Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

10 December 2025

Sha Li

Ayush Sadekar

Nathan Self

Yiqi Su

Lars Andersland

Mira Chaplin

Annabel Zhang

Hyoju Yang

James B Henderson

Krista Wigginton

Linsey Marr

T.M. Murali

Naren Ramakrishnan

ArXiv (abs)PDF HTML Github

Main:2 Pages

13 Figures

2 Tables

Appendix:8 Pages

Abstract

Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.

View on arXiv

Comments on this paper