19
0

An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact

Abstract

The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.

View on arXiv
@article{menon2025_2505.05494,
  title={ An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact },
  author={ Avanija Menon and Ovidiu Serban },
  journal={arXiv preprint arXiv:2505.05494},
  year={ 2025 }
}
Comments on this paper