10

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Leonardo Gonzalez
Main:5 Pages
4 Figures
Bibliography:2 Pages
6 Tables
Abstract

Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of 0.989±0.0570.989\pm0.057 (text: 0.985±0.0830.985\pm0.083, images: 1.000±0.0001.000\pm0.000), with mean text transcription error CER=0.033±0.149\mathrm{CER}=0.033\pm0.149 and mean layout fidelity IoU=0.364±0.161\mathrm{IoU}=0.364\pm0.161 for text regions and 0.644±0.1310.644\pm0.131 for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

View on arXiv
Comments on this paper