PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. Traditional open source tools often produce lower quality extractions compared to vision language models (VLMs), but reliance on the best VLMs can be prohibitively costly (e.g., over 6,240USDpermillionPDFpagesforGPT−4o)orinfeasibleifthePDFscannotbesenttoproprietaryAPIs.WepresentolmOCR,anopen−sourcetoolkitforprocessingPDFsintoclean,linearizedplaintextinnaturalreadingorderwhilepreservingstructuredcontentlikesections,tables,lists,equations,andmore.Ourtoolkitrunsafine−tuned7Bvisionlanguagemodel(VLM)trainedonolmOCR−mix−0225,asampleof260,000pagesfromover100,000crawledPDFswithdiverseproperties,includinggraphics,handwrittentextandpoorqualityscans.olmOCRisoptimizedforlarge−scalebatchprocessing,abletoscaleflexiblytodifferenthardwaresetupsandcanconvertamillionPDFpagesforonly176 USD. To aid comparison with existing systems, we also introduce olmOCR-Bench, a curated set of 1,400 PDFs capturing many content types that remain challenging even for the best tools and VLMs, including formulas, tables, tiny fonts, old scans, and more. We find olmOCR outperforms even top VLMs including GPT-4o, Gemini Flash 2 and Qwen-2.5-VL. We openly release all components of olmOCR: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and benchmark olmOCR-Bench.