ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.10647
14
2

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

21 August 2023
Imam Mohammad Zulkarnain
Shayekh Bin Islam
Md. Zami Al Zunaed Farabe
Md. Mehedi Hasan Shawon
Jawaril Munshad Abedin
Beig Rajibul Hasan
Marsia Haque
Istiak Shihab
Syed Mobassir
Md. Nazmuddoha Ansary
Asif Sushmit
Farig Sadeque
ArXivPDFHTML
Abstract

Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali...AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.

View on arXiv
Comments on this paper