ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.09927
89
0

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

14 February 2025
Granite Vision Team
Leonid Karlinsky
Assaf Arbelle
Abraham Daniels
A. Nassar
Amit Alfassi
Bo Wu
Eli Schwartz
Dhiraj Joshi
Jovana Kondic
Nimrod Shabtay
Pengyuan Li
Roei Herzig
Shafiq Abedin
Shaked Perek
Sivan Harary
Udi Barzelay
Adi Raz Goldfarb
A. Oliva
Ben Wieles
Bishwaranjan Bhattacharjee
Brandon Huang
Christoph Auer
Dan Gutfreund
David Beymer
David Wood
Hilde Kuehne
Jacob A. Hansen
J. Shtok
Ken C. L. Wong
Luis Angel Bathen
Mayank Mishra
Maksym Lysak
Michele Dolfi
Mikhail Yurochkin
Nikolaos Livathinos
Nimrod Harel
Ophir Azulai
O. Naparstek
Rafael Teixeira de Lima
Rameswar Panda
Sivan Doveh
Shubham Gupta
Subhro Das
Syed Zawad
Yusik Kim
Z. He
Alexander Brooks
Gabe Goodhart
Anita Govindjee
Derek Leist
Ibrahim Ibrahim
Aya Soffer
David D. Cox
Kate Soule
Luis A. Lastras
Nirmit Desai
Shila Ofek-koifman
Sriram Raghavan
T. Syeda-Mahmood
Peter W. J. Staar
Tal Drory
Rogerio Feris
    VLM
    AI4TS
ArXivPDFHTML
Abstract

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. Seethis https URLfor model weights.

View on arXiv
@article{team2025_2502.09927,
  title={ Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence },
  author={ Granite Vision Team and Leonid Karlinsky and Assaf Arbelle and Abraham Daniels and Ahmed Nassar and Amit Alfassi and Bo Wu and Eli Schwartz and Dhiraj Joshi and Jovana Kondic and Nimrod Shabtay and Pengyuan Li and Roei Herzig and Shafiq Abedin and Shaked Perek and Sivan Harary and Udi Barzelay and Adi Raz Goldfarb and Aude Oliva and Ben Wieles and Bishwaranjan Bhattacharjee and Brandon Huang and Christoph Auer and Dan Gutfreund and David Beymer and David Wood and Hilde Kuehne and Jacob Hansen and Joseph Shtok and Ken Wong and Luis Angel Bathen and Mayank Mishra and Maksym Lysak and Michele Dolfi and Mikhail Yurochkin and Nikolaos Livathinos and Nimrod Harel and Ophir Azulai and Oshri Naparstek and Rafael Teixeira de Lima and Rameswar Panda and Sivan Doveh and Shubham Gupta and Subhro Das and Syed Zawad and Yusik Kim and Zexue He and Alexander Brooks and Gabe Goodhart and Anita Govindjee and Derek Leist and Ibrahim Ibrahim and Aya Soffer and David Cox and Kate Soule and Luis Lastras and Nirmit Desai and Shila Ofek-koifman and Sriram Raghavan and Tanveer Syeda-Mahmood and Peter Staar and Tal Drory and Rogerio Feris },
  journal={arXiv preprint arXiv:2502.09927},
  year={ 2025 }
}
Comments on this paper