v1v2 (latest)

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

18 November 2025

ArXiv (abs)PDF HTML HuggingFace (18 upvotes)

Main:27 Pages

7 Figures

Bibliography:2 Pages

8 Tables

Abstract

We introduce Orion, a visual agent that integrates vision-based reasoning with tool-augmented execution to achieve powerful, precise, multi-step visual intelligence across images, video, and documents. Unlike traditional vision-language models that generate descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition (OCR), and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance across MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic VLM capabilities to production-grade visual intelligence. Through its agentic, tool-augmented approach, Orion enables autonomous visual reasoning that bridges neural perception with symbolic execution, marking the transition from passive visual understanding to active, tool-driven visual intelligence.Try Orion for free at:this https URLLearn more at:this https URL

View on arXiv

Comments on this paper