Adaptive Orchestration for Inference of Large Foundation Models at the Edge

19 March 2025

Abstract

Large Foundation Models (LFMs), including multi-modal and generative AI models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments presents significant challenges for workload orchestration. We propose a novel adaptive orchestration method and system tailored specifically for managing distributed inference workloads across multi-access edge computing (MEC) infrastructures. Our approach enhances traditional workload orchestration by introducing dynamic methods including: (1) adaptive workload distribution that selects optimal, inter-connected edge nodes based on runtime capacity profiling; (2) dynamic redistribution of LFM partitions as operational conditions evolve, and; (3) real-time reconfiguration (e.g., re-splitting) of LFM layers to balance performance and privacy requirements. Our proposed framework introduces an architecture for adaptive split inference, enabling real-time, QoS-aware management of inference workloads. We present a reference architecture, detail operational mechanisms, and demonstrate its application through various use cases in real-world scenarios.

View on arXiv

@article{koch2025_2504.03668,
  title={ Adaptive Orchestration for Inference of Large Foundation Models at the Edge },
  author={ Fernando Koch and Aladin Djuhera and Alecio Binotto },
  journal={arXiv preprint arXiv:2504.03668},
  year={ 2025 }
}

Comments on this paper