21
1

MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot

Abstract

Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymous Github is available at \href{this https URL}{Anonymous Github}

View on arXiv
@article{song2025_2404.18074,
  title={ MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot },
  author={ Zirui Song and Yaohang Li and Meng Fang and Yanda Li and Zhenhao Chen and Zecheng Shi and Yuan Huang and Xiuying Chen and Ling Chen },
  journal={arXiv preprint arXiv:2404.18074},
  year={ 2025 }
}
Comments on this paper