MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot

28 April 2024

Xiuying Chen

Ling Chen

Abstract

Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymous Github is available at \href{this https URL}{Anonymous Github}

View on arXiv

@article{song2025_2404.18074,
  title={ MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot },
  author={ Zirui Song and Yaohang Li and Meng Fang and Yanda Li and Zhenhao Chen and Zecheng Shi and Yuan Huang and Xiuying Chen and Ling Chen },
  journal={arXiv preprint arXiv:2404.18074},
  year={ 2025 }
}

Comments on this paper