616

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Abstract

Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA(γ\gamma)-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. γ\gamma-Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging γ\gamma-Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of 63.863.8 out of 100100, followed by LLaMA-3.1-70B and GPT-4 with scores of 60.960.9 and 60.560.5, respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.

View on arXiv
Comments on this paper