How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

18 March 2024

Shu Yang

Youliang Yuan

Michael R. Lyu

Abstract

Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA( $\gamma$ )-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. $\gamma$ -Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging $\gamma$ -Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of $63.8$ out of $100$ , followed by LLaMA-3.1-70B and GPT-4 with scores of $60.9$ and $60.5$ , respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.

View on arXiv

Comments on this paper