γ-Bench: Evaluating LLMs in Multi-Agent Games

This paper introduces γ-Bench, a novel framework for evaluating the gaming ability of large language models (LLMs) in complex, multi-agent environments. It includes eight classical game theory scenarios with dynamic scoring and parameters to assess LLMs' robustness, generalizability, and strategic thinking. The study evaluates thirteen LLMs from six model families, revealing that Gemini-1.5-Pro currently achieves the top performance. The research also explores the impact of prompt engineering and different game settings on LLM decision-making.

Om Podcasten