Researchers introduce GENSTRAT, a novel methodology that uses procedurally generated strategic game environments to evaluate Large Language Models (LLMs) as economic agents. This approach provides deployment-relevant diagnostics by assessing model competence across six strategic.
Large Language Models (LLMs) are increasingly used as economic agents in dynamic settings like auctions and bidding. Existing benchmarks fail to capture the complexity of real-world strategic environments. To address this gap, the authors introduce GENSTRAT, a framework that generates a distribution of two-player zero-sum imperfect-information card games on demand for evaluation. This setup allows for evergreen, uncontaminated testing.
GENSTRAT integrates this game distribution with a capability-profile methodology, decomposing model competence across six axes: state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness. Furthermore, it introduces a 'jaggedness measure' to detect unpredictable shifts in a model's advantage between strategically similar games.
In a head-to-head tournament involving nine frontier and open-weight LLMs, the results showed that newer models performed better on average. Crucially, the capability profiles revealed subtle differences even among models with similar overall strength; for instance, two top-three models exhibited greater local volatility than another, offering a diagnostic layer beyond simple ranking for deployment-relevant assessment.