How good are large language models at playing games?

Spread the love
How good are large language models at playing games?

Video games, with their demands on perception, memory, and strategic planning, seem like a natural arena for testing the capabilities of modern Large Language Models (LLMs). However, researchers have found that simply “dropping” LLMs into popular games often fails to provide an effective evaluation. A new benchmark, LMGAME-BENCH, developed by a team from UC San Diego, MBZUAI, and UC Berkeley, aims to change that by creating a more reliable and insightful way to assess how well LLMs can truly play.

Why LLMs often falter in standard game environments

While games have long served as a crucial testbed for reinforcement learning, using them to evaluate the complex agentic skills of today’s LLMs—their ability to see, reason, and plan over many steps—has proven tricky. The researchers behind LMGAME-BENCH identified three primary reasons why direct evaluation often falls short:

  • Brittle vision perception: Even advanced vision-language models (VLMs) can struggle with the nuanced visual understanding required to interpret complex game UIs and dynamic scenes accurately.
  • Prompt sensitivity: The performance of LLMs can vary wildly based on the specific wording and structure of the prompts used to guide their actions, making comparisons between models unreliable.
  • Potential data contamination: Many popular games have extensive online footprints, including walkthroughs, discussions, and visual assets. If an LLM has encountered this data during its training, its performance might reflect memorization rather than genuine problem-solving skills.

These issues often lead to LLMs performing poorly, sometimes no better than random action-taking, making it difficult to discern their true capabilities or distinguish between different models.

To overcome these hurdles, the researchers developed LMGAME-BENCH. This benchmark features a suite of well-known platformer, puzzle, and narrative-driven games, all accessible through a unified Gym-style API. More importantly, it incorporates several key innovations:

A diverse test of skills

LMGAME-BENCH utilizes six popular games, chosen for their familiarity and the broad spectrum of cognitive skills they test:

  • Super Mario Bros: Evaluates visual perception, 2D spatial reasoning, and goal-directed planning with partial observability.
  • Tetris: Tests pattern recognition, spatial reasoning for tile matching, and long-horizon planning.
  • Sokoban: Emphasizes visual perception, spatial reasoning for character and box navigation, and critical long-horizon planning to avoid deadlocks in low fault-tolerance scenarios.
  • Candy Crush: Requires visual perception for identifying candies, spatial reasoning for anticipating chain reactions, and long-horizon planning to maximize points with limited moves.
  • 2048: Assesses visual perception for tracking tile values, spatial reasoning for managing merges, and goal-directed planning.
  • Ace Attorney: Stresses long-context language understanding, causal and deductive reasoning from extensive dialogues and evidence, and long-horizon, low-fault-tolerance decision making in multi-stage trials.

Scaffolding LLMs for meaningful interaction

A core component of LMGAME-BENCH is its “gaming harness,” a set of modular supports designed to address the inherent limitations of current LLMs and enable more meaningful evaluation. These modules can be toggled on or off for experiments:

  • Perception modules: These convert game UI inputs (visual layouts, text) into symbolic representations or textual descriptions that LLMs can more easily process. For grid-based games like Sokoban, this means a text-based table of object coordinates. For text-rich games like Ace Attorney, it involves extracting dialogue and describing visual cues. This helps minimize errors stemming purely from visual misinterpretation.
  • Memory modules: To aid in long-horizon planning, especially in games with rapidly expanding decision spaces like Sokoban and Tetris, the harness includes memory support. This consists of a transient memory (recording past game states and actions) and a reflection module (encoding lessons learned to avoid past failures and narrow the action space).
  • Reasoning modules: The benchmark is designed to accommodate models that use complex reasoning processes, such as long chain-of-thought (CoT) reasoning, by allowing models to generate detailed reasoning traces before deciding on an action.

The study found that activating this harness significantly boosts scores, with 86.7% of game runs outperforming a random baseline when harnessed, compared to only 40% without. This creates clearer performance gaps between models.

Tackling data contamination and prompt variance

LMGAME-BENCH implements specific strategies to ensure fair and reliable evaluations:

  • Data contamination checks & mitigation: For games like Super Mario Bros (vision) and Ace Attorney (text), where assets are widely available online, the team developed checks. For Ace Attorney, they found an initial correlation between model output similarity to fan transcripts and performance. However, after applying mitigation techniques like entity masking, paraphrasing, and enforced reasoning, this correlation disappeared, with rankings then aligning more with judged reasoning quality. For games with combinatorial state spaces (Tetris, 2048, Candy Crush, Sokoban), contamination risk was deemed negligible.
  • Prompt standardization: Recognizing that prompt engineering can drastically affect LLM performance, LMGAME-BENCH employs a two-stage optimization technique. First, an empirical approach based on standardized formats common in agentic workflows is used. Second, DSPy (a framework for algorithmically optimizing LLM prompts and weights) is leveraged to refine prompts further, aiming for the best average performance across models and reducing performance variance. For example, in 2048, this reduced variance by 33.8% to 63.5%.

Key findings from LMGAME-BENCH

The researchers evaluated 13 leading models. Without the gaming harness, most models performed poorly, often near random baselines, especially in complex games like Sokoban and Ace Attorney. With the harness, performance improved significantly, and the benchmark effectively differentiated between models. Top performers included models with strong reasoning capabilities like o3 and o1, followed by Gemini-2.5-pro-preview and Claude-3.7-sonnet-20250219 (thinking). Among non-reasoning models, GPT-4.1-2025-04-14 led its category.

A fascinating aspect of the study involved understanding what underlying capabilities game performance correlates with. By comparing LMGAME-BENCH results with performance on 20 other established benchmarks (spanning math, coding, language, visual reasoning, etc.), the team found:

  • Sokoban performance showed strong correlations with math and coding benchmarks.
  • Tetris and 2048 aligned closely with pattern recognition tasks.
  • Candy Crush related to coding, suggesting algorithmic reasoning.
  • Ace Attorney strongly correlated with language understanding benchmarks.

Using low-rank matrix factorization and linear modeling, the researchers further decomposed game performance into latent abilities. For instance, they identified features corresponding to language/multi-task knowledge, coding, symbolic/puzzle-solving, and physical reasoning. Different games in LMGAME-BENCH were shown to load on unique combinations of these latent abilities, suggesting that games evaluate a richer, more compositional set of skills than many benchmarks that test capabilities in isolation.


New IDP model rethinks how our brains actually retrieve memories


Perhaps one of the most exciting findings was the potential for game-based training to generalize. The team fine-tuned a Qwen2.5-7B-Instruct model using reinforcement learning (RL) on simplified versions of Sokoban and Tetris.

The results were compelling:

  • Training on Sokoban led to strong gains in more complex Sokoban scenarios, improved performance on the planning task Blocksworld, and even showed zero-shot improvement on Tetris.
  • Similarly, training on Tetris enhanced performance on other planning tasks and cross-game scenarios.
  • Interestingly, while these spatial reasoning and planning heuristics transferred effectively, they did not improve performance on math or coding tasks like GSM8K or BIRD. However, game-trained models did show improvement on the agentic WebShop benchmark, suggesting grid-game-derived skills can benefit some real-world decision-making tasks.

Featured image credit

FAQs

Frequently Asked Questions

What is a Premium Domain Name?   A premium domain name is the digital equivalent of prime real estate. It’s a short, catchy, and highly desirable web address that can significantly boost your brand's impact. These exclusive domains are already owned but available for purchase, offering you a shortcut to a powerful online presence. Why Choose a Premium Domain? Instant Brand Boost: Premium domains are like instant credibility boosters. They command attention, inspire trust, and make your business look established from day one. Memorable and Magnetic: Short, sweet, and unforgettable - these domains stick in people's minds. This means more visitors, better recall, and ultimately, more business. Outshine the Competition: In a crowded digital world, a premium domain is your secret weapon. Stand out, get noticed, and leave a lasting impression. Smart Investment: Premium domains often appreciate in value, just like a well-chosen piece of property. Own a piece of the digital world that could pay dividends. What Sets Premium Domains Apart?   Unlike ordinary domain names, premium domains are carefully crafted to be exceptional. They are shorter, more memorable, and often include valuable keywords. Plus, they often come with a built-in advantage: established online presence and search engine visibility. How Much Does a Premium Domain Cost?   The price tag for a premium domain depends on its desirability. While they cost more than standard domains, the investment can be game-changing. Think of it as an upfront cost for a long-term return. BrandBucket offers transparent pricing, so you know exactly what you're getting. Premium Domains: Worth the Investment?   Absolutely! A premium domain is more than just a website address; it's a strategic asset. By choosing the right premium domain, you're investing in your brand's future and setting yourself up for long-term success. What Are the Costs Associated with a Premium Domain?   While the initial purchase price of a premium domain is typically higher than a standard domain, the annual renewal fees are usually the same. Additionally, you may incur transfer fees if you decide to sell or move the domain to a different registrar. Can I Negotiate the Price of a Premium Domain? In some cases, it may be possible to negotiate the price of a premium domain. However, the success of negotiations depends on factors such as the domain's demand, the seller's willingness to negotiate, and the overall market conditions. At BrandBucket, we offer transparent, upfront pricing, but if you see a name that you like and wish to discuss price, please reach out to our sales team. How Do I Transfer a Premium Domain?   Transferring a premium domain involves a few steps, including unlocking the domain, obtaining an authorization code from the current registrar, and initiating the transfer with the new registrar. Many domain name marketplaces, including BrandBucket, offer assistance with the transfer process.