This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/sothatsit on 2024-09-18 11:08:37+00:00.


There has been interest lately in building new harder benchmarks for LLMs. I think game-playing could be a good option!

I tried using claude, gpt-4o, o1-mini and o1-preview to play Connect-4. They are all really bad at it, but it made me think that it might be a good harder benchmark for models!

* o1-mini failed really quickly and started changing the board shape and placing pieces randomly. ()

* o1-preview still failed, but took a little longer and did better. It first didn’t place my piece after a few moves, and then to fix it it placed my piece twice. ()

* gpt-4o started placing pieces wherever it wanted and ignored that pieces fall down completely. ()

* Claude 3.5 Sonnet got the move order wrong, but otherwise did the best. It got so close to a finished game, but just before winning it went haywire. (I don’t know how to share a chat for Claude)

So, all-in-all, LLMs suck at playing games. This doesn’t seem too different to how LLMs are also pretty bad at the ARC-AGI challenge. So, maybe the ability of LLMs to play games would be a good benchmark! Give them the rules and an initial board state for many different games, and then see if they can play through firstly a valid game, and then if they can play well.

Common games like Tic Tac Toe and Connect-4 would be good to see if it is in their training dataset, while variations of games would be good to see that they are reasoning to follow the rules. Verifying that they followed the rules correctly is also easy to validate, which is really important for benchmarking.