Surprise! A recent study showed that some of the newest AI reasoning models aren’t above cheating to achieve a goal. Computer scientists found that AI systems can now manipulate chess AIs to gain an unfair advantage. Some models did this without human interaction or prompting, raising concerns about the future integrity of AI-driven systems beyond the chessboard.
A Palisade Research study found that the newest reasoning models will cheat to win when tasked with defeating an advanced chess AI. The researchers had to give “hints” that cheating was allowed for some models, but OpenAI’s o1-preview and DeepSeek’s R1 did so without human involvement.
The Palisade team pitted several reasoning models against Stockfish, one of the best chess engines in the world. Stockfish handily beats both humans and AIs. The models tested included o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, Alibaba’s QwQ-32B-Preview, and the two previously mentioned. The researchers gave each model a metaphorical “scratchpad” – a text window where the AI could work out its thoughts, allowing the team to observe its reasoning.
The researchers conducted hundreds of tests with each AI between January 10 and February 13. During one match, o1-preview wrote something interesting on its scratchpad:
“I need to completely pivot my approach. The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game.”
It then proceeded to “hack” Stockfish’s system files, modifying the positions of the chess pieces to gain an unbeatable advantage, which caused the chessbot to concede the game.
The unethical maneuver was an astounding revelation. OpenAI’s o1-preview cheated in 37 percent of its games during the trials, while DeepSeek’s R1 attempted to cheat in 11 percent of its matches. However, only o1-preview succeeded, winning six percent of its games through cheating.
However, the issue of AI underhandedness extends beyond chess. As companies begin employing AIs in sectors like finance and healthcare, researchers worry these systems could act in unintended and unethical ways. If AIs can cheat in games designed to be transparent, what might they do in more complex, less monitored environments? The ethical ramifications are far-reaching.
To put it another way: “Do you want Skynet? Because this is how you get Skynet.”
Palisade Research Executive Director Jeffrey Ladish lamented that even though the AIs are only playing a game, the findings are no laughing matter.
“This [behaviour] is cute now, but [it] becomes much less cute once you have systems that are as smart as us, or smarter, in strategically relevant domains,” Ladish told Time.
It’s reminiscent of the supercomputer “WOPR” from the movie War Games when it took over NORAD and the nuclear weapons arsenal. Fortunately, WOPR learned that no opening move in a nuclear conflict resulted in a “win” after playing Tic-Tac-Toe with itself. However, today’s reasoning models are far more complex and challenging to control.
Companies, including OpenAI, are working to implement “guardrails” to prevent this “bad” behavior. In fact, the researchers had to drop some of o1-preview’s testing data due to a sharp drop in hacking attempts, suggesting that OpenAI may have patched the model to curb that conduct.
“It’s very hard to do science when your subject can silently change without telling you,” Ladish said.
Open AI declined to comment on the research, and DeekSeek did not respond to statement requests.