Oracle is the first chess engine that plays like a human, from amateur to super GM. She can play like a 2800-rated player in classical or an 800-rated player in blitz, or at any other level, in any time control.
I’ve spent hundreds of hours over the last few months creating Oracle. This journey has blown my mind many times and changed my view on LLMs and chess.
In this article, we’ll take a close look at how Oracle works. It will be a long read, but hopefully, you’ll find this exploration of chess artificial intelligence as fascinating and enlightening as I did.
Table of Contents
- YouTube video
- Author’s note
- Examples of Oracle’s predictions
- Move-matching accuracy – Oracle vs. Maia
- Human Thinking in Chess
- How does Oracle work?
- Evaluations in Chess
- From GPT to Oracle
- Oracle’s name
- Cost
- Contributions and Future of Oracle
- Download
- Support and Donation
YouTube video
Author’s note
I am Yosha Iglesias, a FIDE Master and Woman International Master with no previous coding experience and no academic background. While this article might attract researchers, it has no scientific pretensions. That being said, all the findings presented here are globally reproducible, although individual Oracle predictions might vary depending on the hardware.
Oracle is an open-source project written in Python and licensed under the MIT License. You can download Oracle’s code from my GitHub page.
Examples of Oracle’s predictions
Unlike Stockfish, Oracle doesn’t display the 3 best moves, but the 3 likeliest moves, considering the players, the time control, and the previous moves. If the best move isn’t one of the 3 likeliest moves, she also displays it. For all the moves, she gives the likelihood of it being played and its evaluation, which is White’s expected score out of 100 games. (More on evaluations later)
In this position taken from the last game of the 2023 World Championship, Oracle thinks that the blunder Qc7?? is the likeliest move, with a 37% probability of being played. Nepo did play Qc7 and went on to lose the game.
Facing the Scholar’s mate attempt, Oracle predicts the best move g6 with a 99.5% probability if we tell her it’s a Carlsen vs. Kasparov classical game.
But if we tell her it’s a Jane, rated 608 vs. John, rated 602 blitz game, Oracle gives Nf6?? a 29% likelihood! You would have noted that Oracle annotates each move and colors it depending on its strength and the player’s level.
It’s important to understand that Oracle uses the previous moves to make her predictions. For instance, in a Dubov vs. Nepo game starting with 1. Nc3 Nc6, Oracle predicts that the 3 likeliest moves are the 3 best moves according to Stockfish.
But after 11. Nc3 Nc6 in the infamous “Dance of the Knights” pre-arranged draw of the 2023 World Blitz Championship, Oracle predicts the blunder 12. Nb1 that was actually played!
Oracle has the potential to revolutionize many areas of chess, the first being broadcasting.
Imagine that you are watching the last round of the Candidates. You see that Fabi is totally winning according to Stockfish. But you turn Oracle on and see that, even though Fabi is theoretically winning, he has a huge risk of blundering.
Sure enough, Fabi did play Rc7??, the likeliest move according to Oracle.
Oracle could also transform opening preparation: instead of only looking at the best moves, you could look at the moves your opponent would likely play.
Oracle could be used to create human-like bots of any level. You could use an Oracle based-bot to play a bullet against a 2800 super Grandmaster, a blitz against a 2500 Grandmaster, a rapid against a 2300 FIDE Master, or a classical game against a 1500 amateur.
Oracle could also contribute to anti-cheating.
Black’s account was banned for cheating, after playing dozens of games like this for months. In this position, Black can simply take the queen with fxg4, but she played instead the best move according to Stockfish, Bxg2. Oracle only gives this move a 1% chance of being played. Someone playing unlikely moves that are also Stockfish’s top move is highly suspicious.
In this position reached a few moves later, White just played Re1 and almost every human at this level would take the rook which mates in 3. But Stockfish shows a mate in 2 moves with Rf1+. According to Oracle, Rf1+ had less than a 2% chance of being played.
Just a couple of moves from one game can be enough, not to ban someone, but to raise suspicion. In addition to existing methods, Oracle could prevent such cheaters from going undetected for months.
Sadly, Oracle could be used for clever cheating as well…
You probably wonder, “Sure, all of this looks great, but does Oracle really work?”
We’ll go back to Oracle’s metrics in more detail later but for now, let’s focus on one of the most important, the move-matching accuracy
Move-matching accuracy – Oracle vs. Maia
Created by a team of American researchers, Maia was considered the best human-like chess engine. Maia is designed to predict moves from amateurs only, from 1100 to 1900. It has a move-matching accuracy of around 52%, meaning that for 52% of moves, the moves played by the human were the likeliest according to Maia. For the same moves, the humans played the best move according to Stockfish only 39% of the time.
I’ve tested Oracle on sets of games of different levels and time controls. Her move-matching accuracy was always between 56% and 63%, much better than Maia and better than Stockfish even at the highest level of classical chess. For instance, in the 2023 Pogchamps games, the players played Stockfish’s first move 40% of the time and Oracle’s likeliest move 57% of the time. In the 2023 World Chess Championship, the players played Stockfish’s first move 59.5% of the time and Oracle’s likeliest move 62.5% of the time.
So, Oracle works. But how does she work? Before answering that question, we need to look at how humans think in chess.
Human Thinking in Chess
In his book How Life Imitates Chess, Garry Kasparov summed up the thinking process of chess Grandmasters, “It’s intuition first, then calculation”.
Magnus Carlsen echoed this view when he declared, “I normally do what my intuition tells me to do. Most of the time spent thinking is just to double-check”
Vishy Anand also explained, “Intuition in chess can be defined as the first move that comes to mind when you see a position.”
The power of intuition sometimes lends a mystical quality to a player’s genius. Capablanca used it when he replied to a journalist who asked him how many moves in advance he could see: “Only one. But that’s always the best one.”
In reality, there is no magic power, divine grace, or even innate genius. Intuition is nothing more than pattern recognition.
To sum up, in chess, we have:
Human Thinking = Intuition + Analysis
Where:
- Intuition is the fast, unconscious, and effortless pattern recognition
- Analysis is all the slow, conscious, and effortful processes aiming to verify, correct, or complement intuition.
Analysis is used here as an umbrella term that covers many processes like calculation, planning, listing imbalances, using a method (the Silman Method, the Dorfman Method, the Kotov Method…), etc. Of course, intuition and analysis are intertwined and a player often uses their intuition during analysis (to determine candidate moves or to evaluate positions)
As explained by Daniel Kahneman in his best-seller System 1, System 2, one’s intuition is more likely to be right if…
- The domain must be predictable. Constant and with rules, not chaotic or random. This is true of chess, but not of casino games, which are purely random. A roulette player may have an “intuition” that red will hit, but this intuition is worthless.
- The person is experienced. You can generally trust the intuition of someone who has spent 10,000 hours practicing chess, but not the intuition of a beginner who has just learned the rules. Even a grandmaster’s intuition is worthless in non-standard, atypical positions.
- During the learning phase, the person has received feedback. A pointless form of training would be to solve puzzles that automatically move on to the next one as soon as you make a move, without telling you whether you were right or wrong.
Intuition’s strengths
1. Find the best move in quiet positions
In this position, the vast majority of club players will instantly understand that the best move is Nf1. Even if they have never seen this exact position, they can extrapolate from hundreds of similar positions where the best move is Nf1.
While many club players might feel puzzled in this position, most titled players will instantly find the best idea 12. Bxf6! Bxf6 13. Bd5! to end up in a typical good knight vs. bad bishop situation. This idea has been seen countless times, including in the following classic by Bobby Fischer:
In a similar position, Fischer played 15. Bxf6! Bxf6 16. Bd5! making this idea a well-known pattern.
2. Find a combination that matches a known pattern
In this position, Black played 24… Qg1+! a move that puzzle-rush users will find instantly.
Intuitions limits
1. It’s impossible to find the best move if it’s not a typical pattern
a) Simple calculation
The above position is a mate in one move that I’ve composed. It is designed so that all the natural tries don’t work (1. Qg7?? is illegal, 1. f7+?, 1. Rxb8+? 1. Qxb8?, 1. Rg3+?, 1. Bxb3+?, 1. Ne7+? or 1. Nh6+? don’t give mate) Unless you are an experienced solver used to problems, the solution might very well be the last move you consider, because it is designed to be counter-intuitive!
b) Complex calculation
In this instant classic, Black first sacrificed the rook with 26… Rxb3!! and after 27. axb3 Ra8 28. Kd1, he sacrificed the queen with 28…Qxh5+!! leading to a mate in a few moves. Because this combination doesn’t match any known pattern, it can only be found through calculation. One cannot be certain that Rxb3 works before calculating. But you might think, “Now that this combination is famous, someone in a similar situation might recognize the pattern and intuitively find Rxb3, right?”
Wrong. Because the combination is so complex, with all the pieces involved, there is a butterfly effect where the slightest change, which might seem unimportant at first sight, may change the evaluation of the combination.
Having the pawn on c6 instead of c5 seems like it changes nothing, and indeed 26… Rxb3? 27. axb3 Ra8 28. Kd1?? would lose to the same mate as in the game starting with 28…Qxh5+!!, but after 27… Ra8, White has the powerful 28. Ba7!! with the idea 28… Rxa7 29. Qxd3 +-
Such untypical and complex combinations can’t be found intuitively and require calculation.
2. The intuitive move can sometimes be a huge blunder
In the above position, Hikaru played the “natural” 20… h6?? As he later explained in this video, as soon as Naka had played it, he realized that White had 21. Qh7# Incredibly, Magnus also missed this simple mate in 1 and played 21. Ne4??
h6?? is the intuitive move at the super grandmaster level because in many similar positions, it would be the best; but here it blunders mate in 1.
How does Oracle work?
1. GPT is a fantastic predictor of intuitive moves
The simplest way to create a module that predicts human moves is to reproduce the human thought process.
The innovative idea behind Oracle is to use GPT to mimic intuition.
At this point, you might be thinking, “What? GPT? But everyone knows that it sucks at chess. I’ve seen several videos where it can’t play more than a few moves before making an illegal one.”
Just a few months ago, I used to think of LLMs as stochastic parrots who might by chance play a few legal moves but would be unable to play a whole game, let alone play chess well. How could they, considering they’re just text predictors?
I discovered that some computer researchers claimed that GPT could play chess quite decently. It all started with this tweet by Grant Slatton in which he explains that you can use GPT in competition mode (and not chatbot) to make it play at around 1800 Elo by prompting a game in PGN format.
Other researchers have looked into the matter, notably Mathieu Acher, who rates GPT at 1750 Elo, while pointing out that it makes illegal moves in ~16% of games. When you put it like that, a program that makes a lot of illegal moves and plays like an average club player doesn’t sound very impressive, let alone useful. But I was fascinated that a text predictor could play chess at all.
I discovered that GPT 3.5 Turbo Instruct has been trained on millions of games from Chess.com, Lichess, and Chessbase. But its training stopped in 2021 and it’s just as good on games after that date. In almost every game of chess you play, you quickly reach the novelty – a move that has never been played before in the history of the game. Even if it has been fed millions of games, I wondered, how can a stochastic parrot predict legal and rather good moves beyond the novelty? Can we say that GPT understands the rules of chess? Can we even say that it understands the game of chess? Does it have an internal representation of chess as Adam Karvonen thinks?
After spending dozens and dozens of hours testing GPT on all kinds of games, I came to understand that GPT can be compared to a super GM who would be asked to read the PGN header and then all the moves until the end of the prompt, and who would predict the next move instantly, intuitively. The super GM is asked not what move they would make, but what move they think the player would make, knowing the previous moves and the header.
When I realized this, I thought one could use GPT to create an AI that played like a human. The only problem was that I’d never coded before in my life. So I learned the basics, and then got to work… with the help of GPT.
As GPT attributes to each predicted token a probability, we can compute the probability of each candidate move given by GPT. GPT predicts here 3 possible moves, with the following (normalized) probabilities:
- Qg1+ -> 98.10%
- Nf2+ -> 1.46%
- Rxe8 -> 0.44%
After thousands of tests, I realized that while GPT is overall excellent in the probabilities it gives, it regularly makes mistakes, due to its inability to calculate. I came to understand that these mistakes were not random, but systematic. They can therefore be corrected using Stockfish evaluations.
2. From GPT’s probabilities to Oracle’s
I first created a program that takes the GPT’s probabilities for all legal candidate moves and normalizes these probabilities. Oracle then modifies these raw probabilities to correct the mistakes made by GPT’s inability to calculate, according to multiple factors:
- the player’s rating
- the time control
- whether the move is a mistake
- how big a mistake it is
- whether the move leads to a forced mate
- in how many moves the mate is forced
Let’s take a look at some examples to better understand these modifications. In the positions below, you can see Oracle’s and GPT’s probabilities, the latter in parenthesis.
a) In intuitive positions, GPT is good enough and Oracle barely modifies its probabilities
b) In less intuitive positions, Oracle’s and GPT’s probabilities can greatly differ
Here are GPT’s and Oracle’s predictions for the move where Naka blundered mate in one if we tell her it’s a bullet, a blitz (like it was), or a classical game
Bullet
Blitz
Classical
While the likelihood of Bd6 according to Oracle increases with the time control, the likelihood of h6?? decreases. In classical, Oracle gives h6?? a less than 0.50% chance of being played. In the meantime, GPT’s probabilities stay more or less the same.
Let’s compare GPT’s and Oracle’s predictions if we tell them White is rated 1453, 2153 (as they are in reality), or 2853:
1453 (blitz)
2153 (blitz)
2853 (blitz)
The likelihood of Qxh7+ according to Oracle increases with White’s rating. At 2153, Oracle predicts that Rd2?? is the likeliest move, as it happened in the game. At 2853, she predicts that Qxh7+ is the likeliest move.
Let’s sum up Oracle’s functioning in one diagram:
Before we dive deeper into Oracle’s performances, we need to talk a bit about evaluations in chess.
Evaluations in chess
There are only 3 objective or theoretical evaluations in chess. A position can be
- Winning for white
- Draw
- Winning for black
These evaluations are essential properties of the positions they refer to and do not depend on the context in which the position has arisen.
All other evaluations are subjective or practical, which means they are only valid for some player against another player, in certain conditions.
Subjective evaluations can be expressed in
- phrases “White is probably winning but Black has drawing chances”
- signs “+/-”
- centipawns “+1.12”
- expected score “75%”
Let’s take an example:
Magnus (2882) vs. Garry (2851), classical game
Jane (608) vs. John (602), bullet game
While we can’t be certain about the theoretical evaluation of this position, we can safely assume it should be a draw.
For Stockfish, the evaluation of a position is the same as the evaluation of the best move, in that case, -0.30. It’s a subjective evaluation, true for Stockfish against itself.
To evaluate a position, Oracle follows these three steps
- Oracle uses Stockfish to evaluate each candidate move (in centipawns)
- Oracle transforms these evaluations in centipawns into an Expected Score depending on the players’ rating and the time control
- Oracle averages the Expected Score of each candidate move, weighted by its likelihood
While Oracle thinks Black has slightly better chances in the Carlsen vs. Kasparov classical game, she considers that White has much better chances in the Jane vs. John bullet game, because in the latter, Black has more risk of blundering Nf6?? and the advantage after g6 doesn’t matter much. You can also note that Qe7 and g6 are given no mark for Kasparov, but a good move “!” mark for John.
To sum up, the above position is likely a draw at the perfect level, is slightly better for Black at computer’s and super grandmaster level, but is much better for White at beginner level.
Prag vs. Caruana
Stockfish: -3.58
Oracle: 12.72%
Caruana vs. Nepo
Stockfish: +3.65
Oracle: 60.41%
In these two positions taken from the 2024 Candidates, Fabiano Caruana has a ~+3.60 advantage according to Stockfish. But Stockfish doesn’t take into account the likelihood of blundering. Oracle is very optimistic in his game against Prag, with an expected score of 87%, but much less in his game against Nepo, with an expected score of 60%. Fabi won the former and drew the latter.
While Oracle’s evaluations are more useful and relevant than Stockfish’s, they are still far from perfect, and in theory, an AI could give much better evaluations.
Let’s consider the following four characteristics of an evaluation system:
1. The evaluation type
Stockfish gives evaluations in centipawns, an obsolete and widely criticized way of evaluating chess positions. See the article Centipawns Suck by Nate Solon
Oracle uses the Expected Score, which is easier to understand and gives the possibility to take into consideration the risk of blundering.
But the Expected Score doesn’t give the probability of winning, drawing, and losing. A 75% Expected Score doesn’t differentiate between these 3 different scenarios:
- Win 75%, Draw 0%, Lose 25%
- Win 60%, Draw 30%, Lose 10%
- Win 50%, Draw 50%, Lose 0%
The most useful type of evaluation is to give the probabilities of winning, drawing, and losing.
2. The players for which the evaluation is valid
Stockfish’s evaluations are valid for itself against itself. Oracle’s evaluations are valid for the best of the two actual players against him or herself. While Oracle understands that a +1 advantage is not the same at 1000 or 2500 Elo, she can’t take into consideration the rating difference between the two players. A much better evaluation system would give evaluations that are valid for the player to move against their opponent.
3. Considering the possibility of blundering
For Stockfish, the evaluation of a position is the evaluation of the best move, which makes sense as Stockfish gives evaluations that are true for itself against itself, and it’ll play the move it considers to be the best. But it also means that Stockfish says nothing of the risk of blundering. For Oracle, the evaluation of a position is the average evaluation of each candidate move, weighted by its likelihood, which means she considers the possibility of blundering, but only on the next move. A better evaluation system would consider the probability of blundering in the next few moves.
4. The clock situation
Stockfish doesn’t take the clock into account at all. Oracle distinguishes between bullet, blitz, rapid, and classical. That is important as being on +3 according to Stockfish doesn’t lead to the same expected score in bullet or classical. But Oracle doesn’t use the actual time on the clock, which would be useful especially in time trouble.
Centipawns to Expected Score
To convert Stockfish’s evaluations in centipawns into expected score, Oracle first adjusts the best player’s rating depending on the time control and then uses the following formula:
- Bullet: adjusted rating = rating
- Blitz: adjusted rating = rating + 200
- Rapid: adjusted rating = rating + 700
- Classical: adjusted rating = rating + 1200
Where:
- R is the players’ adjusted rating
- E is Stockfish’s evaluation in centipawns
Let’s see a graphic for different ratings to understand better:
The graphic clarifies a couple of important points:
For the same centipawn evaluation, the higher the rating, the higher the expected score.
The function is non-linear: the difference between the expected score at +800 and +600 is much lower than the difference between the expected score at +200 and 0.
Expected Score Loss
You might have heard of centipawn loss, a metric defined by the difference between the evaluation in centipawns of the best move and that of the move actually played. Centipawn loss is a rather poor metric to measure the quality of a move, because losing 200 pawns is critical if it gets you from +1 to -1, not so much if it gets you from +10 to +8.
The Expected Score Loss doesn’t have this defects and this is why Chess.com uses ESL for its move classification.
For the next chapter’s statistics, I chose a similar classification (with a fixed rating of 2500):
- Good Moves: 0 ≤ ESL ≤ 5
- Inaccuracies: 5 < ESL ≤ 10
- Mistakes: 10 < ESL ≤ 20
- Blunders: 20 < ESL
From GPT to Oracle
Oracle GPT is a version of Oracle that takes all the legal moves predicted by GPT and normalizes their probabilities without any other change. It’s basically GPT, but with legal moves only.
Oracle has a better move-matching accuracy than GPT across all levels, and the difference increases with the level.
- For the Pogchamps games, GPT has a 56.5% move-matching accuracy, against 57% for Oracle.
- For the Magnus vs. Naka blitz, GPT has a 60% move-matching accuracy against 61.5% for Oracle.
- For the 2023 World Championship, GPT has a 54% move-matching accuracy, against 62.5% for Oracle.
You might be wondering: if Oracle can play like Magnus or Naka in blitz when she predicts her, and GPT has a move-matching accuracy very high (60%!), only 1.5% percent below Oracle, does that mean that GPT can play like a super grandmaster in blitz? Is a text predictor capable of playing chess like the best humans?
To better understand the level difference between the players, GPT, and Oracle, let’s take a look at the quality of their moves.
Carlsen vs Nakamura, 1+1 bullets
Both GPT and Oracle predicted more or less the same percentage of good moves, inaccuracies, and mistakes as played by Magnus and Hikaru.
The big difference comes from blunders. About 2.5% of Magnus and Naka’s moves were blunders. Oracle predicted around 3% of blunders, but GPT predicted around 5.5% of blunders! More than 2 times more than actually played.
In chess, your level is less defined by your ability to find good moves than by your ability to avoid blunders. GPT’s high level of blunders explains why it’s been estimated at around 1800 despite being such a good predictor of super grandmasters’ moves.
On average, Oracle predicted moves of the same level as the ones played by Magnus and Naka, while GPT predicted much worse moves.
Ding Liren vs. Nepomniachtchi 2023 World Chess Championship
As expected, the GPT’s inability to calculate becomes more salient as the level of play increases. While Oracle predicted slightly better moves on average than the ones played by Ding and Nepo, GPT predicted much worse moves.
Pogchamps games (rating ~900)
At a low level, humans make significantly more blunders than those predicted by both Oracle and GPT. How can we explain this phenomenon?
Firstly, while GPT’s inability to calculate is detrimental at a high level, it doesn’t matter at a low level. When it predicts amateur games, GPT isn’t like an amateur playing intuitively, but like a super grandmaster trying to predict the amateur’s next move intuitively. And because a super grandmaster’s intuitive capabilities surpass amateur calculation abilities, it doesn’t matter if the amateur can calculate while the grandmaster and GPT can’t.
Secondly, while some blunders can be predicted, most cannot. Let’s take two examples.
Here CDawgVA played the blunder Qe8, as predicted by GPT and Oracle. If we ask 100 players of his level their next move in this position, Qe8 would likely be the most common answer. Qe8 is a predictable blunder.
Here too, let’s imagine we ask 100 players of white’s level what they would play. The majority of players answer a bad move, either a mistake or a blunder, but the most common answer is Bxc6, the best move in the position. This creates a paradoxical situation reminiscent of the “wisdom of crowds” effect: individually, most players will choose a random, unpredictable bad move, but collectively, they choose the only good move. In the game, white played Qh6, an unpredictable blunder.
To sum up, at the highest level such random blunders happen rarely so the wisdom of crowds effect is much less significant. What matters is GPT’s inability to calculate. GPT plays much worse than super grandmasters, while Oracle can play at a super grandmaster level.
At a low level, GPT’s inability to calculate doesn’t matter and what counts is the high frequency of such unpredictable blunders. So while GPT and especially Oracle are fantastic predictors of amateurs’ moves, both play much better than amateurs when they try to predict amateurs’ moves.
Low Level | High Level |
GPT’s inability to calculate doesn’t matter | GPT’s inability to calculate matters |
Many unpredictable blunders | Few unpredictable blunders |
Strong wisdom of crowds effect | Weak wisdom of crowds effects |
GPT predicts better moves | GPT predicts worse moves |
Oracle predicts better moves | Oracle predicts equally good moves |
Oracle’s name
I’ve decided to name my chess engine Oracle because just like the Oracle from The Matrix, her predictions feel magical even though they are just pure calculations performed by a program. For that reason, Oracle should be referred to as she/her.
Cost
Because Oracle uses GPT for her prediction, she is costly! The average cost is ~400 predictions per $1 but can vary greatly with the length of the prompt (up to 10.000 predictions per $1 for opening moves with a short header)
Contributions and Future of Oracle
The next significant step for Oracle would be the creation of an open-source LLM trained on full PGNs with headers to replace GPT-3.5 Turbo Instruct, making Oracle completely free. Following this, Oracle could be turned into a user-friendly executable file and used on a large scale for broadcasts, training, opening preparation, anti-cheating, bots creation, and so on.
Download
Oracle is an open-source project written in Python and licensed under the MIT License. You can download Oracle’s code from my GitHub page.
Support and Donation
I have dedicated several hundred hours to this project and invested a significant amount of money. As a professional chess player and coach, my resources are limited. While I am happy to offer Oracle to the chess and scientific communities for free, any donation would be greatly appreciated.
If you value my work and wish to support Oracle and me, please consider making a donation.
I would be very thankful. 🙏