Chess engine, pt. 3: Elo, and rigorous SPRT testing

2025-03-15

This post is part of a series about building a chess-playing engine.

Introduction (0) | ← Last part (2) | Next part (4) →

In the last part, we went over the negamax algorithm, which is the core of how a chess engine can play good chess.

Honestly though, I think the content in this post is the most important part of the series, in terms of how essential it is to writing a good chess engine. Minmax and negamax are really well-covered, even trite topics at this point. But proper testing methodology is something many other websites won’t teach you. Despite the relative obscurity of this topic, you need to test, and you should start as soon as possible.

This post will talk about the Sequential Probability Ratio Test¹, or SPRT. What is this test? Essentially, it is a statistical technique that can help you determine if an improved version of your engine is actually better than the old version.

By always running an SPRT when you create an improved version, you are certain that your engine is better, and it’s not a fluke. Otherwise, you might be introducing changes that weaken your engine, but because of bad testing methods, you’ll think it’s actually better. And then, in the worst case scenario, you’ll have to tear apart then re-test every single piece of your code.

So, learn good habits; do SPRT.

In this post, I’m not going to thoroughly explain all the details of the mathematics of SPRT (since I don’t have a solid background in statistics). For our purposes, we just need to understand what it’s doing, and its main results.

Elo ratings

First, some prerequisite context. How do we determine that a chess engine (or a version of the engine) is better than another? We measure the performance of the engines by having them play matches against each other. Usually, wins are assigned a score of 1 point, draws are 0.5 points, and a loss is 0 points.

For example, if Engine A wins 3, draws 1, and loses 1 game against Engine B, then Engine A has 3.5 points, and Engine B has 1.5 points. In total in this match, 5 games were played, and these 5 points were distributed between the players.

However, comparing performance based on win/draw/loss counts or points is a bit unwieldy. For example, if Engine A and Engine B play each other and their score is 12.5 to 10.5, while Engine B and Engine C get a score of 8.5 to 9.5, which engine is the best? Are you confused by all these numbers?

If so, you should be happy to know that a certain chess player, Arpad Elo, invented the Elo rating system to make our life easier. Using the point scores, Elo’s system lets you calculate a single rating number that quantifies a chess player’s performance, whether the player is human or machine. (If you want to see these calculations, please consult the relevant section on Wikipedia.)

Elo rating lets you easily compare the performance of players. If a player A has a higher rating than player B, player A is expected, on average, to perform better, i.e. score more points against player B. A bigger rating difference also predicts that there is a bigger skill difference, i.e. one player will score a lot more points than the other.

For instance, as of writing, Magnus Carlsen has a FIDE rating of 2837.1, and Hikaru Nakamura has a rating of 2799.4. Generally, Magnus Carlsen is considered to be a better chess player.

Note that Elo’s performance predictions say what should “probably” happen; it’s always possible, but improbable, that a bad player wins against a better player. Elo provides a framework that assigns probabilities for these outcomes.

FIDE and many other chess organizations use an Elo rating system.² Generally, even between different player pools that both use the same Elo system, ratings are not comparable. Machines also have their own Elo ratings. At the CCRL website, you can see a leaderboard of some of the best chess engines, as well as their rating. As I said, this is not really comparable to FIDE rating, or any other rating.

The SPRT

Now that we know what Elo is, let’s return to the main topic of the sequential probabilility ratio test.

What is an SPRT?

Note: This section is more mathematical in nature, and is not absolutely essential to using SPRT in engine development. However, I wrote this because having a base level understanding of what you’re doing is helpful. Skip this if you’re allergic to maths. Practically, this process is handled entirely by existing tools, so you don’t need to understand it.

The goal of an SPRT in chess is to accurately estimate the performance of an engine versus another engine. When testing, we run a series of games between the engines we want to test. Let’s say we want to test that engine A is better than engine B by some amount. We have two possible hypotheses:

Engine A is 0 Elo points better than engine B (no improvement).
Engine A is 10 Elo points better than engine B (improvement).³

These can be considered as “models” for the data. The models give the probability of certain outcomes. For example, let’s say we ran 1000 games, and the outcomes of these games for Engine A are WDLWWDWLLWLLWWWDD.... That is, Engine A won, then drew, then lost, then won, won, and so on, for a total of 1000 games.⁴ Here are some invented probabilities:

The “non improvement” hypothesis could say “there is a 1% chance of these outcomes happening.”
The “improvement” hypothesis could say “there is a 3% chance of these outcomes happening.”

These percentages are called the likelihood of this outcome, under a given hypothesis. If the likelihood for a given hypothesis is higher, that means the hypothesis is better at explaining the results we got, thus it seems like a better model.

Let’s define the likelihood ratio LR of our hypotheses as follows:

LR = (likelihood with "improvement" model) / (likelihood with "no improvement" model)

So, LR is a value between 0 and infinity. When it is lower than 1, that means our hypothesis of “no improvement” is a better explanation of our data. When it is higher than 1, our hypothesis of “improvement” is a better explanation.

Often, chess SPRT tools will convert this into a neat number using a logarithm. We can define the log-likelihood ratio as

log-likelihood-ratio = log(LR)

so that if the improvement hypothesis is more likely, it will be positive, and otherwise it will be negative. This is the number you’ll be seeing when you run an SPRT in practice.

We set some bounds, A and B. We set A to be a negative number, and B to a positive number. If log(LR) <= A, then we are mostly certain that the “no improvement” hypothesis is true. If log(LR) >= B, then we think “improvement” is the correct hypothesis.

But what happens if A < log(LR) < B? That means we don’t have enough data to convincingly pick either hypothesis over the other. Remember: at this point we are running chess games between engine A and engine B, and collecting the outcomes of the game. So, in SPRT, if we don’t have enough data, we keep running more games, until we get enough data to prove either hypothesis.

Once SPRT is confident of the hypothesis, we stop playing the chess games. Thus, we collect exactly enough samples to prove that one of our hypotheses is more likely than the other.

Now, where do the parameters A and B come from? There’s a formula that takes two parameters, α and β, and gives us values for A and B. α represents the probability of a false positive (i.e. a type I error), and β represents the probability of a false negative (i.e. a type II error).

So if you configure it so that α = 0.05, there is a 5% chance that the SPRT will wrongly determine that the “improvement” hypothesis is true, when it is actually false.

Note: For more information about the mathematics of SPRT, please see these links:

A lot of the information in this section comes from these pages.

Practical use: fastchess

Alright, enough math. How do we use the SPRT to test chess engines?

As it turns out, other people have already developed tools that perform SPRT tests for us. The one I use is fastchess. Once you have an engine that supports UCI (i.e. it implements the essential UCI commands from part 1 of this series), you can plug it into fastchess, and it will take care of everything.

First, you need to install fastchess and compile it using make from source. It is written in C++.

Next, you need to download an opening book. This is because if you play the same engine against the same other engine 1000 times, the outcome will probably be the exact same every time. Adding an opening book introduces some variation by playing a few unique moves in the opening before the engine starts playing. For my project, I used 8moves_v3.pgn from stockfish.

Then, you can start an engine chess tournament using fastchess:

fastchess \
    -engine cmd=engine_new name="The new engine" \
    -engine cmd=engine_old name="The old engine" \
    -pgnout file="games.pgn" \
    -openings file=8moves_v3.pgn format=pgn order=random \
    -each tc=8+0.08 \
    -rounds 5000 -repeat \
    -concurrency 8 \
    -recover \
    -sprt elo0=0 elo1=10 alpha=0.05 beta=0.1

Flag by flag, this is what this command means:

-engine: This is the path to the executables of the engines you want to test. You can assign a human-readable name to each of them.
-pgnout: This is a Portable Game Notation (PGN) output file, where fastchess will export all the games played in the tournament.
- Notably, the Lichess analysis board lets you import these games for you to review them.
- You may also create a Lichess study and import multiple games at once.
-openings: This specifies the opening file mentioned above.
-each tc: The “each” sets an option for both engines. The “tc” is the time control. This specifically means “8 seconds, plus an increment of 0.08 seconds per move played.”⁵
-rounds: Play 5000 rounds at most. Each round uses one single opening.
-repeat: Each round, play two games (swap the players’ colors for the second game).
-concurrency: Play 8 games at once. The more cores you have on your computer, the faster you can perform an SPRT.
-recover: If an engine crashes, restart it.
-sprt: These set the parameters for SPRT. Alpha and beta are the false positive and false negative rates, and elo0 and elo1 are the two hypotheses we want to test.

Fastchess has a decent man page and --help page, so consult those for more information.

One of the main pain points of using fastchess is that you have to compile your engines repeatedly and deal with all the executable files. Seemingly, many people (including me) accelerate this process by writing a shell script to automate it. Here is my shell script. Essentially, in my code repository, I can use git tag to mark the old and new version with descriptive names. Then, the script can take the tags, compile the respective versions, then run fastchess on them.

Another important point is the time control. Humans typically don’t play 8+0.08, because they can’t really think that fast. But in SPRT, you typically need thousands of games to achieve a statistically significant result. Therefore, developers often use a short time control (called STC), so that we can run these games in a practical time frame. The opposite is a long time control (LTC), which is closer to human blitz chess.

At this point in the series, I still haven’t covered time management yet, so if you’re writing an engine and you don’t have that feature, set its depth searched to some low number such that it’s fast enough to play a game in 8+0.08 time.

The other parameter you might want to tune are the SPRT hypotheses. [0, 10] works for testing a relatively big improvement. But if you are doing a non-regression test, i.e. testing that engine A is just as good as than engine B, you should use [-10, 0] instead. For example, this could be useful if you do a refactor and you want to make sure you did not accidentally weaken the engine’s code.

Also, if you’re testing really minuscule changes that only slightly affect engine performance, you might want to consider using a smaller value, like [0, 5]. See the Chess Programming Wiki article about SPRT for more examples.

Interpreting the output

Once you’re running fastchess, you should start seeing the outcomes of the chess games. Periodically, fastchess will also give a status report that looks like this:

--------------------------------------------------
Results of c_i nnue08a-512 (920bea1) vs c_i eval-saturating (9b3074e) (8+0.08, NULL, 500MB, 8moves_v3.pgn):
Elo: 107.54 +/- 127.62, nElo: 144.54 +/- 152.27
LOS: 96.86 %, DrawRatio: 50.00 %, PairsRatio: 4.00
Games: 20, Wins: 12, Losses: 6, Draws: 2, Points: 13.0 (65.00 %)
Ptnml(0-2): [0, 1, 5, 1, 3], WL/DD Ratio: inf
LLR: 0.17 (-2.25, 2.89) [0.00, 10.00]
--------------------------------------------------

The main statistic you want to look at in this status report is the LLR (here, 0.17), which is the log-likelihood ratio of SPRT, as discussed above. The (-2.25, 2.89) represent the A and B bounds, and [0.00, 10.00] are the Elo hypotheses.

Never ever stop the SPRT prematurely. In this example, you want to wait for the LLR to either be above 2.89, or below -2.25. If you see something like 30 wins and 50 losses, and you think “wow, this version must be worse, so I will stop the test”, that is wrong! For all we know, the engine might just be having a bad day and isn’t performing well. The whole point of SPRT is that the log-likelihood ratio will tell you if the engine is better or not, statistically speaking.

Once the SPRT is confident about a hypothesis, fastchess will automatically stop itself. Here, for example, is the final outcome of a passed SPRT:

--------------------------------------------------
Results of c_i nnue08a-512 (920bea1) vs c_i eval-saturating (9b3074e) (8+0.08, NULL, 500MB, 8moves_v3.pgn):
Elo: 19.57 +/- 13.22, nElo: 22.46 +/- 15.13
LOS: 99.82 %, DrawRatio: 37.81 %, PairsRatio: 1.27
Games: 2026, Wins: 867, Losses: 753, Draws: 406, Points: 1070.0 (52.81 %)
Ptnml(0-2): [134, 143, 383, 181, 172], WL/DD Ratio: 8.34
LLR: 2.91 (-2.25, 2.89) [0.00, 10.00]
--------------------------------------------------

As you can see, 2.91 > 2.89, so we are confident that the version nnue08a-512 is more likely to be 10 Elo better than eval-saturating, rather than 0 Elo. A negative LLR would be the opposite.

However, if your LLR is between the bounds (-2.25, 2.89), then that is not a valid result! You need to run the test for longer to get enough data to prove your hypothesis.

If you do accidentally stop fastchess, it automatically saves its state to a file called config.json. You can resume the SPRT by running

$ fastchess -config file=config.json

and it will restore your existing data.

Game pairs & pentanomial results

Remember when I said that fastchess will run two games per round? These are called game pairs, and the two games use the same opening. We use game pairs because every opening will give an advantage to a certain player. Even the starting position in chess is unfair, because White moves first. Therefore, for fairness reasons, we need players to play both White and Black in the round.

If we have engine A, and engine B, then on the first game, A is White, B is Black. Then on the second game, we swap it so B is White, A is Black. Because of this symmetric nature, we can say that the outcomes of individual games may be biased by the opening, but the outcomes of game pairs are unbiased in that regard.

These are the outcomes possible for the two games in a game pair:

Points	Outcome	Description
0	`LL`	loss, loss
0.5	`LD`	loss, draw
1	`WL`	win, loss
1	`DD`	draw, draw
1.5	`WD`	win, draw
2	`WW`	win, win

Since both DD and WL are 1 point, they are considered the same outcome. Thus, we have 5 possible outcomes, giving a pentanomial model. In fastchess, the Ptnml(0-2) statistic counts the game pairs that had each outcome. For example, if you see Ptnml(0-2): [134, 143, 383, 181, 172] that means there were 134 LL, 143 LD, 383 WL or DD, and so on.

Fastchess is faster than some other SPRT tools, because the pentanomial model will pick a hypothesis faster than using the trinomial (win, draw, loss only) model, using the same data.

Making your own Elo leaderboard

SPRT is the right tool for saying, “I am confident this version is better than that version.” However, if you want to rank a bunch of engines on a leaderboard, the better tool is Ordo.

Ordo takes the outcomes of chess games between any amount of players (human or engine), and computes Elo ratings based on it.

Once you build it from source, here is an example usage:

$ ordo -o ratings.txt -- file1.pgn file2.pgn

-o ratings.txt means output to ratings.txt, and file1.pgn, file2.pgn are the games it should read from. You can use a glob like games/*.pgn to match all PGN files in a directory games/.

Here are some ratings of different versions of my engine:

   # PLAYER                                :  RATING   POINTS  PLAYED   (%)
   1 c_i ttable-smaller4 (3060140)         :  1719.0   5897.0   11770    50
   2 c_i nnue08a-512 (920bea1)             :  1718.3   9428.0   18760    50
   3 c_i duplicate (920bea1)               :  1717.8   2478.0    4963    50
   4 c_i eval-saturating (9b3074e)         :  1698.8   3358.0    6804    49
   5 c_i ttable-smaller3 (a40e3b7)         :  1696.9   4636.5    9159    51
   6 c_i ttable-smaller2 (7123e9e)         :  1688.5   1510.0    2997    50
   7 c_i ttable-beta-cutoff (3baeda6)      :  1685.8   7544.0   15104    50
   8 c_i ttable-alpha-cutoff (a3649e4)     :  1685.5   2609.5    5223    50
   9 c_i ttable-smaller (bd477ba)          :  1674.2    689.0    1425    48
  10 c_i recap-heuristic2 (de55bab)        :  1654.8   1448.5    2839    51

  25 c_i lmr-new-new3 (cf52e13)            :  1501.8   4811.0    9626    50
  26 c_i lmr-new-new2 (3b748a1)            :  1499.3   4814.5    9614    50
  27 c_i no-frac (c0e36f5)                 :  1495.5  10259.5   19983    51
  28 c_i null-window-more2 (f5a17d8)       :  1484.4    749.5    1548    48
  29 c_i history-heuristic (dcff015)       :  1481.5    638.5    1330    48

 150 c_i nnue3-512 (4503401)               :  1012.4     11.5      28    41
 151 c_i contempt-disabled (caa3bc4)       :  1008.7     28.5      58    49
 152 c_i promote-extension2 (9b711bd)      :  1006.4     28.5      60    48
 153 c_i hash-opt-fix (79cfbef v)          :  1002.3     83.5     183    46
 154 c_i short-time2 (290a69a)             :  1000.5     28.0      60    47
 155 c_i promote-extension (359dd5a)       :  1000.0     41.0      88    47
 156 c_i contempt (3fe4b8c)                :   995.0     11.5      26    44
 157 c_i engine2 (97db55b)                 :   990.7    139.5     950    15
 158 c_i avoid-rep7 (e27e18e)              :   990.6    110.5     258    43
 159 c_i contempt3 (68d3d7e)               :   976.4     10.5      26    40
 160 c_i avoid-rep (db365c1)               :   972.4     13.5      32    42

(Since there are many versions, I’ve cut out many rows.)

Ordo provides nice leaderboard-style ratings for your engines. So, once you test a lot, you’ll be able to see the progression.

Conclusion

If you are writing a chess engine, run an SPRT on every change you make, with no exceptions. If you don’t, it’s quite possible that you’ll regret it.

Before I move on to the next topic, here is some further reading about engine testing:

Danny Hammer – Proper Chess Engine Testing: This guide is more thorough, and outlines how to use OpenBench, which allows you to runs tests on multiple computers. It also goes over a way you can estimate your engine’s CCRL rating.
Stockfish Docs – Fishtest Mathematics & Fishtest FAQ: Stockfish is the #1 engine in the world; these pages go over some details of how their advanced testing framework functions.

In the coming parts of this series, I’ll be going over the main improvements that you can make to a chess engine. If you’re following along by writing your own engine, you should be able to implement these changes and, using SPRT, be confident that your engine is improving.

Next part →

My apologies for writing “SPR Test testing” in the title. ↩
A notable exception is Lichess, which uses the Glicko 2 rating system. ↩
Internally, tools first convert Elo values to scores before performing the SPRT. See fastchess’s source code for an example. ↩
In practice, SPRT isn’t performed on the win/draw/loss outcomes of individual games in computer chess; we use the outcomes of game pairs to reduce bias. See this section of this post. ↩
I personally had to talk to the Engine Programming Discord members to discover the 8+0.08 time control, since it’s not really documented in many places online. Less than this, and your engine could forfeit games on time, and more time takes too long for the SPRT to finish. ↩