December 19, 2018

This post features different strategies to implement a bot for a turn-based game like Tic-Tac-Toe. Some are pragmatic while others are overkill for simple games like Tic-Tac-Toe. The point is to give you a taste of some techniques that you can then apply to more elaborate games.

It’s always tempting to create a bot that’s governed by predictable rules. These rules come from knowledge of game strategy and loosely follow how a human player thinks while playing the game.

Examples of rules that make sense in Tic-Tac-Toe:

- Start in the center square.
- If the opponent has two in a row, perform a block.

At each turn the bot checks against these rules and plays accordingly whenever a rule fires. It falls back to a random move when no rule is applicable on a particular turn.

Try playing against this bot and see how it fares:

You’ll notice that it isn’t terrible, but it can still be beaten. It also does some silly things like not finishing off a game even when it has two in a row because we didn’t explicitly instruct it to do so. We can continue strengthening this bot by adding more rules, but let’s explore some techniques that don’t require as much domain knowledge instead.

The previous approach is beatable because the bot doesn’t look ahead and consider all the future ramifications of making a particular move. This is solved by the following techniques which simulate the game a number of moves ahead and then pick a move that is likely to result in victory.

The current game state and all the future ones that can be reached form a directed graph called a game tree. Each game state is a node in the tree and each edge represents a move that takes the game from the source node to its child. The current game state sits at the root of this tree.

Let’s say we’re in this state and it is X’s turn to play:

O | O | X |

X | O | |

X |

If we explore all possible moves from this point forward, we get the following game tree. At this point we try to determine which of the three branches to take.

Minimax will ensure that we pick the middle branch so that X is guaranteed to win. It does so by assigning a score to each node in the game tree that represents the benefit to X:

- A game state that results in X winning is assigned a score of
**+1**. - A game state that results in X losing is assigned a score of
**-1**. - Draw states are assigned a score of
**0**.

The intermediate nodes are scored depending on whose turn it is at that node:

If it is X, we pick the the child with the highest score and assign that to the node (X picks a move that

*maximizes*its score).If it is O, we do the opposite and pick the child with the least score. The intuition here is that O will play optimally and pick a move that results in the least benefit to X (i.e. it

*minimizes*the score).

The roles of X and O are swapped on the next turn when O
becomes the current player (O becomes the *maximizer* and
X becomes the *minimizer*).

Here are the scores for the game tree that we just looked at:

Once the three nodes at the top are scored, we can just pick the node with the highest score as the next move.

You can try playing against the following bot that uses Minimax. It cannot be beaten.

In games with large search spaces, optimizations like Alpha-Beta Pruning come in handy to avoid searching down every single path in the game tree.

MCTS is another approach that searches the game tree before deciding which move to pick next. However, it works quite differently from Minimax.

For one, it does not exhaustively search the game tree. Rather, it samples different paths randomly and then builds up statistics that inform it which moves work well. As these statistics build up, the algorithm continues to exploit moves that are known to work well in order to probe them further.

This is balanced by some amount of exploration of other paths
so that the algorithm doesn’t get stuck down one or two paths
that are yielding good results so far. It is this balance of *exploration* vs. *exploitation* that makes
MCTS a great choice for game trees that have huge branching
factors and are too large to search exhaustively. It can also
be applied to games with more than two players.

Another advantage of MCTS is that we can control the number of iterations, effectively creating a tuneable bot that is stronger the more time you give it per turn. The bot can be stopped at any point and it will still suggest a move to play.

It is important to reiterate that MCTS doesn’t explore the entire game tree. Rather, it maintains a parallel tree of statistics that it expands repeatedly. Each node in this tree corresponds to a single node in the game tree.

Here is what we store at each node:

${n_i}$ = number of times we visited this node

${w_i}$ = value / reward from the perspective of the player that just moved

It executes the following four steps repeatedly until we tell it to stop:

**1. Selection**

This phase starts at the root node and keeps follows nodes down the tree until it finds a node that is either a terminal, or a node that hasn’t fully been expanded (i.e. all its children haven’t been added to the MCTS tree yet).

The child selection policy at each level of the tree must ensure that we visit promising paths often while also ensuring that we don’t neglect other children.

A popular algorithm to do this is the **Upper Confidence Bound (UCT)**.
It boils down to calculating the following value for each
child *i* and then picking the child that has the highest
**UCT** value:

$\dfrac{w_i}{n_i} + \sqrt{\dfrac{2 \ln{n_p}}{n_i}}$

($n_p$ is the number of times the parent of *i* has been visited)

The first term ensures *exploitation* (as ${w_i}$ grows the
UCT value will also grow) and the second term
ensures *exploration* (as we’ll see later, ${n_p}$ grows
every time one of its children is visited, so this term
grows for every child that is not visited during a cycle).

**2. Expansion**

Once we’ve selected a node, we randomly add one of its children to the MCTS tree.

**3. Playout**

We now simulate the game all the way to the end starting
at the node we picked in the **Expansion** phase. Note that none of the game states
visited during this simulation are actually added to the MCTS
tree. We just use the result of the simulation to update
the statistics in the tree.

This phase can just choose moves at random, but we will get better results if we choose plausible moves using some heuristics like the ones described at the beginning of this post.

**4. Backpropagation**

This phase propagates the result of the simulation back up
the tree all the way to the root. We increment ${n_i}$ at
each node on the path from the root to the **Playout** node.
We also increment ${w_i}$, but only if the simulation
resulted in a win for the player that just moved at that
node.

Once we’ve accumulated enough statistics (either by checking for convergence or just running the algorithm for a fixed number of iterations), we pick the child of the root that has been visited the most as the next move. This is the node with the highest ${n_i}$ among its siblings.

Enough talk, now see the algorithm in action. Click *step*
repeatedly to step through the algorithm. You’ll notice that
it converges to the result we achieved through Minimax
after some iterations. The numbers at each node are ${w_i}$ / ${n_i}$.

The keen observer will note that Tic-Tac-Toe doesn’t actually have that many game states. A lazy upper bound can be calculated as 3^9 = 19683 (each square can either be empty, contain an X or an O) which is quite small. The actual number is even smaller because we also end up counting states that are impossible during play (a grid filled with X’s for example). We can reduce this even further by grouping together states that share rotational symmetry or are mirror images of each other.

So, in theory we can precompute the best move at each game state (using an approach like Minimax) and then store it in a lookup table. This results in a fast (and predictable) bot that is unbeatable.

How we choose to encode the game state affects how big this lookup table will be. One approach would be to represent the board as a ternary number.

If we use the following digit positions:

0 | 1 | 2 |

3 | 4 | 5 |

6 | 7 | 8 |

and encode X as 1, O as 0, and an empty cell as 2, then the game state:

X | O | O |

X | X | |

X | O |

gets encoded as the ternary number 201211001, which is 14446 in base-10. This can be stored in an integer.

Bots that get stronger over repeated play sessions are probably the most interesting of the lot. In the previous section we computed the lookup table by using a tree search algorithm, but there’s no reason that you can’t populate it with moves made by actual human players!

Even a naive approach of just picking the most frequent move that humans make at a particular game state can lead to an interesting bot that “learns” how to play by observing.

Note that this is different from machine learning techniques like Reinforcement Learning, which is a topic for another post.

*Tic-Tac-Toe examples were built using boardgame.io*.