Solving Every Sudoku Puzzle Quickly

In this essay I develop and implement an algorithm to solve every Sudoku puzzle. It turns out to be quite straightforward (one page of code to solve 5 easy puzzles per second, or five pages to solve 40,000 hard puzzles per second) using two computer science ideas: constraint satisfaction and search. (Note: this page is an expansion of my previous Sudoku essay; that one emphasized simplicity of implementation; this one emphasizes efficiency and covers the ideas of search and constraint satisfaction in much more detail.)

Sudoku puzzles defined

Sudoku is a logic puzzle game involving filling in squares in a grid. We will use these definitions: Here is a typical well-formed puzzle and its solution:

485
3
7
26
84
1
637
52
14
     
417369825
632158947
958724316
825437169
791586432
346912758
289643571
573291684
164875293

Below we see the three units (row, column, and box) for the square that is filled with the bold-face red 2, and we can verify that no two squares in any unit have the same digit:

417369825
632158947
958724316
825437169
791586432
346912758
289643571
573291684
164875293
     
417369825
632158947
958724316
825437169
791586432
346912758
289643571
573291684
164875293
     
417369825
632158947
958724316
825437169
791586432
346912758
289643571
573291684
164875293

We will define the peers of a square to be all the other squares in that square's row, column, and box. Thus, each square must be filled by a digit that is different from all of the square's peers. Here are all the peers of the square with the bold red 2. Notice that this square (and every other square) has exactly three units (one row, one column and one box) and 20 peers.

417369825
632158947
958724316
825437169
791586432
346912758
289643571
573291684
164875293

Sudoku data structures

Let's start to think about how to write a program to solve Sudoku. We'll need to define an algorithm for solving puzzles, and we'll need data structures for representing all the concepts we defined above: puzzle, grid, digit, row, column, box, unit, peer, and solution. We will also see that it will be useful to represent a set of possible digits that could potentially fill a square.

Let's start with the grid, because that drives everything else. I can think of three possibilities for representing grids; Below I show how each square is represented under each of the three schemes, and show sample Python code to create a grid; define a square; and fill a square with a digit.

2-D [9 × 9] Array:
0,00,10,20,30,40,50,60,70,8
1,01,11,21,31,41,51,61,71,8
2,02,12,22,32,42,52,62,72,8
3,03,13,23,33,43,53,63,73,8
4,04,14,24,34,44,54,64,74,8
5,05,15,25,35,45,55,65,75,8
6,06,16,26,36,46,56,66,76,8
7,07,17,27,37,47,57,67,77,8
8,08,18,28,38,48,58,68,78,8
grid = [['4', '.', ...], ...]
square = (y, x) = (3, 1)
grid[y][x] = '2'

     

1-D [81] Array:
000102030405060708
091011121314151617
181920212223242526
272829303132333435
363738394041424344
454647484950515253
545556575859606162
636465666768697071
727374757677787980
grid = ['4', '.', ...]
square = 28
grid[square] = '2'

     

Hash table:
A1A2A3A4A5A6A7A8A9
B1B2B3B4B5B6B7B8B9
C1C2C3C4C5C6C7C8C9
D1D2D3D4D5D6D7D8D9
E1E2E3E4E5E6E7E8E9
F1F2F3F4F5F6F7F8F9
G1G2G3G4G5G6G7G8G9
H1H2H3H4H5H6H7H8H9
I1I2I3I4I5I6I7I8I9
grid = {'A1':'4', 'A2:'.', ...}
square = 'D2'
grid[square] = '2'

Let's look at each possible grid representation in turn.

2-D [9 × 9] Array:

Pro: Seems like the most "obvious" choice to mirror the 2-D layout of squares.
Con: If we want to iterate over all squares, we need two loops, not one. Making a copy of a grid is complicated. We can't say grid[square] = value, we have to say grid[y][x] = value. The notion of "square" becomes less important; x and y coordinates become more important, but they really aren't crucial to the puzzle (just to the way it is typically drawn).

1-D [81] Array:

Pro: Easy to iterate over all the squares. Easy to make a copy of a grid. Takes a minimal amount of storage.
Con: We need to be able to map from the 1-D array into a 2-D layout, for printing, or for determining where the rows and columns are. This can be done as follows: Rows are numbered 0-8 (top to bottom), columns are numbered 0-8 (left to right), and then row r, column c is square number 9r + c.

Hash table:

Pro: Squares don't have to be integers. Easy to represent a partially-filled grid.
Con: Slightly less efficient. Harder to make a copy of a grid.

Given these pros and cons, I chose the 1-D array (or list in Python) to represent a grid. That means a square is represented as an integer from 0 to 80. The contents of each square is a character, either a digit or '.' for empty. (In my previous essay, grids were hash tables (what Python calls a dict), because I was focusing on simplicity and debuggability and didn't care about efficiency.)

Note: It is also possible to consider puzzles of different sizes. Instead of having boxes of size 3×3, each box could be 2×2 or 4×4 or 5×5, etc. We'll use the variable name N for the size of a unit (row or column), and the variable Nbox for the width or height of a box. So the traditional Sudoku puzzle has N = 9 (and thus N2 = 81 squares) and Nbox = 3. A puzzle with Nbox = 6 has N = 36 and Nbox4 = N2 = 1296 squares. Of course, each puzzle must have N "digits" as well (we can use letters as digits when N > 9).

We can implement the notions of squares, digits, units, and peers as follows. The squares are all the integers from 0 to N×N; the digits are the characters from '1' to '9' when N=9; they include more characters when N is 16 or 25 or 36 or 49 (N=49 is a 2401 element grid; I thought that was big enough). And we define units[s] to be the list of three lists of squares denoting the three units that s participates in, and peers[s] is a set of squares that are the peers of s. The choice of what to represent as a set, and what as a list is determined for convenience in writing the program later on; for now consider those choices unimportant.

# The size of the grid, and the squares within.
Nbox     = 2            # Each box is of size Nbox x Nbox
N        = Nbox * Nbox  # The whole grid is of size N x N
squares  = range(N * N) # ... and thus has N x N squares
Square = int            # Squares are implemented as integers


# The allowable digits, as a string and as a set.
digitstr = '123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmn'[:N] # E.g., '123456789' for N = 9
digitset = set(digitstr)

def roundto(n:int, m:int) -> int:
    "Round the integer n down to a multiple of m."
    return m * (n // m)

def unitsfor(s: Square) -> [[Square]]:
    "Return the three units for square s."
    # (r, c, b) are the first squares in the (row, col, box) of s
    r = roundto(s, N)
    c = (s % N)
    b = roundto(s, N * Nbox) + roundto(c, Nbox)
    print s, 'r,c,b=', r, c, b
    return [range(r, r + N, 1),       # row
            range(c, N * N, N),       # col
            [br + bc                # box
             for br in range(b, b + Nbox*N, N) 
             for bc in range(3)]]

# Define units[s] and peers[s] for all squares s
units      = [unitsfor(s) for s in squares]
peers      = [set().union(*unitsfor(s)) - {s} for s in squares]
every_unit = set(tuple(u) for s in squares for u in units[s])

We can check that the peers and units look good:

>>> units[28]
[[27, 28, 29, 30, 31, 32, 33, 34, 35], 
 [1, 10, 19, 28, 37, 46, 55, 64, 73], 
 [27, 28, 29, 36, 37, 38, 45, 46, 47]]
>>> peers[28]
set([32, 33, 34, 35, 36, 37, 38, 1, 64, 73, 10, 45, 46, 47, 19, 55, 27, 29, 30, 31])

Sudoku notation and drawing grids

How are we going to input a puzzle to our program? Looking around on the web, we see that there are several common notations for describing a puzzle as a string of text. Below I list four example formats, each representing the puzzle diagrammed at the top of this page.

4.....8.5.3..........7......2.....6.....8.4......1.......6.3.7.5..2.....1.4......


4.....8.5 .3....... ...7..... .2.....6. ....8.4.. ....1.... ...6.3.7. 5..2..... 1.4......


400000805
030000000
000700000
020000060
000080400
000010000
000603070
500200000
104000000


4 . . |. . . |8 . 5
. 3 . |. . . |. . .
. . . |7 . . |. . .
------+------+------
. 2 . |. . . |. 6 .
. . . |. 8 . |4 . .
. . . |. 1 . |. . .
------+------+------
. . . |6 . 3 |. 7 .
5 . . |2 . . |. . .
1 . 4 |. . . |. . .
We will develop a function that accepts any of these formats; we call the format a gridstring and define it to be a string that contains exactly N2 instances of either a digit or the characters '.' or '0' to indicate an empty square. Other characters (such as spaces and hyphens) are ignored.

Here is the function Grid that takes a gridstring as input and returns a grid, and the inverse function, show, that takes a grid and returns a pretty gridstring:

def grid(gridstring: str):
    """Convert a gridstring into a grid: a list of values for squares. 
    Accepts digits, or '.' or '0' for empty; all other characters ignored."""
    gridstring = gridstring.replace('0', empty)
    result = [c for c in gridstring if c in digitset or c == empty]
    if len(grid) ** (1/4.) not in range(2, 8):
        raise ValueError('Wrong length for a grid')
    return grid

def show(grid: Grid) -> str:
    "Convert a grid to a pretty gridstring."
    dashes = '\n' + '+'.join(['--'*Nbox]*Nbox) + '\n'
    boxes = '\n'.join(['|'.join(['{} '*Nbox]*Nbox)]*Nbox)
    template = dashes.join([boxes]*Nbox) 
    return template.format(*grid)

And here is an example of the functions at work:
>>> puzzle1 = "4.....8.5.3..........7......2.....6.....8.4......1.......6.3.7.5..2.....1.4......"
>>> parse(puzzle1)
['4', '.', '.', '.', '.', '.', '8', '.', '5',
 '.', '3', '.', '.', '.', '.', '.', '.', '.',
 '.', '.', '.', '7', '.', '.', '.', '.', '.',
 '.', '2', '.', '.', '.', '.', '.', '6', '.',
 '.', '.', '.', '.', '8', '.', '4', '.', '.',
 '.', '.', '.', '.', '1', '.', '.', '.', '.',
 '.', '.', '.', '6', '.', '3', '.', '7', '.',
 '5', '.', '.', '2', '.', '.', '.', '.', '.',
 '1', '.', '4', '.', '.', '.', '.', '.', '.']

>>> print show(Grid(puzzle1))
4 . . |. . . |8 . 5
. 3 . |. . . |. . .
. . . |7 . . |. . .
------+------+------
. 2 . |. . . |. 6 .
. . . |. 8 . |4 . .
. . . |. 1 . |. . .
------+------+------
. . . |6 . 3 |. 7 .
5 . . |2 . . |. . .
1 . 4 |. . . |. . .

We'll need a function to check/verify a solution. The definition of a solution was given at the top of this page, but we will implement the check slightly differently from the way the definition is stated: instead of checking both that every square is filled and that no two squares in a unit have the same digit, we will instead check that every unit forms a set of digits which is equal to the set of all digits (digitset). We still check that the filled-in squares of the puzzle are unchanged in the proposed solution.

def is_solution(candidate: Grid, puzzle: Grid) -> bool:
    """A candidate grid is a solution to a puzzle if every unit of the candidate
    contains all the digits, and the filled squares of the original puzzle remain 
    unchanged in the candidate."""
    return (all(set(candidate[s] for s in unit) == digitset for unit in every_unit)
            and
            all(puzzle[s] == candidate[s] for s in squares if puzzle[s] is not empty))

Searching for solutions

Humans solve Sudoku problems by using logical deduction to infer that a certain digit is (or is not) the value of a square; such deductions are repeated until all squares are filled with a single known digit. An example of an inference rule that allows deductions is "If there is only one empty square in a unit, fill it with the digit that has not yet appeared in the unit." If the set of inference rules are sound and complete and the puzzle is well-formed, the human will only need a pencil--never an eraser. We could write an algorithm to do deductions like that, but the trick is in knowing when you have enough inference rules to solve every possible puzzle. I don't know how to guarantee that.

Therefore we will take a different route: searching for a solution that works by trying out possibilities one by one. We will describe a series of search algorithms, starting with the hopelessly inefficient and ending with one that can solve over 40,000 hard Sudoku puzzles per second. However, when implemented by a human, this strategy requires an eraser as well as a pencil.

Algorithm 1: generate and test

Here is a simple (but inefficient) search algorithm:
Generate and test algorithm: First fill in all the empty squares with digits. Then check to see if we have a solution. If not, then fill the empty squares with a different combination of digits. Repeat with different combinations until a solution is found.

Sometimes generate and test is the best you can do. Herb Simon, in his classic book The Sciences of the Artificial (page 194), describes the task of opening a safe that has ten dials, each with 100 numbers on it. If the safe is well-designed there is no better approach than to generate-and-test with all 10010 combinations.

Let's consider an example of a tiny puzzle, with N=4:

>>> tiny = '...4 4.3. .4.. .241'

>>> print show(Grid(tiny))
. . |. 4 
4 . |3 . 
----+----
. 4 |. . 
. 2 |4 1 
There are nine empty squares, so we need to fill them in with nine digits. Let's do it systematically, starting with 111111111, then 111111112, 111111113, 111111114, and on to 111111121, 111111122, and so on. The function itertools.product is useful for doing this kind of enumeration over possibilities. Here is the implementation of generate and test, encoded in the function solve, which takes a gridstring as input and returns a solution if it can find one, else None.
import itertools

def solve(gridstring: str) -> Grid:
    "Generate every combination of digits for the empty squares; test each one."
    puzzle = Grid(gridstring)
    possibilities = [(digitstr if v == empty else v) for v in puzzle]
    C = 0
    for candidate in itertools.product(*possibilities):
        C += 1
        #print candidate
        if is_solution(candidate, puzzle):
            return candidate, C
    return None
We can use it to solve the tiny puzzle:
>>> print show(solve(tiny))
1 2 |3 4
3 4 |1 2
----+----
4 3 |2 1
2 1 |4 3

It took 0.8 seconds to solve this tiny 4x4 puzzle (it tried about 100,000 possibilities before finding the solution). Now consider the 9×9 puzzle defined above as puzzle1. It has 17 filled squares, and thus 81 - 17 = 64 empty squares. So there are 964 possibile combinations (about 1061) to consider. How long will it take to run through them?

Suppose we have access to a secret new custom 1024 core 10 GHz CPU that features special hardware instructions that can generate and test a position in one clock cycle, and let's say we could afford to buy a billion of these, and while we're shopping, let's say we also pick up a time machine and go back 13 billion years to the early days of the universe, and a fleet of starships with which we visit 10 billion galaxies and in each galaxy convince the inhabitants of 10 billion planets to each buy a billion of these computers, and we all start our programs running, managing to distribute the cases perfectly so that no computation is duplicated or wasted. Then we can estimate that we'd be less than 4% done with this one puzzle by now.

Generate and test is not the algorithm you're looking for. Move along.

Algorithm 2: combinatorial search

In Simon's safe example, he next supposes a defective safe, in which a faint click is heard whenever a dial is set to the right position. With this safe it only takes 100 × 10 trials to discover the correct combination, not 10010. The moral is that if we have some selective knowledge of components of the safe (or puzzle) that tell us if a partial solution is on track, our search will be much easier.

Fortunately for us, Sudoku is like the defective safe. To be precise, it is like a safe with 81 dials, each with 9 numbers, and sometimes if you put one of the dials in the wrong position you hear a sound, but sometimes not (depending on the positions of the other dials and on how carefully you listen). For example, in the lower-left corner of the tiny puzzle above, if we try 1, 2, or 4 we would immediately get feedback that those digits won't work, because they already appear elsewhere in the bottom row. Therefore, the lower-left corner must be filled by a 3. There are 262,144 ways to fill in the 9 empty squares, but right away we can eliminate 196,608 of them; we need only consider the ones that have a 3 in that position.

Unfortunately for us, Sudoku does not give up all its secrets so easily. Sometimes, when we consider a square, we can't immediately tell what digit to put there. For example, in the upper-left of the tiny puzzle, only 4 can be eliminated as a possibility. That's ok--we're allowing ourselves the use of an eraser, so just guess one of the remaining three possibilities; if it doesn't pan out, erase it and try one of the others. This gives us:

Combinatorial search algorithm: Start filling squares with digits, one at a time, always making sure that the digit we put in a square is not the same as any peer. If there are several possible digits, pick one, but remember the others. If we reach a square where every digit conflicts with a peer, back up and consider a different digit for the previous square. Stop when we successfully fill every square, or give up if we tried all possibilities without finding a solution.
The progress of this algorithm can be described as a search tree, where each node represents a partially-filled (incremental) state of the puzzle, and a branch corresponds to filling in a square with a digit. (We display the new digits in red.) Here is a search tree for the tiny puzzle:

A few things to note about this particular search tree:

The code to implement combinatorial search is straightforward:

def search(grid: Grid) -> Grid:
    """Select an unfilled square, try each possible digit in order, recursively searching.
    When all squares filled: success; when no more digits to try: return Inconsistent for failure."""
    if grid is Inconsistent: 
        return Inconsistent
    s = select_square(grid)
    if s is None: 
        return grid
    for d in possible_values(grid, s):
        result = search(assign(grid[:], s, d))
        if result is not Inconsistent: 
            return result
    return Inconsistent

def select_square(grid: Grid) -> Square:
    "Return the first square that is not filled with a digit; or None if all squares are filled."
    for s in squares:
        if grid[s] not in digitset: return s
    return None

def possible_values(grid: Grid, s: Square) -> {Digit}:
    "A square can be any digit that is not already taken by a peer."
    return digitset - set(grid[s] for s in peers[s])

def assign(grid, s, d):
    "For now, simply assign grid[s] = d."
    grid[s] = d
    return grid

def solve(gridstring):
    "Parse gridstring and search for a solution."
    return search(Grid(gridstring))

Inconsistent = ['?'] * N * N  ## A special grid used to indicate that the puzzle is unsolvable
The key function is search. It obeys the convention that if it is passed Inconsistent (to indicate an invalid grid) it returns Inconsistent, meaning that no solution can be found. Otherwise it selects some unfilled square to work on. If select_square returns None that is actually good news: it means that all the squares are already filled, and we are done: we have a solution, so we just return it. Otherwise, for each possible digit that could fill square s we try to assign that digit to s and search for a solution from there. If some call to search succeeds, return it; but if a digit assignment causes the search to fail, just keep going with the next digit assignment. If all digit assignments fail, then the whole call to search fails, and we back up to the previous recursive call. (Note that the expression grid[:] means to make a copy of the entire grid.)

We call this type of combinatorial search a constraint satisfaction search: think of each square as being a variable that can take on a value (a digit), and the rules of Sudoku as being constraints on the possible values. A state in the search tree is then an assignment of values to some subset of variables in a way that does not violate any constraint. The constraint satisfaction approach has found many fun applications in puzzles and serious applications in problem solving.

Here we see solve (and therefore search) in action:

>>> print show(solve(puzzle1))
4 1 7 |3 6 9 |8 2 5
6 3 2 |1 5 8 |9 4 7
9 5 8 |7 2 4 |3 1 6
------+------+------
8 2 5 |4 3 7 |1 6 9
7 9 1 |5 8 6 |4 3 2
3 4 6 |9 1 2 |7 5 8
------+------+------
2 8 9 |6 4 3 |5 7 1
5 7 3 |2 9 1 |6 8 4
1 6 4 |8 7 5 |2 9 3
That's all there is to it! The only reason we don't stop this article now is that the program is still too slow for some purposes. We're not talking billions-of-years slow, but it did take 3 minutes and 45 seconds to solve this puzzle (a rather hard one). On a benchmark of 50 easy puzzles the program was fast enough, taking a total of 9.5 seconds (a rate of 5 puzzles per second, or 5 Hz).

However, if you want to solve a million hard puzzles, you probably don't want to wait a few years, and you would prefer a faster algorithm. Look again at the search tree above, and consider how we can find the solution faster. Here are four general strategies:

  1. We could try to move the solution to the left. Since we go top-down, left-to-right, this means we would find it sooner. This can potentially be done with better value ordering, the order in which the function possible_values considers which digit to try next. If we were able to choose right every time, there would in effect be no search--the solution would be in the lower-left corner of the tree and we would arrive at it directly with no backtracking.
  2. We could try to move the solution closer to the top. If we only fill in one square at a time then the solution will always be at the same level (the number of empty squares in the original puzzle). But if we could fill in several squares at a time, the solution gets closer to the top and we can find it faster. Filling in multiple squares at a time can be accomplished with constraint propagation--using the constraints imposed by the rules of Sudoku to conclude that an unfilled square must take on a certain digit, thus sanctioning us to fill in several squares at once.
  3. We could prune branches (mark them with an X and backtrack) earlier. This can also be done with constraint propagation, but in this case rather than filling in squares, we would notice that some square has no possible digit, and thus that we can backtrack immediately.
  4. We could consider the squares in a better order, a process called variable ordering, which in our program is implemented by the function select_square. At each node in the tree we pick one square to branch on. The current implementation of select_square just considers the squares in order from 0 to 80 and chooses the first unfilled one, but it could be more clever. A different ordering could be better (or worse) for any of the three reasons above.

Algorithm 3: constraint checking with forward checking

We will now consider an approach to strategy (C). The combinatorial search algorithm already prunes the search tree down to size by cutting off a branch as soon as we detect that there is no possible digit for the next square. But we can prune more branches earlier with a process known as forward checking.

Consider this branch of the search tree from above:

As soon as we place a 1 in the upper-left corner, there is an inconsistency: there is no digit that can go in the square in the second row, second column (because 1 is in the same box, 2 is in the same column, and 3 and 4 are in the same row). But our combinatorial search does not discover that until three more levels in the tree, when it gets around to selecting the offending square. The idea of forward checking is that we don't wait: every time we place a digit, we check the unfilled squares and if one of them has no possible digits left we backtrack immediately rather than waiting to select the offending square.

However, there is a cost to forward checking: we need to figure out if any square anywhere in the puzzle has no possible digits left. To do that efficiently we modify the combinatorial search algorithm as follows:

Constraint checking with forward checking algorithm: Initialize each square (each grid[s]) to hold the set of possible digits for that square (that is, all the digits that have not been assigned to a peer). Start filling squares with digits, one at a time, always making sure that the digit we put in a square is not the same as any peer. Whenever we assign a digit d to a square, eliminate that digit from the set of possible digits for all peers of the square. Back up and consider a different digit for the previous square whenever any square is reduced to the empty set of possibilities. Stop when we successfully fill every square, or give up if we tried all possibilities without finding a solution.

Note that grid[s] holds the set of possible digits for square s in the same way that humans use a pencil to write the possible digits for some squares into the squares themselves.

We will describe the implementation in two steps. First, we change the implementation of an unfilled square from a '.' to a set of possible digits. (Each set could be represented by a Python set or list, but I chose instead to use a string of digits. So the value for the upper-left square in the tiny puzzle (before it is assigned a digit) would be '123'.) This convention is set up by the function initialize and we will see that it is maintained by assign. One thing becomes easier: the function possible_values can now just look up the precomputed set of possible digits rather than having to compute it:

def solve(gridstring):
    "Parse gridstring, handle initial constraints, and search for a solution."
    return search(initialize(Grid(gridstring)))

def initialize(puzzle):
    """First set each square in the result grid to have the set of all possible digits.
    Then assign each filled square to the specified digit."""
    assert len(puzzle) == N*N
    grid = N*N * [digitstr]
    for s in squares:
        if puzzle[s] in digitset: grid = assign(grid, s, puzzle[s])
    return grid

def possible_values(grid, s): return grid[s]

Next we'll change assign(grid, s, d) so that it calls the function eliminate to eliminate d as a possibility from all the peers of s and fails if any peer has no remaining possible digits. (We add an extra level of indirection--eliminate calls check which calls forward_check to see if any square has no possible digits--we do this because we will soon introduce other kinds of checks.) Note that the functions search and select_square remain unchanged.

def assign(grid, s, d):
    """Assign grid[s] = d and eliminate d from the peers of s.
    Return the updated grid, or Inconsistent if inconsistency detected."""
    if d not in grid[s]: return Inconsistent # d is not among the possibilities
    grid[s] = d
    if not all(eliminate(grid, p, d) for p in peers[s]): return Inconsistent
    return grid

def eliminate(grid, s, d):
    "Remove d from possibilities for grid[s]. If checking finds an inconsistency return Inconsistent."
    if d not in grid[s]: return grid # Already eliminated d
    grid[s] = grid[s].replace(d, '')
    return check(grid, s, d)

def check(grid, s, d): 
    "Return True iff the constraints on square s after eliminating d are all satisfied."
    return forward_check(grid, s)

def forward_check(grid, s):
    "Fail if s has no possible digit, or one value which is the same as a peer."
    ndigits = len(grid[s])
    return not (ndigits==0 or (ndigits==1 and any(grid[p]==grid[s] for p in peers[s])))

How much does forward checking help? For the easy puzzles, not much. One reason they are called "easy" is that they don't require forward checking. For the harder puzzle1 puzzle the solving speed is improved by a factor of 5. But that's still 43 seconds. Let's keep improving.

Algorithm 4: constraint propagation with arc consistency

With forward checking, every time we eliminate a possible digit from a square we peek ahead to peer squares to see if they will cause a failure. But we can do much more if instead of just peeking at peers, we modify their set of possible digits (when appropriate according to the constraint rules). Modifying a peer can then cause a modification of a peer of a peer, a peer of a peer of a peer, and so on.

This type of cascade affecting multiple squares throughout the puzzle is called constraint propagation, in contrast to the constraint checking that we saw in the previous section. Constraint propagation contributes to strategy (C), pruning earlier, and strategy (B), moving the solution nearer to the top (because we can fill in multiple squares at a time).

For example, suppose we assign grid[s] = '1', and suppose that some peer square s2 has the possibilities '12'. Now we know that s2 can only be '2', so we can make that assignment. Then if some peer of s2 has the possible digits '23', we now know it must take the digit '3', and so on.

This particular type of constraint propagation is called arc consistency; it gets its exotic name from the practice of describing a constraint satisfaction problem as a graph where the nodes are the variables (in Sudoku, the squares) and the arcs are the constraints. We make the arcs consistent one by one, and we can end up traversing multiple links in the graph. In contrast, the forward checking algorithm only peeks ahead one arc in the graph.

The implementation of arc consistency for Sudoku is a simple change to what we already have:

def arc_consistent(grid, s):
    "Return true if s has multiple digits left, or one that we can consistently assign."
    ndigits = len(grid[s])
    return ndigits >= 2 or (ndigits == 1 and assign(grid, s, grid[s]))

def check(grid, s, d): return arc_consistent(grid, s)

Compared to forward checking, arc consistency reduces the time to solve puzzle1 from 43 seconds down to 0.16 seconds (a 266-fold speedup). The 50 easy problems improve from a rate of 5 per second to 282 per second, a 56-fold speedup. In fact, 70% of the easy problems have their search tree pruned to a single node with this approach! Clearly, arc consistency is very effective on Sudoku problems (as it is on many constraint satisfaction problems), and one might be tempted to stop here. But if you want to solve a million hard puzzles, you're still looking at a day or two of compute time, so let's keep going.

Algorithm 5: dual consistency

So far we have thought of each square as a variable that can take on one of 9 possible values (a digit). An alternative, or dual representation is to think of each digit in each unit as a variable that can take on one of 9 possible values (a square). If we reach the point where a digit has only one possible place to go in a unit, then we can assign it there, and if it has no possible places, then we can fail. We could keep track of the set of possible digits for each of these 243 dual variables, but instead we will recompute the values as part of the checking process:
def dual_consistent(grid, s, d):
    """After eliminating d from grid[s], check each unit of s and make sure there is some
    position in the unit for d. If only one possible place left for d, assign it."""
    for u in units[s]:
        places_for_d = [s2 for s2 in u if d in grid[s2]]
        nplaces = len(places_for_d)
        if nplaces==0 or (nplaces==1 and not assign(grid, places_for_d[0], d)):
            return Inconsistent
    return True

def check(grid, s, d):
    return arc_consistent(grid, s) and dual_consistent(grid, s, d)

Dual consistency doesn't help on the easy puzzles, but on a set of hard puzzles, top870, dual consistency gives a 7-fold speedup over arc consistency.

Algorithm 6: naked pairs

The literature on Sudoku solving strategies includes some colorful names; one of them is the so-called naked pairs strategy. It involves finding two squares in a unit which each have the same two possible digits. For example, suppose two squares in a box have both been whittled down to the possible digits 3 and 5. We don't know which is 3 and which is 5, but either way, we know that no other square in the box can be either 3 or 5, so both 3 and 5 can be eliminated from every other square. Here is the code:
def naked_pairs(grid, s):
    """Look for two squares in a unit with the same two possible digits. 
    For example, if s and s2 both have the value '35', then we know that 3 and 5
    must go in those two squares. We don't know which is which, but we can eliminate 
    3 and 5 from any other square s3 that is in the unit."""
    vals = grid[s]
    if len(vals) != 2: return True
    for u in units[s]:
        for s2 in u:
            if s2 != s and grid[s2] == vals:
                # Found naked pair: s and s2; remove their two vals from others in unit
                for s3 in u:
                    if s != s3 != s2:
                        if not all(eliminate(grid, s3, v) for v in vals):
                            return Inconsistent
    return True

def check(grid, s, d):
    return arc_consistent(grid, s) and dual_consistent(grid, s, d) and naked_pairs(grid, s)

Naked pairs is also an effective strategy -- on top870, adding naked pairs gives us a speed almost twice as fast as dual consistency, and about 13 times faster than arc consistency alone.

Algorithms 7-9: variable ordering

Note that our select_square function does not do much in the way of selection: it just returns the first square it finds that has not been filled yet. Could being more clever about selecting a square speed things up? This is the variable ordering problem. We'll stick with arc consistency plus dual consistency plus naked pairs, but we'll compare the variable ordering strategy we've been using so far (which we will call firstV because it chooses the first possible variable) with three alternative selection strategies: randomV, which chooses a random unfilled square; MinRV which chooses a square with the Minimum number of Remaining Values (that is, if there is a square with only 2 possible values we would choose that over one with 3 or more values), and MaxRV, which reverses that choice. Here is the code; to run it, we would assign select_square to one of the following functions:
import random
 
def select_square_randomV(grid):
    "Return a random unfilled square."
    unfilled = [s for s in squares if len(grid[s]) > 1]
    return random.choice(unfilled) if unfilled else None

def select_square_minRV(grid):
   "Return an unfilled square with the fewest possible digits; or None if no unfilled squares."
   unfilled = [s for s in squares if len(grid[s]) > 1]
   return min(unfilled, key=lambda s: len(grid[s])) if unfilled else None

def select_square_maxRV(grid):
   "Return an unfilled square with the most possible digits; or None if no unfilled squares."
   unfilled = [s for s in squares if len(grid[s]) > 1]
   return max(unfilled, key=lambda s: len(grid[s])) if unfilled else None

MinRV is generally held to be a good strategy; the idea is that if there are only two possibilities remaining then on average, half the time you will have moved the solution into the left-most branch of the search tree. Benchmarks show that MinRV is in fact the best variable ordering strategy for my implementation of Sudoku; it gives a speedup of about 20% over the first-variable ordering (and even more over random ordering and MaxRV.

Algorithm 10-12: value ordering

Now we will consider value ordering: given that we have selected a variable, in what order should we consider the possible digits? The current definition of possible_values just goes in increasing order of digits: 1,2,3, etc. In the literature on constraint satisfaction there is the concept of the least constraining value: the value that rules out the fewest choices for the neighbors (peers). We will benchmark least constraining value, most constraining value, and random ordering. Here is the code; to test these algorithms, assign possible_values to one of the three possibilities listed here:
def constraining_value(grid, s, d):
    "Count the number of peers of s that would have 0 or 1 digit left without d."
    return sum((d in grid[s2] and len(grid[s2]) <= 2) for s2 in peers[s])

def possible_values_most_constrained(grid, s):
    "Most constrained value first."
    return sorted(grid[s], key=lambda d: -constraining_value(grid, s, d))

def possible_values_least_constrained(grid, s):
    "Least constrained value first."
    return sorted(grid[s], key=lambda d: constraining_value(grid, s, d))

def possible_values_random(grid, s):
    "Values in random order."
    vals = list(grid[s])
    random.shuffle(vals)
    return vals

It turns out that value ordering makes no consistent difference in the spped of execution. Perhaps that is because I re-compute the constraining_value computation for each square each time. If I kept a sorted queue of precomputed constraining values, I might see a speedup.

Benchmarks

I have collected some benchmark problems from around the web that will be used to measure the performance of algorithms. Thanks particularly to Gordon Royle, Peter Cock, Guenter Stertenbrink and Jean Charles Meyrignac. Here is a description of the files:

FilenameN puzzlesDescription
puzzle1 1 hard puzzle, called puzzle1 above
easy50 50 easy puzzles
hardest 11 puzzles from around the web advertised as "hardest"
top95 95 hard puzzles
top870 870 hard puzzles
top2365 2365 hard puzzles
subig20 17,445 hard puzzles
min17 49,151 hard puzzles with 17 filled squares
ALL69,091 all of the above (duplicates eliminated)
ALL2138,182puzzles from ALL2, plus the reverse of each of them

I ran the benchmark code on five algorithms: combinatorial search (comb), forward checking (FC), and arc consistency by itself (AC), with dual consistency (AC+Dual), and with dual plus naked pairs (AC+D+NP). For each benchmark file we can see the name of the file, the number of puzzles, the average speed of solving (expressed two ways: as a frequency in Hertz (cycles per second) and as the average number of seconds per puzzle), and the maximum time across all puzzles. Here are the results:

version     name        N       Hz      avg      max
   comb  puzzle1        1        0  225.729  225.729

     FC  puzzle1        1        0   42.132   42.132
     FC   easy50       50        5    0.172    0.987

     AC  puzzle1        1        6    0.159    0.159
     AC   easy50       50      285    0.004    0.025
     AC   top870      870        4    0.221   16.197

AC+Dual  puzzle1        1       62    0.016    0.016
AC+Dual   easy50       50      188    0.005    0.009
AC+Dual   top870      870       33    0.030    1.273

AC+D+NP  puzzle1        1      147    0.007    0.007
AC+D+NP   easy50       50      183    0.005    0.007
AC+D+NP   top870      870       58    0.017    0.596

Here are the results; the benchmarks uphold the idea that MinRV is a good strategy; it decreases the average time by a modest amount and the maximum time by a lot. (On the easy50 benchmark, MinRV falls just slightly behind firstV, indicating again that on easy puzzles, simple is better, but the complexity pays off on harder puzzles.)

version     name        N       Hz      avg      max
 firstV   easy50       50      183    0.005    0.007
 firstV   top870      870       58    0.017    0.596

randomV   easy50       50      165    0.006    0.019
randomV   top870      870       25    0.039    1.762

  MaxRV   easy50       50      164    0.006    0.017
  MaxRV   top870      870        4    0.223   19.664

  MinRV   easy50       50      182    0.005    0.008
  MinRV   top870      870       73    0.014    0.185
We can see that on the hard puzzles (top870) the combination of all three checking procedures (AC+D+NP) is the fastest. (On the easy puzzles the more complex checking is not needed; AC alone is good enough.) And here are the results. We see that value ordering makes almost no difference.
 version    name        N       Hz      avg      max
 1stVal   easy50       50      182    0.005    0.008
 1stVal   top870      870       73    0.014    0.185

MostCVl   easy50       50      184    0.005    0.008
MostCVl   top870      870       72    0.014    0.193

LeastCV   easy50       50      181    0.006    0.009
LeastCV   top870      870       72    0.014    0.194

RandVal   easy50       50      185    0.005    0.008
RandVal   top870      870       75    0.013    0.209

Algorithm 13: Concurrency

One thing struck me as I was running these benchmarks: my CPU would be reporting 96% to 98% usage for Python, which may sound good, but my computer actually has six cores, so it could be reporting nearly 600%. In Python 3.2 there is a nice facility for executing several non-communicating functions concurrently, using a separate Python process for each one.
import concurrent.futures

def solve_list(puzzles, max_workers=6):
    results = concurrent.futures.ThreadPoolExecutor(max_workers).map(solve, puzzles)
    return Inconsistent not in results

Algorithm 14: Different compiler/language

Therefore I decided to port the program to Java, which has much better facilities for concurrency. The translation is fairly straightforward, but there are some changes; some because of the nature of Java, and some because I wanted to take advantage of this rewrite opportunity to improve efficiency. Here are the main differences between the Python version and the Java version:

We won't go over the Java program line-by-line, since it is similar to the Python program. Instead we'll skip to the conclusion: on the ALL.txt benchmark file, the Java version with just one thread solves 5,625 puzzles per second, which is 95 times faster than the Python version.

When we start adding threads, the performance increases almost linearly with the number of threads up to 6 (we reach a speedup of 5.66 times over a single thread). That is about what I would expect with 6 CPUs.

As we add a few more threads we continue to run faster, but the slope of the curve is about cut in half from 7 up to 10 threads (where we have a speedup of up to 7.26). From there we have a long flat plateau where the speedup remains between 7 and 7.5 no matter how many threads we add. The best time recorded was 43,873 puzzles per second (with 40 threads, but random variation suggests that 40 threads may be no better than 30 or 20). This is 700 times faster than the Python program.

Here we see the performance curve, with two sample runs done for each number of threads, and with the results expressed two ways: by total number of puzzles solved per second and by the speedup over a single thread:

I have to say I was surprised by two things: that the curve continued to rise past 6 threads, and that the plateau was so long. I expected the curve to peak at 6 and then drop off. In retrospect, I think both surprises can be explained by the facts that thread creation is fast, thread overhead is small, and each core can take advantage of instruction level parallelism. I have a 6-Core Intel Xeon "Westmere" processor, which can execute (various parts of) several instructions simultaneously in a pipeline process. Apparently, multiple threads running at the same time (on each core) can keep this pipeline filled better than one thread. Another advantage of having more than 6 threads is that it is less likely that one thread will take an extra long time (because it happens to have a few very difficult puzzles). With more threads, each sublist is shorter, and thus the expected time when only one thread is left, working on the remaining puzzles, is shorter.

One more thing: we need to make clear the distinction between improving throughput and latency. Throughput is the total number of puzzles that can be solved in a given time. The multithreading approach increased throughput by a factor of 7. Latency is the time it takes to solve a single puzzle. Latency is not helped at all by multithreading. It takes the same amount of time to solve a single puzzle with single threading as with multithreading; it is just that in multithreading we are doing 6 or 7 puzzles at once.

Alternative approaches

At 40,000 puzzles per second I've exceeded my expectations for this project, and I'm not tempted to try other approaches. But for completeness, I will mention three alternative approaches and describe why I decided not to try each one:

Don't let my reluctance stop you: it would be a great project to implement any of these techniques for Sudoku and see how they perform.

Every puzzle?

I made the bold claim that I could solve every Sudoku puzzle. This was meant to emphasize the ideas that the constraint satisfaction search is (1) exhaustive and (2) fast. I've run it on a few hundred thousand of the hardest problems I could find with good results. But does that mean I can solve every puzzle?

There might be a bug in the program that causes it to crash or give erroneous results for some puzzle. That seems less likely every time the program successfully processes another 100,000 puzzles, but, even though the program does pass the test suite, I have not yet proved the program correct, nor provided complete test coverage of every line.

There might be a puzzle that takes a really long time to solve. Let's consider just the Java version for this, since it is so much faster. I modified Sudoku.java to keep track of the execution time of each individual puzzle, and applied it to all the puzzles in ALL.txt, along with the reversal of each of those puzzles (that is, put the last square first, etc.). That gives us 138,182 puzzles. The average time to solve these puzzles is 184 microseconds, and the maximum time is 34,762 microseconds (or 0.034762 seconds). But if we looked at 100 billion puzzles instead of 100 thousand, would some puzzle take significantly more time? To try to answer this I made a chart of the maximum time as a function of the number of puzzles attempted. I also show the average time and standard deviation:

N      mean  stddev      max (times in milliseconds)
10       94     14      120
100      94     29      288
1000    100     48      617
10000   134    267   17,963
100000  184    447   34,762
138182  184    431   34,762

We can see that the maximum time tends to about double for each 10-fold increase in the number of puzzles solved, except for one big jump where it goes up by a factor of 30. This suggests that if we increase the number of puzzles by a factor of a 106 (to 100 billion puzzles) the maximum time might increase by a factor of 26 (to about 2 seconds), or there might be another 30-fold jump (to about 30 seconds). Or not--we just don't have enough data, nor do we have a reasonable model of the underlying distribution we are sampling from.

Here's a chart of the base 10 log of the number of puzzles at each execution time interval up to .01 seconds. The peak point says that there are about 105 or 100,000 puzzles that took between 50 and 100 microseconds:

It might help to look at the problem slightly differently: instead of measuring the time to completion, let's measure the size of the search tree, which we can compute just by incrementing a counter every time we make a recursive call to search. For the 138,182 puzzles, the minimum tree size was 1 and the maximum 6,724. The median was 6 and the mean 20. To get a better idea of the distribution, here is a table of percentiles for the distribution of tree sizes:

Percentile:1%10%25%50%75%90%99%99.9%99.99%99.999%100%
Tree size:113615382451,0413,1224,1026,724

And here is a histogram of the number of puzzles at each search tree size up to size 100 nodes (which accounts for 96.6% of the puzzles):

In my previous attempt to explain Sudoku solvers, I generated a million random puzzles, almost all of which were ill-formed (that is, they didn't have exactly one solution). The toughest one took 189 seconds for the old Python program to solve, 101 seconds for the new Python program described here, and 1.03 seconds for the Java program. My random procedure also generated some impossible puzzles (ones with no solutions); the toughest one of those is really hard: it requires 11.5 seconds for the Java program. In terms of tree size, the toughest solvable puzzle had 162,310 nodes, and the impossible one had 2,112,362 nodes.

I'm fairly confident in saying that if you generate random puzzles according to any reasonable random distribution, fewer than one in a billion of them will take as long as a minute for the Java program to solve. But if you used a clever adversarial approach to find the hardest possible puzzle particularly designed to thwart my program (the puzzle where the solution is in the lower-right corner of the largest possible search tree), I just don't know how long it would take or how big the tree would be. (If I suspected such an adversarial attack, I would use a random variable-ordering procedure to defend against it.)

Deduction versus search revisited

We started off by saying that humans use deduction to solve Sudoku puzzles, and computers use backtracking search. But actually the distinction between the two approaches is more subtle than it may seem. We used the metaphor of an eraser to say that search is different because search needs an eraser and deduction does not (because logic is monotonic).

But if you have enough scratch paper, you don't need an eraser. Think about it: I said that when there are two choices for some square, we write down the first one, recursively search to see if we can find a solution, and if not, erase what we did and try the other possible digit. But we don't really need an eraser at all; we could just use pieces of scratch paper to try out one alternative and then the other.

Another way to look at it: consider the (human) Sudoku strategy known as naked quads, which is similar to naked pairs, but involves finding four squares in a unit that share the same four digits. If we can find such an arrangement, we can make a logical deduction that those four digits appear nowhere else in the unit. But just finding the four squares is a form of search. If you're going to allow a search through four squares as a prerequisite to a deduction, then you should equally allow a search through a search tree of four nodes--and 38% of all the puzzles tested can be completely solved in a four node search tree. So the moral is that search is not qualitatively different than deduction--it just seems quantitatively different when the search trees are larger than a few nodes. From a logical point of view, even the generate-and-test search can be seen as a form of deduction--it is just a very complex deduction involving 81 squares at a time.

Why?

Why did I do this? As my friend Ben Laurie, the the computer security expert, has stated, Sudoku is "a denial of service attack on human intellect". Many people (including my wife) were infected by the virus, and I wanted to convince them that the problem didn't need any more of their time. I have had people tell me that it worked for them, so I've made the world more productive. And perhaps along the way I've taught something about Python, combinatorial search, constraint satisfaction, Java, and concurrency.


Peter Norvig