Solving Every Sudoku Puzzle Quickly

In this essay I develop and implement an algorithm to solve every Sudoku puzzle. It turns out to be quite straightforward (one page of code to solve 5 easy puzzles per second, or five pages to solve 5000 hard puzzles per second) using two ideas: search and constraint satisfaction.

Sudoku notation and preliminary concepts

A Sudoku puzzle is a square grid of 81 squares, which is composed of a 3x3 array of boxes, where each box is a 3x3 array of squares. We use the term unit for a row, column, or box of 9 squares. We define the peers of a square s to be all those squares that share a unit with s. We define Sudoku puzzles and solutions as follows:

A puzzle is a grid in which a subset of squares have been filled with a digit 1 to 9 and the rest are unfilled.
A solution to a puzzle is a grid where every square is filled, every unit contains all of the 9 digits, and the filled squares of the original puzzle remain unchanged in the solution.
A well-formed puzzle has exactly one solution.

Here is a typical well-formed puzzle and its solution:

 4 . . |. . . |8 . 5     4 1 7 |3 6 9 |8 2 5 
 . 3 . |. . . |. . .     6 3 2 |1 5 8 |9 4 7
 . . . |7 . . |. . .     9 5 8 |7 2 4 |3 1 6 
 ------+------+------    ------+------+------
 . 2 . |. . . |. 6 .     8 2 5 |4 3 7 |1 6 9 
 . . . |. 8 . |4 . .     7 9 1 |5 8 6 |4 3 2 
 . . . |. 1 . |. . .     3 4 6 |9 1 2 |7 5 8 
 ------+------+------    ------+------+------
 . . . |6 . 3 |. 7 .     2 8 9 |6 4 3 |5 7 1 
 5 . . |2 . . |. . .     5 7 3 |2 9 1 |6 8 4 
 1 . 4 |. . . |. . .     1 6 4 |8 7 5 |2 9 3

Let's start to think about how to write a program to solve Sudoku. Clearly we will need to represent digits, squares and units, and we will see that it is useful (but not necessary) to represent peers. We will number the columns 0 to 8 (left to right) and the rows also 0 to 8 (top to bottom). The squares will be numbered 0 to 80 (left-to-right, top-to-bottom). Below we show the numbers of the squares and the 3 units and 20 peers for square number 19:

                                00 01 02| 03 04 05| 06 07 08
                                09 10 11| 12 13 14| 15 16 17
                                18 19 20| 21 22 23| 24 25 26
                               ---------+---------+---------
                                27 28 29| 30 31 32| 33 34 35
                                36 37 38| 39 40 41| 42 43 44
                                45 46 47| 48 49 50| 51 52 53
                               ---------+---------+---------
                                54 55 56| 57 58 59| 60 61 62
                                63 64 65| 66 67 68| 69 70 71
                                72 73 74| 75 76 77| 78 79 80



    01   |         |                    |         |            00 01 02|         |         
    10   |         |                    |         |            09 10 11|         |         
    19   |         |            18 19 20| 21 22 23| 24 25 26   18 19 20|         |         
---------+---------+---------  ---------+---------+---------  ---------+---------+---------
    28   |         |                    |         |                    |         |         
    37   |         |                    |         |                    |         |         
    46   |         |                    |         |                    |         |         
---------+---------+---------  ---------+---------+---------  ---------+---------+---------
    55   |         |                    |         |                    |         |         
    64   |         |                    |         |                    |         |         
    73   |         |                    |         |                    |         |

The numbering scheme suggests representing a grid as a one-dimensional array of length 81. One might think that a two-dimensional 9x9 array would be the obvious data structure. But two factors argue against this: First, it will be more common to iterate over all the squares in a grid (and less common to iterate over the rows and then the columns). Second, two-dimensional arrays are a bit more complicated to deal with in the programming languages we will use.

Note: It is also possible to consider puzzles of different sizes. Instead of having boxes of size 3x3, each box could be 2x2 or 4x4 or 5x5, etc. We'll use the variable name Nbox for the box size, and N for the whole puzzle size. So the traditional Sudoku puzzle has Nbox = 3 and N = 9, and therefore 81 squares; a puzzle with Nbox = 5 has N = 25 and Nbox⁴ = 625 squares. Of course, each puzzle must have N "digits" as well (we can use letters as digits when N > 9).

We can implement the notions of digits, squares, units, and peers in the programming language Python as follows:

def grouper(n, items):
    "Group items n at a time: grouper(3, 'abcdefgh') => ['abc', 'def', 'gh']"
    return [items[i:i+n] for i in range(0, len(items), n)]

def cross(Rows, Cols):
    "All squares in these rows and columns"
    return [N*r+c for r in Rows for c in Cols]

Nbox     = 2         # Each box is of size Nbox x Nbox
N        = Nbox*Nbox # The whole grid is of size N x N
rows     = range(N)
cols     = rows
squares  = range(N*N)
digits   = '123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ!'[:N] # E.g., '123456789' for N = 9
digitset = set(digits)
rowunits = [cross([r], cols) for r in rows]
colunits = [cross(rows, [c]) for c in cols]
blocks   = grouper(Nbox, rows) # E.g., [[0, 1, 2], [3, 4, 5], [6, 7, 8]] for N = 9
boxunits = [cross(rblock, cblock) for rblock in blocks for cblock in blocks]
allunits = rowunits + colunits + boxunits
units    = [[u for u in allunits if s in u] for s in squares]
peers    = [set(s2 for u in units[s] for s2 in u if s2!=s) for s in squares]

The next step is to define the Sudoku playing grid. We will actually define two representations for grids:

A gridstring is defined as a character string with 1-9 indicating a digit, and a 0 or period specifying an empty square. All other characters are ignored (including spaces, newlines, dashes, and bars). This representation is used to describe a puzzle, and is found in many repositories of example puzzles found around the web. (A few sites do use incompatible formats.)
A grid is an 81-element array (which Python calls a list) where each element holds the value of the corresponding square number. Since lists (unlike strings) are mutable in Python, this representation can be manipulated in the process of solving a puzzle. Initially, we use a single character, '1' to '9' to indicate a digit and a '.' to indicate an empty square. Later we will see how to write annotations in each square to help solve the puzzle.
There is one complication: we will soon start filling in Sudoku grids with digits. If we reach an inconsistency--a grid where the same digit is used twice in a unit--we will use False to indicate the inconsistency. Thus, the grid data type is defined (implicitly) as either an 81-element list or the value False.

As an example, each of the following three gridstring strings represent the same puzzle:

"4.....8.5.3..........7......2.....6.....8.4......1.......6.3.7.5..2.....1.4......"

"""
400000805
030000000
000700000
020000060
000080400
000010000
000603070
500200000
104000000"""

"""
4 . . |. . . |8 . 5
. 3 . |. . . |. . .
. . . |7 . . |. . .
------+------+------
. 2 . |. . . |. 6 .
. . . |. 8 . |4 . .
. . . |. 1 . |. . .
------+------+------
. . . |6 . 3 |. 7 .
5 . . |2 . . |. . .
1 . 4 |. . . |. . .
"""

Here is the function parse that takes a gridstring as input and returns a grid; and the inverse function, show, which converts a grid to a gridstring in pretty 2D format:

def parse(gridstring):
    """Convert a string into a grid: a list of values. Accepts digits (or '.' or '0'
    for empty); all other characters ignored. In result, '.' used for empty."""
    gridstring = gridstring.replace('0', '.')
    grid = [c for c in gridstring if c in digits or c == '.']
    assert len(grid) == N*N
    return grid

def show(grid):
    "Convert a grid to a gridstring laid out in 2D, with lines between boxes."
    return "INCONSISTENCY" if grid is False else show_template.format(*grid)

show_line = '\n' + '+'.join(['--'*Nbox]*Nbox) + '\n'
show_template = show_line.join(['\n'.join(['|'.join(['{} '*Nbox]*Nbox)]*Nbox)]*Nbox)

And here is an example of the functions at work:

>>> puzzle1 = "4.....8.5.3..........7......2.....6.....8.4......1.......6.3.7.5..2.....1.4......"
>>> parse(puzzle1)
['4', '.', '.', '.', '.', '.', '8', '.', '5',
 '.', '3', '.', '.', '.', '.', '.', '.', '.',
 '.', '.', '.', '7', '.', '.', '.', '.', '.',
 '.', '2', '.', '.', '.', '.', '.', '6', '.',
 '.', '.', '.', '.', '8', '.', '4', '.', '.',
 '.', '.', '.', '.', '1', '.', '.', '.', '.',
 '.', '.', '.', '6', '.', '3', '.', '7', '.',
 '5', '.', '.', '2', '.', '.', '.', '.', '.',
 '1', '.', '4', '.', '.', '.', '.', '.', '.']

>>> print show(parse(puzzle1))
4 . . |. . . |8 . 5
. 3 . |. . . |. . .
. . . |7 . . |. . .
------+------+------
. 2 . |. . . |. 6 .
. . . |. 8 . |4 . .
. . . |. 1 . |. . .
------+------+------
. . . |6 . 3 |. 7 .
5 . . |2 . . |. . .
1 . 4 |. . . |. . .

>>> ''.join(parse(puzzle1))
'4.....8.5.3..........7......2.....6.....8.4......1.......6.3.7.5..2.....1.4......'

>>> ''.join(parse(puzzle1)) == puzzle1
True

Now let's define what it means to be a solution: as we implied before, a solution is a grid where every unit is filled with a permutation of the digits 1 to 9, and the filled squares of the original puzzle remain unchanged in the solution:

def is_solution(grid, puzzle):
    """A grid is a solution to a puzzle if each unit is a permutation
    of the digits and if the filled squares remain unchanged."""
    filled_squares = [s for s in squares if puzzle[s] in digitset]
    def is_permutation(unit): return set(grid[s] for s in unit) == digitset
    return (grid is not False
            and all(is_permutation(u) for u in allunits)
            and all(grid[s]==puzzle[s] for s in filled_squares))

Searching for solutions

Humans solve Sudoku problems by using logical deduction to infer that a certain digit is (or is not) the value of a square; such deductions are repeated until all squares are filled with a single known digit. If the inference rules are sound and complete and the puzzle is well-formed, the human will only need a pencil--never an eraser. We could write an algorithm to do deductions like that, but the trick is in knowing when you have enough inference rules to solve every possible puzzle.

Therefore we will take a different route. Technically, it is known as model checking rather than deduction, and it consists of searching for a solution that works by trying out possibilities one by one. We will describe a series of search algorithms, starting with the hopelessly inefficient and ending with one that can solve over 5000 of the hardest Sudoku puzzles per second. However, when implemented by a human, this strategy requires an eraser as well as a pencil.

Algorithm 1: generate and test

Here is the simplest possible search algorithm, which we call generate and test:

First fill in all the empty squares with digits. Then check to see if we have a solution. If not, then fill the empty squares with a different combination of digits. Repeat with different combinations until a solution is found.

Sometimes generate and test is the best you can do. Herb Simon, in his classic book The Sciences of the Artificial (page 194), describes the task of opening a safe that has ten dials, each with 100 numbers on it. If the safe is well-designed there is no better approach than trying all 100¹⁰ combinations.

Here is the implementation of generate and test for Sudoku:

def solve(gridstring):
    "Generate every combination of digits for the empty squares; test each one."
    from itertools import product
    puzzle = parse(gridstring)
    for candidate in product(*[(v if v in digits else digits) for v in puzzle]):
        if is_solution(candidate, puzzle):
            return candidate
    return None

And here it is solving a tiny puzzle. There are nine empty squares, so generate and test first tries to fill them with the digits 111111111, then 111111112, then 111111113, and so on until a solution is found with 231121233:

>>> tiny = '...4 4.3. .4.. .241'

>>> print show(parse(tiny))
. . |. 4 
4 . |3 . 
----+----
. 4 |. . 
. 2 |4 1 

>>> print show(solve(tiny))
1 2 |3 4
3 4 |1 2
----+----
4 3 |2 1
2 1 |4 3

It took 0.8 seconds to solve this tiny 4x4 puzzle. Now consider the 9x9 puzzle defined above as puzzle1. It has 17 filled squares, and thus 64 empty squares. So there are 9⁶⁴ possibile combinations (about 10⁶¹) to consider. How long will it take to run through them?

Suppose we have access to a secret new custom 1024 core 10 GHz CPU that features special hardware instructions that can generate and test a position in one clock cycle, and let's say we could afford to buy a billion of these, and while we're shopping, let's say we also pick up a time machine and go back 13 billion years to the early days of the universe, and a fleet of starships with which we visit 10 billion galaxies and in each galaxy convince the inhabitants of 10 billion planets to each buy a billion of these computers, and we all start our programs running, managing to distribute the cases perfectly so that no computation is duplicated or wasted. Then we can estimate that we'd be less than 4% done with this one puzzle by now.

Generate and test is not the algorithm you're looking for. Move along.

Algorithm 2: incremental search

In Simon's safe example, he next supposes a defective safe, in which a faint click is heard whenever a dial is set to the right position. With this safe it only takes 100 × 10 trials to discover the correct combination, not 100¹⁰. If we have some selective knowledge of components of the safe (or puzzle), our search will be much easier.

Fortunately for us, Sudoku is more like the defective safe. To be precise, it is like a safe with 81 dials, each with 9 numbers, and sometimes if you put one of the dials in the wrong position you hear a sound, but sometimes not (depending on the positions of the other dials). For example, in the lower-left corner of the tiny puzzle above, if we try 1, 2, or 4 we would immediately get feedback that those digits won't work, because they already appear elsewhere in the bottom row. Therefore, the lower-left corner must be filled by a 3. There are 262,144 ways to fill in the 9 empty squares, but right away we can eliminate 196,608 of them; we need only consider the ones that start with a 3.

Unfortunately for us, Sudoku does not give up all its secrets so easily. Sometimes, when we consider a square, we can't immediately tell what to put there. For example, in the upper-left of the tiny puzzle, only 4 can be eliminated as a possibility. That's ok--we're allowing ourselves the use of an eraser, so just guess one of the remaining three possibilities; if it doesn't pan out, erase it and try one of the others. This gives us the incremental search algorithm:

Start filling squares with digits, always making sure that the digit we put in a square is not the same as any peer. If we reach a square where every digit conflicts with a peer, back up and consider a different digit for the previous square. Stop when we successfully fill every square, or give up if we tried all possibilities without finding a solution.

The progress of this algorithm can be described as a search tree, where each node represents a partially-filled (incremental) state of the puzzle, and a branch corresponds to filling in a square with a digit. (We display the new digits in red.) Here is a search tree for the tiny puzzle:

A few things to note about this search tree:

Each successive level fills in one square, so with nine empty squares, any solution must appear only on the ninth level of the tree (counting the root as the zeroth level).
This is a well-formed puzzle, so there will be exactly one solution.
A rectangle with a black X indicates an inconsistency: there is no way to continue because the next unfilled square already has all four values taken by its peers.
When we hit a X we have to pull out our eraser, or in other words we have to backtrack--go up a level in the tree and try the next possible value for the previous square. On the far-left branch of the tree we have to back up three levels before we find a square that has another possibility--we then erase the 1 in the upper-left corner and replace it with a 2.
The two rightmost rectangles remain unexplored--the search proceeds left-to-right and has not got to them yet. (But since the puzzle is well-formed and we have found the solution, we know the unexplored branches will eventually lead to an inconsistency.)

The code to implement incremental search is straightforward:

def search(grid):
    """Select an unfilled square, try each possible value in order, recursively searching.
    When all squares filled: success; when no more digits to try: return False for failure."""
    if grid is False: return False
    s = select_square(grid)
    if s is None: return grid
    for d in possible_values(grid, s):
        result = search(assign(grid[:], s, d))
        if result: return result
    return False

def select_square(grid):
    "Return the first square that is not filled yet; or None if all squares are filled."
    for s in squares:
        if grid[s] == '.': return s
    return None

def possible_values(grid, s):
    "A square can be any digit that is not already taken by a peer."
    return digitset - set(grid[s] for s in peers[s])

def assign(grid, s, d):
    "For now, simply assign grid[s] = d."
    grid[s] = d
    return grid

def solve(gridstring):
    "Parse gridstring and search for a solution."
    return search(parse(gridstring))

The key function is search. It obeys the convention that if it is passed False (to indicate an invalid grid) it returns False, meaning that no solution can be found. Otherwise it selects some unfilled square to work on. If select_square returns None that is actually good news: it means that all the squares are already filled, and we are done: we have a solution, so we just return it. Otherwise, line 60 says that for each possible digit that could fill square s we try to assign that digit to s and search for a solution from there. If some call to search succeeds, return it; but if a digit assignment causes the search to fail, just keep going with the next digit assignment. If all digit assignments fail, then the whole call to search fails, and we back up to the previous recursive call.

Here we see solve (and therefore search) in action:

>>> print show(solve(puzzle1))
4 1 7 |3 6 9 |8 2 5
6 3 2 |1 5 8 |9 4 7
9 5 8 |7 2 4 |3 1 6
------+------+------
8 2 5 |4 3 7 |1 6 9
7 9 1 |5 8 6 |4 3 2
3 4 6 |9 1 2 |7 5 8
------+------+------
2 8 9 |6 4 3 |5 7 1
5 7 3 |2 9 1 |6 8 4
1 6 4 |8 7 5 |2 9 3

That's all there is to it! The only reason we don't stop this article now is that the program is still too slow for some purposes. We're not talking billions-of-years slow, but it did take 3 minutes and 45 seconds to solve this puzzle (a rather hard one). On a benchmark of 50 easy puzzles the program took a total of 9.5 seconds; a rate of 5 Hz. Not bad. If you're happy with that speed, you can stop reading now.

However, if you wanted to solve a million puzzles, you'd prefer a faster algorithm. Look again at the search tree above, and consider how we can find the solution faster. Here are some general strategies:

We could try to move the solution to the left. This can potentially be done with better value ordering, the order in which the function possible_values considers which digit to try next. If we were able to choose right every time, there would be no search--we would arrive directly at the solution.
We could try to move the solution closer to the top. If we only fill in one square at a time then the solution will always be at the same level (the number of empty squares in the original puzzle). But if we could fill in several squares at a time, the solution gets closer to the top and we can find it faster. Filling in multiple squares at a time can be accomplished with constraint propagation--the process of using the rules of Sudoku to conclude that a particular unfilled square must take on a certain value.
We could waste less time on the branches that eventually lead to an inconsistency (X). That is, we could notice earlier that a grid is destined to an inconsistency. This is another kind of constraint propagation, but one that concludes that a particular square has no possible value.
We could consider the squares in a better order, a process called variable ordering, which in our program is implemented by the function select_square. At each node in the tree we pick one square to branch on. The current implementation of select_square just considers the squares in order from 0 to 80 and chooses the first unfilled one, but it could be more clever. A different ordering could be better for any of the three reasons above.

Algorithm 3: constraint satisfaction with forward checking

The incremental search algorithm prunes the search tree down to size by cutting off a branch as soon as we detect that there is no possible value for the next square. But we can prune more (and thus make the search tree smaller, which will allow us to find the solution faster) with a process known as forward checking.

Consider this branch of the search tree from above:

As soon as we place a 1 in the upper-left corner, there is an inconsistency: there is no value that can go in the square in the second row, second column (because 1 is in the same box, 2 is in the samecolumn, and 3 and 4 are in the same row). But our incremental search does not discover that until three more levels in the tree, when it gets around to selecting the offending square. The idea of forward checking is that we don't wait: every time we place a digit, we check all the unfilled squares and if one of them has no possible values we fail immediately rather than waiting to select the offending square.

In this case incremental search is not too bad; it generates two nodes in the search tree that could have been pruned. But in other cases this could be millions of nodes--forward checking is definitely worth it for Sudoku.

However, there is a cost to forward checking: we need to figure out if any non-selected square has no possible values left. To do that efficiently we will keep a collection of remaining possible values for every square; we can store the collection right in the grid. That is, grid[s] will hold a collection of all the remaining possible digits for square s. The collection could be represented by a Python set or list, but I chose instead to use a string of digits. So the value for the upper-left square in the tiny puzzle (before it is assigned a value) would be '123'.

We'll change solve so that it calls initialize to set all the unfilled squares to their initial value according to this convention. We'll also change assign(grid, s, d) so that it eliminates d from all the peers of s and fails if any peer has no remaining possible values. The eliminate function removes a digit from the possibilities for a square, and also does forward checking---checks if a peer is reduced to no values. (We add an extra level of indirection--the function check calls forward_check--because we will soon introduce other kinds of checks.) With these added complications one function becomes easier: possible_values can now just look up the collection of possible values rather than having to compute it. The functions search and select_square remain unchanged.

def solve(gridstring):
    "Parse gridstring, handle initial constraints, and search for a solution."
    return search(initialize(parse(gridstring)))

def initialize(puzzle):
    """First set each square in the result grid to have the set of all possible values.
    Then assign each filled square to the specified value."""
    grid = N * N * [digits]
    for s in squares:
        if puzzle[s] in digitset: grid = assign(grid, s, puzzle[s])
    return grid

def possible_values(grid, s): return grid[s]

def assign(grid, s, d):
    """Assign grid[s] = d and eliminate d from the peers of s.
    Return the updated grid, or False if inconsistency detected."""
    if d not in grid[s]: return False # d is not among the possibilities
    grid[s] = d
    if not all(eliminate(grid, p, d) for p in peers[s]): return False
    return grid

def eliminate(grid, s, d):
    "Remove d from possibilities for grid[s]. If checking finds an inconsistency return False."
    if d not in grid[s]: return grid # Already eliminated d
    grid[s] = grid[s].replace(d, '')
    return check(grid, s, d)

def check(grid, s, d): return forward_check(grid, s)

def forward_check(grid, s):
    "Fail if s has no possible value, or one value which is the same as a peer."
    nvals = len(grid[s])
    return not (nvals==0 or (nvals==1 and any(grid[p]==grid[s] for p in peers[s])))

How much does forward checking help? For the easy puzzles, not much. One reason they are called "easy" is that they don't require forward checking. For the harder puzzle1 puzzle the solving speed is improved by a factor of 5. But that's still 43 seconds. Let's keep improving.

Algorithm 4: arc consistency

With forward checking, every time we eliminate a possible value from a square we peek ahead to peer squares to see if they will cause a failure. We can improve that idea by looking even farther ahead to the peers of peers. You might think that eliminating, say, the digit 1 from square s would effect only the peers of s and could have no effect on the peers of the peers of s. However, suppose that the peer s2 had only the possibilities 1 and 2. If we eliminate 1, then s2 must be 2, and we can then eliminate 2 from all the peers of s2. If some peer of s2 had only the possibilities 2 and 3, then it must become 3, and we can keep on going from there.

This idea is called arc consistency; it gets its name from the practice of describing a constraint satisfaction problem as a graph where the nodes are the variables (in Sudoku, the squares) and the arcs are the constraints. We make the arcs (the peer constraints) consistent one by one, and we can end up traversing multiple links in the graph (whereas the forward checking algorithm only looks ahead one arc in the graph).

The implementation of arc consistency for Sudoku is a simple change to what we already have:

def check(grid, s, d): return arc_consistent(grid, s)

def arc_consistent(grid, s):
    "Return true if s has multiple values left, or one that we can consistently assign."
    nvals = len(grid[s])
    return nvals >= 2 or (nvals == 1 and assign(grid, s, grid[s]))

Compared to forward checking, arc consistency reduces the time to solve puzzle1 from 43 seconds down to 0.16 seconds (a 266-fold speedup). The 50 easy problems are speeded up from a rate of 5 per second to 282 per second, a 56-fold speedup. In fact, 70% of the easy problems have their search tree pruned to a single node with this approach. Clearly, arc consistency is very effective on Sudoku problems (as it is on many constraint satisfaction problems), and one might be tempted to stop here. But let's try just a few more of the strategies that have proven popular for Sudoku (they're easy to implement).

Algorithm 5: dual consistency

So far we have thought of each square as a variable that can take on one of 9 possible values (a digit). An alternative, or dual representation is to think of each digit in a unit as a variable that can take on one of 9 possible values (a square). If we reach the point where a digit has no possible place in a unit to go, then we can fail. If there is only one possible place for a digit within a unit, then we can assign it there. So we have:

def dual_consistent(grid, s, d):
    """After eliminating d from grid[s], check each unit of s and make sure there is some
    position in the unit for d. If only one possible place left for d, assign it."""
    for u in units[s]:
        places_for_d = [s2 for s2 in u if d in grid[s2]]
        nplaces = len(places_for_d)
        if nplaces==0 or (nplaces==1 and not assign(grid, places_for_d[0], d)):
            return False
    return True

Algorithm 6: naked pairs

The literature on Sudoku solving strategies includes some colorful names; one of them is the so-called naked pairs strategy. It involves finding two squares in a unit which each have the same two possible values. For example, suppose two squares in a box have both been whittled down to the possible values 3 and 5. We don't know which is 3 and which is 5, but either way, we know that no other square in the box can be either 3 or 5, so 3 and 5 can be eliminated from every other square. Here is the code:

def naked_pairs(grid, s):
    """If two squares s and s2 in a unit have the same two values,
    eliminate those values from every other square s3 in the unit."""
    vals = grid[s]
    if len(vals) != 2: return True
    for u in units[s]:
        for s2 in u:
            if s2 != s and grid[s2] == vals: # Found naked pair (s, s2)
               for s3 in u:
                    if s != s3 != s2:
                        if not all(eliminate(grid, s3, v) for v in vals):
                            return False
    return True

Benchmarks

I have collected some benchmark problems from around the web that will be used to measure the performance of algorithms. Thanks particularly to Gordon Royle, Peter Cock, Guenter Stertenbrink and Jean Charles Meyrignac. Here is a description of the files:

Filename N puzzles Description
puzzle1.txt 1 hard puzzle, called puzzle1 above
easy50.txt 50 easy puzzles
hardest.txt 11 puzzles from around the web advertised as "hardest"
top95.txt 95 hard puzzles
top870.txt 870 hard puzzles
top2365.txt 2365 hard puzzles
subig20.txt 17,445 hard puzzles
min17.txt 49,151 hard puzzles with 17 filled squares
ALL.txt 69,091 all of the above (duplicates eliminated)

Filename	N puzzles	Description
`puzzle1.txt`	1	hard puzzle, called `puzzle1` above
`easy50.txt`	50	easy puzzles
`hardest.txt`	11	puzzles from around the web advertised as "hardest"
`top95.txt`	95	hard puzzles
`top870.txt`	870	hard puzzles
`top2365.txt`	2365	hard puzzles
`subig20.txt`	17,445	hard puzzles
`min17.txt`	49,151	hard puzzles with 17 filled squares
`ALL.txt`	69,091	all of the above (duplicates eliminated)

The code to gather statistics on the runtime of our algorithms on these puzzle sets can be seen in sudoku.py.

I ran the benchmark code on the five serious algorithms: incremental search (incr), forward checking (FC), and arc consistency by itself (AC), with dual consistency (AC+Dual), and with dual plus naked pairs (AC+D+NP). For each benchmark file we can see the name of the file, the number of puzzles, the average speed of solving (expressed two ways, as a frequency in Hertz and as the average number of seconds), the standard deviation of the results, and then, to give an idea of the distribution of times, the minimum, the first percentile, 25th percentile, median, 75th percentile, 99th percentile, and maximum times. Here are the results:

version     name        N       Hz      avg   stddev      min       1%      25%   median      75%      99%      max
   incr  puzzle1        1        0  225.729    0.000  225.729  225.729  225.729  225.729  225.729  225.729  225.729

     FC  puzzle1        1        0   42.132    0.000   42.132   42.132   42.132   42.132   42.132   42.132   42.132
     FC   easy50       50        5    0.172    0.272    0.004    0.004    0.010    0.050    0.161    0.987    0.987

     AC  puzzle1        1        6    0.159    0.000    0.159    0.159    0.159    0.159    0.159    0.159    0.159
     AC   easy50       50      285    0.004    0.005    0.001    0.001    0.001    0.002    0.003    0.025    0.025
     AC  hardest       11       22    0.044    0.113    0.001    0.001    0.002    0.010    0.014    0.401    0.401
     AC    top95       95        0    1.030    2.668    0.001    0.001    0.024    0.072    0.329   16.211   16.211
     AC   top870      870        4    0.221    1.111    0.001    0.001    0.006    0.020    0.060    6.943   16.197

AC+Dual  puzzle1        1       62    0.016    0.000    0.016    0.016    0.016    0.016    0.016    0.016    0.016
AC+Dual   easy50       50      188    0.005    0.001    0.005    0.005    0.005    0.005    0.005    0.009    0.009
AC+Dual  hardest       11       67    0.015    0.010    0.005    0.005    0.006    0.011    0.018    0.035    0.035
AC+Dual    top95       95       15    0.067    0.128    0.005    0.005    0.010    0.021    0.055    0.890    0.890
AC+Dual   top870      870       33    0.030    0.074    0.005    0.005    0.008    0.014    0.028    0.298    1.273

AC+D+NP  puzzle1        1      147    0.007    0.000    0.007    0.007    0.007    0.007    0.007    0.007    0.007
AC+D+NP   easy50       50      183    0.005    0.001    0.005    0.005    0.005    0.005    0.005    0.007    0.007
AC+D+NP  hardest       11       83    0.012    0.007    0.005    0.005    0.007    0.009    0.015    0.031    0.031
AC+D+NP    top95       95       30    0.033    0.063    0.005    0.005    0.007    0.012    0.025    0.489    0.489
AC+D+NP   top870      870       58    0.017    0.033    0.005    0.005    0.006    0.010    0.017    0.137    0.596

Note that for the easy50 file, the arc consistency algorithm is fastest. That is because it is the simplest algorithm (it has no unnecessary checks or bookkeeping), yet it does enough to solve all the easy puzzles. For all the other benchmarks, it is worth the effort to incorporate arc consistency plus unit consistency plus naked pairs.

Algorithms 6-9: variable ordering

Note that our select_square function does not do much in the way of selection: it just returns the first square it finds that has not been filled yet. Could being more clever about selecting a square speed things up? We'll stick with arc consistency plus unit consistency plus naked pairs, but we'll compare the square selection strategy we've been using so far (which we will call firstV) with three alternative selection strategies: randomV, which chooses a random unfilled square; MinRV which chooses a square with the Minimum number of Remaining Values (that is, if there is a square with only 2 possible values we would choose that over one with 3 or more values), and MaxRV, which reverses that choice. MinRV is generally held to be a good strategy; the idea is that if there are only two possibilities you have a 50% chance of guessing right.

Here are the results; the benchmarks uphold the idea that MinRV is a good strategy; it performs the best on all the benchmarks except easy50, where it falls just slightly behind firstV. This is true on the average time to solution, and even more so on the maximum time.

version     name        N       Hz      avg   stddev      min       1%      25%   median      75%      99%      max
 firstV   easy50       50      183    0.005    0.001    0.005    0.005    0.005    0.005    0.005    0.007    0.007
 firstV  hardest       11       83    0.012    0.007    0.005    0.005    0.007    0.009    0.015    0.031    0.031
 firstV    top95       95       30    0.033    0.063    0.005    0.005    0.007    0.012    0.025    0.489    0.489
 firstV   top870      870       58    0.017    0.033    0.005    0.005    0.006    0.010    0.017    0.137    0.596

randomV   easy50       50      165    0.006    0.003    0.005    0.005    0.005    0.005    0.006    0.019    0.019
randomV  hardest       11      113    0.009    0.004    0.005    0.005    0.006    0.006    0.010    0.018    0.018
randomV    top95       95        7    0.127    0.203    0.005    0.005    0.021    0.047    0.120    1.090    1.090
randomV   top870      870       25    0.039    0.101    0.005    0.005    0.010    0.018    0.032    0.490    1.762

  MinRV   easy50       50      182    0.005    0.001    0.005    0.005    0.005    0.005    0.005    0.008    0.008
  MinRV  hardest       11      142    0.007    0.002    0.005    0.005    0.005    0.006    0.010    0.011    0.011
  MinRV    top95       95       57    0.017    0.017    0.005    0.005    0.007    0.012    0.021    0.132    0.132
  MinRV   top870      870       73    0.014    0.012    0.005    0.005    0.007    0.010    0.016    0.055    0.185

  MaxRV   easy50       50      164    0.006    0.002    0.005    0.005    0.005    0.005    0.005    0.017    0.017
  MaxRV  hardest       11       25    0.038    0.058    0.005    0.005    0.007    0.013    0.058    0.210    0.210
  MaxRV    top95       95        0    1.161    2.647    0.005    0.005    0.075    0.172    0.913   19.633   19.633
  MaxRV   top870      870        4    0.223    0.983    0.005    0.005    0.025    0.050    0.109    5.129   19.664

Here is the code to implement the three new variable selection strategies and produce the benchmark results:


import random

def select_square(values):
    "Return an unfilled square with the fewest possible values; or None if no unfilled squares."
    def numvalues(s): return len(values[s])
    unfilled = [s for s in squares if numvalues(s) > 1]
    return None if unfilled==[] else random.choice(unfilled)

def select_square(values):
    "Return an unfilled square with the fewest possible values; or None if no unfilled squares."
    def numvalues(s): return len(values[s])
    unfilled = [s for s in squares if numvalues(s) > 1]
    return None if unfilled==[] else min(unfilled, key=numvalues)

def select_square(values):
    "Return an unfilled square with the fewest possible values; or None if no unfilled squares."
    def numvalues(s): return len(values[s])
    unfilled = [s for s in squares if numvalues(s) > 1]
    return None if unfilled==[] else max(unfilled, key=numvalues)

Algorithm 10-12: value ordering

The next question is: given that we have selected a variable, in what order should we consider the possible values? The current definition of possible just goes in increasing order of digits: 1,2,3, etc. In the literature on constraint satisfaction there is the concept of the least constraining value: the value that rules out the fewest choices for the neighbors (peers). We will benchmark least constraining value, most constraining value, and random ordering. Here is the code:

def constraining_value(values, s, d):
    "Count the number of peers of s that would have 0 or 1 value left without d."
    return sum((d in values[s2] and len(values[s2]) <= 2) for s2 in peers[s])

def possible(values, s):
    return sorted(values[s], key=lambda d: constraining_value(values, s, d))


def possible(values, s):
    return sorted(values[s], key=lambda d: -constraining_value(values, s, d))

def possible(values, s):
    vals = list(values[s])
    random.shuffle(vals)
    return vals

benchmark([1,2,3,4], "RandVal")

And here are the results. We see that value ordering makes almost no difference.

 version    name        N       Hz      avg   stddev      min       1%      25%   median      75%      99%      max
MostCVl   easy50       50      184    0.005    0.000    0.005    0.005    0.005    0.005    0.005    0.008    0.008
MostCVl  hardest       11      113    0.009    0.006    0.005    0.005    0.006    0.006    0.009    0.029    0.029
MostCVl    top95       95       57    0.018    0.018    0.005    0.005    0.008    0.012    0.021    0.137    0.137
MostCVl   top870      870       72    0.014    0.013    0.005    0.005    0.007    0.011    0.016    0.057    0.193

LeastCV   easy50       50      181    0.006    0.001    0.005    0.005    0.005    0.005    0.005    0.009    0.009
LeastCV  hardest       11      124    0.008    0.004    0.005    0.005    0.006    0.006    0.010    0.018    0.018
LeastCV    top95       95       54    0.018    0.018    0.005    0.005    0.008    0.012    0.022    0.135    0.135
LeastCV   top870      870       72    0.014    0.013    0.005    0.005    0.008    0.011    0.016    0.058    0.194

RandVal   easy50       50      185    0.005    0.000    0.005    0.005    0.005    0.005    0.005    0.008    0.008
RandVal  hardest       11       91    0.011    0.007    0.005    0.005    0.006    0.007    0.014    0.029    0.029
RandVal    top95       95       52    0.019    0.018    0.005    0.005    0.008    0.013    0.023    0.123    0.123
RandVal   top870      870       75    0.013    0.012    0.005    0.005    0.007    0.010    0.015    0.056    0.209

Robustness

The benchmark results above cover about 1000 puzzles, and every one was solved in a quarter second or less. Does that mean that every puzzle can be solved in a quarter second? I think the answer is clearly no. There must be some puzzles in which the solution is way down in the lower-right corner of the search tree, and thus the search has to backtrack many times to get there. How long could that take? One way to estimate this is to sample bigger and bigger numbers of puzzles and watch how the maximum time goes up. Here it is:

version     name        N       Hz      avg   stddev      min       1%      25%   median      75%      99%      max
 Normal    top95       95       38    0.026    0.025    0.008    0.008    0.011    0.018    0.032    0.196    0.196
 Normal   top870      870       49    0.020    0.018    0.008    0.008    0.011    0.016    0.023    0.082    0.275
 Normal  top2365     2365       54    0.018    0.018    0.008    0.008    0.010    0.014    0.021    0.076    0.435
 Normal  subig20    17445       83    0.012    0.022    0.007    0.007    0.008    0.008    0.010    0.066    1.583
 Normal    min17    49151       49    0.020    0.052    0.007    0.007    0.008    0.010    0.016    0.179    3.385
 Normal     200K   200000       47    0.021    0.051    0.007    0.007    0.008    0.011    0.017    0.178    3.732

1 . . |. . . |. 6 4        1 3 2 |8 5 7 |9 6 4        Solution in 3.732 seconds
. . . |3 2 . |. . .        9 6 4 |3 2 1 |5 7 8        
. 5 . |. . . |. . .        7 5 8 |9 6 4 |1 3 2        
------+------+------       ------+------+------        
. 2 . |6 4 . |. . .        5 2 1 |6 4 8 |7 9 3        
6 . . |. . . |. . 1        6 8 9 |5 7 3 |2 4 1        
. . . |. . . |8 . .        3 4 7 |2 1 9 |8 5 6        
------+------+------       ------+------+------        
. . . |. . 5 |3 2 .        8 1 6 |4 9 5 |3 2 7        
4 . . |7 . . |. . .        4 9 3 |7 8 2 |6 1 5        
. . . |. . . |. . .        2 7 5 |1 3 6 |4 8 9

version     name        N     max  N-ratio max-ratio
 Normal    top95       95   0.196
 Normal   top870      870   0.275  9.2      1.4
 Normal  top2365     2365   0.435  2.7      1.6
 Normal  subig20    17445   1.583  7.4      3.6
 Normal    min17    49151   3.385  2.8      2.1
 Normal     200K   200000   3.732  4.0      1.1

Algorithm 13: Concurrency

One thing struck me as I was running these benchmarks: my CPU would be reporting 100% usage (or very close to it), which may sound good, but my computer actually has six cores, so it could be reporting 600%. Python does have facilities for multithreading: the good news is you can run multiple threads; the bad news is that you can't run them at the same time. There is a global interpreter lock that makes it difficult to take advantage of all six cores.

Therefore I decided to port the program to Java, which has good facilities for concurrency. The translation is fairly straightforward, but there are some changes; some because of the nature of Java, and some because I wanted to take advantage of this rewrite opportunity to make some choices that will enhance efficiency. Here are the main differences between the Python version and the Java version:

The Java version uses concurrency. I decided to do this in the simplest way possible: Given a benchmark file of 100,000 puzzles, and 10 threads, just divide the list of puzzles into 10 sublists of 10,000 puzzles each, and give one to each thread. If the variance of solving puzzles were larger we would want a more complicated scheme that allows a thread that finishes early to take some work from a thread that is not yet finished. But that would complicate the program, and it seems that the variance is low enough that it is not necessary.
Java's compiler should generate much more efficient code than Python's interpreter.
Java doesn't have Python's concise generator expressions and list comprehensions, so the code will be more verbose, and will rely more on for loops and less on expressions.
Java has good facilities at the bit-twiddling level, and we will take advantage of that in the representation of sets of possible digits within a grid square. For example, the set of values {1, 4, 9}, which was represented in Python by the string '149', will be represented in Java by the binary bit pattern 100001001 (which is the int 265, or 1+8+256).
Java is a less dynamic language than Python. In Python a list can be extended in length, and can hold objects of any type. A Java array cannot change size, and must be declared to hold a specific type; for example with "int[81] grid". Python's flexibility comes with a performance penalty. Checking to see if a digit is a possibility for some grid square with the code d in grid[s] means at run time Python will have to fetch the grid object, check its type to see how to do indexing, fetch the underlying array from the list object, fetch the s object, get the numeric value of the s object, fetch grid[s], check its type (str) to see how to do the membership test, and then test if the string d is a substring of grid[s]. In Java, the corresponding code is (d & grid[s]) > 0), and at run time Java just has to check that s is less than 81, fetch the memory location of grid plus s, do a bitwise and with the value od d, and compare to 0.
In the Python version, we made a fresh copy of the grid every time we recursively called search, by executing grid[:]. Over the course of a benchmark Python needs to allocate (and reclaim) storage for millions of these grids. In Java we will save the time of managing memory by pre-allocating 81 grids (one for each level in the search tree) in an array called gridpool, and then instead of creating a new grid, we can just copy the current grid into the pre-allocated grid at the current level. We still need to do the same number of copies, but we save the time of managing storage.

We won't go over the Java version line-by-line, since it is similar to the Python version. Instead we'll skip to the conclusion: on the ALL.txt benchmark file, the Java version with just one thread solves 5,625 puzzles per second, 100 times fater than the Python version.

When we start adding threads, the performance increases almost linearly with the number of threads up to 6 (we reach a spedup of 5.66 and 5.36 in two runs with 6 threads). That is what I would expect with 6 CPUs. However, as we add more threads the gains increase at about half the slope up to 10 threads (where we have a speedup of 7.20 or 7.26 in two runs). From there we have a long flat plateau where the speedup is between 7 and 7.5 no matter how many threads we add. (This is 700 times faster than the Python program.) The top measured performance happens to be at 40 threads, with 43,873 puzzles solved per second, but this is more due to random variation than to any special property of the number 40. (if we repeated the experiment, the peak might come at 20, or 12. There is variation because the computer is doing other things (I didn't shut down all other processes) and because the Java JIT compiler performs slightly differently each run. I did let the JIT "warm up" by solving a few hundred thousand puzzles before I started the measurements.)

Alternative approaches

At 40,000 puzzles per second I've exceeded my expectations for this project, and I'm not tempted to try other approaches. But for completeness, I will mention three alternative approaches and describe why I decided not to try each one:

Heuristic search: traditional heuristic search, as exemplified by the A* Algorithm>/a> is a "best-first" approach that skips around in the search tree. Our search was depth-first, meaning we always went as deep as possible before backtracking. A* keeps a list of all states that have not been proven inconsistent and looks first at the one that is estimated to be on the shortest path to the goal. This works well for things like finding a route to a destination on a map, but one reason it does not seem promising for Sudoku is that all Sudoku solutions have the same path length, so it is difficult to come up with a useful heuristic.
Local search: so far we have considered incremental search, in which we start with a partial assignment of digits to squares and incrementally add new assignments, never violating the rules of Sudoku. In a local search approach we start with a complete (random) assignment of digits to all squares and expect that it violates the rules (for example, there might be two instances of a '3' in the same row). Then we make small (local) changes to the assignment so that it violates fewer rules. Local search is good when it is sufficient to find an answer that is an approximation to the best answer, and for problems that have a large number of solutions (with a very large number of non-solutions). Since these conditions do not apply to Sudoku, my guess is that local search would not perform well. Local search techniques include hill climbing, simulated annealing, and genetic algorithms.
Dancing links: this algorithm has proven to be effectivce for Sudoku. It is a constraint-satisfaction approach that considers both the squares and the locations as variables. In other words, it implements our dual consistency rule by keeping track of variables, rather than by computing places for a digit on the fly. It might well be more efficient than what I have implemented, but it didn't seem that interesting to me, since it is doing roughly the same thing.

Don't let my reluctance stop you: it would be a great project to implement any of these techniques for Sudoku.

Why?

Why did I do this? As computer security expert Ben Laurie has stated, Sudoku is "a denial of service attack on human intellect". My wife was infected by the virus, and I wanted to convince her that the problem had been solved and didn't need any more of her time. It didn't work for her (although she has since independently substantially cut down on the habit), but at least one other person has told me it worked for them, so I've made the world more productive. And perhaps along the way I've taught something about Python, constraint satisfaction, incremental search, Java, and multithreading.

Translations

Peter Norvig