Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium.
Code copyright (c) 2008-2009 by Peter Norvig. You are free to use this code under the MIT license.
To run this code, download either the zip file (and unzip it) or all the files listed below. Then from a shell execute python -i ngrams.py (or start a Python IDE and import ngrams), and if you want to test if everything works, call test(). Note that the hillclimbing function has a random component, so if you have bad luck it is possible that some of the tests will fail, even if everything is correctly installed. (It is unlikely that they will fail twice in a row.)
| 4,800KB | ngrams.zip | A zip file of all the files below. Get this or the files below. |
| 8KB | ngrams.py | The Python code for everything in the chapter. |
| 7KB | ngrams-test.txt | Unit tests; run by the Python function test(). |
| 5,000KB | count_1w.txt | The 1/3 million most frequent words, all lowercase, with counts. (Called vocab_common in the chapter, but I changed file names here.) |
| 5,500KB | count_2w.txt | The 1/4 million most frequent two-word (lowercase) bigrams, with counts. |
| 10KB | count_2l.txt | Counts for all 2-letter (lowercase) bigrams. |
| 200KB | count_3l.txt | Counts for all 3-letter (lowercase) trigrams. |
| 10KB | count_1edit.txt | Counts for all single-edit spelling correction edits, from the file spell-errors.txt. |
| 450KB | spell-errors.txt | A collection of "right: wrong1, wrong2" spelling mistakes, collected from Wikipedia and Roger Mitton. |
| 320KB | count_big.txt | Not from the chapter, this is a word count file for the big.txt file from my spell correction article. |