pre-1800 | 1800-99 | 1900-24 | 1925-49 | 1950-74 | 1975-1999 | 2000-now | overall | |
---|---|---|---|---|---|---|---|---|

words | 87,217 | 97,734 | 98,425 | 99,525 | 99,968 | 100,000 | 100,000 | 100,000 |

occurrences | 3 B | 79 B | 53 B | 47 B | 112 B | 248 B | 203 B | 744 B |

The resulting data file (5MB) is sorted by overall word frequency, with each line containing a single word followed by tab-separated occurrence counts for each of the seven time periods. The first and last 15 lines look like this:

THE 214386567 6205478005 4136227396 3596808601 8434878977 17095905562 13413716353 OF 139080842 3835444951 2455783120 2141178652 5039228492 9911392356 7443965819 AND 104735110 2636959817 1659391866 1411410347 3297450997 7362373093 6159703274 TO 92589834 2236903043 1384523293 1209170285 2841681262 6260231443 5322298917 IN 61663464 1727940862 1202136632 1102307412 2695341106 5710080790 4391594997 A 52666775 1518556104 1067315757 965781133 2278228256 5149986234 4277553636 IS 30222523 845922593 630814098 548758334 1335987058 2818372092 2174169987 THAT 42207807 919231597 570562753 495588435 1148001772 2560771539 2264404325 FOR 22709571 587562031 427461710 407327855 988526132 2270067239 1841627493 IT 30586232 757131617 483135394 402426501 860369525 1702985300 1503450800 AS 25127245 649084129 428121174 366815239 850033321 1835148950 1546315200 WAS 21836236 679586883 445264203 397816862 852480827 1659824046 1445904911 WITH 22650143 600104567 377661407 325318042 755940286 1682707617 1418415187 BE 26016783 591061444 381343530 328060974 773748230 1541412667 1177221157 BY 24913613 624134890 379710635 322171395 767001993 1494625868 1090547690 . . . PREUVES 978 19027 6292 5140 19964 28651 15500 TARANTULAS 172 5254 6071 6385 9801 35034 32834 SPIRULINA 0 848 791 413 2524 50940 40033 CORNEY 113 12993 10726 7575 12067 26828 25245 PATHBREAKING 0 26 190 190 3218 47158 44764 LITTORINA 0 6275 4960 9937 21117 42326 10926 HOOKAH 106 13108 5647 4900 14804 29150 27823 AUSLAND 0 5224 3435 5333 19312 38004 24227 ROUMANIE 0 1157 4181 6801 26821 42310 14263 IVAS 6159 22550 6591 6779 12868 18666 21918 ALANS 1305 17402 7660 8129 18759 21509 20765 GORDIE 6 118 760 3680 5721 45385 39859 THATCHERITE 0 0 2 12 4 53353 42157 EXCOMMUNICATING 2362 30983 10239 6014 13512 16522 15895 HEUSER 2 761 3149 10533 17964 41368 21750 |

In the table below, you can see the percentages for each letter, by time period. Overall, there has not been much change over time. The biggest change is that, as the Scrabblists have noted, there has been a steady increase in the frequency of "Z", doubling since pre-1800 (although the change in the 75 years since the invention of Scrabble has been smaller, from .08% to .10%).

In each column, the letters are ordered by frequency. When there is an exchange of frequency order for a time period (compared to the overall frequency) I have placed a horizontal line between the two exchanged letters (for example, "O" is more common than "A" in pre-1800). We see that 1950-74 is the most average time period (no letter exchanges), and 1975-99, which contains the so-called "me" decade, is the only decade where "I" surpasses "O" (but the word counts for "me", "my", and "I" are not unusual in that time period).

E: 12.79 T: 9.76 | E: 12.78 T: 9.50 A: 7.78 O: 7.67 I: 7.25 N: 7.10 S: 6.43 R: 6.15 H: 5.94 | E: 12.67 T: 9.42 A: 7.93 O: 7.66 I: 7.32 N: 7.12 S: 6.47 R: 6.19 H: 5.63 L: 3.97 D: 3.89 C: 3.09 U: 2.70 | E: 12.59 T: 9.36 A: 7.99 O: 7.66 I: 7.44 N: 7.16 S: 6.47 R: 6.24 H: 5.36 L: 4.02 D: 3.85 C: 3.21 U: 2.71 | E: 12.52 T: 9.33 A: 8.03 O: 7.64 I: 7.64 N: 7.24 S: 6.51 R: 6.29 H: 5.05 L: 4.06 D: 3.76 C: 3.38 U: 2.71 M: 2.51 F: 2.46 P: 2.15 G: 1.81 W: 1.64 Y: 1.63 B: 1.50 V: 1.05 K: 0.49 X: 0.24 J: 0.15 Q: 0.12 Z: 0.09 | E: 12.41 T: 9.19 A: 8.11 | E: 12.40 T: 9.20 A: 8.11 O: 7.64 I: 7.61 N: 7.25 S: 6.52 R: 6.27 H: 4.88 L: 4.12 D: 3.84 C: 3.38 U: 2.76 M: 2.53 F: 2.29 P: 2.16 G: 1.94 | E: 12.49 T: 9.28 A: 8.04 O: 7.64 I: 7.57 N: 7.23 S: 6.51 R: 6.28 H: 5.05 L: 4.07 D: 3.82 C: 3.34 U: 2.73 M: 2.51 F: 2.40 P: 2.14 G: 1.87 W: 1.68 Y: 1.66 B: 1.48 V: 1.05 K: 0.54 X: 0.23 J: 0.16 Q: 0.12 Z: 0.09 |

When Alfred Butts invented Scrabble in 1938, he determined the
point values based on a frequency
analysis of English letters (done by hand, not by computer). In the
**letter frequency**
column of the table below, we see that point value does indeed vary
roughly inversely with letter frequency in the English books corpus. (In every column of the table, letter
frequency is normalized against the letter "Q". That is, by definition "Q"
has a frequency score of 1, and the score of 104 for "E" means it
is 104 times more frequent. The Scrabble point value of each letter is shown in
parentheses.)

E: 104 ( 1) T: 77 ( 1) A: 67 ( 1) O: 64 ( 1) I: 63 ( 1) N: 60 ( 1) S: 54 ( 1) R: 52 ( 1) H: 42 ( 4) L: 34 ( 1) D: 32 ( 2) C: 28 ( 3) U: 23 ( 1) M: 21 ( 3) F: 20 ( 4) P: 18 ( 3) G: 16 ( 2) W: 14 ( 4) Y: 14 ( 4) B: 12 ( 3) V: 9 ( 4) K: 5 ( 5) X: 2 ( 8) J: 1 ( 8) Q: 1 (10) Z: 1 (10) | E: 48 ( 1) S: 41 ( 1) I: 40 ( 1) A: 37 ( 1) R: 36 ( 1) N: 33 ( 1) T: 33 ( 1) O: 31 ( 1) L: 27 ( 1) C: 22 ( 3) D: 19 ( 2) U: 18 ( 1) P: 16 ( 3) M: 16 ( 3) G: 15 ( 2) H: 13 ( 4) B: 11 ( 3) Y: 10 ( 4) F: 7 ( 4) V: 6 ( 4) K: 5 ( 5) W: 5 ( 4) Z: 3 (10) X: 2 ( 8) J: 1 ( 8) Q: 1 (10) | A: 54 ( 1) E: 54 ( 1) S: 44 ( 1) O: 42 ( 1) I: 36 ( 1) R: 32 ( 1) T: 29 ( 1) L: 27 ( 1) N: 26 ( 1) U: 24 ( 1) D: 23 ( 2) P: 20 ( 3) M: 20 ( 3) H: 18 ( 4) Y: 17 ( 4) B: 16 ( 3) G: 16 ( 2) C: 16 ( 3) K: 12 ( 5) W: 12 ( 4) F: 11 ( 4) V: 6 ( 4) X: 4 ( 8) Z: 3 (10) J: 3 ( 8) Q: 1 (10) | E: 186 ( 1) A: 134 ( 1) O: 94 ( 1) S: 89 ( 1) R: 86 ( 1) I: 84 ( 1) T: 83 ( 1) N: 72 ( 1) L: 58 ( 1) D: 52 ( 2) G: 31 ( 2) U: 31 ( 1) P: 30 ( 3) M: 29 ( 3) B: 24 ( 3) H: 23 ( 4) C: 22 ( 3) Y: 20 ( 4) W: 19 ( 4) F: 18 ( 4) K: 14 ( 5) V: 12 ( 4) X: 5 ( 8) J: 4 ( 8) Z: 4 (10) Q: 1 (10) | second playE: 321 ( 1) A: 202 ( 1) S: 169 ( 1) R: 162 ( 1) T: 141 ( 1) I: 133 ( 1) N: 130 ( 1) O: 128 ( 1) L: 93 ( 1) D: 92 ( 2) M: 47 ( 3) P: 45 ( 3) G: 44 ( 2) U: 36 ( 1) C: 34 ( 3) B: 34 ( 3) H: 32 ( 4) F: 22 ( 4) W: 22 ( 4) Y: 21 ( 4) V: 15 ( 4) K: 14 ( 5) X: 5 ( 8) Z: 4 (10) J: 3 ( 8) Q: 1 (10) | "mid" proposalE: 321 ( 1) A: 202 ( 1) S: 169 ( 1) R: 162 ( 1) T: 141 ( 1) I: 133 ( 1) N: 130 ( 1) O: 128 ( 1) L: 93 ( 1) D: 92 ( 2) M: 47 ( 3) P: 45 ( 3) G: 44 | "radical" proposalE: 321 ( 1) = 321 A: 202 ( 1) = 202 S: 169 ( 1) = 169 R: 162 ( 1) = 162 T: 141 ( 1) = 141 I: 133 ( 1) = 133 N: 130 ( 1) = 130 O: 128 ( 1) = 128 L: 93 ( 1) = 93 D: 92 |

To play a letter in Scrabble, you must form a word. The
**words** column above shows the relative numbers of distinct words in the
Scrabble word list that contain each letter. Of the 178,691 words in
the Tournament Word List TWL06, 124,243 (or 70%) contain an "E", but only 2,576 (1.4%) contain
a "Q". (Does that mean the "Q" should be worth 124243/2576 = 48 points?
I don't think so, but you can decide what you think it means.) It
does seem that there is an inequity in that there are 3 times as many
words containing a "Z" than a "Q", but "Z" and "Q" have the
same point value (10). Note also that "S" has moved up from the 7th
spot to the 2nd -- in part because there are so many nouns that have a
plural form ending in "S".

Not all Scrabble words are equally easy to play. You are more likely to be able to make
"AT" than "SYZYGY." The **weighted words** column above compares the weighted sum
of words that contain each letter. The weighting is by the number of
letters: two-letter words are deemed easiest to make; a three-letter
word was weighted as 4 times harder to make, a four-letter word as 4
times harder than a three-letter, and so on. (Why 4 times? It is
somewhat arbitrary but based on the idea that 26 letters divided by 7
letters in a rack is approximately 4.)

Not all three-letter words are equally easy to play.
It is hard to make "ZAX" because there is only one "Z" and one "X",
and easy to make "EAT". In the **first play** column, I report the
relative frequencies of being able to play a letter, based on the
actual probability of being able to play each possible word
as the first play of the game. For example, the probability of being
able to play "THE" turns out to be 9.4%, based on the probability of
drawing a "T", "H", and "E" (or blanks to make up for these letters)
out of the seven letters in a hand.

Words longer than 7 letters are impossible on the first turn, but
possible on subsequent turns. In the **second play** column of the table above, I show the letter
frequencies based on the probability of playing a word as the
second play of the game. That is, the word must either
intersect the first-played word at one letter, or it must use all the
letters of the first word. (That way, we can make words up to 14
letters.) I didn't attempt to model plays beyond the second, but I
think the numbers would not change too much from the second play.

**Conclusion:** Based on the data above, I will make three
possible proposals for Scrabble letter values:

**Conservative Proposal:**Keep letter values as they are. The game ain't broke, so don't fix it. Don't make millions of Scrabble sets obsolete.**Mid Proposal:**Make the minimal number of changes so that there are no inversions in letter scores compared to the ease of play (as measured by the second play column). This would mean increasing "G" from 2 to 3; "U" from 1 to 3; and decreasing "Z" from 10 to 8.**Radical Proposal:**Make more changes so that the product of letter score and ease of use is roughly constant. In the final column above, I show a proposal that gives all the letters except for the first 4 and last 4 a combined (ease × points) score around 100 (ranging from 84 to 141). (I think it is ok that the first and last 4 are outliers; better than making "Q" worth 100 points and "E" 1/3 points.)

Peter Norvig