Letter Frequencies for Scrabble

Recently, I published a post on English Letter Frequency Counts. My intended audience of computational linguists took notice, and some wondered about the change in letter frequency over time. But there was another group of readers I hadn't anticipated: Scrabble enthusiasts. Some of them wanted support for their theory that the letter point values in the game could stand to be adjusted. In this short note, I track letter frequencies over time, and speculate on which letters are easier or harder to play in Scrabble.

Letter Frequency over Time

I broke out the Google Books Ngram data into seven time periods: pre 19th century, 19th century, 4 quarters of the 20th century, and 21st century, and kept the 100,000 most frequent words overall. Here is the breakdown of the number of distinct words and the occurrence counts (in billions) for each time period:

pre-18001800-991900-241925-491950-741975-19992000-nowoverall
words87,217 97,734 98,425 99,525 99,968 100,000 100,000100,000
occurrences3 B79 B53 B47 B112 B248 B203 B744 B

The resulting data file (5MB) is sorted by overall word frequency, with each line containing a single word followed by tab-separated occurrence counts for each of the seven time periods. The first and last 15 lines look like this:

THE	214386567	6205478005	4136227396	3596808601	8434878977	17095905562	13413716353
OF	139080842	3835444951	2455783120	2141178652	5039228492	9911392356	7443965819
AND	104735110	2636959817	1659391866	1411410347	3297450997	7362373093	6159703274
TO	92589834	2236903043	1384523293	1209170285	2841681262	6260231443	5322298917
IN	61663464	1727940862	1202136632	1102307412	2695341106	5710080790	4391594997
A	52666775	1518556104	1067315757	965781133	2278228256	5149986234	4277553636
IS	30222523	845922593	630814098	548758334	1335987058	2818372092	2174169987
THAT	42207807	919231597	570562753	495588435	1148001772	2560771539	2264404325
FOR	22709571	587562031	427461710	407327855	988526132	2270067239	1841627493
IT	30586232	757131617	483135394	402426501	860369525	1702985300	1503450800
AS	25127245	649084129	428121174	366815239	850033321	1835148950	1546315200
WAS	21836236	679586883	445264203	397816862	852480827	1659824046	1445904911
WITH	22650143	600104567	377661407	325318042	755940286	1682707617	1418415187
BE	26016783	591061444	381343530	328060974	773748230	1541412667	1177221157
BY	24913613	624134890	379710635	322171395	767001993	1494625868	1090547690
.
.
.
PREUVES	978	19027	6292	5140	19964	28651	15500
TARANTULAS	172	5254	6071	6385	9801	35034	32834
SPIRULINA	0	848	791	413	2524	50940	40033
CORNEY	113	12993	10726	7575	12067	26828	25245
PATHBREAKING	0	26	190	190	3218	47158	44764
LITTORINA	0	6275	4960	9937	21117	42326	10926
HOOKAH	106	13108	5647	4900	14804	29150	27823
AUSLAND	0	5224	3435	5333	19312	38004	24227
ROUMANIE	0	1157	4181	6801	26821	42310	14263
IVAS	6159	22550	6591	6779	12868	18666	21918
ALANS	1305	17402	7660	8129	18759	21509	20765
GORDIE	6	118	760	3680	5721	45385	39859
THATCHERITE	0	0	2	12	4	53353	42157
EXCOMMUNICATING	2362	30983	10239	6014	13512	16522	15895
HEUSER	2	761	3149	10533	17964	41368	21750

In the table below, you can see the percentages for each letter, by time period. Overall, there has not been much change over time. The biggest change is that, as the Scrabblists have noted, there has been a steady increase in the frequency of "Z", doubling since pre-1800 (although the change in the 75 years since the invention of Scrabble has been smaller, from .08% to .10%).

In each column, the letters are ordered by frequency. When there is an exchange of frequency order for a time period (compared to the overall frequency) I have placed a horizontal line between the two exchanged letters (for example, "O" is more common than "A" in pre-1800). We see that 1950-74 is the most average time period (no letter exchanges), and 1975-99, which contains the so-called "me" decade, is the only decade where "I" surpasses "O" (but the word counts for "me", "my", and "I" are not unusual in that time period).

pre-1800
E: 12.79
T:  9.76
O:  7.73
A:  7.69
I:  7.19
N:  7.07
S:  6.18
H:  6.26
R:  6.16
D:  3.93
L:  3.52
C:  2.84
U:  2.75
F:  2.70
M:  2.48
P:  1.92
W:  1.91
G:  1.72
Y:  1.74
B:  1.58
V:  1.07
K:  0.45
X:  0.21
J:  0.18
Q:  0.13
Z:  0.04
1800-99
E: 12.78
T:  9.50
A:  7.78
O:  7.67
I:  7.25
N:  7.10
S:  6.43
R:  6.15
H:  5.94
D:  3.96
L:  3.80
C:  3.01
U:  2.70
F:  2.61
M:  2.42
P:  1.95
W:  1.90
G:  1.77
Y:  1.70
B:  1.54
V:  1.04
K:  0.47
X:  0.21
J:  0.15
Q:  0.12
Z:  0.05
1900-24
E: 12.67
T:  9.42
A:  7.93
O:  7.66
I:  7.32
N:  7.12
S:  6.47
R:  6.19
H:  5.63
L:  3.97
D:  3.89
C:  3.09
U:  2.70
F:  2.57
M:  2.43
P:  1.98
W:  1.85
G:  1.83
Y:  1.69
B:  1.53
V:  1.00
K:  0.52
X:  0.21
J:  0.14
Q:  0.11
Z:  0.07
1925-49
E: 12.59
T:  9.36
A:  7.99
O:  7.66
I:  7.44
N:  7.16
S:  6.47
R:  6.24
H:  5.36
L:  4.02
D:  3.85
C:  3.21
U:  2.71
F:  2.52
M:  2.46
P:  2.06
G:  1.84
W:  1.77
Y:  1.66
B:  1.52
V:  1.02
K:  0.52
X:  0.22
J:  0.14
Q:  0.12
Z:  0.08
1950-74
E: 12.52
T:  9.33
A:  8.03
O:  7.64
I:  7.64
N:  7.24
S:  6.51
R:  6.29
H:  5.05
L:  4.06
D:  3.76
C:  3.38
U:  2.71
M:  2.51
F:  2.46
P:  2.15
G:  1.81
W:  1.64
Y:  1.63
B:  1.50
V:  1.05
K:  0.49
X:  0.24
J:  0.15
Q:  0.12
Z:  0.09
1975-99
E: 12.41
T:  9.19
A:  8.11
I:  7.68
O:  7.63
N:  7.29
S:  6.55
R:  6.35
H:  4.74
L:  4.15
D:  3.76
C:  3.48
U:  2.74
M:  2.55
F:  2.35
P:  2.22
G:  1.88
Y:  1.64
W:  1.57
B:  1.47
V:  1.07
K:  0.54
X:  0.25
J:  0.16
Q:  0.12
Z:  0.10
2000-now
E: 12.40
T:  9.20
A:  8.11
O:  7.64
I:  7.61
N:  7.25
S:  6.52
R:  6.27
H:  4.88
L:  4.12
D:  3.84
C:  3.38
U:  2.76
M:  2.53
F:  2.29
P:  2.16
G:  1.94
Y:  1.69
W:  1.67
B:  1.45
V:  1.06
K:  0.60
X:  0.24
J:  0.17
Q:  0.12
Z:  0.10
overall
E: 12.49
T:  9.28
A:  8.04
O:  7.64
I:  7.57
N:  7.23
S:  6.51
R:  6.28
H:  5.05
L:  4.07
D:  3.82
C:  3.34
U:  2.73
M:  2.51
F:  2.40
P:  2.14
G:  1.87
W:  1.68
Y:  1.66
B:  1.48
V:  1.05
K:  0.54
X:  0.23
J:  0.16
Q:  0.12
Z:  0.09

Relative Ease of Playing Letters in Scrabble

Now let's look at applications to the game of Scrabble. To make Scrabble a fair game, we would like the point value of letters to be arranged to minimize the luck of the draw. For example, suppose there was only one "S" tile, and it was worth 100 points. The player who was lucky enough to draw that tile would have a huge advantage, because it is relatively easy to play an "S", possibly on a double or triple letter square, and possibly forming words in both directions (because many nouns form a plural by adding "S"). On the other hand, a player who draws a "Q", does not have a huge advantage, even though "Q" is worth 10 points, because it is very difficult to play a "Q" at all, let alone on a double or triple square. So, ideally the point value of a letter should be inversely proportional to its ease of playing.

When Alfred Butts invented Scrabble in 1938, he determined the point values based on a frequency analysis of English letters (done by hand, not by computer). In the letter frequency column of the table below, we see that point value does indeed vary roughly inversely with letter frequency in the English books corpus. (In every column of the table, letter frequency is normalized against the letter "Q". That is, by definition "Q" has a frequency score of 1, and the score of 104 for "E" means it is 104 times more frequent. The Scrabble point value of each letter is shown in parentheses.)

letter frequency
E: 104 ( 1)
T:  77 ( 1)
A:  67 ( 1)
O:  64 ( 1)
I:  63 ( 1)
N:  60 ( 1)
S:  54 ( 1)
R:  52 ( 1)
H:  42 ( 4)
L:  34 ( 1)
D:  32 ( 2)
C:  28 ( 3)
U:  23 ( 1)
M:  21 ( 3)
F:  20 ( 4)
P:  18 ( 3)
G:  16 ( 2)
W:  14 ( 4)
Y:  14 ( 4)
B:  12 ( 3)
V:   9 ( 4)
K:   5 ( 5)
X:   2 ( 8)
J:   1 ( 8)
Q:   1 (10)
Z:   1 (10)
words
E:  48 ( 1)
S:  41 ( 1)
I:  40 ( 1)
A:  37 ( 1)
R:  36 ( 1)
N:  33 ( 1)
T:  33 ( 1)
O:  31 ( 1)
L:  27 ( 1)
C:  22 ( 3)
D:  19 ( 2)
U:  18 ( 1)
P:  16 ( 3)
M:  16 ( 3)
G:  15 ( 2)
H:  13 ( 4)
B:  11 ( 3)
Y:  10 ( 4)
F:   7 ( 4)
V:   6 ( 4)
K:   5 ( 5)
W:   5 ( 4)
Z:   3 (10)
X:   2 ( 8)
J:   1 ( 8)
Q:   1 (10)
weighted words
A:  54 ( 1)
E:  54 ( 1)
S:  44 ( 1)
O:  42 ( 1)
I:  36 ( 1)
R:  32 ( 1)
T:  29 ( 1)
L:  27 ( 1)
N:  26 ( 1)
U:  24 ( 1)
D:  23 ( 2)
P:  20 ( 3)
M:  20 ( 3)
H:  18 ( 4)
Y:  17 ( 4)
B:  16 ( 3)
G:  16 ( 2)
C:  16 ( 3)
K:  12 ( 5)
W:  12 ( 4)
F:  11 ( 4)
V:   6 ( 4)
X:   4 ( 8)
Z:   3 (10)
J:   3 ( 8)
Q:   1 (10)
first play
E: 186 ( 1)
A: 134 ( 1)
O:  94 ( 1)
S:  89 ( 1)
R:  86 ( 1)
I:  84 ( 1)
T:  83 ( 1)
N:  72 ( 1)
L:  58 ( 1)
D:  52 ( 2)
G:  31 ( 2)
U:  31 ( 1)
P:  30 ( 3)
M:  29 ( 3)
B:  24 ( 3)
H:  23 ( 4)
C:  22 ( 3)
Y:  20 ( 4)
W:  19 ( 4)
F:  18 ( 4)
K:  14 ( 5)
V:  12 ( 4)
X:   5 ( 8)
J:   4 ( 8)
Z:   4 (10)
Q:   1 (10)
second play
E: 321 ( 1)
A: 202 ( 1)
S: 169 ( 1)
R: 162 ( 1)
T: 141 ( 1)
I: 133 ( 1)
N: 130 ( 1)
O: 128 ( 1)
L:  93 ( 1)
D:  92 ( 2)
M:  47 ( 3)
P:  45 ( 3)
G:  44 ( 2)
U:  36 ( 1)
C:  34 ( 3)
B:  34 ( 3)
H:  32 ( 4)
F:  22 ( 4)
W:  22 ( 4)
Y:  21 ( 4)
V:  15 ( 4)
K:  14 ( 5)
X:   5 ( 8)
Z:   4 (10)
J:   3 ( 8)
Q:   1 (10)
"mid" proposal
E: 321 ( 1)
A: 202 ( 1)
S: 169 ( 1)
R: 162 ( 1)
T: 141 ( 1)
I: 133 ( 1)
N: 130 ( 1)
O: 128 ( 1)
L:  93 ( 1)
D:  92 ( 2)
M:  47 ( 3)
P:  45 ( 3)
G:  44 ( 3)
U:  36 ( 3)
C:  34 ( 3)
B:  34 ( 3)
H:  32 ( 4)
F:  22 ( 4)
W:  22 ( 4)
Y:  21 ( 4)
V:  15 ( 4)
K:  14 ( 5)
X:   5 ( 8)
Z:   4 ( 8)
J:   3 ( 8)
Q:   1 (10)
"radical" proposal
E: 321 ( 1) = 321
A: 202 ( 1) = 202
S: 169 ( 1) = 169
R: 162 ( 1) = 162
T: 141 ( 1) = 141
I: 133 ( 1) = 133
N: 130 ( 1) = 130
O: 128 ( 1) = 128
L:  93 ( 1) =  93
D:  92 ( 1) =  92
M:  47 ( 2) =  94
P:  45 ( 2) =  90
G:  44 ( 2) =  88
U:  36 ( 3) = 108
C:  34 ( 3) = 102
B:  34 ( 3) = 102
H:  32 ( 3) =  96
F:  22 ( 4) =  88
W:  22 ( 4) =  88
Y:  21 ( 4) =  84
V:  15 ( 6) =  90
K:  14 ( 6) =  84
X:   5 ( 8) =  40
Z:   4 ( 8) =  32
J:   3 ( 8) =  24
Q:   1 (10) =  10

To play a letter in Scrabble, you must form a word. The words column above shows the relative numbers of distinct words in the Scrabble word list that contain each letter. Of the 178,691 words in the Tournament Word List TWL06, 124,243 (or 70%) contain an "E", but only 2,576 (1.4%) contain a "Q". (Does that mean the "Q" should be worth 124243/2576 = 48 points? I don't think so, but you can decide what you think it means.) It does seem that there is an inequity in that there are 3 times as many words containing a "Z" than a "Q", but "Z" and "Q" have the same point value (10). Note also that "S" has moved up from the 7th spot to the 2nd -- in part because there are so many nouns that have a plural form ending in "S".

Not all Scrabble words are equally easy to play. You are more likely to be able to make "AT" than "SYZYGY." The weighted words column above compares the weighted sum of words that contain each letter. The weighting is by the number of letters: two-letter words are deemed easiest to make; a three-letter word was weighted as 4 times harder to make, a four-letter word as 4 times harder than a three-letter, and so on. (Why 4 times? It is somewhat arbitrary but based on the idea that 26 letters divided by 7 letters in a rack is approximately 4.)

Not all three-letter words are equally easy to play. It is hard to make "ZAX" because there is only one "Z" and one "X", and easy to make "EAT". In the first play column, I report the relative frequencies of being able to play a letter, based on the actual probability of being able to play each possible word as the first play of the game. For example, the probability of being able to play "THE" turns out to be 9.4%, based on the probability of drawing a "T", "H", and "E" (or blanks to make up for these letters) out of the seven letters in a hand.

Words longer than 7 letters are impossible on the first turn, but possible on subsequent turns. In the second play column of the table above, I show the letter frequencies based on the probability of playing a word as the second play of the game. That is, the word must either intersect the first-played word at one letter, or it must use all the letters of the first word. (That way, we can make words up to 14 letters.) I didn't attempt to model plays beyond the second, but I think the numbers would not change too much from the second play.

Conclusion: Based on the data above, I will make three possible proposals for Scrabble letter values:


Peter Norvig