World's Longest Palindrome? 21,012 words

See also: comments, program

At 8:02 PM on the 20th of February 2002 it was 20:02 02/20 2002 (if you live in the US), or 20:02 20/02 2002 (if you live in the rest of the world). Either way, it was the best of times, it was the tseb of times, it was a palindromic time. In honor of the event, I searched for world's longest palindrome and found that "In 1980, Giles Selig Hales claimed to have written the world's longest palindrome, which consisted of 58,795 letters." I also remembered that Dan Hoey had generated a long palindrome beginning with "A man, a plan" and ending with "Panama": a 540 word version created in 1984. Hoey writes "This was done with the Unix spelling dictionary and a fairly simple-minded program. With a better word list and a smarter program I'm sure the palindrome could be ten times as long."

Luckily for me, I had a better word list: the very nice Moby Word list from Grady Ward , from which I was able to extract the 126,342-word npdict.txt. Also, very importantly, I had a laptop computer which was at least 1,000 times more powerful than the minicomputer Hoey had in 1984, allowing me to do all computation in memory, rather than having to manage disk data files. That made it easy for me to create a program that appears smarter, even though Hoey's program was probably more difficult to author.

Version 1: 15,139 Word Palindrome

Created: 20-02-2002
Words: 15,319
Letters: 63,647
Phrases: 12,400
Palindrome:A man, a plan, a caddy, Ore, Lee, tsuba, Thaine, a lair, ... (more) ..., Aeniah, Tabu, Steele, Roydd, a canal, Panama.
A man, a plan, a caddy,

Commentary: Cognoscenti such as Mark Saltveit, editor of The Palindromist, rightfully point out that my creation should not be called a true palindrome, because it makes no sense. But Saltveit says that I am probably safe in calling this "the world's longest palindromic sentence, or the world's longest parody of `A man, a plan.' " I agree with that assessment.

Maybe I'm biased, but I think the sentence starts out quite strong. "A man, a plan, a caddy" is the basic premise of a classic piece of storytelling. Unfortunately, things go downhill from there rather quickly. It contains truths, but it does not have a plot. It has Putnam, but no logic; Tesla, but no electricity; Pareto, but no optimality; Ebert, but no thumbs up. It has an ensemble cast including Tim Allen, Ed Harris and Al Pacino, but they lack character development. It has Sinatra and Pink, but it doesn't sing. It has Monet and Goya, but no artistry. It has Slovak, Inuit, Creek, and Italian, but it's all Greek to me. It has exotic locations like Bali, Maui, Uranus, and Canada, but it jumps around needlessly. It has Occam, but it is the antithesis of his maxim "Entia non sunt multiplicanda praeter necessitatem." If you tried to read the whole thing, you'd get to "a yawn" and stop. Or you might be overcome by the jargon, such as PETN, ILGWU, PROM, UNESCO, and MYOB. Most serendipitous of all is that Steele, who collected several shorter versions of the Panama oeuvre in a book about a Lisp, shows up in the very last line. Steele and some others have some comments. You can read the results from top to bottom (if you don't get bored) or you can start in the middle; the letter "y" in "Moray".

Speed: The program increases the length of the longest palindrome found by roughly 200 words every second; so in 3 or 4 seconds it breaks Hoey's record, and in 30 seconds it is over 6000 words. At around 8000 words progress slows to about 1,000 words per minute, and by around 10,000 to 12,000 words progress is sporadic. This is because we are running out of good words: there are 126,000 words in the dictionary, but only about 10% of them are easily reversible. For example, there are 426 words that contain "eq" or "sq", but these are hard to use in a palindrome because there are no words containing "qe" or "qs", and only a few words that end in "q" (and could then be followed by a word starting with "e" or "s".

Goals: This almost doubles Hales' letter count, and quadruples Hoey's aspiration for word count.

Version 2: 17,826 Word Palindrome

Created: 7:00-2-11-2007
Words: 17,826
Letters: 74,663
Phrases: 14,382
Palindrome: A man, a plan, a cameo, Zena, Bird, Mocha, ... (more) ..., Comdr, Ibanez, OEM, a canal, Panama!
A man, a plan, a cameo, Zena,

Commentary: After a reader sent in a suggestion, I made 4 changes to the program:

  1. I added the function reversible_words, which finds all pairs of words in the dictionary that are palindromes of other words, such as "Camus" and "Sumac". There all 1100 of these, and adding them all at once helps a little, but not much, because they tend to be found by the search routine anyways.
  2. The old program would only add a new word that is equal or longer than the missing part on the other side. I added a capability called tryharder to add words that are shorter as well. That is, when I'm looking for a word that starts with "aca", I consider "a caddy" and "a canoe", etc., but this change allows me to also consider "A/C". This helps a little, but it also slows the program down a lot.
  3. I made the program faster. Profiling showed that reverse was a bottleneck, so I used the [::-1] idiom. I also dump the results to file every 1000 words rather than every 200, until we get near the end.
  4. I added unit tests, because that's the way things are done these days.

Speed: The program now starts out increasing by 1000 phrases per second, consistently gets over 10,000 phrases in under 30 seconds, and usually gets to 12,000 phrases in the first minute. Things slow down from there, but in ten runs all but one were over 17,000 words, which takes about an hour.

Version 3: 21,012 Word Palindrome

Created: 6-10-2016
Words: 21,012
Letters: 90,439
Phrases: 16,111
Program: or pal3.ipynb.
palindrome:A man, a plan, a caretaker, ... (more) ..., Komarek, a ter, a canal, Panama!

Commentary: When I saw a meme proclaiming the 6-10-2016 palindromic date, I went back to an old idea: maybe I could find a longer palindrome by advancing letter-by-letter rather than phrase-by-phrase. There usually will be a letter that advances both left and right, so I might not have to back up as much. Even better, I could choose first the letter that leads to the most completions of words on both left and right.

It worked! I got a longer palindrome. The resulting program is quite similar to the original version, but I get to use more modern Python data structures (like a Counter, replacing the bisect class.) See the IPython notebook for more on this version.

Speed: This program is about the same speed, but it makes more steady progress.

The Algorithm, Versions 1 and 2

The algorithm I used for versions 1 and 2 is as follows:

(1) Start off with the following seed text, divided into a left and right half:
A man, a plan, a canal, Panama

(2) Find the bit that is not palindromic; that is, not matched by text on the other side. Call that the remainder and color it red: aca. Since the remainder is itself a palindrome, the whole enchilada must be a palindrome. Record it as such, but keep going to try and find a longer one.
A man, a plan, a canal, Panama

(3) Look in the dictionary for words or phrases that begin with aca. Randomly choose one and add it to the left. Remember the others for later. In my example, the program chose a caddy. Identify the remainder again, namely ddy:
A man, a plan, a caddy, a canal, Panama

(4) Now find in the dictionary a word that ends with ydd (the reverse of ddy). The program found Roydd, which is a male first name (although far from a common one). Add it to the right, leaving the remainder Ro:
A man, a plan, a caddy, Roydd, a canal, Panama

(5) Now we need a word starting with or (the reverse of Ro) on the left. The program chose Ore. The remainder, e, is a palindrome, so again we've got a winner. Print it to a file and move on.
A man, a plan, a caddy, Ore, Roydd, a canal, Panama

(6) Keep going back and forth in this fashion, until you have 10 or 15 thousand words. However, at some point we may need to backtrack. For example, suppose the program next chose Belize:
A man, a plan, a caddy, Ore, Belize, Roydd, a canal, Panama

(7) We need a word on the left that starts with zileb, but there is no such word in my dictionary. So we backtrack, erasing Belize and trying another word that ends in e. Fortunately, there are many of these. If there were not, we would have to backtrack another step and erase Ore, replacing it with another word that starts with or.

There are a few more technical details. For example, whenever I choose a word, I need to record it so that I don't use it again (but un-record it when backtracking over it). For details, see the program listing.

The Algorithm, Version 3

For version 3, I switched to a letter-by-letter approach, rather than phrase-by-phrase. I keep track, on both left and right, of the phrases that have been completed, as well as a currently-being-worked-on phrase (one on the left and one on the right) that are not yet complete. These are the two middlemost phrases. So, it might go like this:
(1) Start off with the following seed text, divided into left and right halves, with incomplete phrases in the middle (in this case, the incomplete "aca" on the left, and no incomplete letters on the right). We always maintain the invariant that the concatenation of the letters on the left equals the reverse of the concatenation of the letters on the right (including the letters in any incomplete phrase).
A man, a plan, aca a canal, Panama

(2) Choose a letter to add to both the left and right. I will choose the word that gives the highest number of completed prefixes on the left and suffixes on the right. It turns out this is the letter r:
A man, a plan, acar r a canal, Panama

(3) Again, add the letter that completes the most prefixes and suffixes; in this case e:
A man, a plan, acare er a canal, Panama

(4) At this point we could add another letter, or we could mark the phrase "acare" ("a care") as complete, moving it the completed part of the left-hand-side:
A man, a plan, a care er a canal, Panama

(5) Keep going like this, adding letters, and sometimes moving a current string of letters to the completed phrase list (on either left or right). Just as in the previous version of the algorithm, at some point we may have to backtrack, undoing some moves we have made, and making an alternative choice instead.

Aside: The Problem of Language Induction

When I showed Hoey my palindrome, he said that he was really thinking of sticking with noun phrases of the form "<indefinite article> <noun>."

This is an example of the venerable language induction problem: given the single sentence "A man, a plan, a canal--Panama" as evidence, what language does it define? It seems clear that it consists of a series of any number of noun phrases. That is,

	    S => NP*
But from one example, you can't say much more. Hoey assumed that every noun phrase (except "Panama") must have an indefinite article:
	    NP => IndefArt Noun
	    IndefArt => "a" | "an"
while I was allowing proper nouns to appear anywhere, not just in the final "Panama":
	    NP => ProperNoun
	    NP => IndefArt Noun
	    IndefArt => "a" | "an"
Guy Steele suggested I should also try allowing other noun phrases, such as "two Xs". Gold said the language induction problem can't be solved in general, but others (e.g. Horning, Mooney, Muggleton, de Marcken) have shown that it can be solved if you use a "probably approximately correct" approach rather than a strictly logical formalism. With Hoey in mind, I also generated a solution with all indefinite articles:

Version 2b: 2,211 Word Palindrome with only indefinite articles

Created: 3:00-2003 (I forget what day, exactly)
Words: 2,211
Letters: 5,842
Phrases: 1,106
Program: with a restricted dictionary
Palindrome:A man, a plan, a casa, a bait, a lag, a malt, ... (more) ..., a natl, a mag, a lati, a baas, a canal, Panama!
A man, a plan, a casa, a bait, a lag, a malt,

Commentary: Hoey said he thought that a better word list and a smarter program could get to ten times his 540-word palindrome, using only noun phrases with indefinite articles. I'm pretty sure that will never happen. The problem is a dirth of "a"s. According to Hoey's rules, every phrase must start with the letter "a". That means that either the rest of the word must be an exact reverse of another word (and we know there are 1100 of these) or the phrase must have another "a" in it somewhere, and it must be matched by two or more other phrases. Phrases such as "a man", "a plan" and "a canal" work well because they contain multiple "a"s. Now consider a phrase such as "a biologist". If that appears in the palindrome, then somewhere else the letters "tsigoloib" must appear. But note that those letters must all appear in one word/phrase, because there is no "a", and we only get word boundaries at "a"s. And of course, there is no single word that contains those letters. In general, take a word (such as "an asparagus" or "a biologist"), split it into components around the "a"s (yielding ["n", "sp", "r", "gus"] and ["biologist"]). Collect the set of all such segments, from all the phrases in the dictionary. Now go back through the dictionary, and for each word, see if the reverse of each of its components is in this set. So "an asparagus" is good, because its reversed components all appear in the set: "n" appears in many places (including "an asparagus" itself), "ps" appears as a component in "a psalm", "r" appears in many places (such as "a karat"), and "sug" appears in "a sugar". On the other hand, "a biologist" is no good, because the component "tsigoloib" does not appear.

When I first applied this test, I started with the 69,241-word anpdict.txt (containing only noun phrases starting with "a" or "an"). I checked to see whether each reversed component of a phrase appeared anywhere in any other phrase. That eliminated "a biologist", but it let "a zoom" stick around, because the reversed component "mooz" appears in "a schmooze". Doing this level of reduction gets us down to a dictionary of 11,065 phrases. But on reflection I realized that not only does each reversed component have to appear in some word, it must appear as a component of some word. The fact that "mooz" appears in "a schmooze" is not good enough. That would only work if "a zoom" could be followed by the letters "hcsa", which of course it cannot; it must be follwed by the letter "a". Using that stricter test, we get down to only 4,528 noun phrases that are actually usable. So to get a 5,400 word palindrome we would need to start with a bigger dictionary.

Speed: The program consistently generates palindromes of over 2000 words in under 10 seconds using the 4,528 word dictionary. It doesn't go much beyond that.

Version 3b: 2,473 Word Palindrome with only indefinite articles

Created: 6-10-2016
Words: 2,473
Letters: 6,787
Phrases: 1,237
Program: or pal3.ipynb.
Palindrome:A man, a plan, a caret, a rec, ... (more) ..., a banana, a nan, a macerater, a canal, Panama!

Peter Norvig, 20:02 02/20 2002 See some comments on this page.