Alexa Toolbar and the Problem of Experiment Design

I wrote elsewhere about the problem of innumeracy: making gross errors in the use of statistics and other numbers due to lack of common sense. But even if you understand the numbers well, you still have to worry about garbage in, garbage out. And sometimes garbage results from even the best ingredients.

Consider the problem of comparing traffic to internet sites. Most sites keep their traffic numbers secret, so you need to rely on third parties that monitor a sampling of traffic. One such third party is the traffic rankings. If you download a toolbar from Alexa, your visits are tracked anonymously, and the aggregate statistics are available for all to see. As Alexa explains there are some biases inherent in this process: sites associated with Alexa such as are overrepresented; sites that use https protocol are underrepresented, and so on. But one bias they don't really comment on is the selection bias: the data would be good if it truly represented a random sample of internet users, but in fact it only represents those who have installed the Alexa toolbar, and that sample is not random. The samplees must be sophisticated enough to know how to install the toolbar, and they must have some reason to want it. It turns out that the toolbar tells you things about web sites, so it is useful to people in the SEO (Search Engine Optimization) industry, so it overrepresents those people.

For example, let's look at the log stats for my site and for some of my friends who have recently published their stats for 2006. We list the actual number of visits and pageviews, and the Alexa numbers for reach and pageviews. Now I realize that this is not a very scientific study: I selected a few of my friends who were willing to share numbers; I didn't select sites at random. The numbers we are sharing are not the same: do rss feeds count as page views? What exactly counts as a visit? Each of our logs packages may have different answers to these questions. (Alexa has more on tyhe perils of interpreting and comparing log data.) You should consider the numbers below as suspect, with an error range of maybe ±30% or even ±50%.

Despite these caveats, the difference in the stats is quite profound. For example, I get about twice the pageviews of, but his Alexa pageview ranking is about 25 times more than mine (I got this by looking at the 1 year, most highly smoothed graph on Alexa, and then squinting to guess at the mean, so that's yet another source of error in my study.) What that means is that people with the Alexa toolbar installed are 25 times more likely to view a page on Matt's site versus mine, but overall, all users view twice as many pages on my site. That's a 50 to 1 difference introduced by the selection bias of Alexa. Presumably this is because Matt's site is really appealing to a core group of SEO enthusiasts, many of whom also like the Alexa toolbar.

The point of this is not that Alexa is a bad tool: for many uses it is a very good one. The point is certainly not to quantify the error in Alexa sampling. This is a small, unscientific, anecdotal look at the data; nothing like a reliable experiment if your goal is to estimate how reliable Alexa is in general. The point is to keep in mind, next time you see a statistic on web usage (or any statistic) that the results are only as good as the selection process that brings in the data.

Site Actual
Ratio 2.1M2511.9 5.7M10.17 4.1M12530.59.8M40.41
Matt Cutts 1.7M1100647 2.9M258.6
Jeremy Zawodny ?200? 3.6M51.4
Geeking with Greg 0.26M40153 0.4M12.5

Addendum: Ironically, after Matt Cutts blogged about this article, Alexa was quick to pick it up, and again, over-estimated the impact. If you look at my Alexa traffic ranking (click on the "Page Views" tab), you see a big spike on March 5 when Matt mentioned this article. According to my logs, I got 1999 pageviews of this page that day (thanks, Matt!) for a total of 26,719 page views. There's another smaller spike, about 1/4 the size, around December 10. According to my logs, I got 14,356 page views of my Teach Yourself Programming essay that day, due to a reference on digg (thanks, Kevin Rose!), for a total of 26,506. If Alexa were sampling accurately, the May spike should be 1% higher than the December spike, but in fact it is about 400% higher.

I also noticed that Alexa provides a neat tool for comparing sites. here are the pageview graphs for all the sites mentioned above, and a second graph without so that you can see the others more clearly.

Peter Norvig