home ¦ Archives ¦ Atom ¦ RSS

Alexa: Web Search Platform

Alexa's opening up of its Web Search Platform is an interesting gambit, but unlike John Battelle's thinking, I'm not sure it's a game changer. At the very least, the jury has yet to be impaneled, much less still out.

Browsing the the Alexa Web Search User Guide, there's lots of serious support for distributed/parallel programming in there. Unfortunately, there's a good bit to learn and a small number of people in the world who are really equipped to make even decent use of it, especially since your mistakes actually cost real money. Real distributed/parallel programming at these scales is dang hard!

Also, a major impediment is a reliance on Alexa's web crawler, whose operation is opaque as far as I can tell. Innovators are thus at the whim of whatever the crawler brings back. Maybe a first set of projects to be commissioned on the Alexa indices and repositories is an empirical study of what's actually in them. For example, it might be sort of irritating in some circles if the crawler doesn't attempt to crawl URLs ending in .php. This is not an unreasonable limitation for a planetary scale web crawler. And if you think there's even a large (> 20%) portion of the Web's content in there you're delusional.

On the other hand, the project is a highly visible rent-a-grid project with access to some major non-scientific datasets. The platform is probably a bit of a boon to information retrieval and web search researchers, especially those who can't round up the grant money for their own grid clusters, or pinch some colleagues for access. Not to mention the system administration has been outsourced.

One other minor nuisance, I suspect there'll be some hair in a platform application that starts redistributing material from the crawls wholesale. I'm pretty sure this is a copyright violation and if you draw the gaze of the wrong folks, lets just say you better have some good lawyers.

© Brian M. Dennis. Built using Pelican. Theme by Giulio Fidente on github.