home ¦ Archives ¦ Atom ¦ RSS

Coates: Blogdump

Tom Coates is providing a complete archival dump of his longrunning blog. If a concerted number of folks do this, it could provide a lot of grist for blog tool developers. Think what kinds of interesting studies would be enabled by having clean captures of a significant number of the top 100 blogs, however you decide to rank them.

One of the issues that blog researchers looking to do empirical studies have is simply getting raw data, especially of a historical sort. Writing crawlers seems simple on the surface, but let me tell you, once you decide to scale to any significant number things get hairy fast. If you want to get full archives, you'd better have IBM class skills and resources.

The real scotcher though is redistribution. It should be obvious that if a researcher collects their own archives of publicly available pages for private analysis things are fine. If the copyright isn't clear though you could potentially get into hot water giving out someone else's content. Maybe there's enough Creative Commons sites out their though to make a start.

I wonder how the TREC folks get around this issue? I know they have a Web track and sets of data.

Anyhoo, coordinating the collection, verification, and redistribution of volunteered blogdumps would be a good task for some enterprising academic.

© Brian M. Dennis. Built using Pelican. Theme by Giulio Fidente on github.