Forgot in passing that Intelliseek has made a largish data dump of blog crawling available. It's about 3 weeks of data from July 2005, covering the London bombing.
Terms of access (PDF) aren't too heinous, other than the "stop using this stuff after the workshop" clause. Sort of puts the screws to using the dataset for any cool demos because as soon as you show off at the workshop and get people hot and bothered about how your system really does scale, you'll be stuck saying, "It's really cool on the Intelliseek data set, but all I can show you now are screen shots," for the rest of WWW2006. Bummer.