Menezes: Interactive Focused Crawling

Posted on: Wed 06 December 2006

While I enjoyed reading Soumen Chakrabarti's papers on focused crawling papers, I never got a sense of the dirty details needed to implement such a crawler. I couldn't quite grok the iVia Nalanda source either. The MTech thesis (PDF) of Roger Menezes, a Chakrabarti student, revealed a little more to me. The thesis also explores the potential for "desktop scale" focused crawlers. This interests me because I have a hunch that tagging and aggregators could serve as good mechanisms for interacting with a personal focused crawler.

Johnny's Brain: GReader Tips

Posted on: Wed 06 December 2006

Being a new GReader convert, I'm on the lookout for various hacks to improve the reading experience. Johnny's Brain has some useful tips and tricks, including a slick GreaseMonkey hack that adds a little indicator to sites if you've subscribed to them in GReader.

NMH: GooTube + Ads

Posted on: Tue 05 December 2006

Just thinking out loud based on the allegation that pre and post roll advertising on Web videos won't fly. Wouldn't a Google class problem be, using organic metadata only, place new forms of advertising within the actual video? Automatically identify good cut points, come up with a new unobtrusive interface to alert viewers to sexy comm...er...potentially useful information, invent a few compatible, yet effective ad forms, and develop a new economy for contextual ad placement.

And of course, do it all at Web scale.

Think of it as AdSense for Web videos.

I'm not sayin', I'm just sayin'.

NMH: GReader v Bloglines

Posted on: Tue 05 December 2006

The battle is over. In this theater, Google Reader has won the day.

In Bloglines' favor:

Visible feed subscriber counts
Automatically picks up redirected feeds better (I think)
Autorefresh works better, the subs panel and items view stay in sync automatically
Integrated search

In GReader's favor:

No phantom saved items
Starring is sweet
Reading feels a hell of a lot faster, collapsed headlines mitigates my discomfort with River of News style reading.
Better at false duplicates
Saving items is easy, saved items are out of the way (although I get a demerit for never getting into Bloglines keyboard shortcuts)
Tagging of feeds and items
Holding out hope for a GData based GReader API.

I find GReader suits my information foraging needs much better.

fwicki: Feed Mastering

Posted on: Tue 05 December 2006

fwicki sounded good at first. Digging deeper, I'm trying to find out what the win is over tagging a group of feeds in Google Reader and having a custom feed generated from the tag. In fwicki there's a whizzy AJAX interface for grouping feeds, you have some control on the output flow of the feed "mashup", and there's a way to inject ads into/around the resulting feed. That's about it as far as I can tell.

Still waiting for a feed aggregator geared towards information foragers.

McGough: All Things IMAP

Posted on: Tue 05 December 2006

A few weeks ago, I was investigating sophisticated IMAP hosting solutions. I now stand in awe of Nancy McGough's comprehensive coverage of IMAP providers, clients, and servers.

NMH: Verizon +1

Posted on: Mon 04 December 2006

I generally view choosing services from Big Telecom (TM) as deciding amongst the least worst, but I have to +1 a couple of serices from Verizon. First off, if it's offered in your community, get Verizon FiOS (fiber optic) service. Big pipes, mucho gusto. Second, Verizon's Wireless BroadbandAccess works pretty well and eliminates wardriving for wi-fi. I didn't want to get locked in to a two year contract, so my tab runs about $80 a month, although you can chop the price considerably. But mostly I've been surprised at the coverage in the Eastern corridor which has been pretty reliable. At the worst, the feel is like old dialup connectivity and at best it seems like low end DSL. This is from the middle of the street, on Amtrak, in that coffehouse that doesn't offer Wi-Fi, the food court, the back of the car, the United terminal at Dulles, etc. etc. Bonus, having a PC card means not having to figure out how to pair your cell phone with your laptop.

N. b. "all you can eat" means 5 GB per month to Verizon. Like I said it's Big Telecom.

PyCon: Talks 2007

Posted on: Mon 04 December 2006

The accepted talks at PyCon 2007, Feb 22 through Mar 1 in Dallas TX, shows a pretty interesting range of topics. Everything from compilers to testing to web frameworks to women in CS. Just scanning the talks I'd bin in them in the following rough categories: education, Python internals and infrastructure, web programming, programming language design and implementation, software engineering, application case studies. Definitely looks like good value for the money

Yahoo! Research: Berkeley Blog

Posted on: Wed 22 November 2006

The Berkeley based arm of Yahoo! Research has a blog. Their current public projects focus on geotagged images and movie remixes. Despite the small posting sample, there's quite a few interesting folks there, and it'll be interesting to see how their work surfaces over time.

Rajaraman & Ullman: Data Mining Lectures

Posted on: Mon 20 November 2006

Link parkin': Stanford's CS345, Autumn 2006: Data Mining handouts. Vetted by Greg Linden, who knows a little about this stuff.

NMH: Aggregator Clustering

Posted on: Mon 20 November 2006

These past few days have highlighted for me why a little bit of item clustering would be useful in my aggregator. I read a goodly number of feeds, but I'm not an outlier. There were pretty distinct bursts of: Zune, Wii, PS3, and Peanut Butter. Just pull 'em all together, let me survey them in one fell swoop, and then I can get to the other distinctive stuff.

C'mon it can't be that hard.

MIT VI: Curriculum Revamp

Posted on: Fri 17 November 2006

I'm really late to the party on this one, but MIT Course VI, the EECS major, is revamping the curriculum. The nugget that caught my eye in a posting from Lambda The Ultimate, was the potential removal of Scheme from the intro programming course. As a VI-III alum, I have but one thing to say:

SACRILEGE!! Even worse the charge is being led by Hal Abelson, one of the authors of the Wizard book.

Actually, the changes aren't that bad. Looks like they're putting two new intro courses that integrate EE and CS more tightly, at the start of the program. The programming for these courses is done in Python and that's a good thing. Meanwhile, good old 6.001 through 6.004 will be substantially unchanged, but folded into a bundle of 7 courses that serves as the second phase of the degree. Finally, advanced design / capstone classes, approaching graduate level difficulty, complete the studies. And this is the type of curriculum design is the type of thing Abelson is really good at.

[Via Lambda The Ultimate]

Bryant: Motionbox Review

Posted on: Wed 15 November 2006

Motionbox is a web based video editing application targeted at casual video users. One of the neat aspects is the use of tagging to mark selected segments of video, helping search, discovery, and navigation. Their player also supports really rapid scanning/preview of a video reducing exploratory costs. Steve Bryant has more information on Motionbox, including some quotes from the CEO, Chris O'Brien.

[Via Lost Remote]

Frakes: Scripting iTunes

Posted on: Tue 14 November 2006

Dan Frakes overviews the comprehensive Doug's AppleScripts for iTunes archive, picking out some highlights of the over 400 scripts available. A decent AppleScript interface lets prosumers overcome a number of shortcomings with iTunes.

NewsCloud: APIs

Posted on: Tue 14 November 2006

If NewsCloud fails, it won't be because its web services APIs aren't thorough, with good usage examples to boot.

Bloglines: Playlists & Glimpses

Posted on: Mon 13 November 2006

Bloglines has actually been working on their user interface recently. They have two new features, Playlists and Glimpses. Playlists look to be lightweight groupings of feeds and Glimpses give you quick, tooltip style, overviews of Playlists. I'm not sure how much of a UI advance this actually is, but the features are welcome experimentation in feed aggregators.

The irony is that I read the notice of Playlists and Glimpses in Google Reader, which is starting to grow on me. I'm feeling like my Bloglines days are numbered, but the jury is still out.

Snell: Getting to Know APP

Posted on: Sat 11 November 2006

Link parkin': James Snell's "Getting to know the Atom Publishing Protocol", Part 1 and Part 2. A good intermediate overview of APP, including concrete examples using various systems such as Blogger, Roller, GCal and GBase. Nice emphasis of how to write into APP stores.

[Via the Google Code Blog]

Briggs: WSGI Overview

Posted on: Fri 10 November 2006

WSGI is Python's answer to Java's servlets, just not as "enterprisey". I've been doing some tinkering developing a WSGI web app, but was highly confused about which servers support WSGI conformant apps and how they did so. Uche Ogbuji's introduction to WSGI is quite good, especially about the underlying design principles, but doesn't quite connect the deployment dots.

I found Jason Briggs "Getting Started with WSGI" to be a nice complement as a slightly more basic introduction. And it touches a little on the webservers that support WSGI.

Gruber: Stikkit Review

Posted on: Fri 10 November 2006

If you're wondering about Stikkit, a new, well publicized, Web PIM tool, John Gruber has a thorough review. Summarizing his final analysis, the overall premise of the tool is flawed, but it does have some nice execution in places.

Getty Images: 10ways

Posted on: Thu 09 November 2006

This set of neat interactive takes on digital images appeared over the summer, as opposed to my thought that it surfaced a few years ago, which stopped me posting about it. Didn't seem to get much play in the blogosphere, so what the heck. Sponsored by Getty Images, Sumona's information interactive is the most intriguing, striking me as a particularly interesting way to navigate a pile of digital images. Wonder what it would look like with a) custom collections and b) the additional goal of surfacing social information about the photos? Presuming of course, you had social information for the photos.

Carmo: Bayesian Aggregation

Posted on: Thu 09 November 2006

Rui Carmo has integrated Bayesian filtering with RSS aggregation to improve his feed reading. So far the results look pretty good. His setup involves receiving new RSS items as SMTP e-mail messages, (e-mail, good old e-mail) which supports highly tuned and customized user interfaces along with automated scripts to process the messages in the background. At the same time it sounds like it was pretty easy to whip out a web interface to label messages if needed.

Any web or desktop aggregator with sophisticated APIs and a dirt cheap labeling mechanism (probably tagging) could replicate this to good effect. Then again I can't really think of any such combinations.

But I would be glad to be enlightened.

Rosenberg: Launch Pad

Posted on: Wed 08 November 2006

Watching the Web 2.0 Summit coverage in the blogosphere has been entertaining. I particularly liked Scott Rosenberg's summary of the Launch Pad, a runway show of recent startups. Short and to the point.

What's up with Sphere making an appearance though? Isn't the company a little mature for these types of things? I sense they have little to no traction, and they even nuked their little Ajaxy time widget. Weird.

Rao: REAP

Posted on: Tue 07 November 2006

REAP stands for

Retrieve — collect information from a variety of sources
Extract — extract data, facts, examples
Arrange — arrange documents and facts for use now or later
Present — compose information into artifacts of value

Which is Ramana Rao's model of what needs to be addressed "beyond search". In the comments, Mark Stefik chimes in with some tantalizing hints at work PARC has done with intelligence analysts. I wonder if these are business or security analysts? In any event, he suggests that tools for analysts fit within some broad strokes but differ significantly in the details. Contrast with Yahoo!'s consumer oriented FUSE vision. Via Knowledge Jolt with Jack

Grudin & Russell: Using Information

Posted on: Tue 07 November 2006

Jonathan Grudin and Dan Russell, two CHI and CSCW giants, are running the second edition of their mini-track at HICSS 40. The mini-track title is Using Information: New Technologies, Ways & Means, and they've got a blog up which includes the accepted papers for the mini-track (along with the Persistent Conversation papers) as well as pointers on how to have a productive time at HICSS. This is from Grudin, who is a HICSS vet, so it's well worth the read.

In case it's not clear, I highly recommend HICSS because you get access to some bigwigs in a relatively relaxed environment, and the acceptance rate isn't rediculously discouraging. Granted, it's expensive, but it's a nice starting point for beginning academics.

Adobe: Open ActionScript VM

Posted on: Tue 07 November 2006

Interesting. Adobe open sourced the virtual machine for ActionScript, the programming language of the Flash 9 player. The Mozilla Foundation has picked it up and started the Tamarin project around the vm, possibly to use it as a JavaScript engine in future versions of Firefox.

What I'm wondering, does this make it easier to target the Flash player from compilers for other languages? For example, could you compile a limited version of Python to this VM? Or a pedagogical language in a compilers class?

Also, this vm might be useful as a restricted execution environment within other languages.

Hurst: On Powerset

Posted on: Mon 06 November 2006

I can't share Matthew Hurst's enthusiasm for Powerset, simply because I haven't been tracking the new NLP based search engine at all. But I did like how he teased a new direction for search innovation:

There are two key things here: the use of NLP and the disruption to the search interface. Finally, information retrieval will actually mean information retrieval, not document retrieval. One of the fundamental models of search that may be challenged in this new world is the fact that search engines are designed to take people to pages. The more we can understand and summarize the information on those pages, the weaker this model becomes and consequently advanced methods may herald a fundamental change in the 'search' business (which will need a new name pretty soon).

I don't know if the keyword/document search model is dead, as some folks are claiming, but I know there's a whole lot of ways to organize the world's data that haven't been tried yet. Some of them might work.

NMH: Google Reader Microreview

Posted on: Sun 05 November 2006

Following up on some events that happened during my hiatus, I decided to start kicking the tires on the new version of Google Reader. I'm sort of digging it. Being absolutely stuck in the "River of News" format would drive me up a wall, but the ability to collect subscriptions using tags still means I can prioritize bundles of new items. If item marking and saving is markedly better than Bloglines, I'll probably switch over, especially if a Google Reader API ever materializes. I, and a fair number of other folks, use our aggregators as information foraging tools as well as news readers.

Weird. Take two months off, subscribers go up, according to Bloglines.

Lundh: Python with & for/in

Posted on: Sat 04 November 2006

Nice overviews of how Python's with statement and for/in statement work, including their extensible nature, by Fredrik Lundh.

Chi & Pirolli: Social Information Foraging

Posted on: Fri 03 November 2006

Ed Chi and Peter Pirolli, the latter of information foraging theory fame, are starting to investigate how the new plethora of social navigation and media toolls intersect with the information foraging model. "Social Information Foraging and Collaborative Search" (PDF) doesn't have a whole lot of meat on it, but hints at some interesting ways to model how groups of people search for information together.

Kennedy: Docuticker

Posted on: Thu 02 November 2006

+1 on Shirl Kennedy's Docuticker

DocuTicker offers a hand-picked selection of resources, reports and publications from government agencies, NGOs, think tanks and other public interest organizations.

I've been subscribed to the webfeed for a while and it's the definition of serendipity. Where else can you get rigorous documents on the impact of sports in the public sector?

Viégas & Wattenberg: Communication-Minded Visualization

Posted on: Wed 01 November 2006

Back in May, at the SIGCHI Social Viz Workshop, I cornered Martin Wattenberg for a few moments and bounced the following idea off of him. One of the interesting results of the Baby NameVoyager was that the visualizations had become social media: artifacts to be created, discussed, and shared in groups. This despite zero support for collaboration built into the system. I also posited that there could be a lot of interesting research issues to follow pursuing this line of thinking.

Wattenberg and his colleague Fernanda Viégas seem to be picking up the ball, and have issued a manifesto on Communication-Minded Visualization (PDF). I don't know if I planted a seed (actually I highly doubt it since Martin and Fernanda are both twice as smart as me and were probably already headed in this direction), but I'll be interested to see if this coalesces into a research community. There's lots of fun challenges, from the systems aspects of just getting something to work, to the visual and interactive design issues, to the investigation of social effects given such visualizations.

In addtion to reading the paper, you can also get an audio recording and slides from Fernanda's presentation on "Democratizing Visualization" at idea2006.

I know what you're thinking. All I can say is Life Changes (TM). Big Time. Things seem to be easing up a bit but no guarantees.

Amazon: Elastic Compute Cloud

Posted on: Thu 24 August 2006

Well I wasn't too far off a month ago when I hypothesized Amazon providing a commodity virtual machine service.

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

...
You have complete control of your instances. You have root access to each one, and you can interact with them as you would any machine. Each instance predictably provides the equivalent of a system with a 1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth.

I was off a couple of factors on the memory, an order of magnitude on disk, and probably a factor of 2 on CPU speed. I got the pricing just about right and even hit on the bandwidth discount for communicating with S3. Simple web based specing and ordering of VMs got lost a bit in the need for security, but I can't fault Amazon as they have to make sure the service doesn't become the spawn of horrifically spamming botnets.

I didn't realize it at the time, and many in the blogosphere are having the same incorrect interpretation, but EC2 isn't supposed to knock out private virtual and dedicated server solutions. EC2 is really for folks who need to build a cluster of machines but don't have the sysadmin staff, rackspace, and funds for hardware, available. For a mom and pop LAMP server running a lightly used Web site, EC2 is probably overkill. For a small Web startup short on cash and people, EC2 might be an attractive alternative to building your own data center.

Put it all together and you've got cheap, powerful computational units (VMs), reliable distributed messaging (SQS), and inexpensive massive storage (S3), all from one vendor, with reasonable (not easy, but reasonable) programmatic APIs. Interesting times.

Lundh: ElementSoup

Posted on: Wed 23 August 2006

Fredrik Lundh combines the elementtree module with BeautifulSoup to make grokking arbitrary HTML a bit more Pythonic.

Krohn, Kohler, Kaashoek: Events Can Make Sense

Posted on: Tue 22 August 2006

Link parkin': Events Can Make Sense, by Max Krohn, Eddie Kohler, and Frans Kaashoek.

Events have earned a reputation for confounding programmers. This work advocates new language features (called tame extensions) that simplify event programming in languages such as C++. Like multithreaded programs but unlike typical event programs, tamed programs use standard control flow constructs, and automatically-managed local variables. In practice, tame is implemented with a C++ source-to-source translator, without compiler modifications, platform-specific support, significant runtime overhead or major semantic compromises. The implementation is backwards-compatible with an existing event library and has already found application in class projects, open-source projects, and production Web sites. Developers on these projects report that the tame approach is a significant improvement over the event-driven status quo.

[Via Hack the Planet]

Kochanski: Tunesafe

Posted on: Mon 21 August 2006

Martin Kochanski's hypothetical tunesafe application is a pretty interesting thought exercise on how to make money off of Amazon's S3 by backing up people's iPod tunes.

[Via Jeff Barr]

Gafter: Java Closures

Posted on: Sun 20 August 2006

If Neal Gafter et. al.'s proposal has any legs, looks like Java will eventually get closures, although the static typing doesn't make them look like a whole lot of fun to use.

[Via Lambda the Ultimate]

del.icio.us: New Features

Posted on: Sun 20 August 2006

For a time there, del.icio.us seemed to be pretty moribund. Recently, there's been a spate of changes including: a new frontpage, secure API access, advertising, and promoting active users. Joshua Schachter, the head del.icio.us honcho, is pretty conservative, but maybe we'll see some interesting new innovations in the tagging arena. Other than developers baking tagging into applications for buzzword compliance, I haven't seen much new in terms of how tagging is designed as a user interface element.

NMH: Web App Details Please

Posted on: Sat 19 August 2006

Here's a few web application features that I think would be fairly useful to have a shared body of knowledge about. Put another way, I'd like to see how someone else designed and implemented them so I can rip them off:

Invitation systems ala Gmail when it first started. To my eye invitations have a surface triviality that probably goes away when you have to manage the rationing of invites. Wonder what it really looks like inside. Any small web startup could benefit from having such a system
Flickr's authentication system for allowing third party applications to perform operations, some dangerous, on a user's behalf. Their scheme also supports revocation of permissions.
Yahoo! style rate limiting, to prevent morons from taking down your shiny new web services API.

Horowitz: On Interestingness

Posted on: Fri 18 August 2006

Bradley Horowitz writes a bit on Flickr's interestingness. While Findory's Greg Linden focuses on "getting the crap out", Horowitz notes that Flickr embraces the junk:

“So Flickr is a system that accommodates taking a ‘worthless’ picture of a hangnail, or a breathtaking Ansel Adams-like landscape. The cool thing is that while creating a frictionless environment that serves both scenarios, we can also determine which of the two is likely more ‘interesting’ to the community at large.”

Of course this is a false dichotomy as Findory attempts to surface personalized interesting bits too and does it with a lot less user information than Flickr. Either way I'm more and more convinced that for systems at Web scale, the real problems involve dealing with noise, which neatly captures cheaters, morons, and minimal information.

MacAskill: smugmug + S3

Posted on: Wed 16 August 2006

I may not have to eat my words regarding S3, but the scent of them cooking is definitely starting to waft. Don MacAskill, the CEO of smugmug, describes another case study of commercial usage of Amazon's Simple Storage Service. Lots of business advantages therein, but I will point out that smugmug doesn't completely bet the farm and dispense with its own storage. Maybe that's the right tradeoff though. Develop an in-house storage architecture and carefully use S3 as a "redundant secondary storage for use in cases of outages, data loss, or other catastrophe."

Still waiting for the game changing, highly interactive, AJAXy, Web 2.0 application built using S3 as its main storage system.