Link parkin': Yahoo!'s visualization of the "top 100 Web moments" according to them.
[Via infosthetics]
Link parkin': Yahoo!'s visualization of the "top 100 Web moments" according to them.
[Via infosthetics]
In addition to his entertaining series of posts on the early Amazon days, Greg Linden also had a couple of other good dispatches. One on RSS suckage, pointing out that aggregators need to proactively help folks cut through the noise. Another is just a nice pile of links on information retrieval research literature, a couple of which were new to me. I've been working my way through Managing Gigabytes. I don't really need all the compression but the stuff on indexing and retrieval is great.
So I've complained before about open data sets for weblog researchers and hackers. Be careful what you wish for...
I received the challenge data for the 2006 Weblogging Ecosystem Workshop. The first day of the set is a 200 odd MB compressed file that expands to over 900 MB. Yowsa!! And there's 17 days of this stuff.
Suffice it to say you can't just slurp the whole mess into an in-memory datastructure and start noodling about. Gonna have to creatively splice this iceberg into somewhat more manageable chunks.
Apropos of absolutely noting, I'm back to #1 in the Google rankings for "media hack" and "Brian Dennis". This despite competition from Wired News on the first front, and a prominent English composer on the second. Apparently though, it looks like Wired News columnist Adam Penenberg has actually moved all of his web writing over to Slate.
Score one for frills free blogging. Or something like that.
In other news, a few deadlines and commitments have passed so expect a bit of bloggorhea from your humble writer as I flush a bunch of links locked in Bloglines "Keep New" purgatory. Really, I wish they'd come up with a less click intensive way to clip an item.
As someone who hacked an Xlib library for OS/2 on the 286, I salute the mobiness of Jamie Zawinski's essentially doing a port of Xlib to Quartz, the Mac OS X native graphics library. This is make life nice for all of those legacy xscreensaver
modules. Yowsa!!
[Via The Tao of Mac]
Link parkin': Rashmi Sinha's A social analysis of tagging.
[Via Jack Vinson who's been on a roll with a lot of thoughtful posts that outlink to interesting stuff]
NewsVine, reddit, digg, memeorandum, TailRank, News Alloy, CommonTimes, not to mention old greybeards Slashdot and Plastic. Is there really that much demand for socially customized news out there?
More importantly, are there enough ad dollars to support them all? I think I'm with Steve Rubel on this one, shakeout approaching.
A particular theme I'm waiting to see play out is responses to The Tyranny of the Crowd. As far as I know, all of these sites work for one large to vast set of participants.
I still haven't figured out why no one has gone meta and let users define and manage smaller communities with their own sets of sources and restrict the algorithms to those smaller groups. This would probably make the personalization algorithms more viable, at the cost of not having enough participant data, and having to come up with decent designs for managing social boundaries. Talk about targeted ad buys though!!
O'Reilly has a "Baseball Hacks" title, authored by Joseph Adler, scheduled to appear Feb 2006. Hearkening back to my sports data vizzing post, it looks like Baseball Hacks might have the pointers for the raw data which I've always known was out there, but didn't quite know how to find.
[Via Brian Carnell]
One two, one two!
Is this thing still on?!
Here's to three years of cranking out new media hackery; comment free, ad free, spam free, trackback free, blogroll free, tagcloud free, and some would say content free ;-)
Ciao, ciao for now!
Link parkin': Matthew Hurst's Data Mining, weblog. Hurst is a senior research scientist for Intelliseek, the folks behind Blogpulse.
expialidocio.us is a neat visualization of one's del.icio.us tags. Cooked up by Oliver Steele, using OpenLaszlo, the widget let's interact with a sliding temporal window over your del.icio.us tags. A tag cloud is generated from the bookmarks that fall within the window.
A nice prototype (?) that leads the mind in various directions. The tag cloud takes up a lot of real estate, and it could be put to more use. For example, if I hover over a tag the slices of the timeline that cover that tag could be highlighted. Also, I might provide some chunking in the timeline, so I could discern patterns in bookmarking behavior. Maybe I'm a regular midweek poster or can identify increased usage close to some major life event.
Good stuff though!!
Paraphrasing Barry Wellman, "The only people measured more than academics are ball players." I wonder if sports data has ever hit the information visualization community?
This line of thought was motivated through getting stuck in my office working on some research papers over the weekend. National Football League playoffs started on Saturday, and I have a rooting interest in the Washington, D. C. franchise who played that day. I didn't have a radio, and really couldn't afford -- the time or the money -- to subscribe to the online game video. So I pulled up the real time game tracker at CBS SportsLine. It worked great. Not only do score updates get pushed to the browser, but a complete log of every play is recorded, numerous stats compiled, and realtime running commentary from a stringer is provided. Yowsa!!
A realtime viz of all the data generated would be interesting, but it also appears that the URL structure for the entire season is relatively easy to understand. A crawler to grab the entire season's pages could be knocked off in a day or two, and you could have a nice data set of real world data to try and visualize. Of course availability on the Web means the raw data is somewhere to be had, but I'm sure someone (read NFL) will charge you an arm and a leg for it.
Why would anybody want to do this? Major PR and commercial possibilities aside, the pool of folks who would want to participate in rigorous experiments would be pretty good. Plus, you could try out expert vs novice style designs, with a relative plethora of, self-identified ;-), subject matter experts available. And ala Martin Wattenberg's experience with the Baby Name Voyager (PDF), I could see hordes of Monday Morning Quarterback's socially working their way through complex data analysis.
By the by, the Wattenberg paper is a quick, good read.
WebJay, the social media playlist site, has been bought by Yahoo!. I'm mainly an observant bystander, although I do have a languishing Y! Unlimited subscription, but Lucas Gonze seems like a guy who deserves to hit it big. Tristan Louis seems to have an early, post announcement interview with Gonze.
Then again Hawaii to Santa Monica might seem like punishment to some folks.
[Via O'Reilly Radar and Kevin Burton]
Yeah, you can get news of the Google Video announcement from a horde of the usual suspects, but one photo by Cory Bergman crystallized why this is right in Google's wheelhouse. Keep in mind that I translate the Google philosophy of "organize the world's data", into "work on fscking hard data wrangling problems, at global scale."
There's a little mention by Bergman about breaking up videos into scenes. I don't know the state of video shot detection research, but I have to guess it's not ridiculously easy on "any old video". Hard problem 1.
Now that you've got the shots selected, and any other metadata you might have at hand, build up a good queryable index. Make it easy enough for television people to use. (I kid!) Hard problem 2.
Did I mention store all those videos and make them easily accessible anywhere on the net? While video has become more friendly with the Web, it still doesn't exactly play well. Partially because the files are huge and partially because of the bandwidth needed, both issues which are being eroded by Moore's law. But not at a pace to make dealing with video at a global scale easy. Ask Akamai. Hard problem 3.
I could see Larry and Sergey rounding up a bunch of those systems PhDs along with a few top web UI folks and saying, "We need to build a platform for the world's biggest video content management system. Every file's a couple of gigagbytes, they have practically no metadata, and we can't tolerate any latency delivering the media. What can you guys do?"
"And if it works, we all get rich again, brokering television advertising."
Yup, right up their alley.
The Hypergene Media Blog guys have an overview of social news reading site NewsVine, which is still in invitation-only beta. Their take is mostly from a high level, strategic perspective. Brian Benzinger of SolutionWatch, has been using NewsVine for a while, and drills down on the details.
I've never been a huge fan of explicit, vote based, recommendation systems, but there seems to be enough people willing to contribute enough free work to make sites like Slashdot, kuro5hin, Digg and now NewsVine, work. If nothing else, someone needs to call up the folks at ACM SIGCHI and have them set up a workshop on "media voting recommender systems" or somesuch. Maybe there's been some fundamental changes or discoveries since GroupLens that make a difference.
l.m. orchard just gets to the tip of the iceberg in discussing potential applications of exposing webfeed aggregator internal data. There's lots of fun things that could be done if aggregators made what you're subscribed to and what you've read available in more programmatic fashions. About the best we've got so far is feedroll export, NetNewsWire's AppleScript interface, and a few aggregators that support plugins, mainly for pulling data in.
If you're worried about privacy, the things I'm thinking about don't even require sharing that data with others. However, I think it would be an interesting research project to make sharing possible, and then design some mechanisms to support the construction of social boundaries.
As an aside, a colleague objected to usage of the term "blogroll" to refer to a list of webfeed subscriptions. Blogrolls allegedly referred to the (poorly maintained) lists of sources bloggers posted on the front pages of their blogs. I think I'll just compromise by using feedroll instead. Same connotation, less confusion. Thanks 0xdecafbad!!
Just some curmudgeonly thinking, inspired by trying to understand "results" from various "blog search" (and I use that term guardedly), engines. What if that dreamy long tail, the one with the cute, clean little infographic, implying vast commercial potential, is really the equivalent of Florida swampland? A thriving ecosystem indeed, but not exactly human hospitable. Really messy, somewhat shallow, dirty, hard to slog through, easy to get lost in, and chock full of hazardous critters and parasites.
As Greg Linden notes, it's tough work to get the crap out.
Speaking of Aaron Swartz, web.py is his horse in the crowded Python Web Application Framework derby. Looks elegant, but no time to kick tires.
Over the holidays I collected three interesting links regarding time, a.k.a life, expenditure.
Paul Graham takes a high level view of procrastination and how to avoid it, even lauding some forms of putting things off. Aaron Swartz takes a different tack, digging in to the dirty details of procrastination in an effort to be more productive. They both have some good points, but tend towards M.I.T. genius influenced "look for the homerun pitch and swing away".
Scott Berkun zags. Look to passionately make a small difference, let it drive you, and wait for a snowball. At worst you still feel good about what you've done.
The varied philosophies aren't incompatible, but the third seems much more attainable by mere mortals.
Despite a number of misgivings, Sony's bad media track record and apparent hostility towards independent developers, I asked for and received a PlayStation Portable as a Christmas present. So far I'm pretty hooked on the dang thing, and I only have two games, Madden '06 and Lumines. Not having been invested in video games for the better part of a decade or more, Madden is a long term investment. Jiminy, the amount of game options makes an airplane cockpit look easy. Lumines, a Tetris-style game with a lot of polished chrome, is dirt simple and fills in when I don't want to think hard. I've only had the PSP for about a week, but I'm pretty impressed, despite Sony trying to blow its foot off.
Today I was struck by how much computing and communication power is packed in these portable packages. Check the PSP specs versus the seminal PARCTAB. In about 15 years, we've gained a low cost, mass produced PARCTAB, with a high powered graphics engine and minimum persistent storage of 32 MB, with gigabytes easily available. Plus multiple communications options: IR, Bluetooth, and Wi-Fi. My off the cuff estimate is that you've got a factor of 30 speed up on the processor, about two orders of magnitude communications speedup (38.4 kbaud IR vs 10 Mbit/second Wi-Fi), and over 3 orders of magnitude on the storage. With relatively decent development environments available, this would seem to be a cornucopia for ubicomp researchers.
And I haven't even mentioned cellular phones!
People beat Vannevar Bush's "As We May Think" into the ground, but Mark Weiser's "The Computer for the 21st Century" was pretty doggone prophetic too.
Bloglines has been having a hard time the past couple of weeks. That plumber guy has been showing up quite often :-(
Maybe he should go to the same masseuse that Flickr visits.
Speaking of CS curriculum, many folks hold rather inflammatory positions on what should and shouldn't be in the mix, and are quite willing to rant at a moment's notice. I like a good rant as much as the next person. Joel Spolsky bathes in gasoline then promptly applies a blowtorch to himself in arguing that 100% Java based curriculums don't make for good programmers. Heck, they're not even really CS programs. Spectactular!!
N.b. Spolsky is ranting against completely Java based curriculums, not introducing Java as one part of the program.
And for the record, I agree with 95% of what he said. I disagree with the weeding out for weeding out's sake, and not preparing students for PhDs as a means of job security. On the first point, I've met zero faculty who took glee in weeders, although in any degree such courses naturally arise. Second, by the time an undergraduate finishes their degree and completes a PhD, even if straight out of school, you're up for tenure, have tenure, or have moved on. Having potential PhD undergrads work with you, do interesting stuff, and succesfully go on to grad school does more for your job security than making sure they can't do a PhD. What's a good rant without a little hyperbole though?
Besides, I was under the illusion that everyone got an A in 6.001. Didn't seem that hard to me, although my head exploded when Sussman used the last lecture to discuss how to build a Turing complete computer out of cellular automata.
Lambda the Ultimate had two interesting posts recently related to teaching introductory programming. This interests me because, in the spring quarter, I'll be going into my 6th edition of our second quarter of intro programming, (Jeez, has it been that long). I've never been really happy with the course, partially because it winds up being overconstrained by the courses directly in front of and behind it. Be that as it may I'm always interested in how others approach such courses.
Peter van Roy, posted a bit about his experience introducing programming semantics early in the student's careers. Due to the formal rigor involved this would seem to work at schools with heavier mathematical pre-reqs in their CS programs, which I think is typical of European schools. Good discussion ensues in the comments.
Similarly, a post by Chung-chieh Shen ignites some interesting commentary on the utility of Structure an d Interpretation of Computer Programming, MIT's ur-book on programming. That was my introduction to rigorous thinking about programming and is the textbook we use here in our first course. Bonus, a link to Joseph Bergin, who seems to have done quite a bit of thinking about CS pedagogy.
Niall Kennedy spent some of his holiday time reverse engineering the API behind Google's Webfeed Reader. Apparently, it's a fairly straightforward REST based API that supports easily retrieving and managing subscription blogrolls and unread items. You can also leverage the feed and item tagging that the reader supports.
The big bonus is that Kennedy got the Google Reader team to confirm that they intend to open up the API for independent developers. Timeframe looks like Q1 2006 for something official. Lot's of potential here for interesting new aggregator experiments, especially if blog search was eventually made available, although I realize that last bit is a tricky design and implementation issue.
Barbara Dybwad's post on the Performancing Firefox extension, provoked me to give it a whirl. Performancing embeds a weblog editing tool directly within Firefox. I'm using it to create this very post.
Performancing isn't going to push me off of MarsEdit on the home Mac. However, it might be an upgrade for me on Windows, where I haven't found a satisfactory weblog editor. It's a little inefficient in terms of screen real estate, and I'm sure there are some glitches under power use, but it has to be better than editing within an HTML form.
Minor correction, although the AIS SIGSEMIS Bulletin was interesting, I was actually thinking of The American Society for Information Science & Technology as Avis to ACM's Hertz. While not having nearly as many different activities as ACM, I often seem to run across interesting stuff appearing in the ASIS&T Journal or Annual Conference Proceedings.
Very entertaining predictions from Greg Linden of Findory. Choice ones to my mind are
BoingBoing is a craptastic flood of news of the weird and IP/telecom/computing policy. Every now and then a really useful nugget floats by.
Mark Frauenfelder recently pointed out the current top 10 torrents on publicdomaintorrents.com. I didn't realize that stuff like "Night of the Living Dead", and "Plan 9 From Outer Space", are in the public domain. What's the big deal about these films, at least from my perspective? Well, bits and pieces could probably be quite humorously applied to lectures and presentations.
C'mon, you just know there's gotta be a way to use a drop from "Night of the Living Dead" for that lecture on UNIX zombies. ;-)
The Association for Information Systems is a group I need to take a closer look at. The organization seems to play Avis to the ACM's Hertz, but I keep running across interesting little papers that appear in proceedings of their conferences or AIS journals.
Witness this recent bulletin (big PDF) from the special interest group on the Semantic Web and Information Systems. I'm not a huge Semantic Web advocate but there's a couple of interesting titles amongst the research papers. Bonus, an interview with Terry Winograd.
Just got around to digesting Tom Coates' essay on "Amazon, excess, and the future of navigation...". Sparked by Amazon's recent deployment of, per user, freeform labeling on items, Coates riffs on how to deal with navigation when there are copious objects to deal with. Overall a good piece that captures a rapidly developing trend. I'll just add a couple of cents worth of my own.
Navigation in an environment of excess is a starting point. People want to get things done as they're noodling about on Amazon. While tagging is a good way to socially find new stuff, it's not much help in executing a particular task other than generic browsing. There's a lot of room in web systems for deploying and improving task specific interface mechanisms. As an example, for Amazon making the shopping cart smarter would be a win, but they also have authoring tools in there for making lists, writing reviews, writing guides, and managing registries. How could these new navigation mechanisms make those authoring tools better?
One other thing about dealing with excess. I was just scanning Tufte's "Envisioning Information" again and as usual was jazzed by the sections on Micro/Macro Readings and Small Multiples. Those general ways of thinking seem particularly appropriate for navigating excessive corpuses. Unfortunately, it's really hard to apply these techniques in web based systems. The graphical control needed to pull such visualizations off is well beyond current browsers' capabilities. Also, interaction can provide a major assist here, but even pushing the limits of AJAX and DHTML the best infoviz techniques can't be applied. Maybe this is the appropriate role for Flash and Java applets.
In short, all of these systems that are providing access to huge piles of stuff will force the further development of web-centric interface techniques.
If Recommended Reading for IR Research Students (PDF) is any good, and it probably is, then the paper should be required reading for any one working on blog search.
[Via ResourceShelf]
Forgot in passing that Intelliseek has made a largish data dump of blog crawling available. It's about 3 weeks of data from July 2005, covering the London bombing.
Terms of access (PDF) aren't too heinous, other than the "stop using this stuff after the workshop" clause. Sort of puts the screws to using the dataset for any cool demos because as soon as you show off at the workshop and get people hot and bothered about how your system really does scale, you'll be stuck saying, "It's really cool on the Intelliseek data set, but all I can show you now are screen shots," for the rest of WWW2006. Bummer.
Ignore the man behind the mirror. I'm just trying to claim this blog on Technorati.
Heck, this might wind up being the first post here I actually delete.
In the thinking out loud department, what would happen if a traditional news organization went balls out with a snarky, online, WebBeat edition. The sole focus would be to have blanket coverage of the intersection of a metro area, its citizens and the Web. This was partially inspired by Chicago's RedStreak giving up the ghost.
Of course you're saying, but I've already got Chicagoist, Gapers Block, Change of Subject, etc. etc. Those sites provide general coverage of Chicago, not a particular focus on what Chicagoans are doing on the Web or how the Web is impacting Chicago. The WebBeat would be one level up, getting behind the scenes of things on the Web people routinely rely upon or should know about. Also, when I say balls out, I mean cranking out regular posts 24/7, tossing in routine lengthy features, pushing the envelope on media distribution (podcasting/vlogging) and maybe supporting some discussion but with really heavy gardening.
Why do this?
Alexa's opening up of its Web Search Platform is an interesting gambit, but unlike John Battelle's thinking, I'm not sure it's a game changer. At the very least, the jury has yet to be impaneled, much less still out.
Browsing the the Alexa Web Search User Guide, there's lots of serious support for distributed/parallel programming in there. Unfortunately, there's a good bit to learn and a small number of people in the world who are really equipped to make even decent use of it, especially since your mistakes actually cost real money. Real distributed/parallel programming at these scales is dang hard!
Also, a major impediment is a reliance on Alexa's web crawler, whose operation is opaque as far as I can tell. Innovators are thus at the whim of whatever the crawler brings back. Maybe a first set of projects to be commissioned on the Alexa indices and repositories is an empirical study of what's actually in them. For example, it might be sort of irritating in some circles if the crawler doesn't attempt to crawl URLs ending in .php
. This is not an unreasonable limitation for a planetary scale web crawler. And if you think there's even a large (> 20%) portion of the Web's content in there you're delusional.
On the other hand, the project is a highly visible rent-a-grid project with access to some major non-scientific datasets. The platform is probably a bit of a boon to information retrieval and web search researchers, especially those who can't round up the grant money for their own grid clusters, or pinch some colleagues for access. Not to mention the system administration has been outsourced.
One other minor nuisance, I suspect there'll be some hair in a platform application that starts redistributing material from the crawls wholesale. I'm pretty sure this is a copyright violation and if you draw the gaze of the wrong folks, lets just say you better have some good lawyers.
Greasemonkey is a great tool for extending your browser...if you're a hardcore programmer. Chickenfoot is an alternative end-user programming environment, embedded in Firefox, developed by Michael Bolin, Robert Miller, and crew of M.I.T's User Interface Design Group. The major wins of Chickenfoot are a focus on naive programmers, an interactive development environment, and a high level representation of the rendered Web page as the major object of manipulation.
[Via Lambda the Ultimate]
SessionSaver is a lifesaver. It's a Firefox extension that saves your windows and tabs across shutdowns and even crashes. I've been using it for about a month now, and it's cut my cursing at the computer by 50% Here's hoping it makes someone else's life a little bit better.
Really!!
I don't actually use processing, but admire how the environment and community keep on trucking. The changes for the latest release, 0098 indicates more robustness, and even a save to application function. Previously processing projects were delivered as applets, which can often be a less than compelling experience. Just try getting a browser to reload a jar file if you need to publish fixes!!
Anyhoo, I think processing can be declared a positive case study on how a domain specific language, with a solid dose of evangelism, can make computing attractive to folks not normally inclined to it. Good enough to get you on the cover of Nature and in the Whitney is good enough for me.
VisAware (PDF link) is the generalization of a visualization technique for situational awareness, developed by Yarden Livnat, Jim Agutter, Shaun Moon and Stefano Foresti, all of the University of Utah. I'll attempt a bad capsule summary. A spatial map is ringed with concentric circles. The circles capture time and type of event occurrences. Lines connect urgent events with locations on the map. Connect with color and interactivity and you get an effective display for keeping on top of a large number of events without an a priori obvious way to correlate them, but three potential correlation vectors: time, type and location. While the screen captures are exceedingly seductive, the paper describes the rigorous construction of a formal framework for generating them.
The authors aren't particularly clear on what this technique isn't good for, but an extrapolation to webfeed aggregation is probably worth a prototype. Take a relatively large blogroll and use some clustering techniques to generate a two dimensional map between them. Then blog posts become the events, which conveniently have a time element, and can point back to feeds to which they're related. Event typing could be hardwired to some levels of "importance", or user controlled, e.g. the result of standing searches against aggregated items.