Speroni: Tags, Clouds vs Sets

Posted on: Mon 30 May 2005

I'm not dead yet.

Pietro Speroni has written a long piece on the difference between tag sets and tag clouds. The former is just a list of labels applied to an item (e.g. url), the latter is the same list with label frequency factored in. Tag clouds essentially carry more information. Speroni also speculates on the rank/frequency distribution in del.icio.us tag clouds, invoking power law distributions, but at least retaining some skepticism that they really behave that way.

The interesting leap is that tag clouds can be considered feature vectors, which are subsequently amenable to well known clustering and information retrieveal techniques. I'm not sure sets versus clouds actually make that much of a difference. I'd have to do the math, but if you assume something approaching a power law distribution, almost all of the information is in the first couple of labels anyway. The rest will be sufficiently non-discriminating that a good clustering algorithm will ignore them anyway.

Based upon these thoughts, Terrell Russell has cooked up a charting engine to see tag set behavior for del.icio.us URLs. However, it strikes me that some kind of rigorous study is in order.