home ¦ Archives ¦ Atom ¦ RSS

Bar-Yossef & Rajagopalan: Template Detection

If you do any web indexing or information retrieval on HTML, templates can easily screw things up. Results from WWW 2002 indicate that there's hope for detecting and leveraging template elements.

Key nuggets use tree structure, administrative authority, and link occurrence counts to find the recurring elements. Also, it can be done reasonably fast, with standard RDBMS technology.

© Brian M. Dennis. Built using Pelican. Theme by Giulio Fidente on github.