May 19, 2006

Tweaking Link Discovery

The original Link Apprentice, designed for the original Macintosh, had to run in very little memory and on a processor that was slow indeed by today's standards. It turns out that even my laptop can index the full text of the 3200 notes in this Tinderbox -- about 389,000 words -- in seconds. That surprised me, but it means we can dispense with a lot of the performance hacks and heuristics the old Apprentice had to use.

There's still a lot of moving parts in link discovery, even if you want to link to similar notes. What's "similar"? We start with the text: if two notes use the same unusual words, that's an interesting clue.

But because we're in Tinderbox, we have other clues, too. We know how long each weblog post is; we boost the similarity of notes that have roughly the same length. We know what kind of note each note is, because we have prototypes; if two notes have the same prototype, they're more similar than if they have two different but related prototypes, and even more similar than two notes than have completely different prototypes.