Skip to content

Building a Smarter Feedreader

I ran across Leonard Richardson’s Ultra Gleeper again yesterday. I hadn’t seen it for a year, and it’s been good looking at it with new eyes since I’ve begun hacking in earnest on machine learning problems and measuring “interestingness” of RSS posts.

The project is interesting because he aims to solve the same problem I do (automatically find interesting/relevant things on the net), but goes about it in a COMPLETELY different way–different in the “automatic finding” and also different in the “deriving interestingness”.

For automatic harvesting of possible material, Richardson looks to a number of resources (not only whatever is pointed at by his RSS feeds, but also whatever is pointed at by what his RSS feeds point at—basically, he wants to search not only his information sources, but also what his own information sources treat as sources themselves). He also harvests from technorati, some custom google queries, delicious (until Joshua got mad =) ), etc.

Good stuff. This harvesting technique definitely broadens the search space. If I look at it in natural language processing terms, I would say it increases “recall”: over the set of all possible interesting articles, looking at a larger set of articles on the whole increases our chances of finding interesting stuff that a more conservative algorithm wouldn’t find. However, if we take this approach then we’d have to have much stronger algorithms that can guess interestingness—otherwise precision will suffer. Now, I want to say that the RSS feeds that I already subscribe to are nearly guaranteed to point to “interesting” pages—otherwise I wouldn’t point to them.

In other words, first-degree information sources (what I treat as a source) are guaranteed high precision low recall. Second-degree information (what my sources treat as a source), by contrast, are much higher recall (quantity order n-squared vs n), but have a hit to precision.

With an order magnitude (or more!) more information, we’d need big changes–changes in GUI, changes in expectation from the program, changes in ranking algorithm.

Richardson also addresses the last of these points–change in ranking algorithm. To compute interestingness, he does something smart that’s almost like a “reverse pagerank”. He says “things are interesting if they point to interesting things” (contrast this with Google’s Pagerank, which says “things are interesting if they are pointed to by interesting things).

The good thing about this is that, once you have some initial seed data, it becomes a sort of passive goodness metric. It bootstraps off of the initial data you provide so that it can continue learning even after you stop providing it data. Unsupervised machine learning, in other words.

Lots of things to think on.

One Comment

  1. Hello,

    You know what is really frustrating sometimes with RSS. It is that some readers will render a RSS feed one way and another will render the same RSS feed another way. It is very tough these days to make all the RSS reader happy due to the quick and recent evolution of the tehnology I have found.

    Matt

    Posted on 06-Apr-07 at 09:45 | Permalink