[Advanced users:] In part 2 of this series, I discussed prettying up a “river of news” – which is essentially a mashup of several RSS feeds but is typically displayed on a web page instead of an RSS reader. What I built was a clone of Techmeme River., though my goal is to build a clone of Techmeme or Megite. To move from a river of news to a Techmeme clone requires a way to group or cluster related articles. Now, I’ve been racking the old noggin for about eight months to produce a suitable clustering algorithm, so I’m not going to give away what I’ve got. But I will give you some things to think about.
(Note: in the discussion below, “news item” is synonymous with “blog article”. Also, there’s no need to do this in WordPress, but that’s my preferred publishing platform.)
- Define a list of keywords and phrases that you’ll want to cluster news items around. For example, in the automotive niche, you’ll have cars, trucks, motorcyles and accessories. Under cars, you might subdivide topics by manufacturer and model, but under motorcycles you might subdivide by engine size. It all depends on what you want for your monitor.
- Prioritize your topic list, from most important to least.
- Partition your stream of news items by tagging each with the most relevant topic.
- A news item should only appear once in your Techmeme clone – that is, in only one cluster. So once you tag it with a topic or sub-topic, it cannot be tagged for anything else.
- Cluster items for each topic.
- Rank items in each cluster by choosing the oldest item for the main headline. Display links to the remaining clustered headlines. (If you look at any Techmeme item, you’ll see a “Discussion” section. Those are the news items that were published later.)
This is simply one possible high-level algorithm. I’m intentionally not giving away all details that I’ve come up with, but I’ve provided some clues. Overall, your niche monitor is only as good as your clustering algorithm. Spend time on a good division of topics, and list various clustering criteria before deciding on an algorithm. There’s no rule stating that your algorithm has to exactly mimic Techmeme.
Steve: I thought about alternate tools such as Netvibes. I guess it depends on the end use. I want to set up niche monitors that other people might use, not just myself. So presenting the river in a web pages seems the only way, short of sharing, say, my Yahoo Pipes mashups.
I have been doing something like this with my netvibes setup, pulling relevant information for a niche together on one page to get a daily snapshot
Stephan: An excellent idea. I hadn’t thought of that. The only thing to contend with is the fact that you would be pulling in items from multiple sources. So you might find category/ tag counts could be sparse. Could be, but not necessarily so. Still, that might be a good starting point for a clustering algorithm.
What if your list of keywords and phrases were pulled from your WordPress tags and categories to make it more automated? I am working on a similar system but mainly to cluster my bookmarks and tags from other feeds around my blog posts. Great subject. I will be staying tuned in.