[Advanced users:] In part 2 of this series, I discussed prettying up a “river of news” – which is essentially a mashup of several RSS feeds but is typically displayed on a web page instead of an RSS reader. What I built was a clone of Techmeme River., though my goal is to build a clone of Techmeme or Megite. To move from a river of news to a Techmeme clone requires a way to group or cluster related articles. Now, I’ve been racking the old noggin for about eight months to produce a suitable clustering algorithm, so I’m not going to give away what I’ve got. But I will give you some things to think about.
(Note: in the discussion below, “news item” is synonymous with “blog article”. Also, there’s no need to do this in WordPress, but that’s my preferred publishing platform.)
- Define a list of keywords and phrases that you’ll want to cluster news items around. For example, in the automotive niche, you’ll have cars, trucks, motorcyles and accessories. Under cars, you might subdivide topics by manufacturer and model, but under motorcycles you might subdivide by engine size. It all depends on what you want for your monitor.
- Prioritize your topic list, from most important to least.
- Partition your stream of news items by tagging each with the most relevant topic.
- A news item should only appear once in your Techmeme clone – that is, in only one cluster. So once you tag it with a topic or sub-topic, it cannot be tagged for anything else.
- Cluster items for each topic.
- Rank items in each cluster by choosing the oldest item for the main headline. Display links to the remaining clustered headlines. (If you look at any Techmeme item, you’ll see a “Discussion” section. Those are the news items that were published later.)
This is simply one possible high-level algorithm. I’m intentionally not giving away all details that I’ve come up with, but I’ve provided some clues. Overall, your niche monitor is only as good as your clustering algorithm. Spend time on a good division of topics, and list various clustering criteria before deciding on an algorithm. There’s no rule stating that your algorithm has to exactly mimic Techmeme.