how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?
i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."
No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:
- POS (part-of-speech) tag the text
We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
- Normalize the text
In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
Also, remove noise words and noise expressions.
- identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
This can be used in subsequent steps to provide more weight to some N-grams
- Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
- The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
- Identify the existing topics (from a list of known topics)
- [optionally] Manually review the new topics
This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).
The main algorithms behind Google News have been published in the academic literature by Google researchers:
- Original paper.
- Talk: Google News Personalization: Scalable Online Collaborative Filtering
- Blog discussion.