blindly classifying new trends in incoming data

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?

i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."

-------------Problems Reply------------

No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:

  • POS (part-of-speech) tag the text
    We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
  • Normalize the text
    In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
    Also, remove noise words and noise expressions.
  • identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
    This can be used in subsequent steps to provide more weight to some N-grams
  • Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
    Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
  • The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
  • Identify the existing topics (from a list of known topics)
  • [optionally] Manually review the new topics

This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).

The main algorithms behind Google News have been published in the academic literature by Google researchers:

  • Original paper.
  • Talk: Google News Personalization: Scalable Online Collaborative Filtering
  • Blog discussion.
Category:statistics Views:0 Time:2010-02-01

Related post

  • "SMTP incoming data timeout" when sending email with inline images using Indy 10 in Windows XP 2009-04-15

    I'm getting the error "SMTP incoming data timeout" when I try to send an email with inline images in Windows XP, I'm using the tiburon branch of Indy 10 with the following code to send emails with inline images: MB := TIdMessageBuilderHtml.Create; tr

  • What is the best design for polling a modem for incoming data? 2009-09-14

    I have a GSM modem connected to my computer, i want to receive text messages sent to it using a python program i have written, am just wondering what is the best technique to poll for data. Should i write a program that has a infinite loop that conti

  • How to assure a UDP server does not lose incoming data? 2010-01-09

    There is a data feed server receives feed from various clients by means of UDP,because the clients are pumping data so fast,the receiving buffer is very easily to get full if the server spends time on processing the received data,so Will it help that

  • Convert String to Date in .NET if my incoming date format is in YYYYMMDD 2010-02-02

    What is the best way to convert string to date in C# if my incoming date format is in YYYYMMDD Ex: 20001106 --------------Solutions------------- Use DateTime.ParseExact(). Something like: string date = "20100102"; DateTime datetime = DateTime.ParseEx

  • Fixed income data online 2010-03-22

    I am looking for a resource to download fixed income data online, much like there is access to stock data from yahoo. At the very least I'd like the treasury bonds. I use python, but any help would be appreciated. --------------Solutions-------------

  • Java: how to collect incoming data fragments into properly terminated strings for subsequent parsing? 2010-10-01

    I just joined StackOverflow after having found many great answers here in the past. Here's my first question: EDIT: I feel bad... my first question to StackOverflow turned out to be a "wild goose chase". The problem is in the data, not the code readi

  • Using TUN/TAP to read incoming data, encapsulate as UDP and transmit 2011-04-26

    I have a tun/tap device which is used to read incoming packets from one interface and send them as UDP packets via another interface. I could implement this and could read ICMP pakcets send to the tun/tap interface and also get them remotely using UD

  • Ignoring incoming data in TcpClient / NetworkStream 2011-04-29

    For communication with some third-party software, I need to establish an unidirectional connection over TCP. My software only needs to send data to the other side and never will read any data. Currently I'm using the TcpClient. What would happen if t

  • Reading incomming data from barcode 2011-06-24

    I have to read incoming data from a barcode scanner using pyserial. Then I have to store the contents into a MySQL database. I have the database part but not the serial part. can someone show me examples of how to do this. I'm using a windows machine

  • Reading incoming data from socket? 2011-06-30

    I have a Socket Client and I was wondering what would be the correct approach to read the incoming data from it. Currently I am using the follow function: private void _ReadResponsePackets() { while (_socket.Connected) { try { byte[] bytes = new byte

  • Parsing Incoming Data from Email Server Objective-C/IOS 2011-11-09

    i have data that i fetch from an email server i want to eliminate noise and parse date and subject of incoming data and output it to users in a textview how should we fetch data from an email server? incoming data is like something.... something... *

  • Format of REST incoming data: POST fields or JSON? 2012-01-31

    In most examples I saw the incoming data (for example for creating new entity) data is POST'ed as form encoded. This is great for 'flat' objects, but I need to transfer more complex objects (2-3 levels of nesting). Is it acceptable to transfer them i

  • Is using C socket programming to listen for incoming data "behind" or "after" a firewall? 2012-02-25

    Recently I've been doing research on using C language to do network socket programming. I'm wondering if I write a program to listen for incoming data, is this "before" or "after" the firewall? What I understand is a web server like nginx, lighttpd o

  • How to tell ASP.Net MVC that all incoming dates deserialized from JSon should be UTC? 2012-03-02

    I'm sending dates from my web app in UTC format, but when I recieve them on the server side, the JSon serializer (which is probably used by setting up your model) makes this in a local date & time with DateTimeKind.Local relative to the server's

  • Checking for incoming data type from file 2012-04-15

    I'm reading in from a .txt file that looks something along the lines of: int string string string int string string int string string string string string where the number of string types after each int is unknown. Each line represents a new group of

  • Is there any real reason to differentiate between POST and GET when handling incoming data? 2010-02-19

    Lately I've been using a wrapper PHP class that fetches GET and POST data and lets me access it using a single getter function. After years of developing web applications I've never had a single good reason to care whether an incoming var was coming

  • How would I calculate EXPECTED income if I have PAST income data in mySQL? 2010-08-25

    Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc) I am thinking taking some averages

  • Classifier performance on subset of data 2010-09-08

    I'm using Weka to perform classification on a set of labelled web pages, and measuring classifier performance with AUC. I have a separate six-level factor that is not used in classification, and I'd like to know how well classifiers perform on each l

  • Use HttpListener to get format of incoming data in .net 2010-09-20

    I am writing a server-side program. I created a HttpListener to listen for incoming requests. How can I find out what kind of data is being sent in? E.g. is it a text, image, pdf, word? Pls correct my code below if it is wrong. I'm really new to this

Copyright (C), All Rights Reserved.

processed in 0.091 (s). 11 q(s)