Text Analysis Thoughts for SickCity

This is an excerpt of an email from Ken, a friend who does term extraction and text analysis professionally, in regards to doing text analysis on SickCity tweets to weed out the signal from the noise.

In response to my inquiry for URLs of places to read up on this, Ken writes:

(quote)
It's a bit of an art, there is no single recipe for it. Ok, here are some details...

One way to go is to use a package that automatically does all the massaging work, for example MALLET. It's a nice package with a few good algorithms, and it should get you started quickly. Only trouble is that it doesn't have the one of the most powerful algorithms, SVMs.

Most packages require that you do the massaging yourself. Places to read more:
* The LingPipe website has a useful tutorial on text classification
* A few books are helpful, e.g., "Web Data Mining" by Liu, Fundamentals of NLP by Manning and Schutze
* I think there are a couple of free tutorials on SVMs for text classification on the web. One of the libraries I used (LIBSVM) had a decent tutorial.

Roughly, text preprocessing involves:
- Begin with a set of positive and negative text examples
- For each individual text, filter out punctuation/numbers/most symbols, and tokenize into single words
- Filter out stopwords (frequent words like 'the', 'and', etc)
- Count the frequency of every word in the corpus, and filter out highly infrequent words
- Convert each text into a sparse vector of numbers. It's generally a list of int:float pairs, where the first number is the index for a particular term, and the second is the weighted frequency of that term in the document. For term weighting, I usually use something like TF-IDF (you can read more about that on the web).
- Every machine learning package has a slightly different input format.
(end quote)

21 May13:49

Hi all, I'm just going away

By danharvey

Hi all,

I'm just going away for a weeks break after exams! and will then be starting on my project when I get back, I'll keep you updated with the work when I start and how I go about this, but what I will be doing is very similar to your friend described above. Most of my project is being creative with the "art" of doing it...

A few questions, how long have you been storing twitter data for? and what data are you storing? also how many people are/wanting to work on this?

Dan

27 May20:31

re: just going away

By John Geraci

Hey Dan - cool, give us a shout when you get back and we'll discuss things from there. SickCity is starting to look pretty good these days, and I'm sure with some help on the text analysis side of things it could look great.

As far as how long we've been storing data, the project has been up and running since March, and we grabbed data backwards as far back as we could when we launched. I'm not sure if it's all the same data or not though - as we've been refining our terms, I imagine the dataset has evolved some.

The number of people involved in the project varies, depending on how exciting it is at any particular moment ;). The more exciting it gets, the more people will jump in. At core, there are 4 people right now who put in the long hours on it, and then there is a wider network of people who give ideas, help out here and there, etc.

-j.

04 Jun11:57

Not much free time

By danharvey

Hi all,

I've start off my projected now and the aim has changed slightly to predicting prediction markets as opposed to lab flu levels, this means my work is not as related as I thought it was going to be. I'm also finding I won't have as much free time as I thought I would so I probably won't be able to help out much time wise but I'll try to give you a few pointers to get started.

One of the research groups who seem to be doing a lot of work, and also using twitter I think, have release a resent paper about their work http://www.jmir.org/2009/1/e11 which would be good to read to get an idea of what they are doing, it might be worth contacting them to see if you could work together on something / use their data. Here's a short presentation of their work too http://www.slideshare.net/eysen/eysenbach-infodemiology-and-infoveillanc...

If you want to go further with analysis it would be worth reading up on regression as your friend said, then use something like libSVM to try a range of features. This is probably quite a steep learning curve though if you don't know much about statistics.

Hope this helps,
Dan

04 Jun15:41

re: not much time

By John Geraci

> One of the research groups who seem to be doing a lot of
> work, and also using twitter I think, have release a resent
> paper about their work http://www.jmir.org/2009/1/e11

Yes, I've read that paper - it actually references SickCity. But the author has registered a reactionary feeling toward us on more than one occasion (see comment here: http://radar.oreilly.com/2009/04/trying-to-track-swine-flu-acro.html) and quite honestly I don't think he would be open to working with DIYcity, or anyone else for that matter. He believes that anyone else working in this area is going to somehow steal his research dollars, despite the fact that we aren't looking for research dollars.

This is what he has so far: http://infovigil.com/ - a single home page with no content. And the only source that references this in Google is his own, non-peer reviewed paper. Hardly a closed case for figuring this stuff out.

Anyway, thanks for the pointers. We'll continue to plug ahead with SickCity. It's doing much better now than it was just a few weeks ago, and I think with another round of improvements we will have something actually worth paying attention to!