to Daniel Greenblatt
cc John Geraci
date Thu, Apr 2, 2009 at 9:18 PM
subject Re: making SickCity more accurate
I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.
Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.
Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?
I think you're right. The distribution of the number of sick tweets over a week should be the same for all cities (when there is no real epidemy). It sure worth checking, but I don't see any a priori reason to think the contrary. So we should build a graph of the typical week using data from all cities, to make sure is as accurate as possible. Do we have enough data ? We will never have enough data :-) but we will be able put error bars on the graph.
This was the practical answer, ready to be implemented. But I don't think it's the good direction in a long term point of view. I think we will deal with a similar problem in specials non-working days, or in a particularly rainy sunday in NYC. I suggest that we look closely at the correlation between the total # of tweets and the total # of sick-tweets. The results could then be presented that way: Today, in NYC, x% of the tweets were considered "sick" ans that differ from our model by y.
The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.
Statistics are widely (mis)used to predict a peak : in finance. I know some people in this area, I will talk to them.
do we personally know any hard-code-twitter users? if we know then personally (i don't) and can ask them last time they were sick, we can go through their twitter history and look at the kinds of things they tweeted while sick. this will, of course, change from person to person. but if we look at enough people i think we could start building some pretty effective regular expressions to determine 'sick' tweets...
In complement, we could also use our database, pick some tweets randomly and, via a simple and quick interface, decide if these are really sick tweets or false positives. That way, we can build two lists of word frequencies which will hopefully be different enough to make a better decision algorithm. I'm currently reading some books for more advanced techniques in natural language statistics.
For the technical part, I'm still writing R functions to interact with the database.