emails on SickCity development [2]

John Geraci
to Daniel Greenblatt
cc LeChfeck
Paul Watson
date Thu, Apr 2, 2009 at 11:07 AM
subject Re: making SickCity more accurate

I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.

Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.

Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?

On another note, I came away with some interesting insights from the health researcher I talked with on Tuesday.

One insight was that city health agencies are less interested in a tool that can tell, say, flu outbreaks apart from smallpox outbreaks, and are more interested in just a good general first alert system that tells them something is happening. Part of this is driven by the fact that lots of very serious diseases will be mis-diagnosed by people as flu initially ("flu" could turn out to be flu, or bird flu, or cholera, or a dozen other things).

The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.

Those two things, taken together, make me think we should hone our keyword list quite a bit. But I don't know what we should hone it to, exactly. Hoping to get input on that from the person I met with.