SickCity Development

Welcome! This is a group for a DIYcity project currently in development. To participate in discussion about the project, join the group. To help develop the project, visit the to-do list, where you can check out any task that isn't already checked out and work on it. When that task is complete, post a link to the code in the Development Group and it will be reviewed and then merged with the main code.
-
Project Name: SickCity
-
Description: SickCity is an application that monitors status messages from social networking sites like Twitter at the local level (metro area) for mentions of sickness, plots that over time, and acts an alert system for disease outbreaks at the city level.
-
Current version in development: 1.1
-
To-do list for development: http://diycity.org/wiki/index.php?title=SickCity_to_do_list
-
Other product documents: http://diycity.org/wiki/index.php?title=SickCity_Documentation
-
Code: http://github.com/paulmwatson/sickcity/tree/master
-

Travelsharing.netsons.org

Cityleft has worked together with Travelsharing.netsons.org to develop an open source website for car pooling.

Carpooling (also known as car-sharing, ride-sharing, lift-sharing and covoiturage), is the shared use of a car by the driver and one or more passengers, usually for commuting (Wikipedia).

However Travelsharing.netsons.org extended this approach to other forms of mobility such as biking, hiking, and so on.

The website is still at its beta version. Users should join to the community in order to translate contents in local languages.

To take part to this Travelsharing.netsons.org project visit:
www.Travelsharing.netsons.org

Cityleft

Text Analysis Thoughts for SickCity

This is an excerpt of an email from Ken, a friend who does term extraction and text analysis professionally, in regards to doing text analysis on SickCity tweets to weed out the signal from the noise.

In response to my inquiry for URLs of places to read up on this, Ken writes:

(quote)
It's a bit of an art, there is no single recipe for it. Ok, here are some details...

One way to go is to use a package that automatically does all the massaging work, for example MALLET. It's a nice package with a few good algorithms, and it should get you started quickly. Only trouble is that it doesn't have the one of the most powerful algorithms, SVMs.

Most packages require that you do the massaging yourself. Places to read more:
* The LingPipe website has a useful tutorial on text classification
* A few books are helpful, e.g., "Web Data Mining" by Liu, Fundamentals of NLP by Manning and Schutze
* I think there are a couple of free tutorials on SVMs for text classification on the web. One of the libraries I used (LIBSVM) had a decent tutorial.

Roughly, text preprocessing involves:
- Begin with a set of positive and negative text examples
- For each individual text, filter out punctuation/numbers/most symbols, and tokenize into single words
- Filter out stopwords (frequent words like 'the', 'and', etc)
- Count the frequency of every word in the corpus, and filter out highly infrequent words
- Convert each text into a sparse vector of numbers. It's generally a list of int:float pairs, where the first number is the index for a particular term, and the second is the weighted frequency of that term in the document. For term weighting, I usually use something like TF-IDF (you can read more about that on the web).
- Every machine learning package has a slightly different input format.
(end quote)

SickCity Needs Someone To Do Text Analysis

Hey all,

SickCity is really in need of some text analytics work (entity extraction, classification) to make it really seaworthy. The team working on it has gotten as far as it can by regular means of searching keywords, omitting bad words, etc. We need to step it up and do some professional-level term analysis.

Do you, or someone you know, know how to do this?

If so, let us know. Or just show up in the SickCity Dev Group and say hi.

SickCity Update 5/4/09

So, we went through a big learning curve last week with SickCity, in the face of the swine flu hysteria that swept around the world. The tool went from being marginally useful, though still a bit noisy, to totally drowned in noise and hence useless, in the space of a day. The team spent the better part of the week trying to come up with ways to combat this, but in the face of the growing storm of tweets about flu and everything sickness-related, we eventually realized our attempts at beating swine flu were useless for now.

Overall, the experience gave us a lot to reflect on and will ultimately make SickCity a much more robust and useful tool. It was sort of a trial by fire, which SickCity failed, but which also positioned us to pass our next trial.

Things we played with during the week were:

- creating a blacklist of words that would cause SickCity to skip particular tweets, and letting anyone visiting the site add to that list. see here: http://sickcity.org/badwords got lots of submissions, but didn't stem the tide.

- letting anyone remove a tweet from the system that wasn't really related to being sick. see: http://sickcity.org/USA/Seattle/phrase/flu This also worked a bit, but not thoroughly enough in the face of the huge onslaught of noisy tweets.

At one point SickCity was processing over 1500 tweets a minute related to flu (almost none of them by people who actually had flu).

So we stopped for the week, threw in the towel, and came up with a new search strategy which we're implementing now. I think this will be much more reliable.

Other improvements that were made to SickCity along the way:

- the top ten sickest cities list is now based on a "sickness quotient" derived by dividing the number of "sick" tweets by the total number of daily tweets for that city. (Formerly it was purely based on total number of sick tweets, which meant that the bigger cities tended to show up as the "sickest cities").

- this top ten sickest list is based on today's data and is updated regularly throughout the day. (this is actually interrupted right now, but will be back soon).

- now you can read the full text of each tweet on the SickCity page w/o clicking through to Twitter. This allows visitors to easily see which tweets are signal and which are noise, and make their own conclusions about the data.

- cities now have overall "sickness" graphs for the past 30 days, showing you, in sum, how much "sick tweet" activity there has been in that city over the past month.

Once we get our better search strategy in place, we should have a pretty workable, maybe even reliable, system.

Have several other improvements to make once the new search strategy is in place. Will post on those later.

BTW, still trying to work it out such that developers communicate directly through this group when developing. For now though it seems preferable for them to talk directly via email or in Campfire for group chat. If you want to join in on the development process, drop a note.

SickCity Upgrades?

Hey all - what's the likelihood of us getting SickCity upgraded along the lines we've been talking about by next week? Would love to have the new (and much better) version to show the people from the Department of Health. Think it would really impress.

Approach for Getting Total # Tweets

yeah - i'm not sure that there's any elegant way to do it, just a brute force approach of querying every so often and incrementing a count. i think i'll just keep track of the count and not the content of each tweet. all in one db sounds good to me...

d
- Hide quoted text -

On Sun, Apr 5, 2009 at 4:48 PM, Paul Watson wrote:
> OK, I should be able to get total number of tweets / city / day (on UTC
> time) working within the next couple of days.

Good stuff. I was racking my brain trying to think of a reasonable way
of getting a tweet count for a city. Didn't think of anything.

Are you recording each tweet or just counts?

> Paul, from a db perspective, do you think it makes sense to keep this all in
> one database or have a different database for each city we're tricking
> (where each record would essentially be the total number of tweets for a
> given day)?

I'd keep it all in one database. Rails can but isn't very easy to work
with across multiple databases. Also would make running SickCity.com
with 300+ cities harder.

cheers,
Paul

emails on SickCity development [3]

LeChfeck
to Daniel Greenblatt
cc John Geraci
Paul Watson
date Thu, Apr 2, 2009 at 9:18 PM
subject Re: making SickCity more accurate

Hi,

I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.


Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.


Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?

I think you're right. The distribution of the number of sick tweets over a week should be the same for all cities (when there is no real epidemy). It sure worth checking, but I don't see any a priori reason to think the contrary. So we should build a graph of the typical week using data from all cities, to make sure is as accurate as possible. Do we have enough data ? We will never have enough data :-) but we will be able put error bars on the graph.

This was the practical answer, ready to be implemented. But I don't think it's the good direction in a long term point of view. I think we will deal with a similar problem in specials non-working days, or in a particularly rainy sunday in NYC. I suggest that we look closely at the correlation between the total # of tweets and the total # of sick-tweets. The results could then be presented that way: Today, in NYC, x% of the tweets were considered "sick" ans that differ from our model by y.

The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.

Statistics are widely (mis)used to predict a peak : in finance. I know some people in this area, I will talk to them.

do we personally know any hard-code-twitter users? if we know then personally (i don't) and can ask them last time they were sick, we can go through their twitter history and look at the kinds of things they tweeted while sick. this will, of course, change from person to person. but if we look at enough people i think we could start building some pretty effective regular expressions to determine 'sick' tweets...

In complement, we could also use our database, pick some tweets randomly and, via a simple and quick interface, decide if these are really sick tweets or false positives. That way, we can build two lists of word frequencies which will hopefully be different enough to make a better decision algorithm. I'm currently reading some books for more advanced techniques in natural language statistics.

For the technical part, I'm still writing R functions to interact with the database.

Best,
Mathieu

emails on SickCity development [2]

John Geraci
to Daniel Greenblatt
cc LeChfeck
Paul Watson
date Thu, Apr 2, 2009 at 11:07 AM
subject Re: making SickCity more accurate

I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.

Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.

Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?

On another note, I came away with some interesting insights from the health researcher I talked with on Tuesday.

One insight was that city health agencies are less interested in a tool that can tell, say, flu outbreaks apart from smallpox outbreaks, and are more interested in just a good general first alert system that tells them something is happening. Part of this is driven by the fact that lots of very serious diseases will be mis-diagnosed by people as flu initially ("flu" could turn out to be flu, or bird flu, or cholera, or a dozen other things).

The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.

Those two things, taken together, make me think we should hone our keyword list quite a bit. But I don't know what we should hone it to, exactly. Hoping to get input on that from the person I met with.

emails on SickCity development [1]

Daniel Greenblatt
to John Geraci
cc LeChfeck
Paul Watson
date Wed, Apr 1, 2009 at 11:02 PM
subject Re: making SickCity more accurate

I agree that in order to get any kind of normalized data we need to be talking about 'sick' tweets as a percentage of all tweets (on a given day) and not an absolute number. So I will work (this weekend, I hope) on putting in some code to fetch the total amount of tweets in a city on any given day. Some questions:

1) Is it okay if we just do this from the current time forward, or do we want the backdated information as well? i realize that ideally we want this info for all tracked cities for all days, but perhaps i'll start by getting the total count for current days onwards.

2) Paul - any idea on how to do this from a database perspective? anything more elegant than having a single table that tracks city_id, date and tweet_count? (would have a looooot of records, but very little data for each record).

Dan

Syndicate content