So, we went through a big learning curve last week with SickCity, in the face of the swine flu hysteria that swept around the world. The tool went from being marginally useful, though still a bit noisy, to totally drowned in noise and hence useless, in the space of a day. The team spent the better part of the week trying to come up with ways to combat this, but in the face of the growing storm of tweets about flu and everything sickness-related, we eventually realized our attempts at beating swine flu were useless for now.
Overall, the experience gave us a lot to reflect on and will ultimately make SickCity a much more robust and useful tool. It was sort of a trial by fire, which SickCity failed, but which also positioned us to pass our next trial.
Things we played with during the week were:
- creating a blacklist of words that would cause SickCity to skip particular tweets, and letting anyone visiting the site add to that list. see here: http://sickcity.org/badwords got lots of submissions, but didn't stem the tide.
- letting anyone remove a tweet from the system that wasn't really related to being sick. see: http://sickcity.org/USA/Seattle/phrase/flu This also worked a bit, but not thoroughly enough in the face of the huge onslaught of noisy tweets.
At one point SickCity was processing over 1500 tweets a minute related to flu (almost none of them by people who actually had flu).
So we stopped for the week, threw in the towel, and came up with a new search strategy which we're implementing now. I think this will be much more reliable.
Other improvements that were made to SickCity along the way:
- the top ten sickest cities list is now based on a "sickness quotient" derived by dividing the number of "sick" tweets by the total number of daily tweets for that city. (Formerly it was purely based on total number of sick tweets, which meant that the bigger cities tended to show up as the "sickest cities").
- this top ten sickest list is based on today's data and is updated regularly throughout the day. (this is actually interrupted right now, but will be back soon).
- now you can read the full text of each tweet on the SickCity page w/o clicking through to Twitter. This allows visitors to easily see which tweets are signal and which are noise, and make their own conclusions about the data.
- cities now have overall "sickness" graphs for the past 30 days, showing you, in sum, how much "sick tweet" activity there has been in that city over the past month.
Once we get our better search strategy in place, we should have a pretty workable, maybe even reliable, system.
Have several other improvements to make once the new search strategy is in place. Will post on those later.
BTW, still trying to work it out such that developers communicate directly through this group when developing. For now though it seems preferable for them to talk directly via email or in Campfire for group chat. If you want to join in on the development process, drop a note.
Help out with sickcity
By danharveyHi all,
For my msc project this summer I'm going to be working on tracking inflenza trends through blog posts. I'm at the School of Informatics at the University of Edinburgh and will be working on a large dataset of blogs posts from over the last few years. My course was in machine learning so I'm well covered with statistics and data mining techniques.
I would quite like to help out with this project as quite a lot of my work will probably also relate to what needs to be done for twitter, or any other noisy user produced text.
I've got a few ideas for disease tracking in general too so it would be great to discuss them here too.
Let me know where you need help!
Dan