SickCity Needs Someone To Do Text Analysis

Hey all,

SickCity is really in need of some text analytics work (entity extraction, classification) to make it really seaworthy. The team working on it has gotten as far as it can by regular means of searching keywords, omitting bad words, etc. We need to step it up and do some professional-level term analysis.

Do you, or someone you know, know how to do this?

If so, let us know. Or just show up in the SickCity Dev Group and say hi.

14 May18:18

The Calais module?

By reikiman

I've been looking at the Calais module and also attended a presentation at the Drupal SF users group a couple weeks ago. It interfaces with the Calais service run by Thomson-Reuters and does some interesting recognition of key stuff in an article. It also arranges for several taxonomy vocabularies to spring into existence and fills those vocabularies with related terms.

- David Herron, http://www.7gen.com

14 May19:50

Re: The Calais module?

By John Geraci

Thanks David - I've heard about Calais but haven't ever looked closely
at it.

Anyone have any hands-on experience with it?

Sent from my iPhone

14 May19:52

Somewhere to start

By danharvey

Hi all,

I've posted on the last post about SickCity saying that I'd like to help with this. I'm a AI student at Edinburgh and my dissertation for my M.Sc. this summer is basically figuring out now to do this kinda of problem well.

Speaking to my supervisor a few months ago and things like entity extraction don't work well on tasks like this as there not very reliable yet, and simple word statistics like you've done here works best. Well simple word features to start with but the models might be complex...

This is a regression problem where we're trying to predict the level of flu from word based features, so the best way to go about it is to get flu data from local health places and fit the regression function to the data. As there are so many features (words) picking the right ones with the right weights is the hard bit, Google Flu go around this by using the flu data to find words that correlate well, so you don't need to pick the keywords or make a bad word list, as this will pick out the best ones to track for you!

A good paper to read to get an idea is the google flu paper http://www.nature.com/nature/journal/v457/n7232/pdf/nature07634.pdf, I've also summarised quite a few with references to them in my proposal here http://danharvey.files.wordpress.com/2009/05/report.pdf

I've got two exams left now, and i'll be able to start helping more in a few weeks.

Dan

15 May00:29

re: somewhere to start

By John Geraci

Hey Dan,

Yeah, saw your note from the other day and meant to respond but then I got swamped with other work and it fell off my plate. Glad you responded here.

So would a good place for us to start in on this be to get flu data for some cities to correlate words against? SickCity has had some good interest from some health services as well as from some research orgs - I could see if anyone would be willing to give us data to correlate against, if that would be useful.

Google Flu go around this by using the flu data to find words that correlate well, so you don't need to pick the keywords or make a bad word list, as this will pick out the best ones to track for you!

Yes, one researcher I've talked with sounded a bit critical of Google's methodology in this regard - said the terms they picked that "correlated well" didn't necessarily bear any relation to sickness whatsoever, they just somehow correlated, so Google uses them.

Regardless, if that's what has been shown to work best, then let's give that a try. I agree that entity extraction could be problematic in this case, especially with tweets being so short.

Let me know if the city flu data would be useful to have, and I'll see if I can come by some.

15 May14:10

Re: Re: The Calais module?

By PaulBaker

It's Thomson Reuters' implementaton of fast search, a semantic search tool from a company in Michigan ($250,000 license).
http://www.opencalais.com/

It's being used by Media Cloud at the Berkman Center for a journalism project.
http://www.mediacloud.org/

Paul

15 May15:26

Yes I agree that some of the

By danharvey

Yes I agree that some of the terms the Google method uses might not be related but coming from the data mining point of view you don't know what's necessarily correlated with the flu levels, it might be people talking about flu or maybe day time tv they are watching as their off work! So it might be good the terms are not flu related as you'll find something new, but it might also be quite meaning less, but if it helps to predict the level of flu reliably then it can't be too useless!

You could also try using that method to find good features then pick out flu related ones? Things like this need trying I guess.

The data is very important as you don't know how the message levels relate to flu levels on it's own, you need to data to make it predict a level of a illness rather than just the frequency. You also really need lots of data to build, validate and test the model as you can't use the same data for building and validating the model. For my project I've got blog data for 2-3 years, I think they've started storing Twitter too but only for the past 6 months or so. Maybe we can find someone who has an archive of twitter data somewhere? Or we could use a unvalidated model whilst the data grows.

What I though might be handy to start with is trying to collect illness data from health labs all over the world in a uniform way as currently it's from different places in different formats! which would be in itself be a useful service to look at and browse on the web. Then we would be able to use this data to correlate user data from sources like Twitter once we've got enough from there too. We could also try and combine signals from Google Flu, Twitter, and Blogs and combining their strengths, so Twitter maybe more accurate in the short term, google mid, and blogs long term? or something like that. I guess heath agencies and other research groups would like this too as it'll help them share data and compare research far more easily, also if we have web people that are better with UI and things than researchers it would make it a lot nicer to use for normal people.

I've just had an exam this morning and now one left next Wednesday! after that (and a short rest) I'll be able to actually start doing something.

15 May16:57

What I though might be handy

By John Geraci

What I though might be handy to start with is trying to collect illness data from health labs all over the world in a uniform way as currently it's from different places in different formats! which would be in itself be a useful service to look at and browse on the web.

I was thinking the same thing - that getting official health data into SickCity, to co-mingle with data mined from social sites, would be very interesting and valuable. The more data points the better, IMO.

I still think some text analysis would be useful here. For example, while in Google people may do a search on daytime television when they are sick, people in Twitter are quite likely to just make declarative statements like "I'm sick" or "feel like I'm getting a fever" or even "not feeling well, getting into bed". In those cases it would be useful, it seems to me, to be fairly good at discerning between tweets that mention fever but are really about having "spring fever" or "cabin fever" or "world cup fever", and tweets that are about actually having a fever of 103 degrees or whatever.

So maybe a hybrid approach is called for here?

I think getting the dataset beyond just Twitter would be a huge step forward, also.