Guest Blogging on O'Reilly Radar

Hi all -- I'm guest-blogging on O'Reilly Radar this month about DIYcity-related themes. The first post, 'The Future of Our Cities: Open, Crowdsourced, and Participatory', went live yesterday and seems to have been met with enthusiasm. Please check it out at:

http://radar.oreilly.com/2009/04/the-future-of-our-cities-open.html

Will be posting once a week for the next month. If you want you can follow via RSS here: http://radar.oreilly.com/jgeraci/

-john

"Twitchhiking" Tool For The Masses?

A friend just pointed me to http://www.twitchhiker.com/ (see twitter feed at http://twitter.com/twitchhiker)

She says "Interesting and fun, but wouldn't happen in large scale."

I'm not so sure though. I think something like this (minus the free tix from airlines for publicity) could become a common way to get around in the future, with the right tool in place, accentuating the right things for people.

What do others think?

Lets DIY Delft!

Hello,

this is my first post to DIY, Delft and I hope more will follow.

The reason I started this DIY is that I believe by participation we can design how we want our cities to be...More human, greener, more sustainable? What do you think?

I am waiting for your participation ;)
Lets DIY Delft!

Approach for Getting Total # Tweets

yeah - i'm not sure that there's any elegant way to do it, just a brute force approach of querying every so often and incrementing a count. i think i'll just keep track of the count and not the content of each tweet. all in one db sounds good to me...

d
- Hide quoted text -

On Sun, Apr 5, 2009 at 4:48 PM, Paul Watson wrote:
> OK, I should be able to get total number of tweets / city / day (on UTC
> time) working within the next couple of days.

Good stuff. I was racking my brain trying to think of a reasonable way
of getting a tweet count for a city. Didn't think of anything.

Are you recording each tweet or just counts?

> Paul, from a db perspective, do you think it makes sense to keep this all in
> one database or have a different database for each city we're tricking
> (where each record would essentially be the total number of tweets for a
> given day)?

I'd keep it all in one database. Rails can but isn't very easy to work
with across multiple databases. Also would make running SickCity.com
with 300+ cities harder.

cheers,
Paul

New Group: SickCity Development

I just created a new group on the site, SickCity Development, where the discussion about making SickCity better will happen between the people working on it. Anyone who wants should join in and participate.

I did this for a couple of reasons: one, there's a very interesting conversation going on about improving SickCity right now among those developing it, but it's totally invisible to everyone except the developers. I wanted to make that conversation visible to all. Two, this is supposed to be a crowdsourced project, and if all of the conversations happen behind closed doors, there wont be much crowdsourcing going on, beyond what we've already got.

So going forward we will try to have as much of our communication about development as possible go through this group, so that the process can be seen by all, and contributed to by any who care to get involved.

Check it out. To start it off I just posted 3 of the most recent emails sent regarding the improvement of SickCity.

emails on SickCity development [3]

LeChfeck
to Daniel Greenblatt
cc John Geraci
Paul Watson
date Thu, Apr 2, 2009 at 9:18 PM
subject Re: making SickCity more accurate

Hi,

I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.


Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.


Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?

I think you're right. The distribution of the number of sick tweets over a week should be the same for all cities (when there is no real epidemy). It sure worth checking, but I don't see any a priori reason to think the contrary. So we should build a graph of the typical week using data from all cities, to make sure is as accurate as possible. Do we have enough data ? We will never have enough data :-) but we will be able put error bars on the graph.

This was the practical answer, ready to be implemented. But I don't think it's the good direction in a long term point of view. I think we will deal with a similar problem in specials non-working days, or in a particularly rainy sunday in NYC. I suggest that we look closely at the correlation between the total # of tweets and the total # of sick-tweets. The results could then be presented that way: Today, in NYC, x% of the tweets were considered "sick" ans that differ from our model by y.

The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.

Statistics are widely (mis)used to predict a peak : in finance. I know some people in this area, I will talk to them.

do we personally know any hard-code-twitter users? if we know then personally (i don't) and can ask them last time they were sick, we can go through their twitter history and look at the kinds of things they tweeted while sick. this will, of course, change from person to person. but if we look at enough people i think we could start building some pretty effective regular expressions to determine 'sick' tweets...

In complement, we could also use our database, pick some tweets randomly and, via a simple and quick interface, decide if these are really sick tweets or false positives. That way, we can build two lists of word frequencies which will hopefully be different enough to make a better decision algorithm. I'm currently reading some books for more advanced techniques in natural language statistics.

For the technical part, I'm still writing R functions to interact with the database.

Best,
Mathieu

emails on SickCity development [2]

John Geraci
to Daniel Greenblatt
cc LeChfeck
Paul Watson
date Thu, Apr 2, 2009 at 11:07 AM
subject Re: making SickCity more accurate

I don't know that we absolutely need to know the total number of tweets in a city on a given day. What about if we just looked at historical mentions of keywords per day? So, for example, if we knew that an average monday in NYC had 15 mentions of flu, then 25 mentions would be high.

Or we could do it by looking at that the graph of a typical week's mentions looks like. Saturday is typically 45% below Friday, etc, then use that to derive a "truer" number so to speak.

Seems like we have enough data in our own system to be able to correct discrepancies. Of course I'm no statistician ;). Mathieu, your thoughts on this?

On another note, I came away with some interesting insights from the health researcher I talked with on Tuesday.

One insight was that city health agencies are less interested in a tool that can tell, say, flu outbreaks apart from smallpox outbreaks, and are more interested in just a good general first alert system that tells them something is happening. Part of this is driven by the fact that lots of very serious diseases will be mis-diagnosed by people as flu initially ("flu" could turn out to be flu, or bird flu, or cholera, or a dozen other things).

The other insight was that detecting the initial surge is much more important than detecting peaks in outbreak. (Google, btw, is accurate for peaks, inaccurate for initial surges). By the time there is a peak, everyone knows it. What you want is the canary in the coal mine that tells health officials that again tells them something is happening.

Those two things, taken together, make me think we should hone our keyword list quite a bit. But I don't know what we should hone it to, exactly. Hoping to get input on that from the person I met with.

emails on SickCity development [1]

Daniel Greenblatt
to John Geraci
cc LeChfeck
Paul Watson
date Wed, Apr 1, 2009 at 11:02 PM
subject Re: making SickCity more accurate

I agree that in order to get any kind of normalized data we need to be talking about 'sick' tweets as a percentage of all tweets (on a given day) and not an absolute number. So I will work (this weekend, I hope) on putting in some code to fetch the total amount of tweets in a city on any given day. Some questions:

1) Is it okay if we just do this from the current time forward, or do we want the backdated information as well? i realize that ideally we want this info for all tracked cities for all days, but perhaps i'll start by getting the total count for current days onwards.

2) Paul - any idea on how to do this from a database perspective? anything more elegant than having a single table that tracks city_id, date and tweet_count? (would have a looooot of records, but very little data for each record).

Dan

DIYcity Challenge #7 Now Online in Discussions

I've been too busy lately to even post a quick note to the site here, but DIYcity Challenge #7 is finally online in Discussions.

The challenge: help city agencies everywhere to open their data by building a site scraper and API for them.

See here for the full scoop.

DIYcity Challenge #7: Open Data!

Lots of city agencies all over the world have data online that is accessible to humans in readable format, yet isn't accessible to other computers and programs via an API. Some agencies don't have the means to turn their data into an API, others don't have the inclination to do so.

Can we help these agencies to open their data?

DIYcity Challenge #7: build a site scraper for the website of a city agency in your city that scrapes data, dumps it into a database, and offers that to everyone in API format.

Do not violate any copyrights for this challenge - please only scrape publicly accessible government data, not data from 3rd party sites.

DIYcity can help host any scraping bots, databases and APIs that come out of this challenge. Or just point us to a dataset you've scraped and we'll make a list in the wiki.

Special thanks to dpk for suggesting this as a challenge.

Syndicate content