Lots of city agencies all over the world have data online that is accessible to humans in readable format, yet isn't accessible to other computers and programs via an API. Some agencies don't have the means to turn their data into an API, others don't have the inclination to do so.
Can we help these agencies to open their data?
DIYcity Challenge #7: build a site scraper for the website of a city agency in your city that scrapes data, dumps it into a database, and offers that to everyone in API format.
Do not violate any copyrights for this challenge - please only scrape publicly accessible government data, not data from 3rd party sites.
DIYcity can help host any scraping bots, databases and APIs that come out of this challenge. Or just point us to a dataset you've scraped and we'll make a list in the wiki.
Special thanks to dpk for suggesting this as a challenge.
MTA Budget Data
By nickygTOPP recently did a small project along these lines -- taking the MTA budget data, previously only available in PDF form, and scraping it and turning it into HTML and CSV. We have not built a programmatic API for it and probably won't, but it's a start.
http://data.topplabs.org/data/mtabudget.html
Nick
Re: MTA Budget Data
By John GeraciThat looks great, Nick. Has anyone done anything similar with the MTA's subway schedule data? Or any other transit agency's scheduling data?
No, we haven't done anything
By nickygNo, we haven't done anything with scheduling data for MTA, but would be interested, potentially.
so simple it hurts
By Anthony Townsendfrom @seangorman what if you just crowdsourced the schedule. Let people submit the times their train comes - whole schedule in no time
Learning Opportunity for Municipal Governments
By dpkThanks for posting this challenge, and thanks to everyone working on it. This project will be useful in many ways and ought to have a positive educating effect.
I have a digressive but related question about the muni open data problem that's provoked by reading an RFP for a sizable suburban city today. This RFP amounts to a complete redo of their entire web presence and appears to have been written by people with little to no technical knowledge or even a general grasp of any relevant subjects. There is no mention of, say, XML or data format standards at all; they basically want to create web-based interfaces for citizen-governmental interaction online and that's it. It also sounds like they really need to rethink and plan their entire data infrastructure and question the assumption that it's OK for them to have no real IT staff but to farm out all their ongoing development needs to private companies.
While it's probably quite typical, this RFP seems pretty disconnected from reality on a lot of points and could not have been produced by anyone with even a passing familiarity with what larger, nearby cities have done or not done well with IT and the web.
My question is, our root problem with the "open data" quest may be municipalities that don’t understand their own data infrastructure and don’t focus on key issues, like the platforms, formats, and standards that will shape (allow or inhibit) future growth when they need to integrate and communicate data from one area to another.
It is disturbing enough to see regressive and inefficient pseudo-planning done by a public, tax-funded entity. But it is more disturbing to see movement toward private sector control of public, governmental data systems in an era where we may have less and less watchdog activity on behalf of the public. But it's not as if the press is exactly known for understanding IT and it's impact on our changing public sphere...
So I want to ask, is anyone, anywhere doing public "activism" of some sort to try to raise the bar on what cities and the public know about and want/expect from information systems that determine who has access, and to what?
Feel free to email me or discuss these topics here: http://diycity.org/local-group/milwaukee-open-government-data-work-group
crowdsourcing schedules
By John GeraciCool idea. Seems like the way to do it would be to let people to it on thjeir own schedule, rather than trying to mobilize people to do it all at the same time.
So then we'd need some sort of place online where people could text their own schedule info to, yes?
NY City Council scraper
By nicholasbsA few of us at the open government sprint at PyCon this week started Purple Voter, a web app/eventual API that combines a number of existing APIs to make it easy to look up your federal, state, and local representatives, as well as candidates for those positions. It's far from finished, but in the process of working on it I wrote a quick script to scrape the NY City Council site to get info (name, party, district, homepage) for all the council members. See: http://bitbucket.org/nicholasbs/purplevoter/src/tip/scrapers/ny_city_cou...
Email me at my username on this blog @openplans.org if you have any interest in the project.
re: NY City Council scraper
By John GeraciThat's very cool Nicholas.
I'm going to stick all of the examples people send in for this challenge together on a page in the wiki for central reference. If anyone else has a data scraper they've built that they want to send it, please do so and I'll add it to that list.
And if anyone wants to create a scraper, go for it.
I'm particularly interested in data that people could use during their day, like schedule data etc. That's just me though.
crowdsourcing schedules
By John Geraci> from @seangorman what if you just crowdsourced the schedule.
> Let people submit the times their train comes - whole schedule
> in no time
Anthony - how would you set this up? Would you do it on the MTA's schedule info?
Arrest database
By eclishamIt's not schedules, but something like this? http://mugshots.tampabay.com/.
Finding data to scrape
By Dan LykeI've got a half-finished script that grabs building permit data off of the city of Petaluma's Accela "Citizen Access" application, geocodes it, and dumps it into a GeoRSS feed.
It's then pretty easy to map water heater replacements across the city. This by itself probably isn't terribly useful, the real hope would be that if we can start filtering a bit by type and pull out the real development projects, it might become so. I probably won't pursue this too much further until I find at least one other person who's interested in it, but a linked demo is worth a thousand words: Petaluma building permits GeoRSS feed displayed on Google maps.
If anyone else is playing with scraping an Accela Citizen Access site, I'm happy to share Perl and my process for reverse-engineering how their JavaScript is submitting forms.
I've also started playing with scraping some information off of the meeting minutes, voting and streaming meetings data stored in Granicus, but haven't figured out too much useful to do with that yet.
In talking with the city information services/technologies folks here it seems like they've been doing a lot to publish human interfaces to underlying databases, and are coming around to understand the value of publishing raw data, the problems are three-fold:
1. Figuring out which portions of the data can/should be published.
2. Making sure that computers and computer data are actually enough of the workflow of a given process that they can get that data out there. For instance, they're working on a road closure database that all city departments would use, but until that database does everything that all of those departments need, it's a matter of double-entry (sometimes where the primary system is paper based), and if the computer version is for as-yet unimplemented features then that data is going to get dropped.
There are also apparently undocumented discussions that happen with various departments before projects get officially proposed, it'd be nice to learn about those projects early, but without an official workflow that keeps records of those sorts of contacts, it's difficult.
3. Making sure that people don't confuse derivative applications with city services. We apparently have at least one site created by rabble rousing citizens that alleges that it's keeping track of road conditions, and the city computer services folks apparently end up fielding calls and emails about inaccuracies in that site's data.
Anyway, it seems like some of the information we're really interested in probably doesn't exist in structured form yet, and the real challenge here will be figuring out where to reach into the operations of the city government and tweak things so that the data is getting processed in some form that we can do other stuff with it. I think figuring out where we have real application for that data, where we get the best return for convincing people to change their processes, is the hard part.
A chicken and egg problem
By Dan LykeI recently got together with some of the folks who run our various information services, and learned a couple of things:
1. They'd much rather publish in PDF than in formats that better preserve the original content because PDF is harder to use. It's not that they're trying to keep things secret, it's that in the past when they've published Word docs directly they've had people make edits, and then bring those edited documents as if they were the originals, causing all sorts of versionitis issues and additional work in checking things back to the master content.
2. They work really hard to keep the web site up and updated because they believe in it, but they have trouble justifying that work in the budget because the usage numbers are relatively low. In a town of 60,000, 600-800 people use the email advisory system, which is a really hard number for anyone pushing budget issues to make a case for.
3. Because they're trying to drive use, they didn't see the value in publishing numbers, they want to present finished apps. They've got really cool online GIS stuff, but I haven't extracted those files as KML or something I can use myself yet.
4. Often the data isn't there in the system to be extracted because they can't get the various departments to use the apps that would let them extract data. A specific example is a road closure tracking system that might also let them do things like plan the re-paving after the work on the water lines; progress is being made, but there was obvious frustration.
5. Publishing data is a chicken and egg problem: It takes time to go through schemas of various apps they've purchased and pieced together and figure out what to extract and publish. If I can come up with an application that I'll write that I'm pretty sure will get users, I'm confident I can get someone to sneak in the hour or three to extract the data I'll need, but I have to sell them on that.
I offer these up as reasons you might not be seeing what you'd hope to see. Might be worthwhile to find some of the folks in the appropriate departments and buy 'em lunch, get some discussions going, see what their needs are in influencing these RFPs, then you can find the right way to help them pitch the things that'll help you build the tools we want.
Anyone else mining Granicus?
By Dan LykeI'm looking through our (Petaluma, California) City Council minutes as maintained by Granicus, and am thinking that it'd be cool to structure this database for some queries about coGranicusuncil member attendance and voting records and such.
The city IT folks have it on their plate to ask Granicus if there's a better way to get this stuff out aside from scraping the HTML, but if that's not an option, is anyone else out there scraping HTML, or interested in the results of that?
I have a basic run-through that's grabbing (some) council attendance information (doesn't yet catch the people who couldn't make the early session but arrived for the later one) and motions, with voting information.
I'm thinking things like searching resolutions for keywords and then being able to see voting records based on that, looking at city council attendance, that sort of thing. Perhaps even just all the resolutions that a council member has voted against, or all the resolutions that failed. I realize that this kind of data mining is potentially fraught with all sorts of potentials for screw-ups if my scraper misses stuff (ie: the special case for a council member joining the meeting late), but as long as I'm conscious of that I think there's useful insight to be gleaned.
So: Anyone else mining Granicus? Anyone else think of interesting ways they'd like to slice and dice city council participation?