Tuesday, May 17, 2011

Search Costs for "Good Data"

I hadn't heard of the company Zanran until I read Donald Marron's blog post in the CS Monitor.  Until now, when I have been searching for some piece of data -- I have gone to Advanced Google and set the option on ".xls" to search for some file.  I knew that this was an inefficient way to search.  I am hoping that Zanran makes it easier to search.  

Proprietary data issues will arise.  I had hoped to write a paper about Beijing's new vehicle stock and in particular I was interested in what types of vehicles were being registered within the city by calendar year.  In California, it is relatively easy to acquire such data because you can purchase data on Smog Check outcomes or you can pay more to a company such as R.L Polk to know the exact count of vehicle registrations for a given geographical area such as the 90024 zip code or Los Angeles county. In Beijing, the price would have been huge to have accessed such data.   My vision for Zanran is that it will be an "open source" means to cobble data sets together.

In an open source, "hippie" sense -- the nascent effort of major social science journals to engage in replication could play out in Zanran as the website would figure out which data sets are where on the web.  So, can Zanran identify the contents of a stata file?  Or based on the text file documentation know that a relevant stata data set is posted at the American Economic Review's website? Ideally, young economists are standing on the shoulders of giants and aren't merely downloading other people's data but are figuring out how to combine such "old data" with their own original data collection efforts.  This is progress. I remember back in 1990 going to the Chicago EPA library and writing down in a notebook individual monitoring station pollution data. From the perspective of today, that was a big waste of time but I was trying to build a pollution data base.