The Benefits of Open Data (part II) – Impact on Economic Research
A couple of weeks ago, I wrote the first part of the three part series on Open Data in Economics. Drawing upon examples from top research that focused on how providing information and data can help increase the quality of public service provision, the article explored economic research on open data. In this second part, I would like to explore the impact of openness on economic research.
We live in a data-driven age
There used to be a time when data was costly: There was not much data around. Comparable GDP data, for example, has only been collected starting in the early mid 20th Century. Computing power was expensive and costly: Data and commands were stored on punch cards, and researchers only had limited hours to run their statistical analyses at the few computers available at hand.
Today, however, statistics and econometric analysis has arrived in every office: Open Data initiatives at the World Bank and governments have made it possible to download cross-country GDP and related data using a few mouse-clicks. The availability of open source statistical packages such as R allows virtually everyone to run quantitative analyses on their own laptops and computers. Consequently, the number of empirical papers have increased substantially. The left figure (taken from Espinosa et al. 2012) plots the number of econometric (statistical) outputs per article in a given year: Quantitative research has really taken off since the 1960s. Where researchers used datasets with a few dozens of observations, modern applied econometricians now often draw upon datasets boasting millions of detailed micro-level observations.
Why we need open data and access
The main economic argument in favour of open data is gains from trade. These gains come in several dimensions: First, open data helps avoid redundancy. As a researcher, you may know there are often same basic procedures (such as cleaning datasets, merging datasets) that have been done thousands of times, by hundreds of different researchers. You may also have experienced the time wasted compiling a dataset someone else already put together, but was unwilling to share: Open data in these cases can save a lot of time, allowing you to build upon the work of others. By feeding your additions back to the ecosystem, you again ensure that others can build on your data work. Just like there is no need to re-invent the wheel several times, the sharing of data allows researchers to build on existing data work and devote valuable time to genuinely new research.
Second, open data ensures the most efficient allocation of scarce resources – in this case datasets. Again, as a researcher, you may know that academics often treat their datasets as private gold mines. Indeed, entire research careers are often built on possessing a unique dataset. This hoarding often results in valuable data lying around on a forgotten harddisk, not fully used and ultimately wasted. What’s worse, the researcher – even though owning a unique dataset – may not be the most skilled to make full use of the dataset, while someone else may possess the necessary skills but not the data. Only recently, I had the opportunity to talk to a group of renown economists who – over the past decades – have compiled an incredibly rich dataset. During the conversation, it was mentioned that they themselves may have only exploited 10% of the data – and were urgently looking for fresh PhDs and talented researchers to unlock the full potential of the their data. But when data is open, there is no need to search, and data can be allocated to the most skilled researcher.
Finally, and perhaps most importantly, open data – by increasing transparency – also fosters scientific rigour: When datasets and statistical procedures are made available to everyone, a curious undergraduate student may be able to replicate and possibly refute the results of a senior researcher. Indeed, journals are increasingly asking researchers to publish their datasets along with the paper. But while this is a great step forward, most journals still keep the actual publication closed, asking for horrendous subscription fees. For example, readers of my first post may have noticed that many of the research articles linked could not be downloaded without a subscription or university affiliation. Since dissemination, replication and falsification are key features of science, the role of both open data and open access become essential to knowledge generation.
But there are of course challenges ahead: For example, while a wider access to data and statistical tools is a good thing, the ease of running regressions with a few mouse-clicks also results in a lot of mindless data mining and nonsensical econometric outputs. Quality control, hence, is and remains important. There are and in some cases also should be some barriers to data sharing. In some cases, researchers have invested a substantial time of their lives to construct their datasets, in which case it is understandable why some are uncomfortable to share their “baby” with just anyone. In addition, releasing (even anonymized) micro-level data often raises concerns of privacy protection. These issues – and existing solutions – will be discussed in the next post.