On June 11-12, the Open Economics Working Group of the Open Knowledge Foundation organised the Second Open Economics International Workshop, hosted at the MIT Sloan School of Management, a second of two international workshops funded by the Alfred P. Sloan Foundation, aimed at bringing together economists and senior academics, funders, data publishers and data curators in order to discuss the progress made in the field of open data for economics and the still existing challenges. This post is an extended summary of the speakers’ input and some of the discussion. See the workshop page for more details.
Setting the Scene
The first panel addressed the current state of open data in economics research and some of the “not bad” practices in the area. Chaired by Rufus Pollock (Open Knowledge Foundation) the panel brought together senior academics and professionals from economics, science, technology and information science.
Eric von Hippel (MIT Sloan School of Management) talked about open consumer-developed innovations revealing that consumers actually innovate a lot to solve their needs as private users and while they are generally willing to let others adopt their innovations for free, they don’t actively invest in knowledge diffusion. As producers of findings, economists have high incentives to diffuse those, but as users of private research methods and data they have low or negative incentives to diffuse to rivals. Lower costs of diffusion, increasing the benefits from diffusion, more collaborative research processes and mandatory sharing are some of the ways to increase economists’ incentives to diffuse research methods and data as they diffuse findings. [See slides]
Micah Altman (MIT Libraries, Brookings Institution) stressed that best practices are often not “best” and rarely practiced thus preferred to discuss some probably “not bad” practices including policy practices for the dissemination and citation of data: e.g. that data citations should be treated as first-class objects of publication as well as reproducibility policies where more support should be given to publishing replications and registering studies. He emphasised that policies are often not self-enforcing or self-sustaining and compliance with data availability policies even in some of the best journals is very low. [See slides]
Shaida Badiee (Development Data Group, World Bank) shared the experience of setting the World Bank’s data free in 2010 and the exceptional popularity and impact the World Bank’s data has affected. To achieve better access, data is legally open – undiscriminating about the types of uses – given the appropriate user support, available in multiple languages, platforms and devices e.g. with API access, plug-ins for regression software, integration with external applications and mobile phones, etc. She reminded that data is as good as the capacity of the countries which produce it and that working closely with countries to improve their statistical capacities is necessary for the continuous improvement of data. The World Bank works in partnership with the global open data community and provides supports to countries who are willing to launch their own open data initiatives. [See slides]
Philip E. Bourne (UCSD) shared some thoughts from the biomedical sciences and indicated that while there are some success stories, many challenges still need to be addressed e.g. the lack of reproducibility and the unsolved problem of sustainability. He highlighted that change is driven by the community and there should be a perception that the community owns this culture including e.g. transparency and shared ownership, a reward system for individuals and teams, strategic policies on open access and data sharing plans, etc. and critically, the notion of “trust” in the data, which is crucial to the open data initiative. Funders and institutions may not initiate change but they would eventually follow suit: the structural biology community created successful data sharing plans before funders. He emphasised that it is all about openness: no restrictions on the usage of the data beyond attribution, running on open source software and transparency about data usage. [See slides]
Knowledge Sharing in Economics
The second panel, chaired by Eric von Hippel (MIT Sloan School of Management) dealt closer with the discipline of economics, what technological and cultural challenges still exist and what are the possible roles and initiatives. [See audio page].
Joshua Gans (University of Toronto) analysed some of the motives for knowledge contribution – e.g. money, award and recognition, ownership and control, intrinsic motivation, etc. – and addressed other issues like the design and technology problems which could be as important as social norms. He talked about designing for contribution and the importance of managing guilt: since there is a concern that data should be accurate and almost perfect, less data is contributed, so a well-designed system should enable the possibility of contributing imperfect pieces (like Wikipedia and open-source in breaking down contributions). This should be ideally combined with an element of usefulness for the contributors – so that they are getting something useful out of it. He called for providing an easy way of sharing without the hassle and all the questions which come from data users since there are “low hanging fruit” datasets that can be shared. [See slides]
Gert Wagner (German Institute for Economic Research DIW) spoke in his capacity as a Chairman of the German Data Forum, an organisation which promotes production of data, data re-use and re-analysis of data. He pointed out that there is no culture of data sharing in economics: “no credit is given where credit is due” and incentives should be promoted for sharing economics data. So far just funding organisations can enforce data sharing by data producers, but this only happens at the institutional level. For individual authors there is a little incentive to share data. As ways to change this culture, he suggested that there is a need to educate graduate students and early career professionals. In the German Socio-Economic Panel Study, a panel study of private households in Germany, they have been applying the Schrumpeter’s principle: where producers who innovate must educate the consumers if necessary. Along with the workshops which educate the new users in technical skills, they will be also educated to cite the data and give the credit, where credit is due. [See slides]
Daniel Feenberg (National Bureau of Economic Research) gave a brief introduction about NBER, which is a publisher of about a thousand working papers a year, more than a third of which are empirical economics papers of the United States. There is the option to upload data resources in a “data appendix” which are put on the website and available for free. Very few authors, however, take the advantage of being able to publish the data and are also aware that they will get questions if they make their data available. He mentioned that requiring data sharing is only something that employers and funders can mandate and there is a limited role for the publisher. Beside the issues of knowledge sharing design and incentives for individual researchers, there is also the issue of governments sharing data, where confidentiality is a big concern but also where politically motivated unscientific research may inform policy in which case more access and more research is better than less research.
John Rust (Georgetown University) indicated that the incentives for researchers might not be the biggest problem, but there is an inherent conflict between openness and confidentiality and there is a lot of economics research which uses data that cannot be made publicly available. While companies and organisations are often sceptical, risk-averse and not aware of the benefits of sharing their operations data with researchers, they could save money and make profit by research insights e.g. especially in the field of optimising rental and replacement decisions (see e.g. seminal paper by Rust 1987). Appealing to the self-interest of firms and showing success stories where collaborative research has worked can convince firms to share more data. The process of establishing trust and getting data could be aided by trusted intermediaries who can house and police confidential data and have the expertise to work with information protected by non-disclosure agreements.
Sharing Research Data
The panel session “Sharing research data – creating incentives and scholarly structures” was chaired by Thomas Burke (European University Institute Library) and dealt with different incentives and opportunities researchers have for sharing their data: storing it in a curated repository like the ICPSR or a self-service repository like DataVerse. In order to be citable a dataset should obtain a DOI where DataCite provides such a service and where a dataset can be also published with a data paper in a peer-review data journal. [See audio page].
Amy Pienta (The Interuniversity Consortium for Political and Social Research – ICPSR) presented some context about the ICPSR – the oldest archive for social science data in the United States, which has been supporting data archiving and dissemination for over 50 years. Among some of the incentives for researchers to share data, she mentioned the funding agencies’ requirements to make data available, scientific openness and stimulating new research. ICPSR has been promoting data citations and getting more journal editors to understand data citations and when archiving data also capturing how data is being used, by what users, institutions, etc. The ICPSR is also currently developing open access data as a new product, where researchers will be allowed to publish their original data, tied with data citation and DOI, data downloads and usage statistics and layered with levels of curation services. See slides.
Mercè Crosas (Institute for Quantitative Social Science, Harvard University) presented the background of the DataVerse network, a free and open-source service and software to publish, share and reference research data, originally open only to social scientists, it now welcomes contributions from all universities and disciplines. It is completely self-curated platform where authors can upload data and additional documentation, adding additional metadata to make the resource more discoverable. It builds on the incentives of data sharing, giving a persistent identifier, generating automatically a data citation (using the format suggested by Altman and King 2007), providing usage statistics and giving attribution to the contributing authors. Currently DataVerse is implementing closer integration with journals using OJS, where the data resources of an approved paper will be directly deposited online. She also mentioned also the Amsterdam Manifesto on Data Citation Principles, which encourages different stakeholders – publishers, institutions, funders, researchers – to recognise the importance of data citations. See slides.
Joan Starr (DataCite, California Digital Library) talked about DataCite – an international organisation set up in 2009 to help researchers find, re-use and cite data. She mentioned some of the most important motivations for researchers to share and cite data e.g. exposure and credit for the work of researchers and curators, scientific transparency and accountability for the authors and data stewards, citation tracking and understanding the impact of one’s work, verification of results and re-use for producing new research (See more at ESIP—Earth Science Information Partners). Some of the basic service that DataCite provides are DOIs for data (see a list of international partners who can support you in your area). Other services include usage statistics and reports, content negotiation, citation formatter and metadata search where one could see what kind of data is being registered in a particular field. Recently DataCite has also implemented a partnership with Orchid to have all research outputs (including data) on researchers’ profiles. See slides.
Brian Hole (Ubiquity Press) talked about data journal or encouraging data sharing and improving data citations through the publication of data and methodology in data papers. He emphasised that while at the beginning of scientific publications it was enough to share the research findings, today the the data, software and methodology should be shared as well in order to enable replication and validation of the research results. Amongst the benefits of making research data available he mentioned the collective benefits for the research community, the long-term preservation or research outputs, enabling new and more research to be done in a more efficient way, re-use of the data in teaching, ensuring of public trust in science, access to publicly-funded research outputs, opportunities for citizen science, etc. The publication of a data paper where the data is stored in a repository with a DOI and linked with a short data paper which describes the methodology of creating the dataset could be a way to incentivise individual researchers to share their data as it builds up their career record of publications. Additional benefits of having data journals is having a metadata platform where data from different (sub-) disciplines can be collected and mashed up producing new research. See slides.
The Evolving Evidence Base of Social Science
The purpose of the panel on the evolving evidence base of social science, chaired by Benjamin Mako Hill (MIT Sloan School of Management / MIT Media Lab) is to showcase examples of collecting more and better data and making more informed policy decisions about a larger volume of evidence. See audio page.
Michael McDonald (George Mason University) presented some updates on the Public Mapping Project, which involves an open source online re-districting application which the optimises re-districting according to selected criteria and allows for public participation in decision-making. Most recently there was a partnership with Mexico – with Instituto Federal Electoral (IFE) – using redistricting criteria like population equality, compactness, travel distance, respect for municipality boundaries, respect for indigenous communities, etc. A point was made about moving beyond data and having open optimisation algorithms, which can be verified, which is of great importance especially when they are the basis of an important public policy decision like the distribution of political representation across the country. Open code in this context is essential not just for the replication of research results but also for a transparent and accountable government. See slides.
Amparo Ballivian (Development Data Group, World Bank) presented the World Bank project for the collection of high frequency survey data using mobile phones. Some of the motivations for the pilot included the lack of recent and frequently updated data where e.g. poverty rates are calculated on the basis of household surveys, yet such surveys involve a long and costly process of data collection. The aspiration was related to the possibility of having comparable data data every month for thousands of households and being able to track changes in welfare and responses to crisis and having data to help decisions in real time. Two half year pilots were implemented in Peru and Honduras where e.g. it was possible to test monetary incentives, different cellphone technologies and the responses of different income groups. In contrast to e.g. crowd-sourced surveys, such a probabilistic cellphone survey provides the opportunity to draw inferences about the whole population and can be implemented at a much lower cost than the traditional household surveys. See slides.
Patrick McNeal (The Abdul Latif Jameel Poverty Action Lab) presented the AEA registry for randomised controlled trials (RCTs). Launched several weeks ago, sponsored by the AEA, the trials registry addresses the problem of publication bias in economics – setting up a place where a list is available of all ongoing RCTs in economics. The registry is open to researchers from around the world who want to register their randomised controlled trial. Some of the most interesting feedback of researchers includes e.g. having an easy and fast process for registering the studies (just about 17 fields are required), including a lot of information which can be taken from the project documentation, the optional uploading of the pre-analysis plan and the option to hide some fields until the trial is completed in order to address the fear that researches will expose their ideas publicly too early. The J-PAL affiliates who are running RCTs will have to register them in the system according to a new policy which mandates registration and there are also discussions on linking required registration with the funding policies of RCT funders. Registration of ongoing and completed trials is also pursued and training of RAs and PhD students now includes the registration of trials. See the website.
Pablo de Pedraza (University of Salamanca) chairs Webdatanet, a network that brings together web data experts from a variety of disciplines e.g. sociologists, psychologists, economists, media researchers, computer scientists working for universities, data collection institutes, companies and statistics institutes. Funded by the European Commission, the network has the goal of fostering the scientific use of web-based data like surveys, experiments, non-reactive data collection and mobile research. Webdatanet organises conferences and meetings, supports researchers to go to other institutes and do research through short scientific missions, organises training schools, web data metrics workshops, supports early career researchers and PhD students and has just started a working paper series. The network has working groups on quality issues, innovation and implementation (working with statistical institutes to obtain representative samples) and hosts bottom-up task forces which work on collaborative projects. See slides.
Mandating data availability and open licenses
The session chaired by Mireille van Echoud (IViR – Institute for Information Law) followed up on the discussions about making datasets available and citable to focus on the roles of different stakeholder and how responsibility should be shared. Mireille reminded that as the legal instruments like creative commons and open data licenses are already quite well-developed, role of the law in this context is in managing risk aversion and it is important to see how legal aspects are managed at the policy level. For instance, while the new EU Framework Programme for Research and Innovation – Horizon 2020 – carries the flag of open access to research publications, there are already a lot of exceptions which would allow lawyers to contest that data falls under an open access obligation. See audio page.
Carson Christiano (Center for Creative Global Action – CEGA) presented the perspective of CEGA, an inter-disciplinary network of researchers focused on global development, which employs rigorous evaluation techniques to measure the impact of large-scale social and economic development programs. The research transparency initiative of CEGA is focusing on the methodology and motivated by the issues of publication bias, selective presentation of results and inadequate documentation of research projects where a number of studies in e.g. medicine, psychology, political science and economics have pointed out the fragility of research results in the absence of the methods and tools for replication. CEGA has launched an opinion series: Transparency in Social Science Research and is looking into ways to promote examples of researchers, support and train early career researchers and PhD students in registering studies and pre-analysis plans and working in a transparent way.
Daniel Goroff (Alfred P. Sloan Foundation) raised the question of what funders should require of the people they make grants to, those who e.g. undertake economics research. While some funders may require data management plans and making the research outputs entirely open, this is not a simple matter and there are trade-offs involved. The Alfred P. Sloan Foundation has funded and supported the establishment of knowledge public goods, commodities which are non-rivalrous and non-excludable like big open access datasets with large setup costs (e.g. Sloan Digital Sky Survey, Census of Marine Life, Wikipedia, etc.). Public goods, however, are notoriously hard to finance. Thinking about other funding models, the involvement of markets and commercial enterprises where e.g. the data is available openly for free, but value-added services are offered at a charge could be some of the ways to make knowledge public goods useful and sustainable.
Nikos Askitas (Institute for the Study of Labor IZA) heads Data and Technology at the Institute for the Study of Labor (IZA), a private independent economic research institute, based in in Bonn, Germany, focused on the analysis of global labor markets. He challenged the notion that funders must require data availability by the researchers, since researchers are already overburdened and too many restrictions may destroy creativity and result in well-documented mediocre research. The data peer review is also a very different process than a peer review of academic research. He suggested that there is a need to create a new class of professionals that will assist the researchers and which would require proper name, titles, salaries and recognition for their work.
Jean Roth (National Bureau of Economic Research – NBER) mentioned that there has been a lot of interest as well as compliance from researchers when the NSF implemented the data managements plans. Several years ago, she modified the NBER paper submission code to incorporate adding data to submit together with the code and now researchers curate their data themselves where about 5.5% have papers have data available with the paper. A number of the data products from the NBER are very popular in online searches which helps people find the data in a format which is easier to use. As a Data Specialist at the NBER, she helps to make data more usable and to facilitate the re-use by other researchers. Over time the resources and time invested in making data more usable decrease both for the data curator and for the users of data.
The last session concentrated on further steps for the open economics community and ideas which should be pursued.
If you have any questions or need to get in touch with one of the presented projects, please contact us at economics[at]okfn.org.