Research Data Management in Economic Journals
By Sven Vlaeminck | ZBW – German National Library of Economics / Leibniz Information Center for Economics
In Economics, as in many other research disciplines, there is a continuous increase in the number of papers where authors have collected their own research data or used external datasets. However, so far there have been few effective means of replicating the results of economic research within the framework of the corresponding article, of verifying them and making them available for repurposing or using in the support of the scholarly debate.
In the light of these findings B.D. McCullough pointed out: “Results published in economic journals are accepted at face value and rarely subjected to the independent verification that is the cornerstone of the scientific method. Most results published in economics journals cannot be subjected to verification, even in principle, because authors typically are not required to make their data and code available for verification.” (McCullough/McGeary/Harrison: “Lessons from the JMCB Archive”, 2006)
Harvard Professor Gary King also asked: “[I]f the empirical basis for an article or book cannot be reproduced, of what use to the discipline are its conclusions? What purpose does an article like this serve?” (King: “Replication, Replication” 1995). Therefore, the management of research data should be considered an important aspect of the economic profession.
The project EDaWaX
Several questions came up when we considered the reasons why economics papers may not be replicable in many cases:
First: what kind of data is needed for replication attempts? Second: it is apparent that scholarly economic journals play an important role in this context: when publishing an empirical paper, do economists have to provide their data to the journal? How many scholarly journals commit their authors to do so? Do these journals require their authors to submit only the datasets, or also the code of computation? Do they pledge their authors to provide programs used for estimations or simulations? And what about descriptions of datasets, variables, values or even a manual on how to replicate the results?
As part of generating the functional requirements for this publication-related data archive, the project analyzed the data (availability) policies of economic journals and developed some recommendations for these policies that could facilitate replication.
Data Policies of Economic Journals
First of all, we wanted to know how many journals in Economics require their authors to provide their empirical analysis data. Of course it was not possible to analyze all of the estimated 8,000 to 10,000 journals in Economics.
We used a sample built by Bräuninger, Haucap and Muck (paper available in German only) for examining the relevance and reputation of economic journals in the eyes of German economists. This sample was very useful for our approach because it allowed the comparison of the international top journals to journals published in the German-speaking area. Using the sample’s rankings for relevance and reputation we could also establish that journals with data policies were also the ones with higher ranking.
In addition to the sample of Bräuninger, Haucap and Muck, we added four additional journals equipped with data availability policy to have more journals in our sample for a detailed evaluation of data policies. We excluded some journals because they are focused only on economic policy or theory and do not publish empirical articles.
The sample we used is not representative for economic journals, because it mainly consists of high-ranked journals. Furthermore, by adding some journals explicitly owning a data policy, the percentage of journals that is equipped with such guidelines also is much higher than we do expect for economic journals in general.
Journals owning a data availability policy
In our sample we have 29 journals equipped with a data availability policy (20.6%) and 11 journals (7.8%) owning a so called “replication policy” (we only examined the websites of the journals, not the printed versions). As mentioned above, this percentage is not representative for economic journals in general. In the contrary we assume that in our sample the majority of economic journals with data (availability) policies is included.
The number of journals with a data availability policy is considerably higher compared to earlier studies where other researchers (e.g. McCullough) examined the data archives of economic journals. An additional online-survey for editors of economic journals showed that most of our respondents implemented the data policies between 2004 and 2011. Therefore we suppose that the number of economic journals with data policies is slightly increasing. The editors of economic scholarly journals seem to realize that the topic of data availability is becoming more important.
The biggest portion of journals equipped with data availability policy were published by Wiley-Blackwell (6) and Elsevier (4). We found out that mainly university or association presses have high to very high percentage of journals owning data availability policies while the major scientific publishers stayed below 20%.
Out of the 29 journals with data availability policies, 10 used initially the data availability policy implemented by the American Economic Review (AER). These journals either used exactly the same policy or a slightly modified version.
The journals with a “replication policy” were excluded from further analysis. The reason is that “replication policies” are pledging authors to provide “sufficient data and other materials” on request only, so there are no files authors have to provide to the journal. This approach sounds good in theory – but it does not work in practice because authors often simply refuse to honor the requirements of these policies. (See “Replication in Empirical Economics: The Journal of Money, Credit and Banking Project” by Dewald, Thursby and Anderson).
Some criteria for data policies to enable replications
For a further evaluation of these data availability policies, we used some criteria for rating the quality of the policies: we extended some of the previously developed criteria by B.D. McCullough by adding standards which are important from an infrastructural point of view. The criteria we used for evaluation are as follows:
Data Policies that aim to ensure the replicability of economic research results have to:
- be mandatory,
- pledge authors to provide datasets, the code of computation, programs and descriptions of the data and variables (in form of a data dictionary at best,
- assure that the data is provided prior to publication of an article,
- have defined rules for research based on proprietary or confidential data,
- provide the data, so other researchers can access these data without problems.
Besides journals should:
- have a special section for the results of replication attempts or should at least publish results of replications in addition to the dataset(s),
- require their authors to provide the data in open formats or in ASCII-format,
- require their authors to specify the name and version of both the software and the operation system used for analysis.
Results of our survey
The above mentioned requirements have been used to analyze the data policies of 141 economic journals. These are some of the results we obtained:
Mandatory Data Availability Policies
We found out that more than 82% of the data policies are mandatory. This is a quite good percentage because for obtaining data it is crucial that policies mandate authors to do so. If they do not, there is little hope that authors provide a noteworthy amount of datasets and code – simply because it is time-consuming to prepare datasets and code and authors do not receive rewards for doing this work. Besides, authors often do not want to publish a dataset that is not fully exploited. In the academic struggle for reputation the opposite a researcher wants is to provide a substantial dataset to a competitor.
What data authors have to provide
We found out that 26 of the 29 policies (89.7%) pledged authors to submit datasets used for the computation of their results. The remaining journals do not pledge their authors to do so, because the journal’s focus often is more oriented towards experimental economic research.
Regarding the question what kinds of data authors have to submit, we found out that 65.5% of the journals’ data policies require their authors to provide descriptions of the data submitted and some instructions on how to use the single files submitted. The quality of these descriptions differs from very detailed instructions to a few sentences only that might not really help would-be-replicators.
For the purpose of replication these descriptions of submitted data are very important due to the structure of the data authors are providing: In most cases, data is available as a zip-file only. In these zip-containers there is a broad bunch of different formats and files. Without proper documentation, it is extremely time-consuming to find out what part of the data corresponds to which results in the paper, if this is possible at all. Therefore it is not sufficient that only 65.5% of the data policies in our sample mandate their authors to provide descriptions. This kind of documentation is currently the most important part of metadata for describing the research data.
The submission of (self-written) programs used e.g. for simulation purposes is mandatory for 62% of the policies. This relatively low percentage can also be considered as problematic: If another researcher wants to replicate the results of a simulation he or she won’t have the chance to do so, if the programs used for these simulations are not available.
Of course it depends on the journal’s focus, whether this kind of research is published. But if suchlike papers are published, a journal should take care that the programs used and the source code of the application are submitted. Only if the source code is available it is possible to check for inaccurate programming.
Approximately half of the policies mandate their authors to provide the code of their calculations. Due to the importance of code for replication purposes this percentage may be considered as low. The code of computation is crucial for the possibility to replicate the findings of an empirical article. Without the code would-be replicators have to code everything from scratch. Whether these researchers will be able to compile an identical code of computation is uncertain. Therefore it is crucial that data availability policies enforce strict availability of the code of computation.
The point in time for providing datasets and other materials
Regarding the question at which point in time authors have to submit the data to the journal, we found out that almost 90% of the data availability policies pledge authors to provide their data prior to the publication of an article. This is a good percentage. It is important to obtain the data prior to publication, because the publication is -due to the lack of other rewards- the only incentive to submit data and code. If an article is published, this incentive is no longer given.
Exemptions from the data policy and the case of proprietary data
In economic research it is quite common to use proprietary datasets. Companies as Thomson Reuters Data Stream offer the possibility to acquire datasets and many researchers are choosing such options. Also research based on company data or microdata always is proprietary or even confidential.
Normally, if researchers want to publish an article based on these data, they have to request for an exemption from the the data policy. More than 72% of the journals we analyzed offered this possibility. One journal (Journal of the European Economic Association) discourages authors from publishing articles that rely on completely proprietary data.
But even if proprietary data was used for research, it is important that these research outputs are replicable in principle. Therefore journals should have a procedure in place that ensures the replicability of the results even in these cases. Consequently some journals request their authors to provide the code of computation, the version(s) of the dataset(s) and some additional information on how to obtain the dataset(s).
Of the 28 journals allowing exemptions from the data policy we found out that more than 60% possess rules for these cases. This is a percentage that is not really satisfactory. There is still room for improvements.
Open formats are important for two reasons: The first is that the long-term preservation of these data is much easier, because the technical specifications of open formats are known. A second reason is that open formats offer the possibility to use data and code in different platforms and software environments. It is useful to have the possibility to utilize the data interoperably and not only in one statistical package or on one platform.
Regarding these topics only two journals made recommendations for open formats.
Version of software and OS
According to McCullough and Vinod (2003) the results achieved in economic research are often influenced by the statistical package that was used for calculations. Also the operating system has a bearing on the results. Therefore both the version of the software and the OS used for calculations should be specified.
Most of the data policies in our sample do not mandate their authors to provide these specifications. But there are differences: For example almost every journal that has adopted the data availability policy of the American Economic Review (AER) requires its authors to “document[…] the purpose and format of each file provided” for each file they submit to the journal.
In sharp contrast, up to now not a single policy requires the specification of the operating system used for calculations.
In the course of our study we also examined whether journals have a special section for providing the results of replication attempts. We found out that only a very limited number of journals own a section for results of replications. In an additional online survey of the project EDaWaX 7 journals stated that they publish replication results or attempts in the journals. However the quantity of these replication attempts was low: None of the respondents published more than three replication studies per annum, most even less than one per year.
The need for a replication section mainly consists by controlling the quality of the data submitted. If a journal does not publish the results of replications authors may submit bad quality data.
In summary, it can be stated that the management of publication related research data in economics is still at its early stages. We were able to find 29 journals with data availability policies. That is many more than other researchers found some years ago but compared to the multitude of economic journals in total the percentage of journals equipped with a data availability policy is still quite low. The 20.6% we found in our analyses might be the main proportion of all journals equipped with a data policy.
Nevertheless, editors and journals in economics seem to be in motion – the topic of data availability seems to become more and more important in economics. This is a positive signal and it will be an interesting aspect to monitor whether and how this upward trend continues.
A large portion of the analyzed data availability policies are mandatory, which is a good practice. Moreover, the finding that 90% of the journals are pledging their authors to submit the data prior to the publication of an article shows that many of them have appreciated the importance of providing data at an early stage in the publication process.
When analysing the data authors have to provide, we noticed that almost all guidelines mandate the submission of the (final) dataset(s), which is also quite positive.
But beyond that there is much room for improvements: Only two thirds of all policies require the submission of descriptions and of (self-written) software. As mentioned above, research data often is not usable, when descriptions or software components are missing. In particular the lack of requirements to submit the code of computation is a big problem for potential replication attempts. Only a small majority of all policies pledges their authors to provide it. Therefore it can be expected that almost half of the data availability policies in our sample is not fully enabling replications.
Another important aspect is the possibility to replicate the results of economic research that is based on proprietary or confidential data. While more than 72% of all policies allowing exemptions from their regulations, only 60.7% have a procedure in place that regulates data and descriptions which authors still have to provide in these cases. On balance, many research based on proprietary or confidential data is not replicable even in principle.
Open formats are used by a small minority of journals only. This might result in difficulties for the interoperable use of research data and the long-term preservation of these important sources of science and research.
The reuse of research data is also complicated by the lack of information on which version of a software was used for calculations. Only little more than a third of all policies discusses that authors have to specify the software version / the formats of submitted research data. Besides up to now, no single journal requires the specification of the operating system used.
But there are also good practices: Among the journals with data availability policies we noticed that the data availability policy implemented by the American Economic Review (AER) is a very good example of a data availability policy in economic journals. Journals equipped with this policy are the biggest single group of guidelines in our sample. Therefore we see a developing trend towards a de facto-standard for data policies.
In a second part to this survey (to be published in spring 2013) we will discuss the infrastructure used by economic scholarly journals for providing datasets and other materials.