Designing a Data Reproduction Artifact

Post Metadata

I recently submitted a data reproduction artifact accompanying our paper “Verifying the Option Type with Rely-Guarantee Reasoning.” There are things that I wish I had done from the get-go next time, so I’m writing them down in this post. Hopefully, you’ll find some of the things I write here useful in creating your own data artifacts!

A (Brief) Checklist for a Data Reproduction Artifact

Below is a very brief checklist for preparing a data artifact. The rest of the blog post explains each of the checklist items in detail, so read on if you’re interested.

  • Prepare your data for public access.
  • Test your automated data analysis scripts.
  • Double-check your documentation (i.e., everything should be documented).
  • Ensure your artifact is adequately anonymized.

What Is a Data Reproduction Artifact?

A data reproduction artifact comprises the raw experimental data presented in your paper and a mechanism to generate the results you derived from them.

Components of a Data Reproduction Artifact

You want to ensure that your artifact can be understood and used by as many people as possible! Here are some things your data artifact should really have, at the minimum:

  • The data: you should include the data in either your artifact, or make it publicly-available in some other way (link to an online archive, etc). The data might be a single CSV file, a massive JSON database, or even a set of software repositories. Hopefully, you’ve automated the analysis and generation of your tables and figures from it.
  • A way to interpret the data: this can be one or more scripts that take your data as input and produce the results you present in your paper (a script that generates the rows of a .tex table is one example).
  • Documentation: It’s better to err on the side of over-documenting your artifact and explain things more than you might think is necessary. A README file that lists all the files in an artifact and their purpose is the minimum amount of documentation.

I’ll go over each of these components.

Data

If you’re working with data that is already publicly-available, such as public software repositories (e.g., a public repo on GitHub), you should be able to include it without much trouble.

If your dataset is large (i.e., more than a couple of GB), you may not want to distribute it directly in the artifact. Users may not have access to a reliable internet connection or their bandwidth might be poor or metered. You may instead choose to upload the data separately to an archival service (e.g., Zenodo, Anonymous GitHub) and link to it from your artifact. Another alternative, especially in the case where your data are software repositories, is to include a script that clones them onto the user’s machine.

You should ensure that you are allowed to publish your data in a publicly-available dataset. There are very few actual reasons why you might not be able to: the data might expose PII (personal identifiable information), be subject to data protections and regulations, or comprise proprietary information or trade secrets. In all of these cases, it’s good to investigate whether you can publish your dataset in part, or create an anonymized version that is allowed to be published.

Scripts

You should automate the generation of tables and figures in your table as much as possible; it should ideally be a one-step process. For example, to generate the rows of Table 1 in our paper “Verifying the Option Type with Rely-Guarantee Reasoning,” I only need to run the following command:

% ./compute-precision-recall-annotations

Not only is automating figure and table generation convenient (for example, after making any updates that affect your results) it can also be less error-prone. Auditing how you compute precision or recall, or any other results is easier when you have a program that anyone can review.

Your scripts should also be clearly documented. I prefer to have a small preamble at the top of each script describing its purpose as well as its inputs and outputs. For example, below is the documentation at the top of the compute-precision-recall-annotations script:

# This script generates the rows for Table 1.
# Specifically, it calculates the precision, recall, and
# number of false positive suppressions for SpotBugs,
# Error Prone, IntelliJ IDEA, and the Optional Checker.
# 
# We use the following standard definitions:
# Precision: (TP/TP+FP)
# Recall:    (TP/TP+FN)
#
# This script runs grep for the relevant error suppression
# patterns for each directory of subject programs that have
# been instrumented with SpotBugs, Error Prone,
# IntelliJ IDEA, and the Optional Checker.
#
# Usage:
#   compute-precision-recall-annotations
# Result:
#   eval-statistics.tex

It may also be helpful to have a top-level script that your users (and you!) can run to generate all the figures and tables in your paper.

Documentation

At minimum, you should include a README file that includes the following information; a clearly-labelled section for each is a good way to organize your README:

  • A list of dependencies and/or requirements for your software. Providing brief pointers on how to install them is also helpful.
  • An index of each top-level file in your artifact, and its purpose (including all datasets and scripts) in addition to any sub-directories or files as appropriate.
  • An index of each file that is generated by your data reproduction scripts and which table or figure it maps to in your paper, if applicable.

It is reasonable and OK that your artifact may not work on every single operating system under the sun or any piece of hardware dating back to the Apple II. However, you should clearly document the limitations of your artifact so that your users aren’t caught by surprise. This is the place where you should also describe any minimum computing requirements. Perhaps your scripts require a certain amount of memory to run within a reasonable time limit, or your generated artifacts occupy a large amount of disk.

Your description of your dataset should be through. For example, if your dataset is composed of a set of folders, what does each folder contain? Is there a reason for the particular structure of your dataset? You should describe each column or field in your dataset (if you are distributing a CSV or JSON file) if it is not obvious from its name or context. If you’re having doubts about the clarity of something, others might be confused, too. Providing more information than less is usually a good idea.

It is important to describe the results of your scripts, particularly if they only generate the rows of a table or the body of a figure. If your generated files do not have names that directly correspond to table names, you should provide an unambiguous mapping.

Checklist for Preparing a Data Artifact

Many conferences have a separate artifact evaluation track, where authors of accepted papers may submit a data artifact to accompany their paper. Every artifact evaluation track might be different and require different things, but here’s a general checklist of things that might be helpful to remember.

Test Your Artifact

You should test your artifact on as many different environments as you have access to; all these systems should have no prior connection to your artifact (i.e., be as close to a regular machine as possible, one that doesn’t have any special dependencies installed). At minimum, your co-authors should be able to take your artifact and reproduce your results following the instructions and documentation you have provided in the README.

Your colleagues in the lab are fantastic test subjects for your artifact, even better if they work on operating systems and hardware that differs from yours. Ask them nicely to write down anything they find confusing or ambiguous as they test your artifact. Additionally, ask them to write down anything they had to do to make your artifact work beyond what was specified in your instructions. Your colleagues are smart people; these are things you’ll want to fix before you submit! They should avoid asking you clarifying questions directly, chances are you’ll clarify on-the-spot and forget to update your artifact, which invalidates the hard work your colleagues are doing for you.

Anonymize Your Artifact

Most review committees are double-blind; your submitted artifact should not include any metadata that may de-anonymize you. If any component of your artifact was under version control, you may have a .git directory or two lying around. You can run the following command to recursively delete any .git directories from the root of your artifact directory:

rm -rf `find . -type d -name .git`

You should run a recursive search over your artifact directory for anything else that might de-anonymize you (e.g., usernames or IDs, institutional IDs, etc). Don’t skip this step, your computer can create metadata and hidden files in ways that you may not have anticipated.

You should ideally have a script that automates the anonymization process, enabling you and your co-authors to audit the process via code review.

Anonymous GitHub is capable of automatically anonymizing GitHub repositories, though you’ll have to sign-in with your

Double-Check Your Documentation

Your documentation is the only way you can communicate with whomever is using or reviewing your artifact. You may have simulated this process by asking your colleagues to test out your artifact without talking to you, but it’s always good to ask yourself the following:

  • Is every file in the artifact documented?
  • Is every file that is generated documented? This includes components of the data your scripts generate (e.g., column names, JSON keys and values, etc.).
  • Are the assumptions your artifact makes about its executing environment documented?
  • Is there anything in the artifact that may de-anonymize you?

Submitting Your Artifact

Archival Services

I use Zenodo to archive my data artifacts if there are no restrictions from the conference or publisher. It is free, convenient, and widely-used by my research community. Additionally, Zenodo will automatically version your artifact whenever you make updates, and generate a DOI that will resolve to the latest version without any work on your end. Here is an example of an anonymously-uploaded file on Zenodo.

Alternate services that support anonymous uploads include the Open Science Foundation, FigShare, Dataverse, and Anonymous GitHub.

Preparing Your Artifact for Submission

You generally do not want to upload a raw, uncompressed artifact, since it might be quite large. You can run the following command to create a .zip archive of your artifact:

zip -r <name_of_resulting_archive>.zip <path_to_artifact>

Don’t create a zip via the GUI on macOS, it will result in irritating hidden __MACOSX files being included in the zip.

Is data reproduction the same as data replication?

There is often some confusion around reproducing data and replicating data. Some fields use one to mean the other, while others use them interchangeably. I work in computer science, so I follow the ACM’s definitions, which I summarize below:

  • Reproducibility: a different team is able to obtain the same results as the original team via the original methodology and data (i.e., it’s possible for an independent group to obtain the same results using the authors’ original artifacts).
  • Replicability: a different team is able to obtain the same results as the original team via their own methodology and data (i.e., it’s possible for an independent group to obtain the same results using an independently-developed methodology and data).

An additional definition is repeatability, where a measurement (i.e., the results) can be obtained by the original team via the original methodology using the original data.

In the context of my specific data artifact, reproducibility can be summarized further as “same script, same data,” while replicability can be summarized as “different script, different data” in pursuit of my original research questions.

Acknowledgments

Thanks to Joseph Wonsil and Michael Ernst for their feedback and comments on earlier versions of this post. Check out Joe’s fantastic research that aims to augment methods that make reproducibility more accessible for research programmers! Mike and I work together at UW to improve programmer productivity, check out some of his latest work here and my work here.