I recently came across a video showing how a case of scientific fraud came to light as another research team tried to figure out how exactly certain published results were obtained.
One of the problems in published scientific studies that involve data processing is that they sometimes are not detailed enough. Another investigator starting with the same raw data may sometimes not get exactly the expected output. Often the differences are inconsequential but sometimes they are not. This may cause serious aggravation to others trying to understand the reason for the discrepancy. A detective work trying to figure out what the first group actually did may be necessary. Sometimes the reason for the discrepancy is just carelessness, but sometimes it may be purposeful choice of some parameter to produce desirable results; Efforts to replicate the computation have even resulted in the discovery of fraudulent data. The solution to these difficulties passes from full transparency, a thorough documentation of the steps taken in performing an analysis. This is the basis of reproducible research.
Reproducible research is different from replicable research. Replicable research means the same (or similar) data can be obtained by another researcher who follows the same methodology. Reproducible research means that, starting from the same raw data, another researcher who follows the same data processing methodology will get the same final results. When the experiment is too difficult to replicate, reproducing the calculation often is the only verification method available to the scientific community.
There can be many reasons for obtaining different results while using a program. For example, many programs have optional, user-settable parameters that alter the output. Thresholds can make the analysis more sensitive, or more specific. There may be the choice of several methods to calculate a p-value or some distance metric. Probabilistic methods can produce different results depending on the random number “seed” used to initiate an algorithm. Sometimes, when reading in the data, a simple mistake like skipping the first data row may go unnoticed. If these details are given, two independent investigators will obtain the same outcomes.
A data analysis process needs to be formulatable as a string of commands. You can package all these programs into a single, main command that calls all sub-commands. The main command can be run on the raw data and the final, publishable figures and tables should be produced from it. Once the data processing pipeline has been defined, there should be no room for human intervention, no ad-hoc decisions. And because everything is documented as a series of computer commands, another investigator can go through it and check the details. This is when data processing becomes reproducible.