Reproducible Science

I recently came across a video showing how a case of scientific fraud came to light as another research team tried to figure out how exactly certain published results were obtained.

One of the problems in published scientific studies that involve data processing is that they sometimes are not detailed enough. Another investigator starting with the same raw data may sometimes not get exactly the expected output. Often the differences are inconsequential but sometimes they are not. This may cause serious aggravation to others trying to understand the reason for the discrepancy. A detective work trying to figure out what the first group actually did may be necessary. Sometimes the reason for the discrepancy is just carelessness, but sometimes it may be purposeful choice of some parameter to produce desirable results; Efforts to replicate the computation have even resulted in the discovery of fraudulent data. The solution to these difficulties passes from full transparency, a thorough documentation of the steps taken in performing an analysis. This is the basis of reproducible research.

Reproducible research is different from replicable research. Replicable research means the same (or similar) data can be obtained by another researcher who follows the same methodology. Reproducible research means that, starting from the same raw data, another researcher who follows the same data processing methodology will get the same final results. When the experiment is too difficult to replicate, reproducing the calculation often is the only verification method available to the scientific community.

There can be many reasons for obtaining different results while using a program. For example, many programs have optional, user-settable parameters that alter the output. Thresholds can make the analysis more sensitive, or more specific. There may be the choice of several methods to calculate a p-value or some distance metric. Probabilistic methods can produce different results depending on the random number “seed” used to initiate an algorithm. Sometimes, when reading in the data, a simple mistake like skipping the first data row may go unnoticed. If these details are given, two independent investigators will obtain the same outcomes.

A data analysis process needs to be formulatable as a string of commands. You can package all these programs into a single, main command that calls all sub-commands. The main command can be run on the raw data and the final, publishable figures and tables should be produced from it. Once the data processing pipeline has been defined, there should be no room for human intervention, no ad-hoc decisions. And because everything is documented as a series of computer commands, another investigator can go through it and check the details. This is when data processing becomes reproducible.

Summer Course on Bioinformatics

I taught a 7-week course on Bioinformatics at Bosphorus University in Istanbul, Turkey, this summer. Although I had previously given guest lectures at Yale, this was the first time I had the responsibility for a whole course. It was an interesting experience at several different levels. 

One interesting challenge was the lab setup. Most bioinformatics tools are designed to run on a Linux/Unix environment but the PCs at the training labs had Windows as their OS. For bureaucratic reasons, it was going to be too complicated to make these machines dual-boot with a Linux option. I ended up requiring all students to bring their own laptops to class and install BioLinux (an Ubuntu flavor with a variety of bioinformatics software pre-installed) on their machine. I gave the students three choices:

  1. A dual boot system, with BioLinux as one of the selections.
  2. Install Oracle Virtual Box, then install BioLinux in it.
  3. Reboot the PC from a USB stick with BioLinux on it.

Each option had their pros and cons but every student was able to make one of these options work.

The lab exercises were designed to enable students to do an RNA-seq analysis followed by a differential expression analysis and then creating a database to store the data. This meant learning Linux commands at a level to manage data files as well as the awk and sed languages to manipulate the data within the files. They also learned the R language for statistical analysis and graphing. With this foundation, the students processed an RNA-seq dataset using Tophat/Cufflinks and graphed their data using cummeRbund. Finally, they learned to set up a relational database in MySQL database.

Half of the final grade of the course depended on writing a literature review on a topic of bioinformatics or implementing one of the algorithms learned in class. I had taught them what constitutes plagiarism and the mechanics of using a bibliography management tool (Mandalay), but the actual writing of the papers turned out to be a bigger challenge than I had expected. Most of the student had had no paper writing experience until then and they could not decide on a topic or how to write one. After I pointed them to a list of papers submitted for a Bioinformatics course similar to ours (, they were able to proceed.

All in all, I think this was a valuable experience for my students. I hope I will get the opportunity to give this course one more time in order to improve on it.

Transcriptomics on an organism with a poorly annotated genome

The most commonly used software for performing transcriptomics analysis with RNA-seq these days is the Tuxedo suite, consisting of programs called Bowtie, TopHat and Cufflinks.This works pretty well for a well annotated genome, like that of mouse and human. But what happens with a genome that is incompletely assembled and whose transcript is not well annotated? I had the opportunity to work on an insect transcriptome and encountered a number of challenges. My task was to compare the transcriptomes of this insect in two different biological states, identify the differentially expressed transcripts, and then determine whether these were enriched for any biological processes. This was easier said than done.

The insect (which shall remain nameless, until publication time; let’s call it “the bug” for now) has a genome that consists of several hundred-thousand contigs ranging in size from a few hundred to a few hundred-thousand base pairs. Its proteins have been identified by homology to other better characterized insects, like fruit fly and mosquito, but only about 1% had an unambiguous descriptions, like “beta-tubulin”. Many had descriptions that suggested a likely function, like “spectrin alpha chain, putative”. About 5% had descriptions based on their domain structure, like “PDZ domain protein,” while half were simply called “hypothetical protein”. However, the proteins were not annotated by any Gene Ontology (GO) terms, and this is what I needed to ascribe biological meaning to the differentially expressed proteins.

I ended up using the BLASTP program to map all proteins of this insect to those of Drosophila melanogaster, the well-studied fruit fly. Three quarters of the bug’s protein were highly homologous to that of D.melanogaster. Then, the GO terms associated with these fruit fly genes were assigned to these proteins. At the end, 53% of the bug’s proteins were annotated through their similarity to the fly proteins. For this pilot study this was good enough to get a high-level impression of biological processes or biochemical functions that are statistically enriched in the differentially enriched genes. There were many transcripts that connected two or more contigs of the genome assembly. This was evidenced by one of the reads of a read-pair spanning two contigs and the other read of the pair being completely within the other contig. Because TopHat treats each contig as a separate chromosome, knowing which contigs are actually part of the same chromosome improved the sensitivity of the analysis. 1% of the contigs were determined to be connected based on this data.

The analysis showed that 100% of previously identified genes were also detected in this study. Unsurprisingly, only 90% of the known alternatively spliced transcripts were confirmed, since these samples were from a different tissue. The analysis also yielded many novel genes relative to the “reference” genome annotation. Actually, there were three times asmany genes identified in just one of the biological replicates but not consistently over three replicates in the same group. Taking the intersection of genes seen in all six samples (3 replicates x 2 conditions), resulted in 10% more genes, while the union of the consistently expressed genes in the Control samples and those in the Treated samples resulted in 25% more genes. Among them were many interesting transcripts such as intronic ones, those that overlapped with another transcript on the opposite strand, those with polymerase run-on sequences, etc.

It was also interesting that this bug, although nominally the same strain as the one used to create the reference genome sequence, was rather polymorphic. The bug happened to have 0.01% of SNPs and 0.02% of short indels.

The transcripts identified by TopHat can be grouped in different ways to give different perspectives on their regulation. While you can study individual transcript isoforms, it is also possible to group them according to those that same protein coding DNA sequence (CDS), or those that have the same transcription start site (TSS), or those that are part of the same gene. Because of statistical issues, for example a CDS-group belonging to a gene may appear to be differentially expressed while the total expression from that gene may not be significantly altered. There are additional ways to slice and dice the information in order to increase sensitivity while maintaining the false discovery rate under control.

This pilot transcriptome analysis of the bug was pleasing, not only because of the detailed characterization of the transcriptome but also for the biological insights that were gleaned from the differentially expressed transcripts. Now comes the time to validate the bioinformatics analysis with targeted experiments.

Advisor for hire

Recently I had the unusual experience of being an advisor for hire. My client was a clinical scientist specialized in molecular biology. One of his post-docs had started a project that required bioinformatics expertise. She already had a computer science background and had taken some training in transcriptomics data analysis but she was inexperienced in this methodology.The PI could not assess her work and guide her. So I became her advisor.

 We met once every week or two. I helped her getting set up with the High Performance Computing environment of Yale University and the intricacies of some Unix commands. She needed to combine data from multiple files into a master table, then remove the low quality or inconsistent data. We discussed the analytic strategy, decided on the best way to validate the data. I was a combination technical consultant and academic advisor. Our meetings were partly trouble-shooting and partly planning sessions for what she needed to do.

I was glad to be able to assist the postdoc and enable her to complete the analysis. But we were determined not to take the data at face value and our effort paid off. As it turned out, the data did not validate, the genotype was not consistent with the genetic crossings that were performed. This meant that the data was not usable and the project had to be terminated. Our consolation was that it is better to fail early than rush to publish and be embarrassed later.

Of mice and bugs

Studies of host-microbiome interactions always fascinate me. The biology of the host organism affects the composition of the microbial flora and vice-versa. For example obesity and diabetes are just two diseases that lead to changes in the gut microbiome. Several studies have shown that the host immune system is modulated by the gut microbiome.

 I recently completed a microbiome analysis on mice. The data consisted of16S rRNA sequences from fecal microbes. The samples were obtained from either wild type (WT) mice or mice that had a knock-out (KO) mutation in a particular gene that shall go nameless. Because the gene is involved in immune response,the question was whether the presence of this gene product affected the bacterial population in the mouse gut. Compared to some other similar studies I was involved in previously, the biological result turned out to be quite straightforward. However, to reach this conclusion it was necessary to realize that two of the eight replicates in one group were outliers. I could demonstrate by several criteria that those two mice were unlike the other ones in the group that they were nominally part of. Once they were excluded, the difference between the microbiomes of the WT and KO groups was obvious.

 The difference was a qualitative one: the absence of the host gene leads to the complete disappearance of one family of bacteria in the population and its net replacement by another one. The disappearing family consists of several genera (plural of genus) and species. The sequence data is not specific enough to identify individual species but we can cluster sequences by similarity to generate groups that are roughly equivalent to either species, genera or families. The number of species was lower in the KO mice than in the WT, while the number of genera or families was the same. On the other hand, the”entropy” of the knockout populations was lower for genera and families but unchanged for species. This meant that the proportions of these groups is more “even” in the knockout, resulting in a less complex population composition.

I wonder what is the mechanism behind this population shift. Presumably there is a bacterial protein found among members of this family that is interacting with the host protein that is absent in the KO mice. Evidently,this lack of interaction leads to their disappearance. Perhaps bacteria of this family need to anchor themselves to the gut lining in order not to be flushed out. Or perhaps the cellular stimulation by these bacteria leads to a cellular response that is unfavorable to them. I am sure my client will be studying this in detail.

Pathway analysis software

When you compare the RNA or protein expression profile of a control and treated sample, you get a list of differentially expressed genes. It is interesting to look for interesting names in the list but there are a number of bioinformatics analysis methods that can tell you a lot more. These are pathway analysis or network analysis software.

At the simplest level, you can look for enrichment of annotation categories in your list of interest. For example a statistically significant enrichment for the Gene Ontology (GO) biological process term “apoptosis” within your list would suggest that the biological phenomenon you are investigating may involve apoptosis. There are a number of public tools that can help you do this kind of analysis.

With commercially available tools you can create networks of interactions that include members of your gene list. For example you could create a “dense” network, that contains most of your proteins of interest, and a minimum number of intermediate proteins. A random network of the same size would be unlikely to have as many of your proteins, suggesting that this network can mechanistically explain how, e.g., an increase in protein A may result in an increase in protein B.

Another type of analysis would involve the identification of transcription factors (TFs) whose targets are enriched in your list of differentially expressed genes. If a TF regulates, say 1% of the genes in the genome but its targets represent 10% of your list, the probability of this happening by chance can be calculated to be very low, and you would have good reason to get excited.

Such analyses require that you have a good database of molecular interactions. While some public domain databases exist, commercial pathway/network analysis programs have proprietary “knowledgebases” that are based on the content of hundreds of thousands of scientific publications. Some are manually curated, others use automatic text analysis on the full text of the articles. On top of these databases, these programs use various algorithms to generate networks containing your proteins of interest. Each algorithm gives a particular perspective of your data.

Because of my extensive experience with pathway analysis, a company recently commissioned me to perform a competitive analysis of the top three commercial pathway analysis programs (Ingenuity Pathway Analysis, MetaCore and Pathway Studio). In a feature-by-feature comparison,i found that none of them dominated the other two; each program had its advantages as well deficiencies. While some of these programs had convenient “one-click” analysis features, others offered additional powerful (but not necessarily obvious) capabilities. Some had apparently user-friendly features that turned out to distort the results. I also found that the p-values given by all three programs are too optimistic and require adjustments. Thus, a good understanding of the algorithms and statistical tests used by these programs is essential for a meaningful interpretation.

Just like any methodology in biological research, pathway/network analysis is best done by understanding the strengths and weaknesses of your tool(s). Ideally, the best approach to analyzing gene expression data seems to be an eclectic approach, combining public domain and commercial tools and then synthesizing their results. And, understanding the biological nature of the data enables the optimal use of such tools.