Posted in Data visualisation, Differentially expressed genes

Defining selection criteria to identify differentially expressed genes

Waterfall plot showing the number of DEGs in patients with severe COVID-19, with respect to day 0 with peak symptoms. Source: Ong EZ et al., eBioMedicine, 2021

Identifying the differentially expressed genes (DEGs) in big datasets from microarray or RNAseq is critical to understand the molecular driving forces, or molecular biomarkers behind biological phenotypes. For example, the reference point can be the control sample, which are the uninfected cells and the gene expression levels of the virus infected cells can be compared to the control sample to find out the gene transcripts that are significantly modulated by virus infection. The inclusion of replicates (n>=3) allows statistical comparisons to be made. To identify the DEGs between treated and control groups, the selection criteria is typically based on fold change and p-value.

The fold change cut-off is often selected arbitrarily to be either 1.5 or 2. This cut-off serves to remove genes with smaller differences, as these are deemed to be biologically unimportant changes. However, depending on the treatment conditions and the research question, these default cutoffs may or may not be suitable. Consider a scenario where a fold-change cutoff 2.0 identified 15 only genes, whereas a fold-change cutoff 1.5 identified 200 DEGs. If the research question is to identify biomarkers associated with virus infection, perhaps a fold-change cutoff of 2 is acceptable as this ensures the identification of genes which are most dramatically altered by infection. However, if the question is to characterise the biological processes associated with infection, then a cutoff of 2 in this case will be too stringent, which will cause large voids in data analysis. An elegant example is shown by Mark Dalman et al., 2012 where he showed that the fold change and statistical cutoffs can lead to erroneous interpretation of biological pathways.

Statistical significance is usually based on p-value 0.05. A student t-test can be used if the researcher is asking if a particular gene is changed under infection condition. However, if the research question is to investigate which transcripts are differentially modulated by infection, then the student t-test is over-simplified. This is because of the multiple-testing problem, which states that the more inferences made in a dataset, the more likely a significant gene is identified by coincidence. Hence, to correct for multiple comparisons, statistical tests such as SAM (significance analysis of microarray), Bonferroni correction, Benjamini–Hochberg procedure etc, can be used. However, as described in the previous section, it is critical to ensure that the statistical test used should not lead to high false-positive rates that may lead to misinterpretation of the data.

To represent upregulated and downregulated DEGs graphically, a waterfall plot can be used (See Figure at top). Alternatively, to visualise common or unique DEGs between treatment conditions, a Venn diagram can be used.

With increasing concerns over reproducibility between microarray and RNAseq datasets, a comprehensive comparison using microarray datasets recently revealed that a combination of non-stringent p-value from t-test and FC ranking was paramount in ensuring reproducibility between different microarray datasets (Leiming Shi et al., BMC Bioinformatics, 2008). This finding reinforces the need for using volcano plots to select DEGs, which I will cover in greater detail in my subsequent posts.