Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")

Output is as follows:

01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, Pathway analysis

Analysing enriched pathways with Enrichr

Reactome pathways showing that increased ER stress response, sumoylation and cell cycle genes at baseline are associated with symptomatic responses to YF17D. Analysis performed with Enrichr tool. Source: Chan et al., Nature Medicine, 2019

Differentially expressed genes (DEGs) can be identified based on a pre-determined fold change and adjusted p-value cutoff. Some scientific questions can include:

  1. What are the functional roles of different DEGs and in what cellular processes do they participate?
  2. How are the DEGs regulated? Do they interact with each other to perform a common function?
  3. Are the identified DEGs also seen in other similar studies?

There are several tools that we can employ to answer some or all of these questions. In this entry, I describe the use of Enrichr, which is an integrative web-based and mobile software application with many gene-set libraries. Using a list of DEGs as data input, the Enrichr tool will query this list of DEGs against the multiple gene-set libraries, broadly categorised as: Transcription pathways, Pathways, Ontologies, Diseases/Drugs, Cell types and Miscellaneous. The Reactome database under the “Pathways” tab and the Gene Ontology Biological Pathways under the “Ontology” tab has been most useful to me, as my research interest is on host responses to virus or vaccines. However, if one is interested in epigenetics, ChIP-seq datasets from the Roadmap Epigenomics project is available to query against. The humanCyc database provides insights into the interactions of the DEGs with metabolic pathways, although I would also recommend Metaboanalyst to have a better understanding of the metabolic pathways involved.

After identifying the top pathway hits in the respective gene-set libraries, the next important feature to look out for is whether the pathways are significantly enriched. The statistical tests provided are :

  1. The Fisher exact test that calculates the p-value or adjusted p-value to determine if the pathway is statistically enriched.
  2. Z-score that calculates the deviation from the expected rank by the Fisher exact test
  3. Combined score that multiplies the log of the p-value computed with the Fisher exact test by the z-score.

By default, an adjusted p-value of <0.05 is considered to be significantly enriched. However, in some cases where the DEGs input list is small, it is common to see that the adjusted p-values will not reach statistical significance. In this case, Enrichr may not be the best tool for analysing small datasets, although it may still be used to provide some insights into the pathways that these genes are involved in. Alternatively, gene-set enrichment analysis (GSEA) could be more informative. In my experience, ranking of pathways by the combined score and filtering by adjusted p-value < 0.05 typically provides a better representation of the top key biological pathways involved. This is likely because the combined score considers both Z-score and p-value, both of which are important parameters to consider for calculating pathway enrichment.

Overall, Enrichr is a state-of-the-art gene set enrichment analysis, and I would highly recommend using Enrichr to understand the biological functions of your DEGs list. For Enrichr to be most informative, it is important to also note that the transcripts in your datasets are preferably not biased to any biological pathways. As such, Enrichr is most suitable for microarray and RNAseq datasets.