Posted in python

How to use GSEApy for pathway enrichment analysis

GSEApy can execute Enrichr commands within the Python wrapper. Figure taken from Edward Chen et al., BMC Bioinformatics, 2013

We have used volcano plots, identified the differentially expressed genes (DEGs) and plotted in clustergrams to show the relative expression of these DEGs. The next logical step now will be to query the function of these DEGs. There are a myraid of web tools, such as Database for Annotation, Visualization and Integrated Discovery (DAVID), Ingenuity Pathway Analysis (IPA), Metacore, PandaOmics, Enrichr, GSEA, STRING, Panther, Reactome and KEGG that you can consider as query databases. In this blog post, we will focus on using Enrichr to query on the functionality and localisation of DEGs.

Enrichr is user-friendly, free to use and allows querying of gene lists against a repertoire of gene-set libraries. Moreover, there is GSEApy, which is a Python wrapper for Enrichr, allowing users to query the characteristics of your DEGs within Python. I strongly recommend using JupyterLab or Jupyter Notebook for visualising the output of GSEApy, which is a web-based interactive development environment for storing and executing Python codes. You may visit here for specific instructions on how you can install them on your computer.

If you have not downloaded GSEApy, you can download by entering this command:

pip install gseapy

We first load the required packages and run matplotlib inline using the following codes:

%matplotlib inline
%config InlineBackend.figure_format='retina' 
%load_ext autoreload
%autoreload 2
import csv
import numpy as np
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

The advantage of using Enrichr is that users can query across multiple databases. To understand the gene set libraries that are available for data query, you can type in the following command:

import gseapy
names = gseapy.get_library_name()
print(names)

This produces an output file showing the 190 different databases (and possibly more in the future) that users can potentially query against:

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0). We will filter the upregulated and downregulated DEGs based on fold-change > 1.5, q-value < 0.05 and fold-change < -1.5, q-value < 0.05 respectively at day 1 post-vaccination using the commands as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
DEGs_up_1d = (df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index.tolist()
DEGs_down_1d = (df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index.tolist()

My favourite list of databases are Gene Ontology (GO) and Reactome. GO Biological Processes (GOBP), GO Molecular Functions (GOMF) and GO Cellular Components (GOCC) usually provide users a rough idea of what the DEGs do, and their localisation within the cell. To query the upregulated DEGs against all of these 4 databases, the following commands can be executed:

enr_GOBP_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Biological_Process_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOBP_up',
                 cutoff=0.5 
                )

enr_GOMF_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Molecular_Function_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOMF_up',
                 cutoff=0.5 
                )

enr_GOCC_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Cellular_Component_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOCC_up',
                 cutoff=0.5 
                )

enr_Reactome_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['Reactome_2016'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_Reactome_up',
                 cutoff=0.5 
                )

These codes may look difficult to understand, so let’s break them down individually to understand the codes. Firstly, gp.enrichr calls Enrichr to query against the gene list that is assigned to DEGs_up_1d in this case. We then specify the gene sets to query against GOBP, GOMF, GOCC or Reactome, which is why the syntaxes are broken into 4 distinct segments. Since the data is from Humans, we assign the organism as Human and add a description. Next we assign the output path in ‘outdir.’ In this example, we specify the creation of a folder named “test” and store the output with the specified file name. Finally, we set a cutoff of p=0.5 for analysis.

To have a sensing of the output table within Jupyter Notebook, we can type this command as an example (showing only the top 5 significant pathways):

enr_GOBP_up.results.head(5)

Output is as follows:

Note that the “genes” column may look truncated within Jupyter Notebook, so you might have to go to your specified path from ‘outdir’ to find out the full genes list. But nonetheless, the table provides important statistics that allows users to have a quick sensing of the pathways that are enriched.

To show the results graphically, we can key in the following code:

from gseapy.plot import barplot, dotplot

barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')

We import barplot from GSEApy to allow us to see the output graphically. The barplot command then plots the top 10 enriched molecular pathways with adjusted p-value < 0.05. Since the genes are upregulated, we assigned the colour as red (‘r’). Output is as follows:

At one glance, we can quickly ascertain that the top pathways induced by the Merck Ad5/HIV are related to innate immune responses and type I IFN response. To determine the enriched pathways in from the other databases, we can use the following code:

barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')
barplot(enr_GOMF_up.res2d,title='GO Molecular Function seroneg day 1 (up)',color = 'r')
barplot(enr_GOCC_up.res2d,title='GO Cellular Component seroneg day 1 (up)',color = 'r')
barplot(enr_Reactome_up.res2d,title='Reactome seroneg day 1 (up)',color = 'r')

Output files are as follows:

When we query more databases, we are beginning to see a consensus, where the chemokines, interferon and inflammation pathways are most positively enriched.

Now we will query the pathways that are negatively enriched. The concept is similar with what we did for the upregulated DEGs. We can use the following codes:

enr_GOBP_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Biological_Process_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOBP_down',
                 cutoff=0.5 
                )

enr_GOMF_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Molecular_Function_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOMF_down',
                 cutoff=0.5 
                )

enr_GOCC_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Cellular_Component_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOCC_down',
                 cutoff=0.5 
                )

enr_Reactome_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['Reactome_2016'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_Reactome_down',
                 cutoff=0.5 
                )

To visualise the downregulated pathways, we can enter the following codes:

barplot(enr_GOBP_down.res2d,title='GO Biological Processes seroneg day 1 (down)',color = 'b')
barplot(enr_GOMF_down.res2d,title='GO Molecular Function seroneg day 1 (down)',color = 'b')
barplot(enr_GOCC_down.res2d,title='GO Cellular Component seroneg day 1 (down)',color = 'b')
barplot(enr_Reactome_down.res2d,title='Reactome seroneg day 1 (down)',color = 'b')

Output is as follows:

A common theme emerges. We can quickly ascertain that protein translation and genes related to T-cell receptors are the most significantly downregulated pathways after 1 day post-vaccination.

These codes should help a lot in your pathway analysis!

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, gProfiler

gProfiler: A useful webtool to query multiple lists of differentially expressed genes

Manhattan plot showing the enriched pathways modulated by AS01B after day 1 post-vaccination (first dose) with the Hepatitis B inactivated vaccine by gProfiler. Different adjuvants, AS01B, AS01E, AS03, AS04 and Alum were tested. No DEGs detected in AS04 and Alum. Data based on Laurane et al., 2020, Science Translational Medicine.

Comparing multiple variables with microarray or RNAseq often generates different sets of differentially expressed genes (DEGs). To examine the similarities and differences in molecular pathways modulated by these sets of DEGs, tools such as Enrichr or DAVID can be used to investigate the individual pathways involved. A Venn diagram can then be used to extract the intersecting or unique pathways involved.

While such comparisons are easy to perform if you only have a few comparisons to make, this method of analysis can be time-consuming when the query size becomes larger. Therefore, in scenarios where multiple comparison of DEGs are required, I recommend using gProfiler for analysis of enriched pathways. One major strength of gProfiler is to be able to submit multiple query lists, by simply adding a “>” symbol between your gene query lists.

To provide more context, consider that you are comparing the day 1 responses between the different adjuvants, published by Laurane et al. After determining the DEGs for the different conditions based on a defined cutoff (Fold change > 1.3, adjusted p-value < 0.05 in this case), we can input the DEGs into gProfiler for the respective conditions. Once the “Run as multiquery” box is checked, gProfiler generates manhattan plots for all the different conditions, showing the enriched pathways that are detected across various databases such as GOMF, GOBP, GOCC, KEGG, Reactome, as well as their respective p-values (See top figure for the output). The corresponding p-values for the other conditions are reflected on the y-axis.

The p-values for all enriched pathways can also be analysed in a table form (see below). Indeed, consistent with the publication findings, AS01B and AS01E behave more similarly as compared to AS03, although most of the pathways among all three adjuvants are overlapping.

Pathway analysis with Reactome using gProfiler comparing pathways modulated by AS01B, AS01E and AS03 after 1 post-vaccination

Besides gene queries against the default databases, you can also use gProfiler to query against any other custom databases, by uploading the custom GMT file. In addition, there are other functions that gProfiler can perform. For instance, g:Convert can convert gene IDs into gene symbols. g:SNPense can also provide information of the SNP queries.

Overall, gProfiler is a useful webtool that allows quick assessment of enriched pathways across multiple variables. A limitation, however, is that the genes involved in the pathways are not shown. To circumvent this, the results from gProfiler can be complemented with Enrichr to identify the gene transcripts responsible for the enriched pathways.

Posted in Malaria, Resource

Expression of genes associated with immunoproteasome processing of major histocompatibility complex peptides is indicative of protection with adjuvanted RTS,S malaria vaccine

Figure highlighting that subjects with increased expression of genes involved in formation of the immunoproteasome complex after RTS,S vaccination are associated with increased protection from malaria challenge. Specific genes involved highlighted in red. Source from MT Vahey, JID, 2010.

The RTS,S is a subunit recombinant vaccine expressed in yeast that represents the central repeat and C-terminal portion of the Plasmodium falciparum circumsporozoite protein (CSP) covalently linked with the S antigen of hepatitis B virus. The overall protective efficacy of RTS,S/ASO1B in malaria-naive adults is ∼50%.

The differences in the spectrum of the protective responses to adjuvanted RTS,S, and the ability to do an experimental challenge with the mosquito-borne falciparum malaria, offers an opportunity to decipher the host genomic mechanisms involved in vaccine efficacy against malaria.

In this study by MT Vahey et al., 2010, 39 vaccine recipients were assessed at study entry, on the day of the third vaccination, at 24 h, 72 h, and 2 weeks after vaccination, and on day 5 after challenge. Of 39 vaccine recipients, 13 were protected and 26 were not.

Most DEGs were detected at day 1 post vaccination, with the majority of these transcripts associated with proinflammatory responses. Most of these responses resolve at day 3 post-vaccination.

After day 5 of malaria challenge, prediction analysis of microarray (PAMR) identified 393 genes that are differentially expressed in subjects who are protected and unprotected. Most of these genes are related to apoptosis and cell cycle.

At 2 weeks after the third vaccination but before malaria challenge, 32 genes belonging to the proteasome degradation pathway separated protected vs unprotected subjects. Examples include PSME2 (a component of the 11S regulator), and PSMA4, PSMB6, and PSMB9 which forms the immunoproteasome involved in processesing of peptides for presentation to the MHC complex.

This study may highlight the critical role of T-cells as correlate of protection against malaria. Data deposited at Gene Expression Omnibus, with accession number GSE18323.

Posted in Data visualisation, Volcano Plot

Genoppi: A new tool to plot volcano plots with annotated functions

Volcano plot plotted using Genoppi. Data is from Vahey et al., J Infect Dis., 2010, demonstrating the types of proteins that are most significantly induced in protected individuals receiving the RTS,S malaria vaccine. Functions are annotated from HGNC, ranked from greatest to lowest number of hits.

As elaborated in my previous post, volcano plots are a great way of visualising the differentially expressed genes (DEGs) that are most affected by a particular treatment (compared to untreated control). However, a limitation is that volcano plots do not directly annotate whether the most influential genes perform similar functions. Here, I introduce Genoppi, which is an open-source computational tool (R package source code: github.com/lagelab/Genoppi) that allows plotting of interactive volcano plots with the corresponding gene functions derived from HGNC, GO or MSigDB (see example figure on top). In addition, by assigning a bait protein of interest, the tool is able to identify the interacting partners that are significant on the volcano plot. The interaction partners are compiled from InWeb_InBioMap, iRefIndex, or BioPlex which includes data from >40000 scientific articles. If interested to find out if SNPs (single-nucleotide polymorphisms) could play a role in your dataset, you may also assign the GWAS study from the NHGRI-EBI GWAS catalog to the dataset.

There are several other functions that Genoppi can do, but not elaborated in this blog post. Interested users may visit the details within the publication or in the guide provided in the website. Overall, while Genoppi is likely to be most utilised by scientists interested in proteomics and interactome research, my experience suggests that Genoppi can be potentially applied more broadly to transcriptomics analyses as well.

Posted in Data visualisation, REVIGO

REVIGO: A useful tool to summarise long lists of Gene Ontology terms

Scatterplot produced by REVIGO, showing the relatedness between the different GO terms. Data based on day2 vs day 0 analysis in subjects vaccinated with YF17D, published by Jue Hou et al., JI, 2017.

After Enrichr analysis, you could be faced with a scenario where you have >30 significant Gene Ontology (GO) pathway hits. Presenting all of these pathways in a figure panel could be potentially challenging, or even distracting if many of these significant hits belong to similar biological pathways. In this entry, I recommend the use of REVIGO, which is a Web server that summarises long lists of GO terms into a representative subset of terms. This clustering is based on an algorithm that relies on semantic similarity measures. The non-redundant GO terms are also visualised in a scatterplot, interactive graph, treemap or tag cloud to ascertain their degree of similarity or differences. As a result, REVIGO mitigates the problem of having long lists of significant GO terms belonging to highly similar pathways, allowing you to focus on unique GO pathways that are influenced by the treatment conditions. For instance, consider the following example where human subjects given the yellow fever live-attenuated vaccine results in upregulated DEGs at day 2 post-vaccination (Jue Hou et al., JI, 2017). The list of GO pathways derived from Enrichr based on adjusted p-value < 0.05 are as follows:

GO:0071357cellular response to type I interferon 
GO:0060337type I interferon signaling pathway 
GO:1903901negative regulation of viral life cycle 
GO:2000112regulation of cellular macromolecule biosynthetic process 
GO:1903506regulation of nucleic acid-templated transcription 
GO:0045071negative regulation of viral genome replication 
GO:0045069regulation of viral genome replication 
GO:0035455response to interferon-alpha 
GO:0010468regulation of gene expression 
GO:0019221cytokine-mediated signaling pathway 
GO:0060333interferon-gamma-mediated signaling pathway 
GO:0097696STAT cascade 
GO:0034340response to type I interferon 
GO:0006355regulation of transcription, DNA-templated 
GO:0006974cellular response to DNA damage stimulus 
GO:0006400tRNA modification 
GO:0071346cellular response to interferon-gamma 
GO:0030488tRNA methylation 
GO:0035456response to interferon-beta 
GO:0006281DNA repair 
GO:0032446protein modification by small protein conjugation 
GO:0048525negative regulation of viral process 
GO:0002230positive regulation of defense response to virus by host 
GO:0016567protein ubiquitination 
GO:0050691regulation of defense response to virus by host 
GO:0001510RNA methylation 
GO:0032543mitochondrial translation 
GO:0007259JAK-STAT cascade 
GO:0001817regulation of cytokine production 
GO:0006415translational termination 

This is a long redundant list of 30 GOBP pathways that is difficult to present in a scientific publication. It is also incorrect to conveniently remove significant pathways from this list without proper justification. In this case, REVIGO provides one of the best solutions to resolve these issues. By putting all these significant GO terms into REVIGO, the program summarises this long list of pathways into a smaller list of 16 non-redundant pathways:

GO:0071357cellular response to type I interferon 
GO:0035455response to interferon-alpha 
GO:0097696STAT cascade 
GO:0034340response to type I interferon 
GO:0006355regulation of transcription, DNA-templated 
GO:0006974cellular response to DNA damage stimulus 
GO:0071346cellular response to interferon-gamma 
GO:0035456response to interferon-beta 
GO:0048525negative regulation of viral process 
GO:0002230positive regulation of defense response to virus by host 
GO:0016567protein ubiquitination 
GO:0001510RNA methylation 
GO:0032543mitochondrial translation 
GO:0007259JAK-STAT cascade 
GO:0001817regulation of cytokine production 
GO:0006415translational termination 

Indeed, the scatterplot (see diagram on top) depicts that this shortlisted list of GO terms are non-overlapping, and indicate that these pathways will likely be modulated by different sets of DEGs. This allows the researcher to report on the restricted list of GO terms, and their associated DEGs more concisely, without losing biological meaning.

Overall, REVIGO is the tool of choice for users who wish to be able to quickly summarise their long list of GO terms, to better understand the distinct pathways that are differentially modulated by various infection or treatment conditions.

Posted in Data visualisation, Pathway analysis

Analysing enriched pathways with Enrichr

Reactome pathways showing that increased ER stress response, sumoylation and cell cycle genes at baseline are associated with symptomatic responses to YF17D. Analysis performed with Enrichr tool. Source: Chan et al., Nature Medicine, 2019

Differentially expressed genes (DEGs) can be identified based on a pre-determined fold change and adjusted p-value cutoff. Some scientific questions can include:

  1. What are the functional roles of different DEGs and in what cellular processes do they participate?
  2. How are the DEGs regulated? Do they interact with each other to perform a common function?
  3. Are the identified DEGs also seen in other similar studies?

There are several tools that we can employ to answer some or all of these questions. In this entry, I describe the use of Enrichr, which is an integrative web-based and mobile software application with many gene-set libraries. Using a list of DEGs as data input, the Enrichr tool will query this list of DEGs against the multiple gene-set libraries, broadly categorised as: Transcription pathways, Pathways, Ontologies, Diseases/Drugs, Cell types and Miscellaneous. The Reactome database under the “Pathways” tab and the Gene Ontology Biological Pathways under the “Ontology” tab has been most useful to me, as my research interest is on host responses to virus or vaccines. However, if one is interested in epigenetics, ChIP-seq datasets from the Roadmap Epigenomics project is available to query against. The humanCyc database provides insights into the interactions of the DEGs with metabolic pathways, although I would also recommend Metaboanalyst to have a better understanding of the metabolic pathways involved.

After identifying the top pathway hits in the respective gene-set libraries, the next important feature to look out for is whether the pathways are significantly enriched. The statistical tests provided are :

  1. The Fisher exact test that calculates the p-value or adjusted p-value to determine if the pathway is statistically enriched.
  2. Z-score that calculates the deviation from the expected rank by the Fisher exact test
  3. Combined score that multiplies the log of the p-value computed with the Fisher exact test by the z-score.

By default, an adjusted p-value of <0.05 is considered to be significantly enriched. However, in some cases where the DEGs input list is small, it is common to see that the adjusted p-values will not reach statistical significance. In this case, Enrichr may not be the best tool for analysing small datasets, although it may still be used to provide some insights into the pathways that these genes are involved in. Alternatively, gene-set enrichment analysis (GSEA) could be more informative. In my experience, ranking of pathways by the combined score and filtering by adjusted p-value < 0.05 typically provides a better representation of the top key biological pathways involved. This is likely because the combined score considers both Z-score and p-value, both of which are important parameters to consider for calculating pathway enrichment.

Overall, Enrichr is a state-of-the-art gene set enrichment analysis, and I would highly recommend using Enrichr to understand the biological functions of your DEGs list. For Enrichr to be most informative, it is important to also note that the transcripts in your datasets are preferably not biased to any biological pathways. As such, Enrichr is most suitable for microarray and RNAseq datasets.