Posted in Data visualisation, python, Venn diagram

Using python codes to study unique or shared elements in Venn diagrams

As elaborated in my previous post, Venn diagrams are useful for identifying unique and shared genes between different treatment conditions. While Venny remains to be one of my favourite tools for plotting Venn diagrams, inputting lists of genes can be cumbersome when the gene lists are long (e.g. >1,000 genes) or when gene lists are provided as .csv or .txt files.

Here, I introduce how you can use python to identify the unique and overlapping genes. The operations, & for intersection, and | for union can be used. An example is shown below:

# Assume each alphabet as individual genes
# Treatment A and treatment B serve as variables

Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}

Total = list(Treatment_A | Treatment_B)

Intersection = list(Treatment_A & Treatment_B)

Treatment_A_no_B = list(Treatment_A - (Treatment_A & Treatment_B))

Treatment_B_no_A = list(Treatment_B - Treatment_A & Treatment_B)
# Output from above commands are as follows:
['A', 'B', 'C', 'D', 'E', 'F']
['B', 'C', 'D']
['E', 'F']

Based on the counts, we can then plot a proportionate Venn diagram using the following code:

from matplotlib_venn import venn2 
from matplotlib import pyplot as plt

venn2(subsets = (1, 2, 3), set_labels = ('Treatment A', 'Treatment B'))
plt.title("Treatment A vs Treatment B")
Output file after command

Edited on 5 July 2021, with contribution from Gabrielle 🙂 You can also plot a Venn diagram with 3 different comparisons as well using the command below:

import matplotlib.pyplot as plt
from matplotlib_venn import venn3

set1 = set(["A", "B", "C", "D"])
set2 = set(["B", "C", "D", "E", "F"])
set3 = set(["C", "D", "E","G","H","J"])

venn3([set1, set2, set3], ('Treatment_A', 'Treatment_B', 'Treatment_C'))
Output file after the command
# To obtain the elements in the shared and unique treatments
Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}
Treatment_C = {"C", "D", "E","G","H","J"}

Total = list(Treatment_A | Treatment_B | Treatment_C)

Intersect_all = Treatment_A & Treatment_B & Treatment_C

Treatment_A_no_BC = list(Treatment_A - (Treatment_B | Treatment_C))

Treatment_B_no_AC = list(Treatment_B - (Treatment_A | Treatment_C))

Treatment_C_no_AB = list(Treatment_C - (Treatment_A | Treatment_B))
# Output file
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J']
['C', 'D']
['G', 'H', 'J']

To use these commands for your application, simply replace the alphabets to your gene lists. Happy coding!

Posted in Data visualisation, gProfiler

gProfiler: A useful webtool to query multiple lists of differentially expressed genes

Manhattan plot showing the enriched pathways modulated by AS01B after day 1 post-vaccination (first dose) with the Hepatitis B inactivated vaccine by gProfiler. Different adjuvants, AS01B, AS01E, AS03, AS04 and Alum were tested. No DEGs detected in AS04 and Alum. Data based on Laurane et al., 2020, Science Translational Medicine.

Comparing multiple variables with microarray or RNAseq often generates different sets of differentially expressed genes (DEGs). To examine the similarities and differences in molecular pathways modulated by these sets of DEGs, tools such as Enrichr or DAVID can be used to investigate the individual pathways involved. A Venn diagram can then be used to extract the intersecting or unique pathways involved.

While such comparisons are easy to perform if you only have a few comparisons to make, this method of analysis can be time-consuming when the query size becomes larger. Therefore, in scenarios where multiple comparison of DEGs are required, I recommend using gProfiler for analysis of enriched pathways. One major strength of gProfiler is to be able to submit multiple query lists, by simply adding a “>” symbol between your gene query lists.

To provide more context, consider that you are comparing the day 1 responses between the different adjuvants, published by Laurane et al. After determining the DEGs for the different conditions based on a defined cutoff (Fold change > 1.3, adjusted p-value < 0.05 in this case), we can input the DEGs into gProfiler for the respective conditions. Once the “Run as multiquery” box is checked, gProfiler generates manhattan plots for all the different conditions, showing the enriched pathways that are detected across various databases such as GOMF, GOBP, GOCC, KEGG, Reactome, as well as their respective p-values (See top figure for the output). The corresponding p-values for the other conditions are reflected on the y-axis.

The p-values for all enriched pathways can also be analysed in a table form (see below). Indeed, consistent with the publication findings, AS01B and AS01E behave more similarly as compared to AS03, although most of the pathways among all three adjuvants are overlapping.

Pathway analysis with Reactome using gProfiler comparing pathways modulated by AS01B, AS01E and AS03 after 1 post-vaccination

Besides gene queries against the default databases, you can also use gProfiler to query against any other custom databases, by uploading the custom GMT file. In addition, there are other functions that gProfiler can perform. For instance, g:Convert can convert gene IDs into gene symbols. g:SNPense can also provide information of the SNP queries.

Overall, gProfiler is a useful webtool that allows quick assessment of enriched pathways across multiple variables. A limitation, however, is that the genes involved in the pathways are not shown. To circumvent this, the results from gProfiler can be complemented with Enrichr to identify the gene transcripts responsible for the enriched pathways.

Posted in Data visualisation, Volcano Plot

Genoppi: A new tool to plot volcano plots with annotated functions

Volcano plot plotted using Genoppi. Data is from Vahey et al., J Infect Dis., 2010, demonstrating the types of proteins that are most significantly induced in protected individuals receiving the RTS,S malaria vaccine. Functions are annotated from HGNC, ranked from greatest to lowest number of hits.

As elaborated in my previous post, volcano plots are a great way of visualising the differentially expressed genes (DEGs) that are most affected by a particular treatment (compared to untreated control). However, a limitation is that volcano plots do not directly annotate whether the most influential genes perform similar functions. Here, I introduce Genoppi, which is an open-source computational tool (R package source code: that allows plotting of interactive volcano plots with the corresponding gene functions derived from HGNC, GO or MSigDB (see example figure on top). In addition, by assigning a bait protein of interest, the tool is able to identify the interacting partners that are significant on the volcano plot. The interaction partners are compiled from InWeb_InBioMap, iRefIndex, or BioPlex which includes data from >40000 scientific articles. If interested to find out if SNPs (single-nucleotide polymorphisms) could play a role in your dataset, you may also assign the GWAS study from the NHGRI-EBI GWAS catalog to the dataset.

There are several other functions that Genoppi can do, but not elaborated in this blog post. Interested users may visit the details within the publication or in the guide provided in the website. Overall, while Genoppi is likely to be most utilised by scientists interested in proteomics and interactome research, my experience suggests that Genoppi can be potentially applied more broadly to transcriptomics analyses as well.

Posted in Data visualisation, Venn diagram

Venn diagrams: A basic method for comparing variables

Venn diagram showing the number of upregulated DEGs under ADE, HI-ADE and DENV conditions. Source: Chan et al., mSphere, 2019

Venn diagrams are a fun and easy way to visualise the unique and overlapping differentially expressed genes between variables in a dataset. In my experience, I find Venny to be the most useful. By copying and pasting gene lists from various comparisons, Venny allows quick identification of genes that are either shared or unique between comparisons.

Another way to draw Venn diagrams is to use the Venn diagram plotter, which allows drawing of correctly proportioned Venn diagrams. The size of the Venn diagram plotted will be proportional to the size of the gene list, allowing a visual representation of the extent of overlap between comparisons (see above for example).

Lastly, it is important to note that Venn diagrams are unable to assign gene rankings, so I would recommend doing a pathway enrichment for the different segments of the Venn diagram to prioritise the genes of interest. Alternatively, you may consider Venn-diaNet for such analysis.

Posted in Data visualisation, Pathway analysis

Analysing enriched pathways with Enrichr

Reactome pathways showing that increased ER stress response, sumoylation and cell cycle genes at baseline are associated with symptomatic responses to YF17D. Analysis performed with Enrichr tool. Source: Chan et al., Nature Medicine, 2019

Differentially expressed genes (DEGs) can be identified based on a pre-determined fold change and adjusted p-value cutoff. Some scientific questions can include:

  1. What are the functional roles of different DEGs and in what cellular processes do they participate?
  2. How are the DEGs regulated? Do they interact with each other to perform a common function?
  3. Are the identified DEGs also seen in other similar studies?

There are several tools that we can employ to answer some or all of these questions. In this entry, I describe the use of Enrichr, which is an integrative web-based and mobile software application with many gene-set libraries. Using a list of DEGs as data input, the Enrichr tool will query this list of DEGs against the multiple gene-set libraries, broadly categorised as: Transcription pathways, Pathways, Ontologies, Diseases/Drugs, Cell types and Miscellaneous. The Reactome database under the “Pathways” tab and the Gene Ontology Biological Pathways under the “Ontology” tab has been most useful to me, as my research interest is on host responses to virus or vaccines. However, if one is interested in epigenetics, ChIP-seq datasets from the Roadmap Epigenomics project is available to query against. The humanCyc database provides insights into the interactions of the DEGs with metabolic pathways, although I would also recommend Metaboanalyst to have a better understanding of the metabolic pathways involved.

After identifying the top pathway hits in the respective gene-set libraries, the next important feature to look out for is whether the pathways are significantly enriched. The statistical tests provided are :

  1. The Fisher exact test that calculates the p-value or adjusted p-value to determine if the pathway is statistically enriched.
  2. Z-score that calculates the deviation from the expected rank by the Fisher exact test
  3. Combined score that multiplies the log of the p-value computed with the Fisher exact test by the z-score.

By default, an adjusted p-value of <0.05 is considered to be significantly enriched. However, in some cases where the DEGs input list is small, it is common to see that the adjusted p-values will not reach statistical significance. In this case, Enrichr may not be the best tool for analysing small datasets, although it may still be used to provide some insights into the pathways that these genes are involved in. Alternatively, gene-set enrichment analysis (GSEA) could be more informative. In my experience, ranking of pathways by the combined score and filtering by adjusted p-value < 0.05 typically provides a better representation of the top key biological pathways involved. This is likely because the combined score considers both Z-score and p-value, both of which are important parameters to consider for calculating pathway enrichment.

Overall, Enrichr is a state-of-the-art gene set enrichment analysis, and I would highly recommend using Enrichr to understand the biological functions of your DEGs list. For Enrichr to be most informative, it is important to also note that the transcripts in your datasets are preferably not biased to any biological pathways. As such, Enrichr is most suitable for microarray and RNAseq datasets.