Posted in correlation, Data visualisation, python

## Correlation matrix – A useful tool to visualise relatedness between multiple variables

Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output

Output file is as follows:

Next, we can visualise the correlation coefficient across all variables using the command:

corrM = test_1.corr()
corrM

Output file showing all the correlation coefficient values is as follows:

We can also determine the p-values of the correlation by using the commands, as follows:

from scipy import stats
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
columns=test_1.columns, index=test_1.columns)
pvals

Output file is as follows:

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

corr = test_1.corr()
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

Posted in Data visualisation, Heatmaps, python

## The art of plotting heatmaps

Bar charts provide precise measurements with standard deviation, but showing multiple bar charts to show trends for multiple genes or proteins is difficult to visualise. In such cases, a heatmap can be a useful alternative to visualise changes in multiple gene or protein expression across different conditions. In a heatmap, the variation in magnitude of expression is often visualised by changing the intensity or hue. For instance, a higher expression can be depicted by using the colour red, and a greater increase in expression by a darker intensity. On the other hand, downregulation can be depicted by using the colour blue (more colour-blind friendly than green), with a stronger intensity indicating greater downregulation. While plotting raw expression values in a heatmap is acceptable if the relative abundance between genes and proteins are similar, data normalisation is required when the baseline gene or protein abundance is different.

For instance, consider you want to plot expression levels of gene A and gene B in a single heatmap. Gene A has a baseline expression of 5 copies per cell whereas gene B has a much higher baseline expression of 50 copies per cell. Let’s assume the condition where during a virus infection, gene A expression is reduced to 2.5 copies/cell but gene B expression is increased by 2 fold to 100 copies/cell. In this case, due to the high baseline abundance of gene B, plotting the heatmap using gene abundance will mask the ability to visualise differential expression of gene A after infection.

To display the relative expression of both genes A and B in a heatmap, data normalisation is recommended. One of the most common strategies is to plot a heatmap based on log2 fold-change instead of abundance. In this specific case, the log2 fold-change for gene A and gene B after virus infection is -1 and 1 respectively, and these values can be nicely presented in a heatmap to depict the phenomena. Another common alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, which means that values above 0 are above average and values below 0 are below average. Hence, the Z-score transformation standardises the values between genes or proteins to the same mean and standard deviation, enabling users to easily compare expression values across genes and proteins. However, an important point to note is that Z-score can only be applied for values that are obtained within a single experiment.

To plot a heatmap, Microsoft excel (under the format cell column) or specialised software such as PRISM can be used. However, most of these tools do not perform clustering of heatmaps . From my experience, besides clustergrammer, python is one of the fastest and easiest way to plot a heatmap or clustergram. After loading the data, the heatmap can be quickly plotted with a single line of code, and with various kinds of customisation. The step by step commands and code descriptions are as follows:

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Gene symbol is assigned as the index column

Output file:

# Packages to plot clustergram
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar

# Plot clustergram
g = sns.clustermap(df, z_score=0, cmap="vlag", figsize=(10, 4), method='average', cbar_kws={"ticks":[-1.5,0,1.5]})

# Various customisations can be performed on the above code as follows:
# Z_score = 0 ; tells python to plot the Z-score values
# cmap = "vlag" ; indicates red-blue heatmap
# figsize ; indicates figure size dimensions
# method = 'average' ; default option
# robust=True ; can be added in the code to plot as robust quantiles instead of the extreme values
# cbar_kws={"ticks":[-1.5,0,1.5]} ; defines points to show in colour bar
# g.cax.set_visible(False) ; can be added to remove colour bar
# Add yticklabels=True ; can be added for big heatmaps to display all gene symbols

Output:

At a single glance, you can quickly tell that condition 2 behaves similarly with the control condition, whereas condition 1 show a marked increase in expression of IFNA4, IFNA7, HERC5, IFIT1 and IFIT2. The replicates are also clustered together. As you may already appreciate, this kind of data presentation is particularly useful for analysing big datasets.

Posted in Data visualisation, python, Venn diagram

## Using python codes to study unique or shared elements in Venn diagrams

As elaborated in my previous post, Venn diagrams are useful for identifying unique and shared genes between different treatment conditions. While Venny remains to be one of my favourite tools for plotting Venn diagrams, inputting lists of genes can be cumbersome when the gene lists are long (e.g. >1,000 genes) or when gene lists are provided as .csv or .txt files.

Here, I introduce how you can use python to identify the unique and overlapping genes. The operations, & for intersection, and | for union can be used. An example is shown below:

# Assume each alphabet as individual genes
# Treatment A and treatment B serve as variables

Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}

Total = list(Treatment_A | Treatment_B)
print(sorted(Total))

Intersection = list(Treatment_A & Treatment_B)
print(sorted(Intersection))

Treatment_A_no_B = list(Treatment_A - (Treatment_A & Treatment_B))
print(sorted(Treatment_A_no_B))

Treatment_B_no_A = list(Treatment_B - Treatment_A & Treatment_B)
print(sorted(Treatment_B_no_A))
# Output from above commands are as follows:
['A', 'B', 'C', 'D', 'E', 'F']
['B', 'C', 'D']
['A']
['E', 'F']

Based on the counts, we can then plot a proportionate Venn diagram using the following code:

from matplotlib_venn import venn2
from matplotlib import pyplot as plt

venn2(subsets = (1, 2, 3), set_labels = ('Treatment A', 'Treatment B'))
plt.title("Treatment A vs Treatment B")
plt.show()

Edited on 5 July 2021, with contribution from Gabrielle đź™‚ You can also plot a Venn diagram with 3 different comparisons as well using the command below:

import matplotlib.pyplot as plt
from matplotlib_venn import venn3

set1 = set(["A", "B", "C", "D"])
set2 = set(["B", "C", "D", "E", "F"])
set3 = set(["C", "D", "E","G","H","J"])

venn3([set1, set2, set3], ('Treatment_A', 'Treatment_B', 'Treatment_C'))

plt.show()
# To obtain the elements in the shared and unique treatments
Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}
Treatment_C = {"C", "D", "E","G","H","J"}

Total = list(Treatment_A | Treatment_B | Treatment_C)
print(sorted(Total))

Intersect_all = Treatment_A & Treatment_B & Treatment_C
print(sorted(Intersect_all))

Treatment_A_no_BC = list(Treatment_A - (Treatment_B | Treatment_C))
print(sorted(Treatment_A_no_BC))

Treatment_B_no_AC = list(Treatment_B - (Treatment_A | Treatment_C))
print(sorted(Treatment_B_no_AC))

Treatment_C_no_AB = list(Treatment_C - (Treatment_A | Treatment_B))
print(sorted(Treatment_C_no_AB))
# Output file
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J']
['C', 'D']
['A']
['F']
['G', 'H', 'J']

To use these commands for your application, simply replace the alphabets to your gene lists. Happy coding!

Posted in Data visualisation, gProfiler

## gProfiler: A useful webtool to query multiple lists of differentially expressed genes

Comparing multiple variables with microarray or RNAseq often generates different sets of differentially expressed genes (DEGs). To examine the similarities and differences in molecular pathways modulated by these sets of DEGs, tools such as Enrichr or DAVID can be used to investigate the individual pathways involved. A Venn diagram can then be used to extract the intersecting or unique pathways involved.

While such comparisons are easy to perform if you only have a few comparisons to make, this method of analysis can be time-consuming when the query size becomes larger. Therefore, in scenarios where multiple comparison of DEGs are required, I recommend using gProfiler for analysis of enriched pathways. One major strength of gProfiler is to be able to submit multiple query lists, by simply adding a â€ś>â€ť symbol between your gene query lists.

To provide more context, consider that you are comparing the day 1 responses between the different adjuvants, published by Laurane et al. After determining the DEGs for the different conditions based on a defined cutoff (Fold change > 1.3, adjusted p-value < 0.05 in this case), we can input the DEGs into gProfiler for the respective conditions. Once the â€śRun as multiqueryâ€ť box is checked, gProfiler generates manhattan plots for all the different conditions, showing the enriched pathways that are detected across various databases such as GOMF, GOBP, GOCC, KEGG, Reactome, as well as their respective p-values (See top figure for the output). The corresponding p-values for the other conditions are reflected on the y-axis.

The p-values for all enriched pathways can also be analysed in a table form (see below). Indeed, consistent with the publication findings, AS01B and AS01E behave more similarly as compared to AS03, although most of the pathways among all three adjuvants are overlapping.

Besides gene queries against the default databases, you can also use gProfiler to query against any other custom databases, by uploading the custom GMT file. In addition, there are other functions that gProfiler can perform. For instance, g:Convert can convert gene IDs into gene symbols. g:SNPense can also provide information of the SNP queries.

Overall, gProfiler is a useful webtool that allows quick assessment of enriched pathways across multiple variables. A limitation, however, is that the genes involved in the pathways are not shown. To circumvent this, the results from gProfiler can be complemented with Enrichr to identify the gene transcripts responsible for the enriched pathways.

Posted in Data visualisation, KEA3

## KEA3: A web-based tool to predict involvement of upstream kinases based on a list of gene or protein queries.

Protein kinases catalyze the transfer of a phosphate group from ATP to other proteinsâ€™ threonine, serine, or tyrosine residues. This addition of phosphate group to a protein can influence substrate protein activity, stability, localization, and interactions with other molecules. While kinases can be suitably targeted by drugs, characterization of the cell kinome is not easy, as intercellular staining with phospho-specific antibodies is required.

A plausible solution is to leverage on the transcriptomic data to down-select kinase candidates for further validation. Here, I introduce Kinase Enrichment Analysis 3 (KEA3) developed by MV Kuleshov et al., which is a webserver application that predicts upstream kinases based on a list of gene or protein queries.

The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data. By integrating KSIs and KPIs across data sources, KEA3 produces a composite ranking that improves the recovery of the expected kinases. In addition, the relationship between the top predicted kinase are also displayed in an interactive map.

I tested the ability for KEA3 to evaluate the possible kinases involved in the host transcriptomic responses to the YF-17D vaccine published by Querec et al., Nature Immunology, 2009. Taking the up-regulated differentially expressed genes at day 7 post-YF17D administration as the query list, the top 10 kinase hits are displayed in the figure at the top of this post. Notably, these kinases appear to be highly interconnected and the predicted involvement of EIF2AK2-JAK1-JAK2-TYK axis suggests the involvement of these kinases in triggering type-I interferon responses. This finding is consistent with previous studies showing that YF-17D induces strong interferon and antiviral responses.

Overall, KEA3 is a user-friendly tool that allows users to quickly predict the upstream kinases involved, based on a list of proteins or genes. While an experimental validation will be needed to confirm the involvement of these predicted kinases, the tool provides an informed prediction on the kinases involved that can be used for future studies.

Posted in Data visualisation, Volcano Plot

## Genoppi: A new tool to plot volcano plots with annotated functions

As elaborated in my previous post, volcano plots are a great way of visualising the differentially expressed genes (DEGs) that are most affected by a particular treatment (compared to untreated control). However, a limitation is that volcano plots do not directly annotate whether the most influential genes perform similar functions. Here, I introduce Genoppi, which is an open-source computational tool (R package source code:Â github.com/lagelab/Genoppi) that allows plotting of interactive volcano plots with the corresponding gene functions derived from HGNC, GO or MSigDB (see example figure on top). In addition, by assigning a bait protein of interest, the tool is able to identify the interacting partners that are significant on the volcano plot. The interaction partners are compiled from InWeb_InBioMap, iRefIndex, or BioPlex which includes data from >40000 scientific articles. If interested to find out if SNPs (single-nucleotide polymorphisms) could play a role in your dataset, you may also assign the GWAS study from the NHGRI-EBI GWAS catalog to the dataset.

There are several other functions that Genoppi can do, but not elaborated in this blog post. Interested users may visit the details within the publication or in the guide provided in the website. Overall, while Genoppi is likely to be most utilised by scientists interested in proteomics and interactome research, my experience suggests that Genoppi can be potentially applied more broadly to transcriptomics analyses as well.

Posted in Data visualisation, Venn diagram

## Venn diagrams: A basic method for comparing variables

Venn diagrams are a fun and easy way to visualise the unique and overlapping differentially expressed genes between variables in a dataset. In my experience, I find Venny to be the most useful. By copying and pasting gene lists from various comparisons, Venny allows quick identification of genes that are either shared or unique between comparisons.

Another way to draw Venn diagrams is to use the Venn diagram plotter, which allows drawing of correctly proportioned Venn diagrams. The size of the Venn diagram plotted will be proportional to the size of the gene list, allowing a visual representation of the extent of overlap between comparisons (see above for example).

Lastly, it is important to note that Venn diagrams are unable to assign gene rankings, so I would recommend doing a pathway enrichment for the different segments of the Venn diagram to prioritise the genes of interest. Alternatively, you may consider Venn-diaNet for such analysis.

Posted in Data visualisation, REVIGO

## REVIGO: A useful tool to summarise long lists of Gene Ontology terms

After Enrichr analysis, you could be faced with a scenario where you have >30 significant Gene Ontology (GO) pathway hits. Presenting all of these pathways in a figure panel could be potentially challenging, or even distracting if many of these significant hits belong to similar biological pathways. In this entry, I recommend the use of REVIGO, which is a Web server that summarises long lists of GO terms into a representative subset of terms. This clustering is based on an algorithm that relies on semantic similarity measures. The non-redundant GO terms are also visualised in a scatterplot, interactive graph, treemap or tag cloud to ascertain their degree of similarity or differences. As a result, REVIGO mitigates the problem of having long lists of significant GO terms belonging to highly similar pathways, allowing you to focus on unique GO pathways that are influenced by the treatment conditions. For instance, consider the following example where human subjects given the yellow fever live-attenuated vaccine results in upregulated DEGs at day 2 post-vaccination (Jue Hou et al., JI, 2017). The list of GO pathways derived from Enrichr based on adjusted p-value < 0.05 are as follows:

This is a long redundant list of 30 GOBP pathways that is difficult to present in a scientific publication. It is also incorrect to conveniently remove significant pathways from this list without proper justification. In this case, REVIGO provides one of the best solutions to resolve these issues. By putting all these significant GO terms into REVIGO, the program summarises this long list of pathways into a smaller list of 16 non-redundant pathways:

Indeed, the scatterplot (see diagram on top) depicts that this shortlisted list of GO terms are non-overlapping, and indicate that these pathways will likely be modulated by different sets of DEGs. This allows the researcher to report on the restricted list of GO terms, and their associated DEGs more concisely, without losing biological meaning.

Overall, REVIGO is the tool of choice for users who wish to be able to quickly summarise their long list of GO terms, to better understand the distinct pathways that are differentially modulated by various infection or treatment conditions.

Posted in Clustergrammer, Data visualisation

## Clustergrammer: A great online tool for plotting clustergrams and heatmaps

A clustergram or a heatmap is one of several techniques that can directly visualise data without the need for dimensionality reduction. As clustergrams are easy to interpret, they are widely used to visualise biological data in print publications. Based on similarities and differences in gene expression patterns, clustergrams can also allow direct visualisation of clusters.

In this entry, I will introduce Clustergrammer, which is a user-friendly webtool for plotting clustergrams. The loading of the data into Clustergrammer can be summarised in 3 basic steps:

1. Normalise the gene expression data by performing a Z score transformation. This ensures that the grand mean of each gene will be centralised at value of 0, with standard deviation of 1.
2. Make sure that the samples are arranged in columns and the genes are arranged in rows. I recommend ordering the samples in the same way as how you would want your data to be published (e.g. controls on the extreme left and the other samples on the right), as proper ordering of the variables allows Clustergrammer to perform supervised clustering. Finally, if you have multiple conditions, you may assign the clusters beforehand by inserting additional rows at the top. You may also consider adding additional columns on the left to assign genes that perform similar functions (see detailed instructions within website).
3. Save file in .txt format and upload file in Clustergrammer.

By default, Clustergrammer performs an unsupervised clustering on both rows and columns, and clusters can be visualised by the small arrowheads at the bottom and right of the heatmap. A single-click on the arrowhead reveals the genes within the cluster, allowing you to query their functions directly in Enrichr. A double-click allows you to zoom into the heatmap within the cluster. To further examine the expression levels at the individual level, you can move your mouse cursor within the heatmap and use the mouse scroll to zoom in or zoom out.

For supervised clustering, you can choose to arrange the rows and columns according to the sample order originally assigned. The sidebar is located at the top left hand side of the website. If you have pre-assigned your clusters by adding additional rows, you may choose to click on the category you have classified.

Finally, to determine the relatedness between the different conditions, Clustergrammer also plots the co-expression matrix. The applications of Clustergrammer are not just limited to analysing gene expression studies, but can be extended to proteomics, metabolomics, virus-host interactions and cyTOF analyses. The ease of use, interactive interface and the ability to directly visualise gene expression patterns makes Clustergrammer my top choice in analysing omics datasets.

Posted in Data visualisation, Pathway analysis

## Analysing enriched pathways with Enrichr

Differentially expressed genes (DEGs) can be identified based on a pre-determined fold change and adjusted p-value cutoff. Some scientific questions can include:

1. What are the functional roles of different DEGs and in what cellular processes do they participate?
2. How are the DEGs regulated? Do they interact with each other to perform a common function?
3. Are the identified DEGs also seen in other similar studies?

There are several tools that we can employ to answer some or all of these questions. In this entry, I describe the use of Enrichr, which is an integrative web-based and mobile software application with many gene-set libraries. Using a list of DEGs as data input, the Enrichr tool will query this list of DEGs against the multiple gene-set libraries, broadly categorised as: Transcription pathways, Pathways, Ontologies, Diseases/Drugs, Cell types and Miscellaneous. The Reactome database under the â€śPathwaysâ€ť tab and the Gene Ontology Biological Pathways under the â€śOntologyâ€ť tab has been most useful to me, as my research interest is on host responses to virus or vaccines. However, if one is interested in epigenetics, ChIP-seq datasets from the Roadmap Epigenomics project is available to query against. The humanCyc database provides insights into the interactions of the DEGs with metabolic pathways, although I would also recommend Metaboanalyst to have a better understanding of the metabolic pathways involved.

After identifying the top pathway hits in the respective gene-set libraries, the next important feature to look out for is whether the pathways are significantly enriched. The statistical tests provided are :

1. The Fisher exact test that calculates the p-value or adjusted p-value to determine if the pathway is statistically enriched.
2. Z-score that calculates the deviation from the expected rank by the Fisher exact test
3. Combined score that multiplies the log of the p-value computed with the Fisher exact test by the z-score.

By default, an adjusted p-value of <0.05 is considered to be significantly enriched. However, in some cases where the DEGs input list is small, it is common to see that the adjusted p-values will not reach statistical significance. In this case, Enrichr may not be the best tool for analysing small datasets, although it may still be used to provide some insights into the pathways that these genes are involved in. Alternatively, gene-set enrichment analysis (GSEA) could be more informative. In my experience, ranking of pathways by the combined score and filtering by adjusted p-value < 0.05 typically provides a better representation of the top key biological pathways involved. This is likely because the combined score considers both Z-score and p-value, both of which are important parameters to consider for calculating pathway enrichment.

Overall, Enrichr is a state-of-the-art gene set enrichment analysis, and I would highly recommend using Enrichr to understand the biological functions of your DEGs list. For Enrichr to be most informative, it is important to also note that the transcripts in your datasets are preferably not biased to any biological pathways. As such, Enrichr is most suitable for microarray and RNAseq datasets.