Posted in Data visualisation, Heatmaps, python

The art of plotting heatmaps

Bar charts provide precise measurements with standard deviation, but showing multiple bar charts to show trends for multiple genes or proteins is difficult to visualise. In such cases, a heatmap can be a useful alternative to visualise changes in multiple gene or protein expression across different conditions. In a heatmap, the variation in magnitude of expression is often visualised by changing the intensity or hue. For instance, a higher expression can be depicted by using the colour red, and a greater increase in expression by a darker intensity. On the other hand, downregulation can be depicted by using the colour blue (more colour-blind friendly than green), with a stronger intensity indicating greater downregulation. While plotting raw expression values in a heatmap is acceptable if the relative abundance between genes and proteins are similar, data normalisation is required when the baseline gene or protein abundance is different.

For instance, consider you want to plot expression levels of gene A and gene B in a single heatmap. Gene A has a baseline expression of 5 copies per cell whereas gene B has a much higher baseline expression of 50 copies per cell. Let’s assume the condition where during a virus infection, gene A expression is reduced to 2.5 copies/cell but gene B expression is increased by 2 fold to 100 copies/cell. In this case, due to the high baseline abundance of gene B, plotting the heatmap using gene abundance will mask the ability to visualise differential expression of gene A after infection.

To display the relative expression of both genes A and B in a heatmap, data normalisation is recommended. One of the most common strategies is to plot a heatmap based on log2 fold-change instead of abundance. In this specific case, the log2 fold-change for gene A and gene B after virus infection is -1 and 1 respectively, and these values can be nicely presented in a heatmap to depict the phenomena. Another common alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, which means that values above 0 are above average and values below 0 are below average. Hence, the Z-score transformation standardises the values between genes or proteins to the same mean and standard deviation, enabling users to easily compare expression values across genes and proteins. However, an important point to note is that Z-score can only be applied for values that are obtained within a single experiment.

To plot a heatmap, Microsoft excel (under the format cell column) or specialised software such as PRISM can be used. However, most of these tools do not perform clustering of heatmaps . From my experience, besides clustergrammer, python is one of the fastest and easiest way to plot a heatmap or clustergram. After loading the data, the heatmap can be quickly plotted with a single line of code, and with various kinds of customisation. The step by step commands and code descriptions are as follows:

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading data into python as dataframe and check that the dataset is properly loaded
# Gene symbol is assigned as the index column
df = pd.read_csv('/Users/kuanrongchan/Desktop/Blog.csv',index_col=0)
df.head()

Output file:

Gene_SymbolControl_1Control_2Control_3Condition_1_1Condition_1_2Condition_1_3Condition_2_1Condition_2_2Condition_2_3
HERC52.352.181.903.412.993.252.672.552.46
IFIT14.504.424.586.316.266.236.085.525.21
IFIT24.344.844.765.915.965.955.565.364.97
IFNA41.421.521.441.732.052.351.141.541.61
IFNA70.881.220.561.461.231.701.230.710.64
# Packages to plot clustergram
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar

# Plot clustergram
g = sns.clustermap(df, z_score=0, cmap="vlag", figsize=(10, 4), method='average', cbar_kws={"ticks":[-1.5,0,1.5]}) 

# Various customisations can be performed on the above code as follows:
# Z_score = 0 ; tells python to plot the Z-score values
# cmap = "vlag" ; indicates red-blue heatmap
# figsize ; indicates figure size dimensions
# method = 'average' ; default option
# robust=True ; can be added in the code to plot as robust quantiles instead of the extreme values
# cbar_kws={"ticks":[-1.5,0,1.5]} ; defines points to show in colour bar
# g.cax.set_visible(False) ; can be added to remove colour bar
# Add yticklabels=True ; can be added for big heatmaps to display all gene symbols 

Output:

At a single glance, you can quickly tell that condition 2 behaves similarly with the control condition, whereas condition 1 show a marked increase in expression of IFNA4, IFNA7, HERC5, IFIT1 and IFIT2. The replicates are also clustered together. As you may already appreciate, this kind of data presentation is particularly useful for analysing big datasets.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s