Posted in Data visualisation

Visualising distributions with pie charts, donut plots and stacked graphs

Pie charts can be used to display distribution of variables within datasets.

A pie chart is a chart that uses the pie slices to display the relative size or proportion of the different variables. In general, pie charts are easier to visualise if the pie has no more than 7 slices.

Consider the dataset where 10 patients are infected with virus X, 6 of which experience no symptoms. The remaining 4 patients are symptomatic, of which 2 of them experience headaches, 1 with fever and 1 with rashes. Here, you can simply use a pie chart to display the data:

I have used the Prism software to plot the pie chart. Note that you can use the colour variants to emphasise on the subjects with symptoms. It is a good habit to include the total counts when presenting pie charts, as these charts meant to only display relative sizes. Note that the pie chart always start at the 12 o’clock position for easy visualisation.

Sometimes, when pie charts are difficult to visualise, another way is to plot a donut plot. The advantage of donut plot is that you can provide additional information in the middle. However, it is important to make sure that the donut is not too thin that can make visualisation difficult. An example is as shown below:

Finally, sometimes it might be more useful to use a stacked bar chart to show the distribution of the data. The stacked bar chart may be more useful when you want to group variables with similar properties (In this case, you can group all the symptomatic patients together). An example is as follows:

There are no fixed rules to justify which of these charts are more suitable for data visualisation. When in doubt, there is no harm in trying to plot all of these graphs and then decide later which is the most suitable for publications or presentations!

Posted in Dengue, Resource

Immunotranscriptomic profiling the acute and clearance phases of a human challenge dengue virus serotype 2 infection model

Differentially expressed genes at day 8 and 28 after rDEN2Δ30 infection. Source from Hanley JP et al., Nature Communications, 2021.

rDEN2Δ30 is a recombinant serotype 2 virus based on the American genotype 1974 Tonga DENV2 virus, which has been partially attenuated by deletion of 30 nucleotides in the 3′ untranslated region of the RNA genome (Δ30). rDEN2Δ30 infection is known to induce modest viremia in all flavivirus-naive subjects and a mild, transient non-pruritic rash in 80% of recipients.

rDEN2Δ30 infection could hence be a suitable model to evaluate molecular signatures responsible for asymptomatic or mild DENV-2 infection.

In this study by Hanley JP et al., RNA-seq was performed on whole blood collected from rDEN2Δ30-infected subjects at 0, 8, and 28 days post infection. rDEN2Δ30-induced reproducible but modest viremia and a mild rash as the only clinically significant finding in DENV-naive subjects.

Principal component analysis reveal minimal overlap between baseline (day 0) and peak viremia (day 8). The day 28 data (post viremia) partially overlapped with the baseline (day 0) and acute (day 8) timepoints. Pathways enriched in the type I and type II interferon and antiviral responses were upregulated at day 8, whereas pathways controlling translational initiation were downregulated. NF-κB, IL-17 signaling pathways, apoptosis, toll-like receptor signaling, response to viruses, ribosomes, and defense responses were also differentially regulated at day 28.

Myeloid cells including monocytes and activated dendritic cells were significantly increased during acute infection and returned to baseline. In contrast, regulatory T cells (Tregs) were significantly decreased during acute stage.

Gene ontology pathway analysis revealed that the viremia-tracking set of genes was enriched for both response to and regulation of type I and II interferon pathways, including JAK/STAT signaling. Genes encoding for proteins that directly inhibit viral genome replication and involved in protein ubiquitination and catabolism, especially ISG15 pathway, tracked with viremia. Day 28 revealed more varied pathways, including protein ubiquitination, cell migration, cytoskeletal reorganization, and angiogenesis.

Baseline transcript signatures can potentially predict whether the subjects would develop rash after rDEN2Δ30 infection. Higher baseline expression of myeloid nuclear differentiation antigen (MNDA), and cell surface associated cellular processes such as tetraspanin CD37, integral membrane 2B (ITM2B), and genes involved in autophagy (VMP1) was associated with protection from rash. These genes are mostly related to myeloid responses, membrane regulation, autophagy, K63 ubiquitination, and cell morphogenesis.

Transcriptomic signatures modulated by rDEN2Δ30 infection and severe dengue are distinct. Only one gene family, the guanine binding protein (GBP1/2) genes was differentially regulated in both severe dengue and during mild rDEN2Δ30 infection.

Data deposited im Gene Expression Omnibus under accession number GSE152255

Posted in python

Executing Excel functions in Jupyter Notebooks using Mito


Data scientists often need to stratify their data using “IF” function. For instance, what is the difference of immune responses in the young and elderly after vaccination? In this case, you would filter the young by a pre-defined age cut-off (less than 65 years old). Thereafter, the data can be sorted in descending order using fold change, so as to determine the genes that are most differentially regulated. Another filtering can be used to allow identification of differentially expressed genes (DEGs), based on a pre-determined fold change and p-value cut-off. Then, another possibility may arise, where you may wonder if the dataset is influenced by gender differences, and how gender interacts with age to influence vaccine responses. This will require even more filtering of the dataset to probe into these interesting biological questions…

As you can imagine, there are many instances of filtering and comparisons to perform, especially when analysing a human dataset, due to large variations between individuals. Microsoft Excel is a useful tool to do many of these filtering and aggregation functions. However, a major disadvantage is that this consequently generates many Excel files and Excel sheets which can become difficult to manage. Python can be used to perform these filtering functions easily, but this will usually require typing of long lists of codes which can be potentially time-consuming. Hence, I recommend a python package named Mito that can accelerate execution of these filtering functions.

To install the package, execute the following command in the terminal to download the installer:

python -m pip install mitoinstaller

Ensure that you have Python 3.6 or above to utilise this package. Next, run the installer in the terminal:

python -m mitoinstaller install

Finally, launch JupyterLab and restart your Notebook Kernel:

python -m jupyter lab

To run the package in Jupyter Lab:

import mitosheet

Output is as such:

The features of a mitosheet includes:

1. Import dataset
2. Export dataset
3. Add columns (you can insert formulas here)
4. Delete columns that are unnecessary
5. Pivot tables allows you to transpose and query the dataframe differently
6. Merge datasets. Possible if you have 2 excel files with a shared column.
7. Basic graph features, using Plotly
8. Filter data
9. Different excel sheets and for inserting new sheets

Most excitingly, Mito will also generate the equivalent annotated Python codes for each of these edits, meaning that you can easily use the commands to do further analysis in python. Overall, Mito is a user-friendly tool that allows you to perform excel functions within Python. This helps the workflow to be much more efficient as we do not have to import multiple excel files into Jupyter notebooks. Critically, typing of long codes is not required as the Python codes are automatically generated as you plow through the data.

Posted in python

Saving python files for scientific publications

High resolution image files are essential for scientific publications. I personally prefer to use Adobe Illustrator to make my scientific figures, as the created files are in a vector-based format, with infinitely high resolution. To export graph files made by Matplotlib, you can execute the code below after plotting your graph in Jupyter Notebooks:

plt.savefig('destination_path.eps', format='eps')

Saving the file directly in a .eps format allows you to be able to transfer your file directly from Jupyter Notebooks into Adobe Illustrator. Moreover, you can manipulate the files easily within Adobe Illustrator, without any loss in figure resolution. Hope you like this TGIF tip! 🙂

Posted in Dengue, Resource

Increased adaptive immune responses and proper feedback regulation protect against clinical dengue

Genes related to antigen presentation were significantly increased in the asymptomatic compared to the symptomatic dengue individuals. Manuscript by E Simon-Lorière et al., Science Translational Medicine, 2017.

Dengue infections can be asymptomatic, symptomatic, or occasionally progress to severe dengue, a life-threatening condition characterised by a cytokine storm, vascular leakage, and shock. However, the molecular and immunological mechanisms underlying asymptomatic dengue virus (DENV) infection remains largely unknown.

In the publication, E Simon-Lorière et al recruited DENV infected children in Cambodia. Nine individuals remained strictly asymptomatic at the time of inclusion and during the 10-day follow-up period. PBMCs from 8 asymptomatic DENV-1 viremic individuals and 25 symptomatic dengue patients were used for further gene expression analysis.

Asymptomatic individuals have an increase in the percentage of CD4+ T cells and a decrease in CD8+ T cells compared to symptomatic dengue individiuals. However, CD14+ monocytes, Lin-CD11c+ dendritic cells, CD19+ B cells, and CD335+ natural killer cells are not significantly different between asymptomatic and symptomatic individuals.

Transcriptomic signatures were distinct between asymptomatic and symptomatic individuals. The top pathways that diverge the most between asymptomatic and clinical dengue individuals were related to immune processes. Notably, the transcriptomic differences cannot be explained by differences in viral load or immune status.

The innate immune responses were not significantly different between the asymptomatic and symptomatic individuals. Instead, the most significantly activated pathway in asymptomatic individuals was related to “nuclear factor of activated T-cells (NFAT) mediated regulation of immune response.” These genes include CIITA, CD74 and various human leukocyte antigen (HLA) genes, where their expression differences were also validated at the protein levels (See figure on top).

Protein kinase Cq (PKCq) signaling in T lymphocytes was also highly activated in asymptomatic viremic individuals. Genes upregulated included AKT3, SOS1, PAK1, and SLAMF6, as well as T-cell costimulatory pathways such as ICOS-ICOSL, and CD28 and CTLA4 signaling in cytotoxic T-cells.

In contrast, genes related to B-cell activation, differentiation and plasma cell development (BLIMP-1, IRF4) were downregulated in asymptomatic individuals. This finding is correlated with the reduction in antibody production in the asymptomatic individuals.

Data is saved in Gene Expression Omnibus under accession number GSE100299

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")

Output is as follows:

01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)


Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in About me

Saying a big thank you!

Thank you all for reading my blog entries. My blog has now more than 1,000 visitors with more than 2,000 views. Hopefully will bring more content to my entries and eventually be a useful resource for all of us!

Posted in Resource

Correlates of Vaccine-Induced Immunity

Figure showing how the innate and adaptive immune responses can interact synergistically after vaccination to confer protection against viral infection and diseases. It is critical to define which of these components are correlates or cocorrelates of protection. Drawing by BioRender.

There are generally 4 categories of immune functions that relate to protection:

CorrelateA specific immune response to a vaccine that is closely related to protection against infection, disease or other defined end point
Absolute correlateA quantity of a specific immune response to a vaccine that always provide near 100%
Relative correlateA quantity of a specific immune response to a vaccine that usually (not always) provides protection
CocorrelateA quantity of a specific immune response to a vaccine that is 1 of >=2 correlates of protection, and that must be synergistic with other correlates
SurrogateA quantified specific immune response to a vaccine that is not itself protective but that substitutes for the true (perhaps unknown) correlate

Some important pointers that I learnt from the article published by Stanley Plotkin, CID, 2008:

1. The correlate of protection induced by vaccination may not necessarily be the same correlate that operates to close off infection. An example of this principle is measles vaccine. Titers <200 mIU/mL of antibody after vaccination are protective against infection, whereas titers between 120 and 200 mIU/mL protect against clinical signs of disease but not against infection. Titers <120 mIU/mL are not protective at all. Another consideration is the cellular immunity to measles, which is critical in recovery from disease, as CD8+ cells are needed to control viremia and consequent infection of organs. Another example is cytomegalovirus, where antibodies are a correlate of protection against infection, whereas T cell immunity is a correlate of protection against disease.

2. Correlate of protection may be either absolute and relative. Examples of absolute correlates (situations in which a certain level of response almost guarantees protection) include diphtheria, tetanus, measles, rubella and hepatitis A. While absolute correlation is highly desired, many correlates are relative. In these cases, although protection is usually conferred at a certain level of responses, breakthrough infections are possible. An example is the influenza vaccine, where a hemagglutination-inhibition antibody titer of 1/40 is associated with 70% clinical efficacy.

3. While antibodies are often used as measures of correlates of protection, not all antibodies neutralise infections in the same way. An example is the Meningococcal polysaccharide vaccines which give notoriously poor protection in young children, although children do have significant ELISA antibody responses. Other functions, including opsonophagocytosis, ADCC and complement activation could also be important for protection.

4. In some cases, antibodies are surrogates, rather as a true correlate of protection. This means that the antibodies could be indirectly related to the true correlate of protection. Examples provided were the rotavirus and varicella vaccine, where cell-mediated immunity is clearly required for protection against viral infection and disease.

5. Emerging evidence suggest the possibility of organ-specific correlates. Based on experimental studies, it appears that CD4+ cells are key to the prevention of brain pathology after measles and in helping CD8+ cells to close off West Nile virus CNS infection. More work will be needed to define correlates of protection that are organ-specific.

6. Correlates of immunity may differ between different age groups. An example is the influenza vaccine, where antibody production is critical to prevent primary influenza infection in the young, but CD4+ cells may be more important for immunologically experienced individuals undergoing heterosubtypic infection.

7. Cellular responses are increasingly recognised as correlates or cocorrelates of protection. Given that CD4+ cells must be present to help antibodies to develop, and CD8+ cells are needed for virus clearance, emerging evidence now suggest that cellular responses are critical in limiting viral pathogenesis and dissemination. However, more work will be needed to uncover the parameters that are essential for cell-mediated protection.

Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable


To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Posted in python

Converting strings to list in Python

Lists are used to store multiple items in a single variable. Grouping them together in a single list tells Python that these variables should be analysed as a group, rather than as individual entries. However, converting a string of items to a single list may not be an easy task. In the past, my solution was to copy the item list into a new notebook, and use the “find and replace” function to convert delimiters to “,”. Recently, I chanced upon an article that describes how you can elegantly do this task in a single line of code. Imagine you have a list of genes from EnrichR pathway analysis, and the genes are separated by the semicolon (;). You can convert these genes to a list by executing with a single line command:

Gene_list = DEGs.split(";")

Output will then be the list of genes which you can use to query in other datasets:

['IFNG', 'IFNA14', 'CXCL10', 'RSAD2', 'ISG15']

Super cool!