Posted in python

How to use GSEApy for pathway enrichment analysis

GSEApy can execute Enrichr commands within the Python wrapper. Figure taken from Edward Chen et al., BMC Bioinformatics, 2013

We have used volcano plots, identified the differentially expressed genes (DEGs) and plotted in clustergrams to show the relative expression of these DEGs. The next logical step now will be to query the function of these DEGs. There are a myraid of web tools, such as Database for Annotation, Visualization and Integrated Discovery (DAVID), Ingenuity Pathway Analysis (IPA), Metacore, PandaOmics, Enrichr, GSEA, STRING, Panther, Reactome and KEGG that you can consider as query databases. In this blog post, we will focus on using Enrichr to query on the functionality and localisation of DEGs.

Enrichr is user-friendly, free to use and allows querying of gene lists against a repertoire of gene-set libraries. Moreover, there is GSEApy, which is a Python wrapper for Enrichr, allowing users to query the characteristics of your DEGs within Python. I strongly recommend using JupyterLab or Jupyter Notebook for visualising the output of GSEApy, which is a web-based interactive development environment for storing and executing Python codes. You may visit here for specific instructions on how you can install them on your computer.

If you have not downloaded GSEApy, you can download by entering this command:

pip install gseapy

We first load the required packages and run matplotlib inline using the following codes:

%matplotlib inline
%config InlineBackend.figure_format='retina' 
%load_ext autoreload
%autoreload 2
import csv
import numpy as np
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

The advantage of using Enrichr is that users can query across multiple databases. To understand the gene set libraries that are available for data query, you can type in the following command:

import gseapy
names = gseapy.get_library_name()
print(names)

This produces an output file showing the 190 different databases (and possibly more in the future) that users can potentially query against:

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0). We will filter the upregulated and downregulated DEGs based on fold-change > 1.5, q-value < 0.05 and fold-change < -1.5, q-value < 0.05 respectively at day 1 post-vaccination using the commands as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
DEGs_up_1d = (df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index.tolist()
DEGs_down_1d = (df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index.tolist()

My favourite list of databases are Gene Ontology (GO) and Reactome. GO Biological Processes (GOBP), GO Molecular Functions (GOMF) and GO Cellular Components (GOCC) usually provide users a rough idea of what the DEGs do, and their localisation within the cell. To query the upregulated DEGs against all of these 4 databases, the following commands can be executed:

enr_GOBP_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Biological_Process_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOBP_up',
                 cutoff=0.5 
                )

enr_GOMF_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Molecular_Function_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOMF_up',
                 cutoff=0.5 
                )

enr_GOCC_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['GO_Cellular_Component_2021'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_GOCC_up',
                 cutoff=0.5 
                )

enr_Reactome_up = gp.enrichr(gene_list=DEGs_up_1d ,
                 gene_sets=['Reactome_2016'],
                 organism='Human', 
                 description='DEGs_up_1d',
                 outdir='test/enr_DEGs_Reactome_up',
                 cutoff=0.5 
                )

These codes may look difficult to understand, so let’s break them down individually to understand the codes. Firstly, gp.enrichr calls Enrichr to query against the gene list that is assigned to DEGs_up_1d in this case. We then specify the gene sets to query against GOBP, GOMF, GOCC or Reactome, which is why the syntaxes are broken into 4 distinct segments. Since the data is from Humans, we assign the organism as Human and add a description. Next we assign the output path in ‘outdir.’ In this example, we specify the creation of a folder named “test” and store the output with the specified file name. Finally, we set a cutoff of p=0.5 for analysis.

To have a sensing of the output table within Jupyter Notebook, we can type this command as an example (showing only the top 5 significant pathways):

enr_GOBP_up.results.head(5)

Output is as follows:

Note that the “genes” column may look truncated within Jupyter Notebook, so you might have to go to your specified path from ‘outdir’ to find out the full genes list. But nonetheless, the table provides important statistics that allows users to have a quick sensing of the pathways that are enriched.

To show the results graphically, we can key in the following code:

from gseapy.plot import barplot, dotplot

barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')

We import barplot from GSEApy to allow us to see the output graphically. The barplot command then plots the top 10 enriched molecular pathways with adjusted p-value < 0.05. Since the genes are upregulated, we assigned the colour as red (‘r’). Output is as follows:

At one glance, we can quickly ascertain that the top pathways induced by the Merck Ad5/HIV are related to innate immune responses and type I IFN response. To determine the enriched pathways in from the other databases, we can use the following code:

barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')
barplot(enr_GOMF_up.res2d,title='GO Molecular Function seroneg day 1 (up)',color = 'r')
barplot(enr_GOCC_up.res2d,title='GO Cellular Component seroneg day 1 (up)',color = 'r')
barplot(enr_Reactome_up.res2d,title='Reactome seroneg day 1 (up)',color = 'r')

Output files are as follows:

When we query more databases, we are beginning to see a consensus, where the chemokines, interferon and inflammation pathways are most positively enriched.

Now we will query the pathways that are negatively enriched. The concept is similar with what we did for the upregulated DEGs. We can use the following codes:

enr_GOBP_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Biological_Process_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOBP_down',
                 cutoff=0.5 
                )

enr_GOMF_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Molecular_Function_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOMF_down',
                 cutoff=0.5 
                )

enr_GOCC_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['GO_Cellular_Component_2021'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_GOCC_down',
                 cutoff=0.5 
                )

enr_Reactome_down = gp.enrichr(gene_list=DEGs_down_1d ,
                 gene_sets=['Reactome_2016'],
                 organism='Human', 
                 description='DEGs_down_1d',
                 outdir='test/enr_DEGs_Reactome_down',
                 cutoff=0.5 
                )

To visualise the downregulated pathways, we can enter the following codes:

barplot(enr_GOBP_down.res2d,title='GO Biological Processes seroneg day 1 (down)',color = 'b')
barplot(enr_GOMF_down.res2d,title='GO Molecular Function seroneg day 1 (down)',color = 'b')
barplot(enr_GOCC_down.res2d,title='GO Cellular Component seroneg day 1 (down)',color = 'b')
barplot(enr_Reactome_down.res2d,title='Reactome seroneg day 1 (down)',color = 'b')

Output is as follows:

A common theme emerges. We can quickly ascertain that protein translation and genes related to T-cell receptors are the most significantly downregulated pathways after 1 day post-vaccination.

These codes should help a lot in your pathway analysis!

Posted in python

Python workflow for omics data analysis

Bioinformatic Python codes for volcano plot, DEG and heatmap analysis

Thank you all for the great interest in my blog. For the year end, I have decided to compile the blog posts that I have made for this year so we can revise what we have learnt before starting the next year with new content. If you have followed my posts, a lot of emphasis has been placed in data analysis of omics data, including volcano plot, DEG identification and heatmap analysis. I have thus re-written some of these posts in Medium, curating the codes so that readers can also implement them easily. Furthermore, a concrete example dataset is provided in GitHub to bring context into the codes used!

For Python codes on volcano plots, please click here.

For Python codes on DEGs, please click here.

For Python codes on clustergrams and heatmaps, please click here.

All of the above visualisation tools are important before doing any downstream analysis, including pathway enrichment analysis, interaction maps, gene network etc., which I will be covering more next year!

Posted in Data visualisation, python

Using Seaborn library to plot clustergrams

Unsupervised hierachical clustering can also allow direct visualisation of clusters without the need for dimensionality reduction. The full article can be found here.

There are two common strategies to for data normalisation. One way is to plot the heatmap or clustergram based on log2 fold-change instead of absolute abundance. The comparisons can be performed against time-point = 0 (baseline) for temporal studies, or against a control/placebo experiment (for static studies). However, the disadvantage of using this method is that the distribution of baseline or controls cannot be easily visualised.

Another alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, enabling users to easily compare expression values across multiple genes and protein. Values that are above 0 means the gene expression level is above mean expression, whereas values below 0 means that the gene expression level is below mean expression. However, the limitation is that Z-scores can only be applied for experiments that are obtained within a single experiment. Hence, the choice of either method for data normalisation is highly dependent on the research question and objectives.

Before jumping straight into plotting heatmap or clustergrams for your data, it is generally a good idea to first filter the dataset, especially when you have a large number of genes. The reason is because in most scenarios, the large majority of the genes or proteins remain unchanged, which will consequently impact the ability for the unsupervised hierarchical clustering to separate gene expression profiles based on your experimental conditions. To circumvent this limitation, you can choose to filter the dataset based on differentially expressed genes or enriched biological pathways before plot the heatmaps or clustergrams.

To plot heatmaps and clustergrams using Python, we first load the required packages. In this blog entry, we will be using the Seaborn library to plot the heatmaps and clustergrams:

import numpy as np
import pandas as pd
import seaborn as sns

Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column so that gene names can be referred to in the clustergram or heatmap. The commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

As described above, it is important to normalise the dataset to ensure that the relative expression is comparable between different genes. Here, we will use log2 fold-change for normalisation and create log2 fold-change (log2FC) columns in the dataframe:

df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])

There are a total of 17,623 genes that are measured. To visualise the comparisons between time-points better, we will filter the dataset. Since we have previously ascertained that day 1 has the most differentially expressed genes (DEGs), we could filter the dataset based on upregulated DEGs (with fold-change > 1.5, adjusted p-value < 0.05). The filtered dataframe is saved under DEGs_up_1d. Since we are interested in plotting the log2-fold change values, we will select the log2FC columns and remove all other columns. The code is as follows:

DEGs_up_1d = df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]
DEGs_up_1d = DEGs_up_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])

To plot the clustergram, the codes are as follows:

from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar
g = sns.clustermap(DEGs_up_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

There are a few Seaborn settings that are displayed in the code, and the documentation can be found here. The colour map to be ‘vlag’ was chosen as it would allow us to have a heatmap where increased expression is red and reduced expression is blue. Note that I have also assigned maximum and minimum value of 2 and -2 respectively, as I wanted to ensure that log2FC = 0 is white (to signify no change). yticklabels = False was chosen because it is near impossible to see all 826 gene names in the clustergram. The output is as shown:

As expected, the day 1 signatures are the most distinct compared to other time-points. The 6 hour and day 7 signatures are clustered together, showing that these time-points have little or no changes in gene expression profile. Interestingly, some of the DEGs have prolonged expression up to day 3, while others resolve very quickly. Do we see the same trends for the downregulated DEGs? Let’s test it out with the following command:

DEGs_down_1d = df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]
DEGs_down_1d = DEGs_down_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
g = sns.clustermap(DEGs_down_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

Similar patterns can be seen. However, unlike upregulated DEGs where some DEGs persisted to day 3, most of the downregulated DEGs returned to baseline levels at day 3.

It’s not so hard isnt it? 🙂

Posted in Data visualisation, IPyWidgets, python

Identifying differentially expressed genes with ipywidgets

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual
 
@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
 return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']
fig = go.Figure(data=[
 go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
 go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
 barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
 family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

Posted in Data visualisation, python

Plotting volcano plots with Plotly

As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!

We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.

We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:

df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):

The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:

df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’])
df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’])
df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’])
df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’])
df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1)
df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1)
df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1)
df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)

Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:

fig = go.Figure()
trace1 = go.Scatter(
 x=df[‘log2FC_6h’],
 y=df[‘negative_log_pval_6h’],
 mode=’markers’,
 name=’6hrs’,
 hovertext=list(df.index)
)
trace2 = go.Scatter(
 x=df[‘log2FC_1d’],
 y=df[‘negative_log_pval_1d’],
 mode=’markers’,
 name=’day 1',
 hovertext=list(df.index)
)
trace3 = go.Scatter(
 x=df[‘log2FC_3d’],
 y=df[‘negative_log_pval_3d’],
 mode=’markers’,
 name=’day 3',
 hovertext=list(df.index)
)
trace4 = go.Scatter(
 x=df[‘log2FC_7d’],
 y=df[‘negative_log_pval_7d’],
 mode=’markers’,
 name=’day 7',
 hovertext=list(df.index)
)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
fig.update_layout(title=’Volcano plot for seronegatives’)
fig.show()

I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:

At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.

Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:

fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index)
fig_d1.update_traces(textposition=’top center’)
fig_d1.update_layout(
 title_text=’Volcano plot for seronegatives (day 1)’
)
fig_d1.show()

Output file is as follows:

Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.

Posted in Data visualisation, python

Building interactive dashboards with Streamlit (II) – Plotting pairplots and scatterplots for numeric variables

In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.

In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.

To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:

import seaborn as sns

Then we plot the pairplot using the commands (with the header and subheader descriptions) below:

st.header('Visualising relationship between numeric variables')
st.subheader('Pairplot analysis')
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)
st.pyplot(g)

Output file is as follows:

After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.

st.subheader('Scatterplot analysis')
selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns)
selected_y_var = st.selectbox('What about the y?', df.columns)
fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species")
st.plotly_chart(fig)

Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.

We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:

from scipy.stats import pearsonr
from sklearn import linear_model, metrics
from sklearn.metrics import r2_score
from scipy import stats

We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:

#Correlation calculations (Pearson)
st.subheader("Pearson Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.pearsonr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Pearson correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
#Correlation calculations (Spearman)
st.subheader("Spearman Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.spearmanr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Spearman correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)

Output is as follows:

Pearson Correlation

Pearson correlation coefficient: -0.109

p value: 0.183

Spearman Correlation

Spearman correlation coefficient: -0.159

p value: 0.051

Hence, sepal length and sepal width are not linearly correlated.

As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.

Posted in python

Building interactive dashboards with Streamlit (Part I)

Growing Irises – Planting & Caring for Iris Flowers | Garden Design
Iris flower dataset used to illustrate how we can use Streamlit to build data dashboards

Python is often used in back-end programming, which builds functionality in web applications. For instance, Python can be used to connect the web to a database, so that users can query information. By adding the backend components, Python turns a static webpage into a dynamic web application, where users can interact with the webpage based on an updated and connected database. If data provided is big, developers can even use the machine learning tools in Python to make data predictions. However, front-end development, which is the part of making website beautiful and interactive webpages, is often done with other programming languages such as HTML, CSS and JavaScript. The question remains, do I need to be a full-stack developer to master both front-end and back-end web development? This can be time-consuming as this means you have to learn multiple languages.

This is where Streamlit, Django and Flask comes into the rescue! These are built on the Python programming language, and allows building of websites with minimal knowledge on CSS and JavaScript. Among these frameworks, I personally find Flask and Streamlit easier to learn and implement. However, with more experience, I decided to focus on Streamlit as the syntax are easier to understand and the deployment is more rapid.

Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.

First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
from typing import Any, List, Tuple

After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:

st.title('Data analysis of iris dataset from PlotlyExpress')
df = px.data.iris()

Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:

col1, col2, col3, col4, col5 = st.columns(5)
col1.metric("Mean", column.mean())
col2.metric("Max", column.max())
col3.metric("Min", column.min())
col4.metric("Std", column.std())
col5.metric("Count", int(column.count()))

The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.

These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:

# Build wordcloud
non_num_cols = df.select_dtypes(include=object).columns
column = st.selectbox("Select column to generate wordcloud", non_num_cols)
column = df[column]
wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique()))
st.image(wc.to_image())

# Build table
st.markdown('***Counts in selected column for generating WordCloud***')
unique_values = column.value_counts()
st.write(unique_values)

The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!

Posted in python

My experience with Voilà and Streamlit for building data dashboards

Data engineer vs. Data scientist- What does your company need?
Differences between data engineer vs scientist. Source

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include Voilà, Panel, Dash and Streamlit. On my end, I have tried Voilà and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with Voilà and Streamlit.

The installation of Voilà, and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

pip install voila
pip install voila-gridstack
pip install voila-vuetify

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, Voilà can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

  1. Open terminal window
  2. Install Streamlit
  3. Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
  4. And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

pip install streamlit

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I haven’t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why it’s increasing popularity in these recent years!

Streamlit vs Dash vs Voilà vs Panel — Battle of The Python Dashboarding  Giants | by Stephen Kilcommins | DataDrivenInvestor
Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Posted in machine learning, python

The fundamentals of machine learning

Machine learning is a hot topic in data science, but few people understand the concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, and decide to delve deeper into the topic, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning in some of my future blog entries.

With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis. In fact, executing machine learning algorithms plays only a small part of data analysis and the decision making process. To make proper use of machine learning, I would recommend you to take a step back and look at the research question from a helicopter view. What are you trying to answer? Are you trying to classify severe and mild dengue cases based on cytokine levels? Or are you predicting antibody responses based on gene expression levels? Or do you have an end-goal in mind?

Often, the research question will define the machine learning model to execute, and this is highly dependent on your explanatory variables and test variables. It is also dependent on whether you prefer a supervised or unsupervised learning of your dataset. Another point to consider is to define the impact of using machine learning. Does it help the company to save money? Would it have a positive impact on clinical practice? If you cannot envision an impact, then it may not be worth the trouble to use machine learning after all.

Once you have an end-goal in mind, you are now ready to use machine learning tools to analyse your data. However, before that, it is critical to ensure that your data is of good quality. As the saying goes, garbage in = garbage out. Your machine learning will not be able to resurrect and learn a poor quality dataset. Next, it is also important to have a reasonable sensing of your dataset. Specifically, are there any outliers in your dataset? If there are, is it worth removing or changing them before performing machine learning? Does your dataset have high variance and require a Z-score transformation for better prediction? Based on these questions, it is often a good idea to use data visualisation tools to understand your dataset before performing machine learning. A quick way to obtain a sensing of the data is to type in the following code in your dataframe (assigned as df in my example):

df.describe()

Another way to have a sensing of the data is to use the correlation matrix using the command (but do note that correlation does not capture non-linear relationships):

df.corr()

Histograms can be used for quick sensing of the distribution of data. To plot histograms for each variable, I recommend using the Lux package, which I have previously updated in my blog.

After preparing your dataset, then it is time to use the machine learning algorithms. At this point, you hope to find a model that can best represent your data. A reasonable solution can be to split your data into two sets: the training set and the test set (see figure below). Usually, approximately 70% – 80% of the data is used for training and the other 20% – 30% will be used as test to evaluate the accuracy of the model. If you have several models that perform equally well, you may even consider doing a validation set or cross-validation set to ascertain which model would best describe the data.

This article provides the fundamentals of machine learning. Subsequently, I will expand on the different machine learning models, how to execute them effectively and the advantages and limitations in each of these models. Excited to learn more and share with all of you. Meanwhile, stay tuned! 🙂