I have spent my last week building my lab webpage, so that people can better understand what our lab is doing. The webpage is built with Zyro, and hosted at kuanrongchan.com. If interested to work with us, please do not hesitate to contact us! 🙂
We have used volcano plots, identified the differentially expressed genes (DEGs) and plotted in clustergrams to show the relative expression of these DEGs. The next logical step now will be to query the function of these DEGs. There are a myraid of web tools, such as Database for Annotation, Visualization and Integrated Discovery (DAVID), Ingenuity Pathway Analysis (IPA), Metacore, PandaOmics, Enrichr, GSEA, STRING, Panther, Reactome and KEGG that you can consider as query databases. In this blog post, we will focus on using Enrichr to query on the functionality and localisation of DEGs.
Enrichr is user-friendly, free to use and allows querying of gene lists against a repertoire of gene-set libraries. Moreover, there is GSEApy, which is a Python wrapper for Enrichr, allowing users to query the characteristics of your DEGs within Python. I strongly recommend using JupyterLab or Jupyter Notebook for visualising the output of GSEApy, which is a web-based interactive development environment for storing and executing Python codes. You may visit here for specific instructions on how you can install them on your computer.
If you have not downloaded GSEApy, you can download by entering this command:
pip install gseapy
We first load the required packages and run matplotlib inline using the following codes:
%matplotlib inline %config InlineBackend.figure_format='retina' %load_ext autoreload %autoreload 2 import csv import numpy as np import pandas as pd import gseapy as gp import matplotlib.pyplot as plt
The advantage of using Enrichr is that users can query across multiple databases. To understand the gene set libraries that are available for data query, you can type in the following command:
import gseapy names = gseapy.get_library_name() print(names)
This produces an output file showing the 190 different databases (and possibly more in the future) that users can potentially query against:
As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0). We will filter the upregulated and downregulated DEGs based on fold-change > 1.5, q-value < 0.05 and fold-change < -1.5, q-value < 0.05 respectively at day 1 post-vaccination using the commands as follows:
df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0) DEGs_up_1d = (df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index.tolist() DEGs_down_1d = (df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index.tolist()
My favourite list of databases are Gene Ontology (GO) and Reactome. GO Biological Processes (GOBP), GO Molecular Functions (GOMF) and GO Cellular Components (GOCC) usually provide users a rough idea of what the DEGs do, and their localisation within the cell. To query the upregulated DEGs against all of these 4 databases, the following commands can be executed:
enr_GOBP_up = gp.enrichr(gene_list=DEGs_up_1d , gene_sets=['GO_Biological_Process_2021'], organism='Human', description='DEGs_up_1d', outdir='test/enr_DEGs_GOBP_up', cutoff=0.5 ) enr_GOMF_up = gp.enrichr(gene_list=DEGs_up_1d , gene_sets=['GO_Molecular_Function_2021'], organism='Human', description='DEGs_up_1d', outdir='test/enr_DEGs_GOMF_up', cutoff=0.5 ) enr_GOCC_up = gp.enrichr(gene_list=DEGs_up_1d , gene_sets=['GO_Cellular_Component_2021'], organism='Human', description='DEGs_up_1d', outdir='test/enr_DEGs_GOCC_up', cutoff=0.5 ) enr_Reactome_up = gp.enrichr(gene_list=DEGs_up_1d , gene_sets=['Reactome_2016'], organism='Human', description='DEGs_up_1d', outdir='test/enr_DEGs_Reactome_up', cutoff=0.5 )
These codes may look difficult to understand, so let’s break them down individually to understand the codes. Firstly, gp.enrichr calls Enrichr to query against the gene list that is assigned to DEGs_up_1d in this case. We then specify the gene sets to query against GOBP, GOMF, GOCC or Reactome, which is why the syntaxes are broken into 4 distinct segments. Since the data is from Humans, we assign the organism as Human and add a description. Next we assign the output path in ‘outdir.’ In this example, we specify the creation of a folder named “test” and store the output with the specified file name. Finally, we set a cutoff of p=0.5 for analysis.
To have a sensing of the output table within Jupyter Notebook, we can type this command as an example (showing only the top 5 significant pathways):
Output is as follows:
Note that the “genes” column may look truncated within Jupyter Notebook, so you might have to go to your specified path from ‘outdir’ to find out the full genes list. But nonetheless, the table provides important statistics that allows users to have a quick sensing of the pathways that are enriched.
To show the results graphically, we can key in the following code:
from gseapy.plot import barplot, dotplot barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')
We import barplot from GSEApy to allow us to see the output graphically. The barplot command then plots the top 10 enriched molecular pathways with adjusted p-value < 0.05. Since the genes are upregulated, we assigned the colour as red (‘r’). Output is as follows:
At one glance, we can quickly ascertain that the top pathways induced by the Merck Ad5/HIV are related to innate immune responses and type I IFN response. To determine the enriched pathways in from the other databases, we can use the following code:
barplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r') barplot(enr_GOMF_up.res2d,title='GO Molecular Function seroneg day 1 (up)',color = 'r') barplot(enr_GOCC_up.res2d,title='GO Cellular Component seroneg day 1 (up)',color = 'r') barplot(enr_Reactome_up.res2d,title='Reactome seroneg day 1 (up)',color = 'r')
Output files are as follows:
When we query more databases, we are beginning to see a consensus, where the chemokines, interferon and inflammation pathways are most positively enriched.
Now we will query the pathways that are negatively enriched. The concept is similar with what we did for the upregulated DEGs. We can use the following codes:
enr_GOBP_down = gp.enrichr(gene_list=DEGs_down_1d , gene_sets=['GO_Biological_Process_2021'], organism='Human', description='DEGs_down_1d', outdir='test/enr_DEGs_GOBP_down', cutoff=0.5 ) enr_GOMF_down = gp.enrichr(gene_list=DEGs_down_1d , gene_sets=['GO_Molecular_Function_2021'], organism='Human', description='DEGs_down_1d', outdir='test/enr_DEGs_GOMF_down', cutoff=0.5 ) enr_GOCC_down = gp.enrichr(gene_list=DEGs_down_1d , gene_sets=['GO_Cellular_Component_2021'], organism='Human', description='DEGs_down_1d', outdir='test/enr_DEGs_GOCC_down', cutoff=0.5 ) enr_Reactome_down = gp.enrichr(gene_list=DEGs_down_1d , gene_sets=['Reactome_2016'], organism='Human', description='DEGs_down_1d', outdir='test/enr_DEGs_Reactome_down', cutoff=0.5 )
To visualise the downregulated pathways, we can enter the following codes:
barplot(enr_GOBP_down.res2d,title='GO Biological Processes seroneg day 1 (down)',color = 'b') barplot(enr_GOMF_down.res2d,title='GO Molecular Function seroneg day 1 (down)',color = 'b') barplot(enr_GOCC_down.res2d,title='GO Cellular Component seroneg day 1 (down)',color = 'b') barplot(enr_Reactome_down.res2d,title='Reactome seroneg day 1 (down)',color = 'b')
Output is as follows:
A common theme emerges. We can quickly ascertain that protein translation and genes related to T-cell receptors are the most significantly downregulated pathways after 1 day post-vaccination.
These codes should help a lot in your pathway analysis!
This year, I was very interested in systems vaccinology, and have placed a lot of my efforts summarising various systems vaccinology papers which I found interesting. However, the current layout of my blog didn’t allow presentation of all these publications in a format that users can quickly access to them. I have thus compiled them in my medium publication account, and in a systematic fashion in my medium website. Feel free to visit these small compilations of my blog entries from the hyperlink. Cheers!
Bioinformatic Python codes for volcano plot, DEG and heatmap analysis
Thank you all for the great interest in my blog. For the year end, I have decided to compile the blog posts that I have made for this year so we can revise what we have learnt before starting the next year with new content. If you have followed my posts, a lot of emphasis has been placed in data analysis of omics data, including volcano plot, DEG identification and heatmap analysis. I have thus re-written some of these posts in Medium, curating the codes so that readers can also implement them easily. Furthermore, a concrete example dataset is provided in GitHub to bring context into the codes used!
For Python codes on volcano plots, please click here.
For Python codes on DEGs, please click here.
For Python codes on clustergrams and heatmaps, please click here.
All of the above visualisation tools are important before doing any downstream analysis, including pathway enrichment analysis, interaction maps, gene network etc., which I will be covering more next year!
Unsupervised hierachical clustering can also allow direct visualisation of clusters without the need for dimensionality reduction. The full article can be found here.
There are two common strategies to for data normalisation. One way is to plot the heatmap or clustergram based on log2 fold-change instead of absolute abundance. The comparisons can be performed against time-point = 0 (baseline) for temporal studies, or against a control/placebo experiment (for static studies). However, the disadvantage of using this method is that the distribution of baseline or controls cannot be easily visualised.
Another alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, enabling users to easily compare expression values across multiple genes and protein. Values that are above 0 means the gene expression level is above mean expression, whereas values below 0 means that the gene expression level is below mean expression. However, the limitation is that Z-scores can only be applied for experiments that are obtained within a single experiment. Hence, the choice of either method for data normalisation is highly dependent on the research question and objectives.
Before jumping straight into plotting heatmap or clustergrams for your data, it is generally a good idea to first filter the dataset, especially when you have a large number of genes. The reason is because in most scenarios, the large majority of the genes or proteins remain unchanged, which will consequently impact the ability for the unsupervised hierarchical clustering to separate gene expression profiles based on your experimental conditions. To circumvent this limitation, you can choose to filter the dataset based on differentially expressed genes or enriched biological pathways before plot the heatmaps or clustergrams.
To plot heatmaps and clustergrams using Python, we first load the required packages. In this blog entry, we will be using the Seaborn library to plot the heatmaps and clustergrams:
import numpy as np import pandas as pd import seaborn as sns
Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).
We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column so that gene names can be referred to in the clustergram or heatmap. The commands are as follows:
df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0) df.head()
The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):
As described above, it is important to normalise the dataset to ensure that the relative expression is comparable between different genes. Here, we will use log2 fold-change for normalisation and create log2 fold-change (log2FC) columns in the dataframe:
df['log2FC_6h'] = np.log2(df['ratio_6h']) df['log2FC_1d'] = np.log2(df['ratio_1d']) df['log2FC_3d'] = np.log2(df['ratio_3d']) df['log2FC_7d'] = np.log2(df['ratio_7d'])
There are a total of 17,623 genes that are measured. To visualise the comparisons between time-points better, we will filter the dataset. Since we have previously ascertained that day 1 has the most differentially expressed genes (DEGs), we could filter the dataset based on upregulated DEGs (with fold-change > 1.5, adjusted p-value < 0.05). The filtered dataframe is saved under DEGs_up_1d. Since we are interested in plotting the log2-fold change values, we will select the log2FC columns and remove all other columns. The code is as follows:
DEGs_up_1d = df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)] DEGs_up_1d = DEGs_up_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
To plot the clustergram, the codes are as follows:
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable from mpl_toolkits.axes_grid1.colorbar import colorbar g = sns.clustermap(DEGs_up_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)
There are a few Seaborn settings that are displayed in the code, and the documentation can be found here. The colour map to be ‘vlag’ was chosen as it would allow us to have a heatmap where increased expression is red and reduced expression is blue. Note that I have also assigned maximum and minimum value of 2 and -2 respectively, as I wanted to ensure that log2FC = 0 is white (to signify no change). yticklabels = False was chosen because it is near impossible to see all 826 gene names in the clustergram. The output is as shown:
As expected, the day 1 signatures are the most distinct compared to other time-points. The 6 hour and day 7 signatures are clustered together, showing that these time-points have little or no changes in gene expression profile. Interestingly, some of the DEGs have prolonged expression up to day 3, while others resolve very quickly. Do we see the same trends for the downregulated DEGs? Let’s test it out with the following command:
DEGs_down_1d = df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)] DEGs_down_1d = DEGs_down_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d']) g = sns.clustermap(DEGs_down_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)
Similar patterns can be seen. However, unlike upregulated DEGs where some DEGs persisted to day 3, most of the downregulated DEGs returned to baseline levels at day 3.
It’s not so hard isnt it? 🙂
Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.
Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.
As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.
The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).
To install ipywidgets, simply type in the below command (you only need to do this once):
pip install ipywidgets
We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:
import numpy as np import pandas as pd import ipywidgets as widgets import plotly.express as px import plotly.graph_objects as go
Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:
df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0) df.head()
The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):
Next, we will execute the ipywidgets command:
from ipywidgets import interact, interact_manual @interact def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']): return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))
The above code may seem intimidating, but you can break this down into bite sizes.
First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.
In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.
The output is as follows:
I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.
We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:
As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.
After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:
DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index) DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index) DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index) DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index) DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index) DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index) DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index) DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)
Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:
days = ['6hrs', 'day 1', 'day 3', 'day 7'] fig = go.Figure(data=[ go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]), go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d]) ]) fig.update_layout( barmode='stack', title = 'DEGs in seronegative subjects', yaxis_title='Number of DEGs', font=dict( family='Arial', size=18)) fig.show()
I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:
At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.
As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!
We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.
We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:
import numpy as np import pandas as pd import plotly.graph_objects as go import plotly.express as px
Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:
df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0) df.head()
The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):
The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:
df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’]) df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’]) df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’]) df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’]) df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1) df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1) df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1) df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)
Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:
fig = go.Figure() trace1 = go.Scatter( x=df[‘log2FC_6h’], y=df[‘negative_log_pval_6h’], mode=’markers’, name=’6hrs’, hovertext=list(df.index) ) trace2 = go.Scatter( x=df[‘log2FC_1d’], y=df[‘negative_log_pval_1d’], mode=’markers’, name=’day 1', hovertext=list(df.index) ) trace3 = go.Scatter( x=df[‘log2FC_3d’], y=df[‘negative_log_pval_3d’], mode=’markers’, name=’day 3', hovertext=list(df.index) ) trace4 = go.Scatter( x=df[‘log2FC_7d’], y=df[‘negative_log_pval_7d’], mode=’markers’, name=’day 7', hovertext=list(df.index) ) fig.add_trace(trace1) fig.add_trace(trace2) fig.add_trace(trace3) fig.add_trace(trace4) fig.update_layout(title=’Volcano plot for seronegatives’) fig.show()
I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:
At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.
Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:
fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index) fig_d1.update_traces(textposition=’top center’) fig_d1.update_layout( title_text=’Volcano plot for seronegatives (day 1)’ ) fig_d1.show()
Output file is as follows:
Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.
In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.
In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.
To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:
import seaborn as sns
Then we plot the pairplot using the commands (with the header and subheader descriptions) below:
st.header('Visualising relationship between numeric variables') st.subheader('Pairplot analysis') g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde") g.map_lower(sns.regplot) st.pyplot(g)
Output file is as follows:
After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.
st.subheader('Scatterplot analysis') selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns) selected_y_var = st.selectbox('What about the y?', df.columns) fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species") st.plotly_chart(fig)
Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.
We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:
from scipy.stats import pearsonr from sklearn import linear_model, metrics from sklearn.metrics import r2_score from scipy import stats
We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:
#Correlation calculations (Pearson) st.subheader("Pearson Correlation") def calc_corr(selected_x_var, selected_y_var): corr, p_val = stats.pearsonr(selected_x_var, selected_y_var) return corr, p_val x = df[selected_x_var].to_numpy() y = df[selected_y_var].to_numpy() correlation, corr_p_val = calc_corr(x, y) st.write('Pearson correlation coefficient: %.3f' % correlation) st.write('p value: %.3f' % corr_p_val) #Correlation calculations (Spearman) st.subheader("Spearman Correlation") def calc_corr(selected_x_var, selected_y_var): corr, p_val = stats.spearmanr(selected_x_var, selected_y_var) return corr, p_val x = df[selected_x_var].to_numpy() y = df[selected_y_var].to_numpy() correlation, corr_p_val = calc_corr(x, y) st.write('Spearman correlation coefficient: %.3f' % correlation) st.write('p value: %.3f' % corr_p_val)
Output is as follows:
Pearson correlation coefficient: -0.109
p value: 0.183
Spearman correlation coefficient: -0.159
p value: 0.051
Hence, sepal length and sepal width are not linearly correlated.
As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.
It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.
1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?
In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.
2) Are the cutoff values in your data analysis justified?
We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.
3) How did you manage the missing values and outliers? Can we do better?
Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.
4) Besides the primary research goal, what more can we learn more from the data?
As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.
5) Look beyond statistical significance for biological meaning
As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.
In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.
Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.
First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:
import streamlit as st import numpy as np import pandas as pd import plotly.express as px from wordcloud import WordCloud from typing import Any, List, Tuple
After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:
st.title('Data analysis of iris dataset from PlotlyExpress') df = px.data.iris()
Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:
col1, col2, col3, col4, col5 = st.columns(5) col1.metric("Mean", column.mean()) col2.metric("Max", column.max()) col3.metric("Min", column.min()) col4.metric("Std", column.std()) col5.metric("Count", int(column.count()))
The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.
These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:
# Build wordcloud non_num_cols = df.select_dtypes(include=object).columns column = st.selectbox("Select column to generate wordcloud", non_num_cols) column = df[column] wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique())) st.image(wc.to_image()) # Build table st.markdown('***Counts in selected column for generating WordCloud***') unique_values = column.value_counts() st.write(unique_values)
The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!