Posted in Data visualisation, python

Using Seaborn library to plot clustergrams

Unsupervised hierachical clustering can also allow direct visualisation of clusters without the need for dimensionality reduction. The full article can be found here.

There are two common strategies to for data normalisation. One way is to plot the heatmap or clustergram based on log2 fold-change instead of absolute abundance. The comparisons can be performed against time-point = 0 (baseline) for temporal studies, or against a control/placebo experiment (for static studies). However, the disadvantage of using this method is that the distribution of baseline or controls cannot be easily visualised.

Another alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, enabling users to easily compare expression values across multiple genes and protein. Values that are above 0 means the gene expression level is above mean expression, whereas values below 0 means that the gene expression level is below mean expression. However, the limitation is that Z-scores can only be applied for experiments that are obtained within a single experiment. Hence, the choice of either method for data normalisation is highly dependent on the research question and objectives.

Before jumping straight into plotting heatmap or clustergrams for your data, it is generally a good idea to first filter the dataset, especially when you have a large number of genes. The reason is because in most scenarios, the large majority of the genes or proteins remain unchanged, which will consequently impact the ability for the unsupervised hierarchical clustering to separate gene expression profiles based on your experimental conditions. To circumvent this limitation, you can choose to filter the dataset based on differentially expressed genes or enriched biological pathways before plot the heatmaps or clustergrams.

To plot heatmaps and clustergrams using Python, we first load the required packages. In this blog entry, we will be using the Seaborn library to plot the heatmaps and clustergrams:

import numpy as np
import pandas as pd
import seaborn as sns

Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column so that gene names can be referred to in the clustergram or heatmap. The commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

As described above, it is important to normalise the dataset to ensure that the relative expression is comparable between different genes. Here, we will use log2 fold-change for normalisation and create log2 fold-change (log2FC) columns in the dataframe:

df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])

There are a total of 17,623 genes that are measured. To visualise the comparisons between time-points better, we will filter the dataset. Since we have previously ascertained that day 1 has the most differentially expressed genes (DEGs), we could filter the dataset based on upregulated DEGs (with fold-change > 1.5, adjusted p-value < 0.05). The filtered dataframe is saved under DEGs_up_1d. Since we are interested in plotting the log2-fold change values, we will select the log2FC columns and remove all other columns. The code is as follows:

DEGs_up_1d = df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]
DEGs_up_1d = DEGs_up_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])

To plot the clustergram, the codes are as follows:

from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar
g = sns.clustermap(DEGs_up_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

There are a few Seaborn settings that are displayed in the code, and the documentation can be found here. The colour map to be ‘vlag’ was chosen as it would allow us to have a heatmap where increased expression is red and reduced expression is blue. Note that I have also assigned maximum and minimum value of 2 and -2 respectively, as I wanted to ensure that log2FC = 0 is white (to signify no change). yticklabels = False was chosen because it is near impossible to see all 826 gene names in the clustergram. The output is as shown:

As expected, the day 1 signatures are the most distinct compared to other time-points. The 6 hour and day 7 signatures are clustered together, showing that these time-points have little or no changes in gene expression profile. Interestingly, some of the DEGs have prolonged expression up to day 3, while others resolve very quickly. Do we see the same trends for the downregulated DEGs? Let’s test it out with the following command:

DEGs_down_1d = df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]
DEGs_down_1d = DEGs_down_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
g = sns.clustermap(DEGs_down_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

Similar patterns can be seen. However, unlike upregulated DEGs where some DEGs persisted to day 3, most of the downregulated DEGs returned to baseline levels at day 3.

It’s not so hard isnt it? 🙂

Posted in Data visualisation, IPyWidgets, python

Identifying differentially expressed genes with ipywidgets

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual
 
@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
 return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']
fig = go.Figure(data=[
 go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
 go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
 barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
 family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

Posted in Data visualisation, python

Plotting volcano plots with Plotly

As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!

We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.

We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:

df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):

The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:

df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’])
df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’])
df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’])
df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’])
df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1)
df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1)
df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1)
df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)

Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:

fig = go.Figure()
trace1 = go.Scatter(
 x=df[‘log2FC_6h’],
 y=df[‘negative_log_pval_6h’],
 mode=’markers’,
 name=’6hrs’,
 hovertext=list(df.index)
)
trace2 = go.Scatter(
 x=df[‘log2FC_1d’],
 y=df[‘negative_log_pval_1d’],
 mode=’markers’,
 name=’day 1',
 hovertext=list(df.index)
)
trace3 = go.Scatter(
 x=df[‘log2FC_3d’],
 y=df[‘negative_log_pval_3d’],
 mode=’markers’,
 name=’day 3',
 hovertext=list(df.index)
)
trace4 = go.Scatter(
 x=df[‘log2FC_7d’],
 y=df[‘negative_log_pval_7d’],
 mode=’markers’,
 name=’day 7',
 hovertext=list(df.index)
)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
fig.update_layout(title=’Volcano plot for seronegatives’)
fig.show()

I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:

At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.

Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:

fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index)
fig_d1.update_traces(textposition=’top center’)
fig_d1.update_layout(
 title_text=’Volcano plot for seronegatives (day 1)’
)
fig_d1.show()

Output file is as follows:

Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.

Posted in Data visualisation, python

Building interactive dashboards with Streamlit (II) – Plotting pairplots and scatterplots for numeric variables

In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.

In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.

To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:

import seaborn as sns

Then we plot the pairplot using the commands (with the header and subheader descriptions) below:

st.header('Visualising relationship between numeric variables')
st.subheader('Pairplot analysis')
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)
st.pyplot(g)

Output file is as follows:

After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.

st.subheader('Scatterplot analysis')
selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns)
selected_y_var = st.selectbox('What about the y?', df.columns)
fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species")
st.plotly_chart(fig)

Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.

We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:

from scipy.stats import pearsonr
from sklearn import linear_model, metrics
from sklearn.metrics import r2_score
from scipy import stats

We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:

#Correlation calculations (Pearson)
st.subheader("Pearson Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.pearsonr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Pearson correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
#Correlation calculations (Spearman)
st.subheader("Spearman Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.spearmanr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Spearman correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)

Output is as follows:

Pearson Correlation

Pearson correlation coefficient: -0.109

p value: 0.183

Spearman Correlation

Spearman correlation coefficient: -0.159

p value: 0.051

Hence, sepal length and sepal width are not linearly correlated.

As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.

Posted in Data visualisation, python

So you think you have finished analysing your data? Think again…

It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.

1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?

Two different ways of showing association of age with antibody responses

In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.

2) Are the cutoff values in your data analysis justified?

What cutoff should you try?

We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.

3) How did you manage the missing values and outliers? Can we do better?

Missing-Puzzle-Pieces-600x398-betanews - Ministry Insights
Handling missing data and outliers are important to ensure data interpretation is sound

Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.

4) Besides the primary research goal, what more can we learn more from the data?

proyeccion-continuada-de-luz Imagenes y fotos Premium de Istock
Sometimes, being too focused on what you are chasing may cause you to lose other important knowledge that the data provides

As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.

5) Look beyond statistical significance for biological meaning

As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.

In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.

Posted in Data visualisation, python

Making interactive and executable notebooks/dashboards with Lux and Voilà packages

As described in my previous blog posts, pair plots and correlation matrices are useful to quickly analyse relationships between multiple variables. However, when analysing the datasets with more than 10-20 variables, these analysis methods have their limitations. One critical limitation is the difficulty in assessing the effects of 2 or more categorical variables. Another is the difficulty in visualising a large number of variables plotted altogether in one plot.

Here, I propose two python packages that will make graph plotting less tedious. First is the Voilà package, that supports Jupyter interactive widgets and second is the Lux package, which is integrated with a Jupyter interactive widget that allows users to quickly browse through large collections of data directly within their Jupyter notebooks. These packages are useful in displaying data interactively, which makes them highly versatile for fast experimentation with data. I will share the features of these 2 packages as I run through the procedures for running these packages:

First, we need to install the packages. To install Voilà, execute the following command:

pip install voila

Note that this command will also be automatically be installed as an extension if you have version 3 of JupyterLab installed:

Next, to install Lux, enter the following command in terminal:

pip install lux-api

With these packages, you are now ready to use these applications. We will import all the necessary packages and use the iris dataset from PlotlyExpress as a proof of concept.:

import csv
import lux
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = px.data.iris()

The output file is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Next, launch Voilà in the Jupyter Notebook extension (indicated in arrow):

This should allow you to toggle between Pandas and Lux freely (indicated in arrow):

The Lux package automatically performs a pairwise plot against all variables in your dataframe, allowing you to focus on the scatterplots that show correlated relationship. Under the “Distribution” tab, you can quickly visualise the distribution of the data:

For the correlation graph, you can click onto the plots that you are interested in, and click the magnifying glass which indicates “intent.” This will then zoom into the graph further, allowing you to see how the other variables (in this case, species), interact with your scatterplot. This immediately provides you with a 3-dimensional view of the dataset.

As you can begin to appreciate, the Voilà package is a powerful tool to plot interactive widgets directly within your local address. You can even publish it online, allowing other users to use the interactive widgets to explore the dataset according to their research intent. Combined with the Lux package, this allows users, even those without bioinformatics background, to zoom into the interesting scatterplots for further analysis.

Posted in Data visualisation

Visualising distributions with pie charts, donut plots and stacked graphs

Pie charts can be used to display distribution of variables within datasets.

A pie chart is a chart that uses the pie slices to display the relative size or proportion of the different variables. In general, pie charts are easier to visualise if the pie has no more than 7 slices.

Consider the dataset where 10 patients are infected with virus X, 6 of which experience no symptoms. The remaining 4 patients are symptomatic, of which 2 of them experience headaches, 1 with fever and 1 with rashes. Here, you can simply use a pie chart to display the data:

I have used the Prism software to plot the pie chart. Note that you can use the colour variants to emphasise on the subjects with symptoms. It is a good habit to include the total counts when presenting pie charts, as these charts meant to only display relative sizes. Note that the pie chart always start at the 12 o’clock position for easy visualisation.

Sometimes, when pie charts are difficult to visualise, another way is to plot a donut plot. The advantage of donut plot is that you can provide additional information in the middle. However, it is important to make sure that the donut is not too thin that can make visualisation difficult. An example is as shown below:

Finally, sometimes it might be more useful to use a stacked bar chart to show the distribution of the data. The stacked bar chart may be more useful when you want to group variables with similar properties (In this case, you can group all the symptomatic patients together). An example is as follows:

There are no fixed rules to justify which of these charts are more suitable for data visualisation. When in doubt, there is no harm in trying to plot all of these graphs and then decide later which is the most suitable for publications or presentations!

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()
df.head()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Posted in Data visualisation, IPyWidgets, python

Making Jupyter notebooks interactive with IPyWidgets

There are many instances where scientists and data scientists will have to identify cut-offs before deeper data analyses. For instance, what is the fold change and adjusted p-value threshold that is most appropriate for identification of DEGs? For pre-rank pathway analysis, what is an appropriate cut-off for enrichment scores to be considered significant? For RNAseq, what is the minimum counts that we should filter out without trimming away too many transcripts? Executing these tasks can take a long time, even with coding because you will need multiple lines of coding to identify the most suitable cut-off.

Consider that you are unsure what is an appropriate cut-off for identification of DEGs. The usual way will be to probably do a query on the number of DEGs identified when fold-change is at least 1.5 with adjusted p-value < 0.05. Imagine you identified only 30 DEGs amongst 20,000 genes. This may indicate that your cutoff may be too stringent, and this time, you decide to adjust to fold-change of at least 1.2 with adjusted p-value < 0.05. Again, you will require another line of code. Imagine this time you found 1000 genes that satisfy the criteria and now you think this number is now too many for pathway analysis. Again, this means you might need another code to query fold-change of at least 1.3, and the cycle repeats until you find that suitable cutoff.

To circumvent this issue, I introduce IPython widgets (IPyWidgets), which turn Jupyter Notebooks from static documents into interactive dashboards, perfect for exploring and visualizing data. To install, simply type the command:

pip install ipywidgets

We first import the packages and load a dataset. Here, I load a dataset from Zak DE et al. from my BENEATH account. The study looks into the transcriptomic responses in human subjects across time, after taking the MRKAd5/HIV vaccination, and the details have been previously elaborated in my blog:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import beneath
df = await beneath.load_full("kuanrongchan/vaccine-repository/ad5-seroneg-daniel-2012")
df.info()
df.head(2)

The output for the header columns is as follows. fc indicate fold-change, pval indicate p-value with respect to day 0 and qval indicate adjusted p-value with respect to day 0. The time-points considered are 6hrs, 1, 3 and 7 days after vaccination.

genefc006hpval006hqval006hfc001dpval001dqval001dfc003dpval003dqval003dfc007dpval007dqval007d@meta.timestamp
0A1BG-1.134080.1577810.500370-1.037426.099120e-010.804213-1.139440.1430990.678669-1.048680.5899130.9975792021-07-16 00:24:39.972000+00:00
1A1CF-1.041990.3837500.6388801.074756.557900e-020.1701181.019200.6860740.929992-1.013690.7725000.9975792021-07-16 00:24:39.972000+00:00

We will use the following command to make an interactive slider. Assuming we are fixing adjusted p-value to be less than 0.05 and want to find out the number of DEGs based on a fold-change cut-off:

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

@interact
def show_DEGs(column1=['fc001d','fc003d','fc007d'], x=(0,3,0.1),column2=['qval001d','qval003d','qval007d']):
    return len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index)

Output is as follows:

With the @interact decorator, the IPyWidgets library automatically gives us a text box and a slider for choosing a column and number! In this case, we can move the slider from 0 to 3, and for the different time-points (6hrs, 1, 3 and 7 days post-vaccination) to see the number of upregulated DEGs for different fold-change cut-offs. The output shown has the slider moved to 1.3 and 1.5, which yielded 1510 and 905 DEGs respectively.

You can also use the @interact function with .loc function to call out the genes and expression values that satisfy the criteria designated by your slider:

Finally, you can also use the IPyWidgets to do multiple comparisons, especially using the dropdown widget. For example, if you want to quickly find out if the transcript expression at two time-points are correlated, you can use IPyWidgets to create the dropdown menus for multiple comparisons:

@interact
def correlations(column1=list(df.select_dtypes('number').columns),
                 column2=list(df.select_dtypes('number').columns)):
    print(f"Correlation:{df[column1].corr(df[column2])}")

The output is two dropdown lists, that would allow you to quickly evaluate the correlation of any 2 variables in the dataframe:

The potential of IPyWidgets is limitless, allowing you to scan through data very quickly, and brings some life to the static Jupyter Notebook. You can also use this for graph plotting, scanning through files etc. The possibilities are endless… 🙂