Posted in python

Python workflow for omics data analysis

Bioinformatic Python codes for volcano plot, DEG and heatmap analysis

Thank you all for the great interest in my blog. For the year end, I have decided to compile the blog posts that I have made for this year so we can revise what we have learnt before starting the next year with new content. If you have followed my posts, a lot of emphasis has been placed in data analysis of omics data, including volcano plot, DEG identification and heatmap analysis. I have thus re-written some of these posts in Medium, curating the codes so that readers can also implement them easily. Furthermore, a concrete example dataset is provided in GitHub to bring context into the codes used!

For Python codes on volcano plots, please click here.

For Python codes on DEGs, please click here.

For Python codes on clustergrams and heatmaps, please click here.

All of the above visualisation tools are important before doing any downstream analysis, including pathway enrichment analysis, interaction maps, gene network etc., which I will be covering more next year!

Posted in Data visualisation, python

Using Seaborn library to plot clustergrams

Unsupervised hierachical clustering can also allow direct visualisation of clusters without the need for dimensionality reduction. The full article can be found here.

There are two common strategies to for data normalisation. One way is to plot the heatmap or clustergram based on log2 fold-change instead of absolute abundance. The comparisons can be performed against time-point = 0 (baseline) for temporal studies, or against a control/placebo experiment (for static studies). However, the disadvantage of using this method is that the distribution of baseline or controls cannot be easily visualised.

Another alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, enabling users to easily compare expression values across multiple genes and protein. Values that are above 0 means the gene expression level is above mean expression, whereas values below 0 means that the gene expression level is below mean expression. However, the limitation is that Z-scores can only be applied for experiments that are obtained within a single experiment. Hence, the choice of either method for data normalisation is highly dependent on the research question and objectives.

Before jumping straight into plotting heatmap or clustergrams for your data, it is generally a good idea to first filter the dataset, especially when you have a large number of genes. The reason is because in most scenarios, the large majority of the genes or proteins remain unchanged, which will consequently impact the ability for the unsupervised hierarchical clustering to separate gene expression profiles based on your experimental conditions. To circumvent this limitation, you can choose to filter the dataset based on differentially expressed genes or enriched biological pathways before plot the heatmaps or clustergrams.

To plot heatmaps and clustergrams using Python, we first load the required packages. In this blog entry, we will be using the Seaborn library to plot the heatmaps and clustergrams:

import numpy as np
import pandas as pd
import seaborn as sns

Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column so that gene names can be referred to in the clustergram or heatmap. The commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

As described above, it is important to normalise the dataset to ensure that the relative expression is comparable between different genes. Here, we will use log2 fold-change for normalisation and create log2 fold-change (log2FC) columns in the dataframe:

df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])

There are a total of 17,623 genes that are measured. To visualise the comparisons between time-points better, we will filter the dataset. Since we have previously ascertained that day 1 has the most differentially expressed genes (DEGs), we could filter the dataset based on upregulated DEGs (with fold-change > 1.5, adjusted p-value < 0.05). The filtered dataframe is saved under DEGs_up_1d. Since we are interested in plotting the log2-fold change values, we will select the log2FC columns and remove all other columns. The code is as follows:

DEGs_up_1d = df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]
DEGs_up_1d = DEGs_up_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])

To plot the clustergram, the codes are as follows:

from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar
g = sns.clustermap(DEGs_up_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

There are a few Seaborn settings that are displayed in the code, and the documentation can be found here. The colour map to be ‘vlag’ was chosen as it would allow us to have a heatmap where increased expression is red and reduced expression is blue. Note that I have also assigned maximum and minimum value of 2 and -2 respectively, as I wanted to ensure that log2FC = 0 is white (to signify no change). yticklabels = False was chosen because it is near impossible to see all 826 gene names in the clustergram. The output is as shown:

As expected, the day 1 signatures are the most distinct compared to other time-points. The 6 hour and day 7 signatures are clustered together, showing that these time-points have little or no changes in gene expression profile. Interestingly, some of the DEGs have prolonged expression up to day 3, while others resolve very quickly. Do we see the same trends for the downregulated DEGs? Let’s test it out with the following command:

DEGs_down_1d = df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]
DEGs_down_1d = DEGs_down_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
g = sns.clustermap(DEGs_down_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

Similar patterns can be seen. However, unlike upregulated DEGs where some DEGs persisted to day 3, most of the downregulated DEGs returned to baseline levels at day 3.

It’s not so hard isnt it? 🙂

Posted in Data visualisation, IPyWidgets, python

Identifying differentially expressed genes with ipywidgets

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual
 
@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
 return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']
fig = go.Figure(data=[
 go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
 go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
 barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
 family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

Posted in Data visualisation, python

Plotting volcano plots with Plotly

As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!

We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.

We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:

df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):

The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:

df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’])
df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’])
df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’])
df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’])
df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1)
df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1)
df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1)
df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)

Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:

fig = go.Figure()
trace1 = go.Scatter(
 x=df[‘log2FC_6h’],
 y=df[‘negative_log_pval_6h’],
 mode=’markers’,
 name=’6hrs’,
 hovertext=list(df.index)
)
trace2 = go.Scatter(
 x=df[‘log2FC_1d’],
 y=df[‘negative_log_pval_1d’],
 mode=’markers’,
 name=’day 1',
 hovertext=list(df.index)
)
trace3 = go.Scatter(
 x=df[‘log2FC_3d’],
 y=df[‘negative_log_pval_3d’],
 mode=’markers’,
 name=’day 3',
 hovertext=list(df.index)
)
trace4 = go.Scatter(
 x=df[‘log2FC_7d’],
 y=df[‘negative_log_pval_7d’],
 mode=’markers’,
 name=’day 7',
 hovertext=list(df.index)
)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
fig.update_layout(title=’Volcano plot for seronegatives’)
fig.show()

I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:

At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.

Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:

fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index)
fig_d1.update_traces(textposition=’top center’)
fig_d1.update_layout(
 title_text=’Volcano plot for seronegatives (day 1)’
)
fig_d1.show()

Output file is as follows:

Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.

Posted in Data visualisation, python

So you think you have finished analysing your data? Think again…

It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.

1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?

Two different ways of showing association of age with antibody responses

In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.

2) Are the cutoff values in your data analysis justified?

What cutoff should you try?

We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.

3) How did you manage the missing values and outliers? Can we do better?

Missing-Puzzle-Pieces-600x398-betanews - Ministry Insights
Handling missing data and outliers are important to ensure data interpretation is sound

Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.

4) Besides the primary research goal, what more can we learn more from the data?

proyeccion-continuada-de-luz Imagenes y fotos Premium de Istock
Sometimes, being too focused on what you are chasing may cause you to lose other important knowledge that the data provides

As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.

5) Look beyond statistical significance for biological meaning

As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.

In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.

Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Posted in Data visualisation

Visualising distributions with pie charts, donut plots and stacked graphs

Pie charts can be used to display distribution of variables within datasets.

A pie chart is a chart that uses the pie slices to display the relative size or proportion of the different variables. In general, pie charts are easier to visualise if the pie has no more than 7 slices.

Consider the dataset where 10 patients are infected with virus X, 6 of which experience no symptoms. The remaining 4 patients are symptomatic, of which 2 of them experience headaches, 1 with fever and 1 with rashes. Here, you can simply use a pie chart to display the data:

I have used the Prism software to plot the pie chart. Note that you can use the colour variants to emphasise on the subjects with symptoms. It is a good habit to include the total counts when presenting pie charts, as these charts meant to only display relative sizes. Note that the pie chart always start at the 12 o’clock position for easy visualisation.

Sometimes, when pie charts are difficult to visualise, another way is to plot a donut plot. The advantage of donut plot is that you can provide additional information in the middle. However, it is important to make sure that the donut is not too thin that can make visualisation difficult. An example is as shown below:

Finally, sometimes it might be more useful to use a stacked bar chart to show the distribution of the data. The stacked bar chart may be more useful when you want to group variables with similar properties (In this case, you can group all the symptomatic patients together). An example is as follows:

There are no fixed rules to justify which of these charts are more suitable for data visualisation. When in doubt, there is no harm in trying to plot all of these graphs and then decide later which is the most suitable for publications or presentations!

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()
df.head()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Posted in Data visualisation, IPyWidgets, python

Making Jupyter notebooks interactive with IPyWidgets

There are many instances where scientists and data scientists will have to identify cut-offs before deeper data analyses. For instance, what is the fold change and adjusted p-value threshold that is most appropriate for identification of DEGs? For pre-rank pathway analysis, what is an appropriate cut-off for enrichment scores to be considered significant? For RNAseq, what is the minimum counts that we should filter out without trimming away too many transcripts? Executing these tasks can take a long time, even with coding because you will need multiple lines of coding to identify the most suitable cut-off.

Consider that you are unsure what is an appropriate cut-off for identification of DEGs. The usual way will be to probably do a query on the number of DEGs identified when fold-change is at least 1.5 with adjusted p-value < 0.05. Imagine you identified only 30 DEGs amongst 20,000 genes. This may indicate that your cutoff may be too stringent, and this time, you decide to adjust to fold-change of at least 1.2 with adjusted p-value < 0.05. Again, you will require another line of code. Imagine this time you found 1000 genes that satisfy the criteria and now you think this number is now too many for pathway analysis. Again, this means you might need another code to query fold-change of at least 1.3, and the cycle repeats until you find that suitable cutoff.

To circumvent this issue, I introduce IPython widgets (IPyWidgets), which turn Jupyter Notebooks from static documents into interactive dashboards, perfect for exploring and visualizing data. To install, simply type the command:

pip install ipywidgets

We first import the packages and load a dataset. Here, I load a dataset from Zak DE et al. from my BENEATH account. The study looks into the transcriptomic responses in human subjects across time, after taking the MRKAd5/HIV vaccination, and the details have been previously elaborated in my blog:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import beneath
df = await beneath.load_full("kuanrongchan/vaccine-repository/ad5-seroneg-daniel-2012")
df.info()
df.head(2)

The output for the header columns is as follows. fc indicate fold-change, pval indicate p-value with respect to day 0 and qval indicate adjusted p-value with respect to day 0. The time-points considered are 6hrs, 1, 3 and 7 days after vaccination.

genefc006hpval006hqval006hfc001dpval001dqval001dfc003dpval003dqval003dfc007dpval007dqval007d@meta.timestamp
0A1BG-1.134080.1577810.500370-1.037426.099120e-010.804213-1.139440.1430990.678669-1.048680.5899130.9975792021-07-16 00:24:39.972000+00:00
1A1CF-1.041990.3837500.6388801.074756.557900e-020.1701181.019200.6860740.929992-1.013690.7725000.9975792021-07-16 00:24:39.972000+00:00

We will use the following command to make an interactive slider. Assuming we are fixing adjusted p-value to be less than 0.05 and want to find out the number of DEGs based on a fold-change cut-off:

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

@interact
def show_DEGs(column1=['fc001d','fc003d','fc007d'], x=(0,3,0.1),column2=['qval001d','qval003d','qval007d']):
    return len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index)

Output is as follows:

With the @interact decorator, the IPyWidgets library automatically gives us a text box and a slider for choosing a column and number! In this case, we can move the slider from 0 to 3, and for the different time-points (6hrs, 1, 3 and 7 days post-vaccination) to see the number of upregulated DEGs for different fold-change cut-offs. The output shown has the slider moved to 1.3 and 1.5, which yielded 1510 and 905 DEGs respectively.

You can also use the @interact function with .loc function to call out the genes and expression values that satisfy the criteria designated by your slider:

Finally, you can also use the IPyWidgets to do multiple comparisons, especially using the dropdown widget. For example, if you want to quickly find out if the transcript expression at two time-points are correlated, you can use IPyWidgets to create the dropdown menus for multiple comparisons:

@interact
def correlations(column1=list(df.select_dtypes('number').columns),
                 column2=list(df.select_dtypes('number').columns)):
    print(f"Correlation:{df[column1].corr(df[column2])}")

The output is two dropdown lists, that would allow you to quickly evaluate the correlation of any 2 variables in the dataframe:

The potential of IPyWidgets is limitless, allowing you to scan through data very quickly, and brings some life to the static Jupyter Notebook. You can also use this for graph plotting, scanning through files etc. The possibilities are endless… 🙂