Posted in python

Building interactive dashboards with Streamlit (Part I)

Growing Irises – Planting & Caring for Iris Flowers | Garden Design
Iris flower dataset used to illustrate how we can use Streamlit to build data dashboards

Python is often used in back-end programming, which builds functionality in web applications. For instance, Python can be used to connect the web to a database, so that users can query information. By adding the backend components, Python turns a static webpage into a dynamic web application, where users can interact with the webpage based on an updated and connected database. If data provided is big, developers can even use the machine learning tools in Python to make data predictions. However, front-end development, which is the part of making website beautiful and interactive webpages, is often done with other programming languages such as HTML, CSS and JavaScript. The question remains, do I need to be a full-stack developer to master both front-end and back-end web development? This can be time-consuming as this means you have to learn multiple languages.

This is where Streamlit, Django and Flask comes into the rescue! These are built on the Python programming language, and allows building of websites with minimal knowledge on CSS and JavaScript. Among these frameworks, I personally find Flask and Streamlit easier to learn and implement. However, with more experience, I decided to focus on Streamlit as the syntax are easier to understand and the deployment is more rapid.

Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.

First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
from typing import Any, List, Tuple

After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:

st.title('Data analysis of iris dataset from PlotlyExpress')
df = px.data.iris()

Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:

col1, col2, col3, col4, col5 = st.columns(5)
col1.metric("Mean", column.mean())
col2.metric("Max", column.max())
col3.metric("Min", column.min())
col4.metric("Std", column.std())
col5.metric("Count", int(column.count()))

The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.

These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:

# Build wordcloud
non_num_cols = df.select_dtypes(include=object).columns
column = st.selectbox("Select column to generate wordcloud", non_num_cols)
column = df[column]
wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique()))
st.image(wc.to_image())

# Build table
st.markdown('***Counts in selected column for generating WordCloud***')
unique_values = column.value_counts()
st.write(unique_values)

The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!

Posted in python

My experience with Voilà and Streamlit for building data dashboards

Data engineer vs. Data scientist- What does your company need?
Differences between data engineer vs scientist. Source

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include Voilà, Panel, Dash and Streamlit. On my end, I have tried Voilà and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with Voilà and Streamlit.

The installation of Voilà, and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

pip install voila
pip install voila-gridstack
pip install voila-vuetify

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, Voilà can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

  1. Open terminal window
  2. Install Streamlit
  3. Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
  4. And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

pip install streamlit

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I haven’t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why it’s increasing popularity in these recent years!

Streamlit vs Dash vs Voilà vs Panel — Battle of The Python Dashboarding  Giants | by Stephen Kilcommins | DataDrivenInvestor
Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Posted in machine learning, python

The fundamentals of machine learning

Machine learning is a hot topic in data science, but few people understand the concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, and decide to delve deeper into the topic, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning in some of my future blog entries.

With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis. In fact, executing machine learning algorithms plays only a small part of data analysis and the decision making process. To make proper use of machine learning, I would recommend you to take a step back and look at the research question from a helicopter view. What are you trying to answer? Are you trying to classify severe and mild dengue cases based on cytokine levels? Or are you predicting antibody responses based on gene expression levels? Or do you have an end-goal in mind?

Often, the research question will define the machine learning model to execute, and this is highly dependent on your explanatory variables and test variables. It is also dependent on whether you prefer a supervised or unsupervised learning of your dataset. Another point to consider is to define the impact of using machine learning. Does it help the company to save money? Would it have a positive impact on clinical practice? If you cannot envision an impact, then it may not be worth the trouble to use machine learning after all.

Once you have an end-goal in mind, you are now ready to use machine learning tools to analyse your data. However, before that, it is critical to ensure that your data is of good quality. As the saying goes, garbage in = garbage out. Your machine learning will not be able to resurrect and learn a poor quality dataset. Next, it is also important to have a reasonable sensing of your dataset. Specifically, are there any outliers in your dataset? If there are, is it worth removing or changing them before performing machine learning? Does your dataset have high variance and require a Z-score transformation for better prediction? Based on these questions, it is often a good idea to use data visualisation tools to understand your dataset before performing machine learning. A quick way to obtain a sensing of the data is to type in the following code in your dataframe (assigned as df in my example):

df.describe()

Another way to have a sensing of the data is to use the correlation matrix using the command (but do note that correlation does not capture non-linear relationships):

df.corr()

Histograms can be used for quick sensing of the distribution of data. To plot histograms for each variable, I recommend using the Lux package, which I have previously updated in my blog.

After preparing your dataset, then it is time to use the machine learning algorithms. At this point, you hope to find a model that can best represent your data. A reasonable solution can be to split your data into two sets: the training set and the test set (see figure below). Usually, approximately 70% – 80% of the data is used for training and the other 20% – 30% will be used as test to evaluate the accuracy of the model. If you have several models that perform equally well, you may even consider doing a validation set or cross-validation set to ascertain which model would best describe the data.

This article provides the fundamentals of machine learning. Subsequently, I will expand on the different machine learning models, how to execute them effectively and the advantages and limitations in each of these models. Excited to learn more and share with all of you. Meanwhile, stay tuned! 🙂

Posted in Data visualisation, python

Making interactive and executable notebooks/dashboards with Lux and Voilà packages

As described in my previous blog posts, pair plots and correlation matrices are useful to quickly analyse relationships between multiple variables. However, when analysing the datasets with more than 10-20 variables, these analysis methods have their limitations. One critical limitation is the difficulty in assessing the effects of 2 or more categorical variables. Another is the difficulty in visualising a large number of variables plotted altogether in one plot.

Here, I propose two python packages that will make graph plotting less tedious. First is the Voilà package, that supports Jupyter interactive widgets and second is the Lux package, which is integrated with a Jupyter interactive widget that allows users to quickly browse through large collections of data directly within their Jupyter notebooks. These packages are useful in displaying data interactively, which makes them highly versatile for fast experimentation with data. I will share the features of these 2 packages as I run through the procedures for running these packages:

First, we need to install the packages. To install Voilà, execute the following command:

pip install voila

Note that this command will also be automatically be installed as an extension if you have version 3 of JupyterLab installed:

Next, to install Lux, enter the following command in terminal:

pip install lux-api

With these packages, you are now ready to use these applications. We will import all the necessary packages and use the iris dataset from PlotlyExpress as a proof of concept.:

import csv
import lux
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = px.data.iris()

The output file is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Next, launch Voilà in the Jupyter Notebook extension (indicated in arrow):

This should allow you to toggle between Pandas and Lux freely (indicated in arrow):

The Lux package automatically performs a pairwise plot against all variables in your dataframe, allowing you to focus on the scatterplots that show correlated relationship. Under the “Distribution” tab, you can quickly visualise the distribution of the data:

For the correlation graph, you can click onto the plots that you are interested in, and click the magnifying glass which indicates “intent.” This will then zoom into the graph further, allowing you to see how the other variables (in this case, species), interact with your scatterplot. This immediately provides you with a 3-dimensional view of the dataset.

As you can begin to appreciate, the Voilà package is a powerful tool to plot interactive widgets directly within your local address. You can even publish it online, allowing other users to use the interactive widgets to explore the dataset according to their research intent. Combined with the Lux package, this allows users, even those without bioinformatics background, to zoom into the interesting scatterplots for further analysis.

Posted in python

Executing Excel functions in Jupyter Notebooks using Mito

Mito

Data scientists often need to stratify their data using “IF” function. For instance, what is the difference of immune responses in the young and elderly after vaccination? In this case, you would filter the young by a pre-defined age cut-off (less than 65 years old). Thereafter, the data can be sorted in descending order using fold change, so as to determine the genes that are most differentially regulated. Another filtering can be used to allow identification of differentially expressed genes (DEGs), based on a pre-determined fold change and p-value cut-off. Then, another possibility may arise, where you may wonder if the dataset is influenced by gender differences, and how gender interacts with age to influence vaccine responses. This will require even more filtering of the dataset to probe into these interesting biological questions…

As you can imagine, there are many instances of filtering and comparisons to perform, especially when analysing a human dataset, due to large variations between individuals. Microsoft Excel is a useful tool to do many of these filtering and aggregation functions. However, a major disadvantage is that this consequently generates many Excel files and Excel sheets which can become difficult to manage. Python can be used to perform these filtering functions easily, but this will usually require typing of long lists of codes which can be potentially time-consuming. Hence, I recommend a python package named Mito that can accelerate execution of these filtering functions.

To install the package, execute the following command in the terminal to download the installer:

python -m pip install mitoinstaller

Ensure that you have Python 3.6 or above to utilise this package. Next, run the installer in the terminal:

python -m mitoinstaller install

Finally, launch JupyterLab and restart your Notebook Kernel:

python -m jupyter lab

To run the package in Jupyter Lab:

import mitosheet
mitosheet.sheet()

Output is as such:

The features of a mitosheet includes:

1. Import dataset
2. Export dataset
3. Add columns (you can insert formulas here)
4. Delete columns that are unnecessary
5. Pivot tables allows you to transpose and query the dataframe differently
6. Merge datasets. Possible if you have 2 excel files with a shared column.
7. Basic graph features, using Plotly
8. Filter data
9. Different excel sheets and for inserting new sheets

Most excitingly, Mito will also generate the equivalent annotated Python codes for each of these edits, meaning that you can easily use the commands to do further analysis in python. Overall, Mito is a user-friendly tool that allows you to perform excel functions within Python. This helps the workflow to be much more efficient as we do not have to import multiple excel files into Jupyter notebooks. Critically, typing of long codes is not required as the Python codes are automatically generated as you plow through the data.

Posted in python

Saving python files for scientific publications

High resolution image files are essential for scientific publications. I personally prefer to use Adobe Illustrator to make my scientific figures, as the created files are in a vector-based format, with infinitely high resolution. To export graph files made by Matplotlib, you can execute the code below after plotting your graph in Jupyter Notebooks:

plt.savefig('destination_path.eps', format='eps')

Saving the file directly in a .eps format allows you to be able to transfer your file directly from Jupyter Notebooks into Adobe Illustrator. Moreover, you can manipulate the files easily within Adobe Illustrator, without any loss in figure resolution. Hope you like this TGIF tip! 🙂

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()
df.head()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Posted in python

Converting strings to list in Python

Lists are used to store multiple items in a single variable. Grouping them together in a single list tells Python that these variables should be analysed as a group, rather than as individual entries. However, converting a string of items to a single list may not be an easy task. In the past, my solution was to copy the item list into a new notebook, and use the “find and replace” function to convert delimiters to “,”. Recently, I chanced upon an article that describes how you can elegantly do this task in a single line of code. Imagine you have a list of genes from EnrichR pathway analysis, and the genes are separated by the semicolon (;). You can convert these genes to a list by executing with a single line command:

DEGs = "IFNG;IFNA14;CXCL10;RSAD2;ISG15"
Gene_list = DEGs.split(";")
print(Gene_list)

Output will then be the list of genes which you can use to query in other datasets:

['IFNG', 'IFNA14', 'CXCL10', 'RSAD2', 'ISG15']

Super cool!