Posted in python

Building interactive dashboards with Streamlit (Part I)

Growing Irises – Planting & Caring for Iris Flowers | Garden Design
Iris flower dataset used to illustrate how we can use Streamlit to build data dashboards

Python is often used in back-end programming, which builds functionality in web applications. For instance, Python can be used to connect the web to a database, so that users can query information. By adding the backend components, Python turns a static webpage into a dynamic web application, where users can interact with the webpage based on an updated and connected database. If data provided is big, developers can even use the machine learning tools in Python to make data predictions. However, front-end development, which is the part of making website beautiful and interactive webpages, is often done with other programming languages such as HTML, CSS and JavaScript. The question remains, do I need to be a full-stack developer to master both front-end and back-end web development? This can be time-consuming as this means you have to learn multiple languages.

This is where Streamlit, Django and Flask comes into the rescue! These are built on the Python programming language, and allows building of websites with minimal knowledge on CSS and JavaScript. Among these frameworks, I personally find Flask and Streamlit easier to learn and implement. However, with more experience, I decided to focus on Streamlit as the syntax are easier to understand and the deployment is more rapid.

Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.

First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
from typing import Any, List, Tuple

After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:

st.title('Data analysis of iris dataset from PlotlyExpress')
df = px.data.iris()

Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:

col1, col2, col3, col4, col5 = st.columns(5)
col1.metric("Mean", column.mean())
col2.metric("Max", column.max())
col3.metric("Min", column.min())
col4.metric("Std", column.std())
col5.metric("Count", int(column.count()))

The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.

These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:

# Build wordcloud
non_num_cols = df.select_dtypes(include=object).columns
column = st.selectbox("Select column to generate wordcloud", non_num_cols)
column = df[column]
wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique()))
st.image(wc.to_image())

# Build table
st.markdown('***Counts in selected column for generating WordCloud***')
unique_values = column.value_counts()
st.write(unique_values)

The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!

Posted in Demographics signature

The transcriptional landscape of age in human peripheral blood

Figure 1
Molecular pathways that were most differentially regulated by age. Source is taken from Marjolein J Peters et al., 2016.

Chronological age is a major risk factor for many common diseases including heart disease, cancer and stroke, three of the leading causes of death.

Previously, APOE, FOXO3 and 5q33.3 were the only identified loci consistently associated with longevity.

The discovery stage included six European-ancestry studies (n=7,074 samples) with whole-blood gene expression levels (11,908 genes). The replication stage included 7,909 additional whole-blood samples. A total of 1,497 genes were found to be associated with age, of which 897 are negatively correlated and 600 are positively correlated.

Among the negatively age-correlated genes, three major clusters were identified. The largest group: Cluster #1, consisted of three sub-clusters enriched for (1a) RNA metabolism functions, ribosome biogenesis and purine metabolism; (1b) multiple mitochondrial and metabolic pathways including 10 mitochondrial ribosomal protein (MRP) genes and (1c) DNA replication, elongation and repair, and mismatch repair. Cluster #2 contained factors related to immunity; including T- and B-cell signalling genes, and genes involved in hematopoiesis. Cluster #3 include cytosolic ribosomal subunits.

The positively age-correlated genes revealed four major clusters. Cluster #1: Innate and adaptive immunity. Cluster #2: Actin cytoskeleton, focal adhesion, and tight junctions. Cluster #3: Fatty acid metabolism and peroxisome activity. Cluster #4: Lysosome metabolism and glycosaminoglycan degradation.

DNA methylation, measured by CpG methylation, was not associated with chronological age but associated with the gene expression levels. This result hint at the possibility that DNA methylation could be affecting regulation of gene expression.

Transcriptomic age and epigenetic age (both Hannum and Horvath) were positively correlated, with r-squared values varying between 0.10 and 0.33.

Posted in python

Streamlit: My recommended tool for data science app development

Streamlit is a web application framework that helps in the building and development of Python-based web applications to share data, build data dashboards, app development and even build machine learning models. In addition, developing and deploying Streamlit apps is amazingly quick and versatile, allowing the development of simple apps in a few hours. In fact, I was able to make a simple app within a day after reading this book:

Getting Started with Streamlit for Data Science | Packt
My favourite book for Streamlit

This book summarises the basics of Streamlit, and provides realistic solutions in executing python codes and simple app development. Streamlit is quite newly developed so there are very few books available to learn from. Nonetheless, from the examples provided in the book, I was able to make a simple scatterplot app. We can first load the dataset to make sure we are analysing the correct dataset:

The commands that make the above section are as follows:

st.title() # Gives the header title
st.markdown() # Can provide descriptions of the title in smaller fonts
st.file_uploader() # Creates a file uploader for users to drag and drop
st.write() # Loads the details of the dataset 

Next, we want to create widgets that allow users to select their x and y-axis variables:

To make the above section, the relevant commands are as follows:

st.selectbox() #To create widgets for users to select
px.scatter() # For plotting scatterplot with plotly express
plotly_chart() # To plot the figure out

Why use Plotly? This is because Plotly allows you to interact with the graph, including zoom in and zoom out functions, mousing over data points to determine data point attributes and selection functions to crop the range of x- and y-axis.

Deploying in Streamlit is fast, but the initial steps of setting up can be time-consuming, especially if this is the first time you are trying out. The essential steps are as follows:
1. Create a GitHub account
2. Contact the Streamlit Team to allow the developers to connect Streamlit to your GitHub account
3. Create the GitHub repository. You can choose to make a public or private repository. To make the requirements.txt file, make sure you download by typing the following command: pip install pipreqs.
4. Create apps within the Streamlit account by adding the Github repository address and specifying the Python file to execute.

And that’s it! Your web address will start with: https://share.streamlit.io and the deployed website can be shared publicly to anyone. Furthermore, any changes you make within the GitHub repository can be immediately updated into the deployed website. I appreciate the responsiveness and the speed of deploying the website once everything is set up. Finally, for an icing in the cake, you can even convert the weblink into a desktop app with these specific instructions! If you are into app development and you want to stick to the Python language, I would strongly recommend Streamlit. It’s simplicity from coding to execution and deployment is just so attractive!

Posted in python

My experience with Voilà and Streamlit for building data dashboards

Data engineer vs. Data scientist- What does your company need?
Differences between data engineer vs scientist. Source

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include Voilà, Panel, Dash and Streamlit. On my end, I have tried Voilà and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with Voilà and Streamlit.

The installation of Voilà, and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

pip install voila
pip install voila-gridstack
pip install voila-vuetify

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, Voilà can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

  1. Open terminal window
  2. Install Streamlit
  3. Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
  4. And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

pip install streamlit

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I haven’t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why it’s increasing popularity in these recent years!

Streamlit vs Dash vs Voilà vs Panel — Battle of The Python Dashboarding  Giants | by Stephen Kilcommins | DataDrivenInvestor
Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Posted in Dengue, Resource

Gene Expression Patterns of Dengue Virus-Infected Children from Nicaragua Reveal a Distinct Signature of Increased Metabolism

Pathways and genes that are differentially regulated in DF, DHF and DSS patients. Source from Loke et al., PLOS NTD, 2010.

Identifying signatures of host genome-wide transcriptional patterns can be a tool for biomarker discovery as well as for understanding molecular mechanisms and pathophysiological signatures of disease states.

In this study by Loke et al., transcriptional profiling analysis of pediatric patients from Nicaragua with a predominantly DENV-1 infection was performed, and the gene signatures between healthy, dengue fever (DF), dengue haemorrhaigic fever (DHF) and dengue shock syndrome (DSS) were compared. Enrolment criteria consisted of hospitalised patients younger than 15 years of age. Whole blood was collected during acute illness (days 3-6).

Unsupervised clustering reveal that DHF and DF patients cluster distinctly from the DSS patients. Interestingly, many of the genes that separate these two groups are involved in ‘protein biosynthesis’ and ‘protein metabolism and modification’. A large number of mitochondrial ribosomal proteins and ‘nucleic acid binding’ were also flagged (See Figure above).

Genes related to metabolism, oxidative phosphorylation, protein targeting, nucleic acid metabolism, purine and pyrimidine metabolism, electron transport, DNA metabolism and replication, and protein metabolism and modification were differentially regulated by DF, DHF and DSS patients, reflecting a shared signature of DENV-1 infection.

On the other hand, the biological processes differentially expressed by DSS patients were protein metabolism and modification, intracellular protein traffic, pre-mRNA processing, mRNA splicing, nuclear transport, protein-lipid modification and protein folding.

Of note, the changes in metabolism genes cannot be seen in vitro. Instead, interferon signatures were upregulated.

Data is deposited in Gene Expression Omnibus (GEO) under GSE25226.

Posted in machine learning, python

The fundamentals of machine learning

Machine learning is a hot topic in data science, but few people understand the concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, and decide to delve deeper into the topic, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning in some of my future blog entries.

With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis. In fact, executing machine learning algorithms plays only a small part of data analysis and the decision making process. To make proper use of machine learning, I would recommend you to take a step back and look at the research question from a helicopter view. What are you trying to answer? Are you trying to classify severe and mild dengue cases based on cytokine levels? Or are you predicting antibody responses based on gene expression levels? Or do you have an end-goal in mind?

Often, the research question will define the machine learning model to execute, and this is highly dependent on your explanatory variables and test variables. It is also dependent on whether you prefer a supervised or unsupervised learning of your dataset. Another point to consider is to define the impact of using machine learning. Does it help the company to save money? Would it have a positive impact on clinical practice? If you cannot envision an impact, then it may not be worth the trouble to use machine learning after all.

Once you have an end-goal in mind, you are now ready to use machine learning tools to analyse your data. However, before that, it is critical to ensure that your data is of good quality. As the saying goes, garbage in = garbage out. Your machine learning will not be able to resurrect and learn a poor quality dataset. Next, it is also important to have a reasonable sensing of your dataset. Specifically, are there any outliers in your dataset? If there are, is it worth removing or changing them before performing machine learning? Does your dataset have high variance and require a Z-score transformation for better prediction? Based on these questions, it is often a good idea to use data visualisation tools to understand your dataset before performing machine learning. A quick way to obtain a sensing of the data is to type in the following code in your dataframe (assigned as df in my example):

df.describe()

Another way to have a sensing of the data is to use the correlation matrix using the command (but do note that correlation does not capture non-linear relationships):

df.corr()

Histograms can be used for quick sensing of the distribution of data. To plot histograms for each variable, I recommend using the Lux package, which I have previously updated in my blog.

After preparing your dataset, then it is time to use the machine learning algorithms. At this point, you hope to find a model that can best represent your data. A reasonable solution can be to split your data into two sets: the training set and the test set (see figure below). Usually, approximately 70% – 80% of the data is used for training and the other 20% – 30% will be used as test to evaluate the accuracy of the model. If you have several models that perform equally well, you may even consider doing a validation set or cross-validation set to ascertain which model would best describe the data.

This article provides the fundamentals of machine learning. Subsequently, I will expand on the different machine learning models, how to execute them effectively and the advantages and limitations in each of these models. Excited to learn more and share with all of you. Meanwhile, stay tuned! 🙂

Posted in Resource, VSV vectors

Systems Vaccinology Identifies an Early Innate Immune Signature as a Correlate of Antibody Responses to the Ebola Vaccine rVSV-ZEBOV

Immunologic parameters that are correlated with antibody responses to rVSV-EBOV. Source from Rechtien et al., Cell Reports, 2017.

Predicting and achieving vaccine efficacy remains a major challenge. Here, Rechtien et al used a systems vaccinology approach to disentangle the early innate immune responses elicited by the Ebola vaccine rVSV-Zaire Ebola virus (ZEBOV) to identify innate immune responses correlating with Ebola virus (EBOV)-glycoprotein (GP)-specific antibody induction. Of note, this replication-competent recombinant vaccine candidate is based on the vesicular stomatitis virus (rVSV)-based vaccine vector, which has been shown safe and immunogenic in a number of phase I trials.

The vaccine rVSV-ZEBOV induced a rapid and robust increase in cytokine levels, with a maximum peak at day 1, especially for CXCL10, MCP-1 and MIP-1β. Assessment of PBMCs revealed significant induction of co-stimulatory molecules, monocyte/DC activation and NK cell activation at day 1 post-vaccination. The expression of these molecules begin to decline at day 3.

Interestingly, CXCL10 plasma levels and frequency of activated NK cells at day 3 were found to be positively correlated with antibody responses. CD86+ expression in monocytes and mDCs at day 3 are negatively correlated with antibody responses (See figure on top).

The most number of upregulated genes were detected at day 1 post-vaccination. Critically, the early gene signature linked to CXCL10 pathway, including TIFA (TRAF-interacting protein with forkhead-associated domain) on day 1, SLC6A9 (solute carrier family 6 member 9) on day 3, NFKB1 and NFKB2 were most predictive of antibody responses.

Data is stored under NCBI GEO: GSE97590.

Posted in Data visualisation, python

Making interactive and executable notebooks/dashboards with Lux and Voilà packages

As described in my previous blog posts, pair plots and correlation matrices are useful to quickly analyse relationships between multiple variables. However, when analysing the datasets with more than 10-20 variables, these analysis methods have their limitations. One critical limitation is the difficulty in assessing the effects of 2 or more categorical variables. Another is the difficulty in visualising a large number of variables plotted altogether in one plot.

Here, I propose two python packages that will make graph plotting less tedious. First is the Voilà package, that supports Jupyter interactive widgets and second is the Lux package, which is integrated with a Jupyter interactive widget that allows users to quickly browse through large collections of data directly within their Jupyter notebooks. These packages are useful in displaying data interactively, which makes them highly versatile for fast experimentation with data. I will share the features of these 2 packages as I run through the procedures for running these packages:

First, we need to install the packages. To install Voilà, execute the following command:

pip install voila

Note that this command will also be automatically be installed as an extension if you have version 3 of JupyterLab installed:

Next, to install Lux, enter the following command in terminal:

pip install lux-api

With these packages, you are now ready to use these applications. We will import all the necessary packages and use the iris dataset from PlotlyExpress as a proof of concept.:

import csv
import lux
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = px.data.iris()

The output file is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Next, launch Voilà in the Jupyter Notebook extension (indicated in arrow):

This should allow you to toggle between Pandas and Lux freely (indicated in arrow):

The Lux package automatically performs a pairwise plot against all variables in your dataframe, allowing you to focus on the scatterplots that show correlated relationship. Under the “Distribution” tab, you can quickly visualise the distribution of the data:

For the correlation graph, you can click onto the plots that you are interested in, and click the magnifying glass which indicates “intent.” This will then zoom into the graph further, allowing you to see how the other variables (in this case, species), interact with your scatterplot. This immediately provides you with a 3-dimensional view of the dataset.

As you can begin to appreciate, the Voilà package is a powerful tool to plot interactive widgets directly within your local address. You can even publish it online, allowing other users to use the interactive widgets to explore the dataset according to their research intent. Combined with the Lux package, this allows users, even those without bioinformatics background, to zoom into the interesting scatterplots for further analysis.

Posted in Dengue, Resource

A 20-Gene Set Predictive of Progression to Severe Dengue

Methodology employed by Robinson et al., Cell Reports, 2019. The 20-gene set was used to distinguish between individuals with severe and mild dengue

The gene signatures predictive of severe dengue disease progression is poorly understood.

The study by Robinson et al., utilise 10 publicly available datasets and divided them into 7 “discovery” and 3 “validation” datasets. In the discovery datasets, a total of 59 differentially expressed genes (FDR < 10%, effect size > 1.3-fold) was detected between patients who progress to DHF and/or DSS (DHF/DSS) versus patients with an uncomplicated course (dengue fever).

An iterative greedy forward search to the 59 genes revealed a final set of and 20 differentially expressed genes (3 over-expressed, 17 under-expressed) in DHF/DSS (Gene list as shown in figure above). A dengue score for each sample was obtained by subtracting the geometric mean expression of the 17 under-expressed genes from the geometric mean expression of the 3 over-expressed genes.

The 20-gene dengue severity scores distinguished DHF/DSS from dengue fever upon presentation and prior to the onset of severe complications with a summary area under the curve (AUC) = 0.79 in the discovery datasets. The 20-gene dengue scores also accurately identified dengue patients who will develop DHF/DSS in all three validation datasets.

To further validate this signature, the authors tested a cohort of prospectively enrolled dengue patients in Colombia. The 20-gene dengue score, measured by qPCR, distinguished severe dengue from dengue with or without warning signs (AUC = 0.89) and even severe dengue from dengue with warning signs (AUC = 0.85).

Finally, the 20-gene set is significantly downregulated in natural killer (NK) and NK T (NKT) cells, indicating the role of NK and NKT cells in modulating severe disease outcome.

Dataset deposited under Gene Expression Omnibus (GEO): GSE124046