Featured
Posted in Data visualisation, python

Plotting volcano plots with Plotly

As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!

We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.

We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:

df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):

The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:

df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’])
df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’])
df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’])
df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’])
df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1)
df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1)
df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1)
df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)

Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:

fig = go.Figure()
trace1 = go.Scatter(
 x=df[‘log2FC_6h’],
 y=df[‘negative_log_pval_6h’],
 mode=’markers’,
 name=’6hrs’,
 hovertext=list(df.index)
)
trace2 = go.Scatter(
 x=df[‘log2FC_1d’],
 y=df[‘negative_log_pval_1d’],
 mode=’markers’,
 name=’day 1',
 hovertext=list(df.index)
)
trace3 = go.Scatter(
 x=df[‘log2FC_3d’],
 y=df[‘negative_log_pval_3d’],
 mode=’markers’,
 name=’day 3',
 hovertext=list(df.index)
)
trace4 = go.Scatter(
 x=df[‘log2FC_7d’],
 y=df[‘negative_log_pval_7d’],
 mode=’markers’,
 name=’day 7',
 hovertext=list(df.index)
)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
fig.update_layout(title=’Volcano plot for seronegatives’)
fig.show()

I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:

At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.

Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:

fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index)
fig_d1.update_traces(textposition=’top center’)
fig_d1.update_layout(
 title_text=’Volcano plot for seronegatives (day 1)’
)
fig_d1.show()

Output file is as follows:

Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.

Featured
Posted in Data visualisation, python

Building interactive dashboards with Streamlit (II) – Plotting pairplots and scatterplots for numeric variables

In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.

In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.

To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:

import seaborn as sns

Then we plot the pairplot using the commands (with the header and subheader descriptions) below:

st.header('Visualising relationship between numeric variables')
st.subheader('Pairplot analysis')
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)
st.pyplot(g)

Output file is as follows:

After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.

st.subheader('Scatterplot analysis')
selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns)
selected_y_var = st.selectbox('What about the y?', df.columns)
fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species")
st.plotly_chart(fig)

Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.

We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:

from scipy.stats import pearsonr
from sklearn import linear_model, metrics
from sklearn.metrics import r2_score
from scipy import stats

We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:

#Correlation calculations (Pearson)
st.subheader("Pearson Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.pearsonr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Pearson correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
#Correlation calculations (Spearman)
st.subheader("Spearman Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.spearmanr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Spearman correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)

Output is as follows:

Pearson Correlation

Pearson correlation coefficient: -0.109

p value: 0.183

Spearman Correlation

Spearman correlation coefficient: -0.159

p value: 0.051

Hence, sepal length and sepal width are not linearly correlated.

As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.

Featured
Posted in Data visualisation, python

So you think you have finished analysing your data? Think again…

It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.

1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?

Two different ways of showing association of age with antibody responses

In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.

2) Are the cutoff values in your data analysis justified?

What cutoff should you try?

We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.

3) How did you manage the missing values and outliers? Can we do better?

Missing-Puzzle-Pieces-600x398-betanews - Ministry Insights
Handling missing data and outliers are important to ensure data interpretation is sound

Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.

4) Besides the primary research goal, what more can we learn more from the data?

proyeccion-continuada-de-luz Imagenes y fotos Premium de Istock
Sometimes, being too focused on what you are chasing may cause you to lose other important knowledge that the data provides

As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.

5) Look beyond statistical significance for biological meaning

As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.

In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.

Featured
Posted in python

Building interactive dashboards with Streamlit (Part I)

Growing Irises – Planting & Caring for Iris Flowers | Garden Design
Iris flower dataset used to illustrate how we can use Streamlit to build data dashboards

Python is often used in back-end programming, which builds functionality in web applications. For instance, Python can be used to connect the web to a database, so that users can query information. By adding the backend components, Python turns a static webpage into a dynamic web application, where users can interact with the webpage based on an updated and connected database. If data provided is big, developers can even use the machine learning tools in Python to make data predictions. However, front-end development, which is the part of making website beautiful and interactive webpages, is often done with other programming languages such as HTML, CSS and JavaScript. The question remains, do I need to be a full-stack developer to master both front-end and back-end web development? This can be time-consuming as this means you have to learn multiple languages.

This is where Streamlit, Django and Flask comes into the rescue! These are built on the Python programming language, and allows building of websites with minimal knowledge on CSS and JavaScript. Among these frameworks, I personally find Flask and Streamlit easier to learn and implement. However, with more experience, I decided to focus on Streamlit as the syntax are easier to understand and the deployment is more rapid.

Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.

First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
from typing import Any, List, Tuple

After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:

st.title('Data analysis of iris dataset from PlotlyExpress')
df = px.data.iris()

Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:

col1, col2, col3, col4, col5 = st.columns(5)
col1.metric("Mean", column.mean())
col2.metric("Max", column.max())
col3.metric("Min", column.min())
col4.metric("Std", column.std())
col5.metric("Count", int(column.count()))

The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.

These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:

# Build wordcloud
non_num_cols = df.select_dtypes(include=object).columns
column = st.selectbox("Select column to generate wordcloud", non_num_cols)
column = df[column]
wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique()))
st.image(wc.to_image())

# Build table
st.markdown('***Counts in selected column for generating WordCloud***')
unique_values = column.value_counts()
st.write(unique_values)

The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!

Featured
Posted in Demographics signature

The transcriptional landscape of age in human peripheral blood

Figure 1
Molecular pathways that were most differentially regulated by age. Source is taken from Marjolein J Peters et al., 2016.

Chronological age is a major risk factor for many common diseases including heart disease, cancer and stroke, three of the leading causes of death.

Previously, APOE, FOXO3 and 5q33.3 were the only identified loci consistently associated with longevity.

The discovery stage included six European-ancestry studies (n=7,074 samples) with whole-blood gene expression levels (11,908 genes). The replication stage included 7,909 additional whole-blood samples. A total of 1,497 genes were found to be associated with age, of which 897 are negatively correlated and 600 are positively correlated.

Among the negatively age-correlated genes, three major clusters were identified. The largest group: Cluster #1, consisted of three sub-clusters enriched for (1a) RNA metabolism functions, ribosome biogenesis and purine metabolism; (1b) multiple mitochondrial and metabolic pathways including 10 mitochondrial ribosomal protein (MRP) genes and (1c) DNA replication, elongation and repair, and mismatch repair. Cluster #2 contained factors related to immunity; including T- and B-cell signalling genes, and genes involved in hematopoiesis. Cluster #3 include cytosolic ribosomal subunits.

The positively age-correlated genes revealed four major clusters. Cluster #1: Innate and adaptive immunity. Cluster #2: Actin cytoskeleton, focal adhesion, and tight junctions. Cluster #3: Fatty acid metabolism and peroxisome activity. Cluster #4: Lysosome metabolism and glycosaminoglycan degradation.

DNA methylation, measured by CpG methylation, was not associated with chronological age but associated with the gene expression levels. This result hint at the possibility that DNA methylation could be affecting regulation of gene expression.

Transcriptomic age and epigenetic age (both Hannum and Horvath) were positively correlated, with r-squared values varying between 0.10 and 0.33.

Featured
Posted in python

Streamlit: My recommended tool for data science app development

Streamlit is a web application framework that helps in the building and development of Python-based web applications to share data, build data dashboards, app development and even build machine learning models. In addition, developing and deploying Streamlit apps is amazingly quick and versatile, allowing the development of simple apps in a few hours. In fact, I was able to make a simple app within a day after reading this book:

Getting Started with Streamlit for Data Science | Packt
My favourite book for Streamlit

This book summarises the basics of Streamlit, and provides realistic solutions in executing python codes and simple app development. Streamlit is quite newly developed so there are very few books available to learn from. Nonetheless, from the examples provided in the book, I was able to make a simple scatterplot app. We can first load the dataset to make sure we are analysing the correct dataset:

The commands that make the above section are as follows:

st.title() # Gives the header title
st.markdown() # Can provide descriptions of the title in smaller fonts
st.file_uploader() # Creates a file uploader for users to drag and drop
st.write() # Loads the details of the dataset 

Next, we want to create widgets that allow users to select their x and y-axis variables:

To make the above section, the relevant commands are as follows:

st.selectbox() #To create widgets for users to select
px.scatter() # For plotting scatterplot with plotly express
plotly_chart() # To plot the figure out

Why use Plotly? This is because Plotly allows you to interact with the graph, including zoom in and zoom out functions, mousing over data points to determine data point attributes and selection functions to crop the range of x- and y-axis.

Deploying in Streamlit is fast, but the initial steps of setting up can be time-consuming, especially if this is the first time you are trying out. The essential steps are as follows:
1. Create a GitHub account
2. Contact the Streamlit Team to allow the developers to connect Streamlit to your GitHub account
3. Create the GitHub repository. You can choose to make a public or private repository. To make the requirements.txt file, make sure you download by typing the following command: pip install pipreqs.
4. Create apps within the Streamlit account by adding the Github repository address and specifying the Python file to execute.

And that’s it! Your web address will start with: https://share.streamlit.io and the deployed website can be shared publicly to anyone. Furthermore, any changes you make within the GitHub repository can be immediately updated into the deployed website. I appreciate the responsiveness and the speed of deploying the website once everything is set up. Finally, for an icing in the cake, you can even convert the weblink into a desktop app with these specific instructions! If you are into app development and you want to stick to the Python language, I would strongly recommend Streamlit. It’s simplicity from coding to execution and deployment is just so attractive!

Featured
Posted in python

My experience with Voilà and Streamlit for building data dashboards

Data engineer vs. Data scientist- What does your company need?
Differences between data engineer vs scientist. Source

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include Voilà, Panel, Dash and Streamlit. On my end, I have tried Voilà and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with Voilà and Streamlit.

The installation of Voilà, and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

pip install voila
pip install voila-gridstack
pip install voila-vuetify

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, Voilà can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

  1. Open terminal window
  2. Install Streamlit
  3. Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
  4. And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

pip install streamlit

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I haven’t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why it’s increasing popularity in these recent years!

Streamlit vs Dash vs Voilà vs Panel — Battle of The Python Dashboarding  Giants | by Stephen Kilcommins | DataDrivenInvestor
Featured
Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Featured
Posted in Dengue, Resource

Gene Expression Patterns of Dengue Virus-Infected Children from Nicaragua Reveal a Distinct Signature of Increased Metabolism

Pathways and genes that are differentially regulated in DF, DHF and DSS patients. Source from Loke et al., PLOS NTD, 2010.

Identifying signatures of host genome-wide transcriptional patterns can be a tool for biomarker discovery as well as for understanding molecular mechanisms and pathophysiological signatures of disease states.

In this study by Loke et al., transcriptional profiling analysis of pediatric patients from Nicaragua with a predominantly DENV-1 infection was performed, and the gene signatures between healthy, dengue fever (DF), dengue haemorrhaigic fever (DHF) and dengue shock syndrome (DSS) were compared. Enrolment criteria consisted of hospitalised patients younger than 15 years of age. Whole blood was collected during acute illness (days 3-6).

Unsupervised clustering reveal that DHF and DF patients cluster distinctly from the DSS patients. Interestingly, many of the genes that separate these two groups are involved in ‘protein biosynthesis’ and ‘protein metabolism and modification’. A large number of mitochondrial ribosomal proteins and ‘nucleic acid binding’ were also flagged (See Figure above).

Genes related to metabolism, oxidative phosphorylation, protein targeting, nucleic acid metabolism, purine and pyrimidine metabolism, electron transport, DNA metabolism and replication, and protein metabolism and modification were differentially regulated by DF, DHF and DSS patients, reflecting a shared signature of DENV-1 infection.

On the other hand, the biological processes differentially expressed by DSS patients were protein metabolism and modification, intracellular protein traffic, pre-mRNA processing, mRNA splicing, nuclear transport, protein-lipid modification and protein folding.

Of note, the changes in metabolism genes cannot be seen in vitro. Instead, interferon signatures were upregulated.

Data is deposited in Gene Expression Omnibus (GEO) under GSE25226.

Featured
Posted in machine learning, python

The fundamentals of machine learning

Machine learning is a hot topic in data science, but few people understand the concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, and decide to delve deeper into the topic, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning in some of my future blog entries.

With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis. In fact, executing machine learning algorithms plays only a small part of data analysis and the decision making process. To make proper use of machine learning, I would recommend you to take a step back and look at the research question from a helicopter view. What are you trying to answer? Are you trying to classify severe and mild dengue cases based on cytokine levels? Or are you predicting antibody responses based on gene expression levels? Or do you have an end-goal in mind?

Often, the research question will define the machine learning model to execute, and this is highly dependent on your explanatory variables and test variables. It is also dependent on whether you prefer a supervised or unsupervised learning of your dataset. Another point to consider is to define the impact of using machine learning. Does it help the company to save money? Would it have a positive impact on clinical practice? If you cannot envision an impact, then it may not be worth the trouble to use machine learning after all.

Once you have an end-goal in mind, you are now ready to use machine learning tools to analyse your data. However, before that, it is critical to ensure that your data is of good quality. As the saying goes, garbage in = garbage out. Your machine learning will not be able to resurrect and learn a poor quality dataset. Next, it is also important to have a reasonable sensing of your dataset. Specifically, are there any outliers in your dataset? If there are, is it worth removing or changing them before performing machine learning? Does your dataset have high variance and require a Z-score transformation for better prediction? Based on these questions, it is often a good idea to use data visualisation tools to understand your dataset before performing machine learning. A quick way to obtain a sensing of the data is to type in the following code in your dataframe (assigned as df in my example):

df.describe()

Another way to have a sensing of the data is to use the correlation matrix using the command (but do note that correlation does not capture non-linear relationships):

df.corr()

Histograms can be used for quick sensing of the distribution of data. To plot histograms for each variable, I recommend using the Lux package, which I have previously updated in my blog.

After preparing your dataset, then it is time to use the machine learning algorithms. At this point, you hope to find a model that can best represent your data. A reasonable solution can be to split your data into two sets: the training set and the test set (see figure below). Usually, approximately 70% – 80% of the data is used for training and the other 20% – 30% will be used as test to evaluate the accuracy of the model. If you have several models that perform equally well, you may even consider doing a validation set or cross-validation set to ascertain which model would best describe the data.

This article provides the fundamentals of machine learning. Subsequently, I will expand on the different machine learning models, how to execute them effectively and the advantages and limitations in each of these models. Excited to learn more and share with all of you. Meanwhile, stay tuned! 🙂

Featured
Posted in Resource, VSV vectors

Systems Vaccinology Identifies an Early Innate Immune Signature as a Correlate of Antibody Responses to the Ebola Vaccine rVSV-ZEBOV

Immunologic parameters that are correlated with antibody responses to rVSV-EBOV. Source from Rechtien et al., Cell Reports, 2017.

Predicting and achieving vaccine efficacy remains a major challenge. Here, Rechtien et al used a systems vaccinology approach to disentangle the early innate immune responses elicited by the Ebola vaccine rVSV-Zaire Ebola virus (ZEBOV) to identify innate immune responses correlating with Ebola virus (EBOV)-glycoprotein (GP)-specific antibody induction. Of note, this replication-competent recombinant vaccine candidate is based on the vesicular stomatitis virus (rVSV)-based vaccine vector, which has been shown safe and immunogenic in a number of phase I trials.

The vaccine rVSV-ZEBOV induced a rapid and robust increase in cytokine levels, with a maximum peak at day 1, especially for CXCL10, MCP-1 and MIP-1β. Assessment of PBMCs revealed significant induction of co-stimulatory molecules, monocyte/DC activation and NK cell activation at day 1 post-vaccination. The expression of these molecules begin to decline at day 3.

Interestingly, CXCL10 plasma levels and frequency of activated NK cells at day 3 were found to be positively correlated with antibody responses. CD86+ expression in monocytes and mDCs at day 3 are negatively correlated with antibody responses (See figure on top).

The most number of upregulated genes were detected at day 1 post-vaccination. Critically, the early gene signature linked to CXCL10 pathway, including TIFA (TRAF-interacting protein with forkhead-associated domain) on day 1, SLC6A9 (solute carrier family 6 member 9) on day 3, NFKB1 and NFKB2 were most predictive of antibody responses.

Data is stored under NCBI GEO: GSE97590.

Featured
Posted in Data visualisation, python

Making interactive and executable notebooks/dashboards with Lux and Voilà packages

As described in my previous blog posts, pair plots and correlation matrices are useful to quickly analyse relationships between multiple variables. However, when analysing the datasets with more than 10-20 variables, these analysis methods have their limitations. One critical limitation is the difficulty in assessing the effects of 2 or more categorical variables. Another is the difficulty in visualising a large number of variables plotted altogether in one plot.

Here, I propose two python packages that will make graph plotting less tedious. First is the Voilà package, that supports Jupyter interactive widgets and second is the Lux package, which is integrated with a Jupyter interactive widget that allows users to quickly browse through large collections of data directly within their Jupyter notebooks. These packages are useful in displaying data interactively, which makes them highly versatile for fast experimentation with data. I will share the features of these 2 packages as I run through the procedures for running these packages:

First, we need to install the packages. To install Voilà, execute the following command:

pip install voila

Note that this command will also be automatically be installed as an extension if you have version 3 of JupyterLab installed:

Next, to install Lux, enter the following command in terminal:

pip install lux-api

With these packages, you are now ready to use these applications. We will import all the necessary packages and use the iris dataset from PlotlyExpress as a proof of concept.:

import csv
import lux
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = px.data.iris()

The output file is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Next, launch Voilà in the Jupyter Notebook extension (indicated in arrow):

This should allow you to toggle between Pandas and Lux freely (indicated in arrow):

The Lux package automatically performs a pairwise plot against all variables in your dataframe, allowing you to focus on the scatterplots that show correlated relationship. Under the “Distribution” tab, you can quickly visualise the distribution of the data:

For the correlation graph, you can click onto the plots that you are interested in, and click the magnifying glass which indicates “intent.” This will then zoom into the graph further, allowing you to see how the other variables (in this case, species), interact with your scatterplot. This immediately provides you with a 3-dimensional view of the dataset.

As you can begin to appreciate, the Voilà package is a powerful tool to plot interactive widgets directly within your local address. You can even publish it online, allowing other users to use the interactive widgets to explore the dataset according to their research intent. Combined with the Lux package, this allows users, even those without bioinformatics background, to zoom into the interesting scatterplots for further analysis.

Featured
Posted in Dengue, Resource

A 20-Gene Set Predictive of Progression to Severe Dengue

Methodology employed by Robinson et al., Cell Reports, 2019. The 20-gene set was used to distinguish between individuals with severe and mild dengue

The gene signatures predictive of severe dengue disease progression is poorly understood.

The study by Robinson et al., utilise 10 publicly available datasets and divided them into 7 “discovery” and 3 “validation” datasets. In the discovery datasets, a total of 59 differentially expressed genes (FDR < 10%, effect size > 1.3-fold) was detected between patients who progress to DHF and/or DSS (DHF/DSS) versus patients with an uncomplicated course (dengue fever).

An iterative greedy forward search to the 59 genes revealed a final set of and 20 differentially expressed genes (3 over-expressed, 17 under-expressed) in DHF/DSS (Gene list as shown in figure above). A dengue score for each sample was obtained by subtracting the geometric mean expression of the 17 under-expressed genes from the geometric mean expression of the 3 over-expressed genes.

The 20-gene dengue severity scores distinguished DHF/DSS from dengue fever upon presentation and prior to the onset of severe complications with a summary area under the curve (AUC) = 0.79 in the discovery datasets. The 20-gene dengue scores also accurately identified dengue patients who will develop DHF/DSS in all three validation datasets.

To further validate this signature, the authors tested a cohort of prospectively enrolled dengue patients in Colombia. The 20-gene dengue score, measured by qPCR, distinguished severe dengue from dengue with or without warning signs (AUC = 0.89) and even severe dengue from dengue with warning signs (AUC = 0.85).

Finally, the 20-gene set is significantly downregulated in natural killer (NK) and NK T (NKT) cells, indicating the role of NK and NKT cells in modulating severe disease outcome.

Dataset deposited under Gene Expression Omnibus (GEO): GSE124046

Featured
Posted in Data visualisation

Visualising distributions with pie charts, donut plots and stacked graphs

Pie charts can be used to display distribution of variables within datasets.

A pie chart is a chart that uses the pie slices to display the relative size or proportion of the different variables. In general, pie charts are easier to visualise if the pie has no more than 7 slices.

Consider the dataset where 10 patients are infected with virus X, 6 of which experience no symptoms. The remaining 4 patients are symptomatic, of which 2 of them experience headaches, 1 with fever and 1 with rashes. Here, you can simply use a pie chart to display the data:

I have used the Prism software to plot the pie chart. Note that you can use the colour variants to emphasise on the subjects with symptoms. It is a good habit to include the total counts when presenting pie charts, as these charts meant to only display relative sizes. Note that the pie chart always start at the 12 o’clock position for easy visualisation.

Sometimes, when pie charts are difficult to visualise, another way is to plot a donut plot. The advantage of donut plot is that you can provide additional information in the middle. However, it is important to make sure that the donut is not too thin that can make visualisation difficult. An example is as shown below:

Finally, sometimes it might be more useful to use a stacked bar chart to show the distribution of the data. The stacked bar chart may be more useful when you want to group variables with similar properties (In this case, you can group all the symptomatic patients together). An example is as follows:

There are no fixed rules to justify which of these charts are more suitable for data visualisation. When in doubt, there is no harm in trying to plot all of these graphs and then decide later which is the most suitable for publications or presentations!

Featured
Posted in Dengue, Resource

Immunotranscriptomic profiling the acute and clearance phases of a human challenge dengue virus serotype 2 infection model

Differentially expressed genes at day 8 and 28 after rDEN2Δ30 infection. Source from Hanley JP et al., Nature Communications, 2021.

rDEN2Δ30 is a recombinant serotype 2 virus based on the American genotype 1974 Tonga DENV2 virus, which has been partially attenuated by deletion of 30 nucleotides in the 3′ untranslated region of the RNA genome (Δ30). rDEN2Δ30 infection is known to induce modest viremia in all flavivirus-naive subjects and a mild, transient non-pruritic rash in 80% of recipients.

rDEN2Δ30 infection could hence be a suitable model to evaluate molecular signatures responsible for asymptomatic or mild DENV-2 infection.

In this study by Hanley JP et al., RNA-seq was performed on whole blood collected from rDEN2Δ30-infected subjects at 0, 8, and 28 days post infection. rDEN2Δ30-induced reproducible but modest viremia and a mild rash as the only clinically significant finding in DENV-naive subjects.

Principal component analysis reveal minimal overlap between baseline (day 0) and peak viremia (day 8). The day 28 data (post viremia) partially overlapped with the baseline (day 0) and acute (day 8) timepoints. Pathways enriched in the type I and type II interferon and antiviral responses were upregulated at day 8, whereas pathways controlling translational initiation were downregulated. NF-κB, IL-17 signaling pathways, apoptosis, toll-like receptor signaling, response to viruses, ribosomes, and defense responses were also differentially regulated at day 28.

Myeloid cells including monocytes and activated dendritic cells were significantly increased during acute infection and returned to baseline. In contrast, regulatory T cells (Tregs) were significantly decreased during acute stage.

Gene ontology pathway analysis revealed that the viremia-tracking set of genes was enriched for both response to and regulation of type I and II interferon pathways, including JAK/STAT signaling. Genes encoding for proteins that directly inhibit viral genome replication and involved in protein ubiquitination and catabolism, especially ISG15 pathway, tracked with viremia. Day 28 revealed more varied pathways, including protein ubiquitination, cell migration, cytoskeletal reorganization, and angiogenesis.

Baseline transcript signatures can potentially predict whether the subjects would develop rash after rDEN2Δ30 infection. Higher baseline expression of myeloid nuclear differentiation antigen (MNDA), and cell surface associated cellular processes such as tetraspanin CD37, integral membrane 2B (ITM2B), and genes involved in autophagy (VMP1) was associated with protection from rash. These genes are mostly related to myeloid responses, membrane regulation, autophagy, K63 ubiquitination, and cell morphogenesis.

Transcriptomic signatures modulated by rDEN2Δ30 infection and severe dengue are distinct. Only one gene family, the guanine binding protein (GBP1/2) genes was differentially regulated in both severe dengue and during mild rDEN2Δ30 infection.

Data deposited im Gene Expression Omnibus under accession number GSE152255

Featured
Posted in python

Executing Excel functions in Jupyter Notebooks using Mito

Mito

Data scientists often need to stratify their data using “IF” function. For instance, what is the difference of immune responses in the young and elderly after vaccination? In this case, you would filter the young by a pre-defined age cut-off (less than 65 years old). Thereafter, the data can be sorted in descending order using fold change, so as to determine the genes that are most differentially regulated. Another filtering can be used to allow identification of differentially expressed genes (DEGs), based on a pre-determined fold change and p-value cut-off. Then, another possibility may arise, where you may wonder if the dataset is influenced by gender differences, and how gender interacts with age to influence vaccine responses. This will require even more filtering of the dataset to probe into these interesting biological questions…

As you can imagine, there are many instances of filtering and comparisons to perform, especially when analysing a human dataset, due to large variations between individuals. Microsoft Excel is a useful tool to do many of these filtering and aggregation functions. However, a major disadvantage is that this consequently generates many Excel files and Excel sheets which can become difficult to manage. Python can be used to perform these filtering functions easily, but this will usually require typing of long lists of codes which can be potentially time-consuming. Hence, I recommend a python package named Mito that can accelerate execution of these filtering functions.

To install the package, execute the following command in the terminal to download the installer:

python -m pip install mitoinstaller

Ensure that you have Python 3.6 or above to utilise this package. Next, run the installer in the terminal:

python -m mitoinstaller install

Finally, launch JupyterLab and restart your Notebook Kernel:

python -m jupyter lab

To run the package in Jupyter Lab:

import mitosheet
mitosheet.sheet()

Output is as such:

The features of a mitosheet includes:

1. Import dataset
2. Export dataset
3. Add columns (you can insert formulas here)
4. Delete columns that are unnecessary
5. Pivot tables allows you to transpose and query the dataframe differently
6. Merge datasets. Possible if you have 2 excel files with a shared column.
7. Basic graph features, using Plotly
8. Filter data
9. Different excel sheets and for inserting new sheets

Most excitingly, Mito will also generate the equivalent annotated Python codes for each of these edits, meaning that you can easily use the commands to do further analysis in python. Overall, Mito is a user-friendly tool that allows you to perform excel functions within Python. This helps the workflow to be much more efficient as we do not have to import multiple excel files into Jupyter notebooks. Critically, typing of long codes is not required as the Python codes are automatically generated as you plow through the data.

Featured
Posted in python

Saving python files for scientific publications

High resolution image files are essential for scientific publications. I personally prefer to use Adobe Illustrator to make my scientific figures, as the created files are in a vector-based format, with infinitely high resolution. To export graph files made by Matplotlib, you can execute the code below after plotting your graph in Jupyter Notebooks:

plt.savefig('destination_path.eps', format='eps')

Saving the file directly in a .eps format allows you to be able to transfer your file directly from Jupyter Notebooks into Adobe Illustrator. Moreover, you can manipulate the files easily within Adobe Illustrator, without any loss in figure resolution. Hope you like this TGIF tip! 🙂

Featured
Posted in Dengue, Resource

Increased adaptive immune responses and proper feedback regulation protect against clinical dengue

Genes related to antigen presentation were significantly increased in the asymptomatic compared to the symptomatic dengue individuals. Manuscript by E Simon-Lorière et al., Science Translational Medicine, 2017.

Dengue infections can be asymptomatic, symptomatic, or occasionally progress to severe dengue, a life-threatening condition characterised by a cytokine storm, vascular leakage, and shock. However, the molecular and immunological mechanisms underlying asymptomatic dengue virus (DENV) infection remains largely unknown.

In the publication, E Simon-Lorière et al recruited DENV infected children in Cambodia. Nine individuals remained strictly asymptomatic at the time of inclusion and during the 10-day follow-up period. PBMCs from 8 asymptomatic DENV-1 viremic individuals and 25 symptomatic dengue patients were used for further gene expression analysis.

Asymptomatic individuals have an increase in the percentage of CD4+ T cells and a decrease in CD8+ T cells compared to symptomatic dengue individiuals. However, CD14+ monocytes, Lin-CD11c+ dendritic cells, CD19+ B cells, and CD335+ natural killer cells are not significantly different between asymptomatic and symptomatic individuals.

Transcriptomic signatures were distinct between asymptomatic and symptomatic individuals. The top pathways that diverge the most between asymptomatic and clinical dengue individuals were related to immune processes. Notably, the transcriptomic differences cannot be explained by differences in viral load or immune status.

The innate immune responses were not significantly different between the asymptomatic and symptomatic individuals. Instead, the most significantly activated pathway in asymptomatic individuals was related to “nuclear factor of activated T-cells (NFAT) mediated regulation of immune response.” These genes include CIITA, CD74 and various human leukocyte antigen (HLA) genes, where their expression differences were also validated at the protein levels (See figure on top).

Protein kinase Cq (PKCq) signaling in T lymphocytes was also highly activated in asymptomatic viremic individuals. Genes upregulated included AKT3, SOS1, PAK1, and SLAMF6, as well as T-cell costimulatory pathways such as ICOS-ICOSL, and CD28 and CTLA4 signaling in cytotoxic T-cells.

In contrast, genes related to B-cell activation, differentiation and plasma cell development (BLIMP-1, IRF4) were downregulated in asymptomatic individuals. This finding is correlated with the reduction in antibody production in the asymptomatic individuals.

Data is saved in Gene Expression Omnibus under accession number GSE100299

Featured
Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Featured
Posted in About me

Saying a big thank you!

Thank you all for reading my blog entries. My blog has now more than 1,000 visitors with more than 2,000 views. Hopefully will bring more content to my entries and eventually be a useful resource for all of us!

Featured
Posted in Resource

Correlates of Vaccine-Induced Immunity

Figure showing how the innate and adaptive immune responses can interact synergistically after vaccination to confer protection against viral infection and diseases. It is critical to define which of these components are correlates or cocorrelates of protection. Drawing by BioRender.

There are generally 4 categories of immune functions that relate to protection:

TermDefinition
CorrelateA specific immune response to a vaccine that is closely related to protection against infection, disease or other defined end point
Absolute correlateA quantity of a specific immune response to a vaccine that always provide near 100%
Relative correlateA quantity of a specific immune response to a vaccine that usually (not always) provides protection
CocorrelateA quantity of a specific immune response to a vaccine that is 1 of >=2 correlates of protection, and that must be synergistic with other correlates
SurrogateA quantified specific immune response to a vaccine that is not itself protective but that substitutes for the true (perhaps unknown) correlate

Some important pointers that I learnt from the article published by Stanley Plotkin, CID, 2008:

1. The correlate of protection induced by vaccination may not necessarily be the same correlate that operates to close off infection. An example of this principle is measles vaccine. Titers <200 mIU/mL of antibody after vaccination are protective against infection, whereas titers between 120 and 200 mIU/mL protect against clinical signs of disease but not against infection. Titers <120 mIU/mL are not protective at all. Another consideration is the cellular immunity to measles, which is critical in recovery from disease, as CD8+ cells are needed to control viremia and consequent infection of organs. Another example is cytomegalovirus, where antibodies are a correlate of protection against infection, whereas T cell immunity is a correlate of protection against disease.

2. Correlate of protection may be either absolute and relative. Examples of absolute correlates (situations in which a certain level of response almost guarantees protection) include diphtheria, tetanus, measles, rubella and hepatitis A. While absolute correlation is highly desired, many correlates are relative. In these cases, although protection is usually conferred at a certain level of responses, breakthrough infections are possible. An example is the influenza vaccine, where a hemagglutination-inhibition antibody titer of 1/40 is associated with 70% clinical efficacy.

3. While antibodies are often used as measures of correlates of protection, not all antibodies neutralise infections in the same way. An example is the Meningococcal polysaccharide vaccines which give notoriously poor protection in young children, although children do have significant ELISA antibody responses. Other functions, including opsonophagocytosis, ADCC and complement activation could also be important for protection.

4. In some cases, antibodies are surrogates, rather as a true correlate of protection. This means that the antibodies could be indirectly related to the true correlate of protection. Examples provided were the rotavirus and varicella vaccine, where cell-mediated immunity is clearly required for protection against viral infection and disease.

5. Emerging evidence suggest the possibility of organ-specific correlates. Based on experimental studies, it appears that CD4+ cells are key to the prevention of brain pathology after measles and in helping CD8+ cells to close off West Nile virus CNS infection. More work will be needed to define correlates of protection that are organ-specific.

6. Correlates of immunity may differ between different age groups. An example is the influenza vaccine, where antibody production is critical to prevent primary influenza infection in the young, but CD4+ cells may be more important for immunologically experienced individuals undergoing heterosubtypic infection.

7. Cellular responses are increasingly recognised as correlates or cocorrelates of protection. Given that CD4+ cells must be present to help antibodies to develop, and CD8+ cells are needed for virus clearance, emerging evidence now suggest that cellular responses are critical in limiting viral pathogenesis and dissemination. However, more work will be needed to uncover the parameters that are essential for cell-mediated protection.

Featured
Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()
df.head()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Featured
Posted in python

Converting strings to list in Python

Lists are used to store multiple items in a single variable. Grouping them together in a single list tells Python that these variables should be analysed as a group, rather than as individual entries. However, converting a string of items to a single list may not be an easy task. In the past, my solution was to copy the item list into a new notebook, and use the “find and replace” function to convert delimiters to “,”. Recently, I chanced upon an article that describes how you can elegantly do this task in a single line of code. Imagine you have a list of genes from EnrichR pathway analysis, and the genes are separated by the semicolon (;). You can convert these genes to a list by executing with a single line command:

DEGs = "IFNG;IFNA14;CXCL10;RSAD2;ISG15"
Gene_list = DEGs.split(";")
print(Gene_list)

Output will then be the list of genes which you can use to query in other datasets:

['IFNG', 'IFNA14', 'CXCL10', 'RSAD2', 'ISG15']

Super cool!

Featured
Posted in HIV, Resource

Molecular Signatures of a TLR4 Agonist-Adjuvanted HIV-1 Vaccine Candidate in Humans

Molecular signatures associated with early (A) and late (B) humoral responses. Source from Anderson et al., 2018, Frontiers in Immunology

Immunisation with the stable trimeric recombinant HIV-1 envelope glycoprotein, CN54gp140, has been shown to induce potent humoral immune responses, especially when adjuvanted with TLR4 agonist adjuvants, such as monophosphoryl lipid A or GLA-AF (glucopyranosol lipid adjuvant-aqueous formulation). These adjuvants exert their adjuvanticity, at least in part, by activating the myeloid differentiation factor 88 (MyD88) and toll-interleukin 1 receptor domain-containing adapter inducing interferon-β (TRIF) pathways. However, the clinical efficacy to the CN54gp140 adjuvanted with GLA-AF is variable between individuals. Anderson et al characterised the host responses after vaccinating subjects with CN54gp140 adjuvanted with GLA-AF, and examined the gene signatures linked to vaccine immunogenicity. 

Healthy male (n = 8) or female (n = 6) volunteers aged between 18 and 45 and with no history of HIV-1 and HIV-2 infection received the vaccine, and whole blood was collected from these subjects at 6 hours, 1, 3 and 7 days after vaccination. 

Majority of total DEGs were observed within 24 h post vaccination compared to later time points at 3 days and 7 days post-immunisation. 

The DEGs reveal an enrichment of BTMs related to cell cycle regulation and signaling as well as those related to innate and adaptive immune responses.

NK cell-related enriched BTMs (M7.2, M61.0, and S1) were significantly repressed in the gene expression profiles from individuals with either late high serum IgA or IgG4 responders (See Figure on top). 

In particular, we identified a repression in BTM modules related to NK cells, especially at 3 and 7 days post-vaccination, for high serum IgM, IgA, and IgG4 antibody responders.

Flow cytometry was performed to determine that the changes were due to NK cell numbers or expression levels of proteins upon vaccination.

In the limited number of analyzed samples, frequency of CD3–CD56dim NK cell population in the blood of high antibody responder subjects was increased on 14 days post vaccination compared to the 0 h baseline. While more studies need to be done, the authors speculate that the repression of BTMs related to NK cells observed in the first 7 days post-vaccination reflects NK cells leaving the circulation early in the response. Given that NK cells are short lived, the enhanced frequency of NK cells for 14  days post vaccination is presumably attributed to secondary induction of NK cell differentiation processes in response to vaccination.

Featured
Posted in Data visualisation, IPyWidgets, python

Making Jupyter notebooks interactive with IPyWidgets

There are many instances where scientists and data scientists will have to identify cut-offs before deeper data analyses. For instance, what is the fold change and adjusted p-value threshold that is most appropriate for identification of DEGs? For pre-rank pathway analysis, what is an appropriate cut-off for enrichment scores to be considered significant? For RNAseq, what is the minimum counts that we should filter out without trimming away too many transcripts? Executing these tasks can take a long time, even with coding because you will need multiple lines of coding to identify the most suitable cut-off.

Consider that you are unsure what is an appropriate cut-off for identification of DEGs. The usual way will be to probably do a query on the number of DEGs identified when fold-change is at least 1.5 with adjusted p-value < 0.05. Imagine you identified only 30 DEGs amongst 20,000 genes. This may indicate that your cutoff may be too stringent, and this time, you decide to adjust to fold-change of at least 1.2 with adjusted p-value < 0.05. Again, you will require another line of code. Imagine this time you found 1000 genes that satisfy the criteria and now you think this number is now too many for pathway analysis. Again, this means you might need another code to query fold-change of at least 1.3, and the cycle repeats until you find that suitable cutoff.

To circumvent this issue, I introduce IPython widgets (IPyWidgets), which turn Jupyter Notebooks from static documents into interactive dashboards, perfect for exploring and visualizing data. To install, simply type the command:

pip install ipywidgets

We first import the packages and load a dataset. Here, I load a dataset from Zak DE et al. from my BENEATH account. The study looks into the transcriptomic responses in human subjects across time, after taking the MRKAd5/HIV vaccination, and the details have been previously elaborated in my blog:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import beneath
df = await beneath.load_full("kuanrongchan/vaccine-repository/ad5-seroneg-daniel-2012")
df.info()
df.head(2)

The output for the header columns is as follows. fc indicate fold-change, pval indicate p-value with respect to day 0 and qval indicate adjusted p-value with respect to day 0. The time-points considered are 6hrs, 1, 3 and 7 days after vaccination.

genefc006hpval006hqval006hfc001dpval001dqval001dfc003dpval003dqval003dfc007dpval007dqval007d@meta.timestamp
0A1BG-1.134080.1577810.500370-1.037426.099120e-010.804213-1.139440.1430990.678669-1.048680.5899130.9975792021-07-16 00:24:39.972000+00:00
1A1CF-1.041990.3837500.6388801.074756.557900e-020.1701181.019200.6860740.929992-1.013690.7725000.9975792021-07-16 00:24:39.972000+00:00

We will use the following command to make an interactive slider. Assuming we are fixing adjusted p-value to be less than 0.05 and want to find out the number of DEGs based on a fold-change cut-off:

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

@interact
def show_DEGs(column1=['fc001d','fc003d','fc007d'], x=(0,3,0.1),column2=['qval001d','qval003d','qval007d']):
    return len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index)

Output is as follows:

With the @interact decorator, the IPyWidgets library automatically gives us a text box and a slider for choosing a column and number! In this case, we can move the slider from 0 to 3, and for the different time-points (6hrs, 1, 3 and 7 days post-vaccination) to see the number of upregulated DEGs for different fold-change cut-offs. The output shown has the slider moved to 1.3 and 1.5, which yielded 1510 and 905 DEGs respectively.

You can also use the @interact function with .loc function to call out the genes and expression values that satisfy the criteria designated by your slider:

Finally, you can also use the IPyWidgets to do multiple comparisons, especially using the dropdown widget. For example, if you want to quickly find out if the transcript expression at two time-points are correlated, you can use IPyWidgets to create the dropdown menus for multiple comparisons:

@interact
def correlations(column1=list(df.select_dtypes('number').columns),
                 column2=list(df.select_dtypes('number').columns)):
    print(f"Correlation:{df[column1].corr(df[column2])}")

The output is two dropdown lists, that would allow you to quickly evaluate the correlation of any 2 variables in the dataframe:

The potential of IPyWidgets is limitless, allowing you to scan through data very quickly, and brings some life to the static Jupyter Notebook. You can also use this for graph plotting, scanning through files etc. The possibilities are endless… 🙂

Featured
Posted in About me

Animal Coronavirus Diseases: Parallels with COVID-19 in Humans

Figure showing the comparisons between animal and human coronaviruses that are discussed in our review paper

Happy to share that we have just published a review paper on “Animal Coronavirus Diseases: Parallels with COVID-19 in Humans.” This review is written with Chao Nan Lin, whom I have worked very closely with since I was a PhD student. It is hence a great privilege to be able to collaborate with him on his topic of expertise: Animal coronaviruses. This manuscript describes the similarities and differences between animal and human coronaviruses with regards to genome organization and recombination, immunopathogenesis, transmission, viral shedding, diagnosis, treatment, and prevention. We hope that the synthesised knowledge can help us better in managing coronavirus epidemics and design of interventions that can reduce the spread of these viruses.

Featured
Posted in influenza, Resource

Systems Analysis of Immunity to Influenza Vaccination across Multiple Years and in Diverse Populations Reveals Shared Molecular Signatures

Figure showing that the number of DEGs related to innate immune responses are more highly expressed in the young compared to the elderly subjects after receiving the Influenza TIV. Source from Nakaya et al., 2015, Immunity.

Although vaccination is considered the most effective method for preventing influenza, it shows limited efficacy in the elderly. Here, Nakaya et al used a systems vaccinology approach to understand the mechanisms behind poor vaccine efficacy in the inactivated influenza vaccine (TIV) in the elderly.

212 individuals from the current study and 218 individuals from a previously published study (Franco et al., 2013) were included in analyses. 54 of these were elderly (>65 years old). As expected, antibody responses to the Influenza vaccine (TIV) decrease with age.

Blood Transcriptomic Modules related to the induction of interferons and activation of dendritic cells were enriched on days 1 and 3 after TIV vaccination, whereas modules related to T cells at these time points were negatively associated with the antibody response. On day 7, there was a robust enrichment of antibody secreting cells (ASC) and cell cycle-related modules. The enriched modules in young and elderly were similar, particularly those related to the interferon response and activation of dendritic cells on day 1. However, the magnitude of expression of interferon-related genes was significantly higher in young individuals (See top figure).

Combining all datasets, several B-cell- and T-cell-related modules at pre-vaccination was positively correlated with an increased antibody response to vaccination. In contrast, modules related to monocytes were negatively correlated with antibody responses, supporting the concept that inflammatory responses at baseline might be detrimental to the induction of vaccine-induced antibody responses.

Consistent with the neutralizing antibody responses, B-cell and plasmablast modules (BTM S3) showed reduced expression in the elderly compared to the young on day 7. However, Natural killer (NK) cell and monocyte modules were enriched in the elderly at day 3 and day 7 after vaccination.

To examine if the transcriptional changes were due to changes in these specific cell types, flow cytometry was performed. Proportions of total NK cells in elderly subjects were higher than those of young subjects at baseline and all time points studied (days 0–14). The NK-cell activation markers were also more prominent in the elderly. Similarly, increased quantities of monocytes were seen in the elderly, with higher CCR5 expression at all time-points tested.

Differential expression of miRNA is also evident between the elderly and young, which suggests that miRNA could be important regulators of the immune response to influenza vaccination.

Data is deposited in GEO as GSE74817.

Featured
Posted in python, Research tools

Introducing BENEATH as a data warehouse platform

Data warehousing improves data accessibility to multiple datasets, increasing the speed and efficiency of data analysis. By storing and categorising multiple processed datasets in one specified location, this allows data scientists to quickly query and analyse multiple datasets. It also facilitates advanced analysis like predictive modelling and machine learning.

Here, I introduce BENEATH as a data warehouse platform to store multiple datasets. There are several features which I find this platform cool:

  1. User-friendly interface allows ease of depositing dataframes
  2. Ability to use the SQL programming language to do various kinds of data query
  3. Python commands are adequate to import and export files, meaning that no background in other programming languages is required
  4. Easy to set permissions. One command allows other users to create or view dataframes. You can also have the option to make your database public.
  5. Free of charge (unless your datasets are huge)
  6. Responsive community to accelerate troubleshooting

I will break down the individual steps involved in depositing a dataframe in BENEATH:

First, install BENEATH by executing the below command in the Jupyter Notebook (you only need to do this once):

pip install --upgrade beneath

Next, set up an account in https://beneath.dev/ Thereafter, under the command line (Terminal), execute the command below and follow the subsequent instructions to authenticate your account.

beneath auth

Thus far, we have installed the important packages and executed the authentication procedures. We are now ready to deposit any dataframe into BENEATH. In this example, we will import a .csv file (named test) into Python using the below command in a Jupyter Notebook. The dataframe is saved under the variable “df”

import csv
import numpy as np
import pandas as pd
df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
010.2847200.0806110.9971330.151514
120.1638440.8187040.4599750.143680

Depending on your username (in my case is kuanrongchan), we can create a folder to store this dataframe. In this example, we will create a folder under the name: vaccine_repository, to store this dataframe. You will have to execute the code under the command line (Terminal):

beneath project create kuanrongchan/vaccine_repository

Finally, to deposit your dataset, execute the below command. The table_path will direct the dataframe to the assigned folder . You can then set the index under “key” command. In this case, I have assigned subject as the index key, which will allow you to query subject IDs quickly in future. You can also add a description to your file to include any other details of this dataset.

import beneath
await beneath.write_full(
table_path="kuanrongchan/vaccine_repository/test",
    records=df,
    key=["subject"],
    description="my first test file"
)

Now, for the cool part: You can execute a simple command to quickly import this dataframe into Jupyter Notebook for data analysis:

df = await beneath.load_full("kuanrongchan/vaccine-repository/test")

To share this folder with a friend, you can execute the below command. In this case, assume my friend BENEATH username is abc.

beneath project update-permissions kuanrongchan/vaccine_candidates abc --view --create

Multiple datasets can be easily deposited in the platform by using the above described codes. However, the current limitation of BENEATH is the lack of data visualisation tools, which means that the graphs will have to be processed in Jupyter Notebooks. The developers are currently working on this particular aspect, which should make BENEATH a great data warehouse platform for data scientists.

Featured
Posted in Resource, VSV vectors

Human Transcriptomic Response to the VSV-Vectored Ebola Vaccine

Principal Component Analysis plot showing the transcriptomic differences between days 0, 1, 2 and 3 after rVSV∆G-ZEBOV-GP vaccination. Source from F Santoro et al., 2021

rVSV∆G-ZEBOV-GP is a recombinant vaccine based on the Vesicular Stomatitis Virus (VSV), where the original VSV glycoprotein encoding gene was deleted and replaced with the surface glycoprotein (GP) encoding gene from the Ebolavirus Zaire strain (ZEBOV).

The vaccine was shown to be safe, although occasionally associated with transient reactogenicity. However, the host response to the vaccine has not been thoroughly investigated.

In this manuscript by F Santoro et al., 2021, the blood transcriptomic response to high dose vaccination (107 and 5 × 107 pfu) with rVSV∆G- ZEBOV-GP was analysed in 51 volunteers. Whole blood data was taken from day 0, 1, 2, 3, 7, 14, 21 and 28.

Vaccination resulted in greatest host transcriptomic changes at day 1, which lasts till day 3 (see top figure). Notably, the massive transcriptomic changes on days 1-2 corresponds to the timing of occurrence of mild to moderate reactogenicity events (chills, fever, headache, fatigue or myalgia) in 50 out of 51 vaccinees

Viral load differences did not affect host responses to vaccine, except for the MZB1 gene, coding for Marginal Zone B And B1 Cell Specific Protein.

Most blood transcriptomic module correlations with anti-ZEBOV GP IgGs were detected at day 14 post-vaccination. As expected, B cell activation and BCR signaling modules were observed to correlate with vaccine immunogenicity. Other modules that were significantly correlated at day 14 involve pathways such as calcium signalling, cell adhesion and activating transcription factor networks, which are possibly related to signal transduction.

Transcriptomic data are available in the Zenodo database.

Featured
Posted in correlation, Data visualisation, python

Correlation matrix – A useful tool to visualise relatedness between multiple variables

Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output
test_1 = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv',index_col=0)
test_1.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
10.2847200.0806110.9971330.151514
20.1638440.8187040.4599750.143680

Next, we can visualise the correlation coefficient across all variables using the command:

corrM = test_1.corr()
corrM

Output file showing all the correlation coefficient values is as follows:

ab30dab60dab90dtcell7d
ab30d1.0000000.3789260.042423-0.324900
ab60d0.3789261.000000-0.489996-0.374259
ab90d0.042423-0.4899961.000000-0.392458
tcell7d-0.324900-0.374259-0.3924581.000000

We can also determine the p-values of the correlation by using the commands, as follows:

from scipy import stats 
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
                     columns=test_1.columns, index=test_1.columns)
pvals

Output file is as follows:

ab30dab60dab90dtcell7d
ab30d6.646897e-640.2802149.073675e-010.359673
ab60d2.802137e-010.0000001.505307e-010.286667
ab90d9.073675e-010.1505316.646897e-640.261956
tcell7d3.596730e-010.2866672.619556e-010.000000

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

corr = test_1.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, cmap=cmap, mask=mask, center=0, square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

Featured
Posted in About me

Work hard, play hard!

It’s good to pace yourself, and take breaks when necessary. As the saying goes: Slow and steady wins the race. This week, I had the privilege to have a short break from work to have a lunch with my lab mates.

The best part: I solved a few python codes (that I had trouble previously) right after the lunch. Grateful for all the support from my lab mates!

Featured
Posted in BNT162b2 mRNA vaccine, Resource

Systems vaccinology of the BNT162b2 mRNA vaccine in humans

Interferon responses were most evident after 1 day post BNT162b2 vaccination, especially after the second vaccination. Source from Arunachalam et al., Nature, 2021

The Pfizer-BioNTech vaccine, BNT162b2 has ~95% efficacy, but little is known about the host immune responses involved. In this study by Arunachalam et al., the authors examined the host immune responses to this mRNA vaccine.

Immune responses were examined in 56 volunteers, each receiving two doses of the vaccine. Neutralizing antibodies were increased after primary vaccination, which boosted significantly after the second vaccination. Spike-specific CD4 and CD8 responses were more evident only after the second vaccination. No significant correlation were seen between T-cell, neutralizing antibodies and age were noted.

Phosphorylation of STAT1 and STAT3 in B-cells, T-cells, monocytes and pDCs were seen after 1 day post-vaccination, and especially after second vaccination. This observation was correlated with an induction of IFN-gamma and CXCL-10 protein expression at 1-2 days post-vaccination. Similarly, the induction of these cytokines was greater after second vaccination.

Consistent with the results from protein analysis, the transcriptional responses was greater after the second vaccination compared to the first vaccination (See top figure). Most of these responses were related to monocyte and inflammatory modules.

Single-cell transcriptomics reveal that the interferon signatures were broadly induced across cell types, much of which are driven by monocytes and dendritic cells. NK cell activation was only apparent at 1 day post second vaccination, which then disappears on the second day.

Comparing transcriptomics responses with other vaccines, the mRNA vaccine responses behave most similarly to the adjuvanted vaccines and live-viral vectors at day 1 post-vaccination. However, the authors did not detect any B-cell signatures in any of their time-points tested, despite seeing a significant increase in plasmablasts. Instead, the pathways related to interferon and inflammation were most correlated with neutralizing antibody and CD8 T-cell responses.

Featured
Posted in Data visualisation, Heatmaps, python

The art of plotting heatmaps

Bar charts provide precise measurements with standard deviation, but showing multiple bar charts to show trends for multiple genes or proteins is difficult to visualise. In such cases, a heatmap can be a useful alternative to visualise changes in multiple gene or protein expression across different conditions. In a heatmap, the variation in magnitude of expression is often visualised by changing the intensity or hue. For instance, a higher expression can be depicted by using the colour red, and a greater increase in expression by a darker intensity. On the other hand, downregulation can be depicted by using the colour blue (more colour-blind friendly than green), with a stronger intensity indicating greater downregulation. While plotting raw expression values in a heatmap is acceptable if the relative abundance between genes and proteins are similar, data normalisation is required when the baseline gene or protein abundance is different.

For instance, consider you want to plot expression levels of gene A and gene B in a single heatmap. Gene A has a baseline expression of 5 copies per cell whereas gene B has a much higher baseline expression of 50 copies per cell. Let’s assume the condition where during a virus infection, gene A expression is reduced to 2.5 copies/cell but gene B expression is increased by 2 fold to 100 copies/cell. In this case, due to the high baseline abundance of gene B, plotting the heatmap using gene abundance will mask the ability to visualise differential expression of gene A after infection.

To display the relative expression of both genes A and B in a heatmap, data normalisation is recommended. One of the most common strategies is to plot a heatmap based on log2 fold-change instead of abundance. In this specific case, the log2 fold-change for gene A and gene B after virus infection is -1 and 1 respectively, and these values can be nicely presented in a heatmap to depict the phenomena. Another common alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, which means that values above 0 are above average and values below 0 are below average. Hence, the Z-score transformation standardises the values between genes or proteins to the same mean and standard deviation, enabling users to easily compare expression values across genes and proteins. However, an important point to note is that Z-score can only be applied for values that are obtained within a single experiment.

To plot a heatmap, Microsoft excel (under the format cell column) or specialised software such as PRISM can be used. However, most of these tools do not perform clustering of heatmaps . From my experience, besides clustergrammer, python is one of the fastest and easiest way to plot a heatmap or clustergram. After loading the data, the heatmap can be quickly plotted with a single line of code, and with various kinds of customisation. The step by step commands and code descriptions are as follows:

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading data into python as dataframe and check that the dataset is properly loaded
# Gene symbol is assigned as the index column
df = pd.read_csv('/Users/kuanrongchan/Desktop/Blog.csv',index_col=0)
df.head()

Output file:

Gene_SymbolControl_1Control_2Control_3Condition_1_1Condition_1_2Condition_1_3Condition_2_1Condition_2_2Condition_2_3
HERC52.352.181.903.412.993.252.672.552.46
IFIT14.504.424.586.316.266.236.085.525.21
IFIT24.344.844.765.915.965.955.565.364.97
IFNA41.421.521.441.732.052.351.141.541.61
IFNA70.881.220.561.461.231.701.230.710.64
# Packages to plot clustergram
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar

# Plot clustergram
g = sns.clustermap(df, z_score=0, cmap="vlag", figsize=(10, 4), method='average', cbar_kws={"ticks":[-1.5,0,1.5]}) 

# Various customisations can be performed on the above code as follows:
# Z_score = 0 ; tells python to plot the Z-score values
# cmap = "vlag" ; indicates red-blue heatmap
# figsize ; indicates figure size dimensions
# method = 'average' ; default option
# robust=True ; can be added in the code to plot as robust quantiles instead of the extreme values
# cbar_kws={"ticks":[-1.5,0,1.5]} ; defines points to show in colour bar
# g.cax.set_visible(False) ; can be added to remove colour bar
# Add yticklabels=True ; can be added for big heatmaps to display all gene symbols 

Output:

At a single glance, you can quickly tell that condition 2 behaves similarly with the control condition, whereas condition 1 show a marked increase in expression of IFNA4, IFNA7, HERC5, IFIT1 and IFIT2. The replicates are also clustered together. As you may already appreciate, this kind of data presentation is particularly useful for analysing big datasets.

Featured
Posted in About me

Dataquest certificates for python programming

Had the chance to complete my training for fundamentals of python, NUMPY and PANDAs, all of which are essential for data science research. The knowledge is particularly relevant for analysing dataframes, and provide the essential building blocks to execute more sophisticated commands found in other online resources. As the saying goes, learning never stops, and I certainly hope that you are having this positive attitude wherever you go!

Featured
Posted in influenza, Resource

High-Resolution Temporal Response Patterns to Influenza Vaccine Reveal a Distinct Human Plasma Cell Gene Signature

Antibody titers, cell subset numbers, and gene transcript expression in different subjects across the different subjects. Source from Henn AD. et al., Scientific Reports, 2013

Influenza vaccines produce highly variable B cell responses among individuals, making it difficult to predict who will achieve protective antibody titers after vaccination.

In this paper by Henn AD et al., 2013, daily sampling of serum, peripheral blood mononuclear cells (PBMC), B cells and plasma cells from 14 human subjects was performed over 11 days post-influenza inactivated vaccine administration. Peripheral blood was drawn during the week prior to vaccination (pre-V), immediately before vaccination (day 0), daily for days 1–10 and on day 21 post-vaccination.

Most differentially expressed genes were detected at days 5-6 post vaccination, and this corresponded with the heightened IgM, IgG, plasmablast and activated ASC responses in most subjects. Many of these transcripts were validated to be B cell differentiation genes.

742 genes were differentially regulated temporally, and the majority of these genes were significantly correlated with CD27hiCD38hiCD138− plasmablasts. These genes are termed as the plasma cell gene signature (PCgs).

Ten of the top 30 categories of functionally related genes in the PCgs involved ER function and protein production. These findings are consistent with involvement of the PCgs in program-level upregulation of antibody production machinery and the unfolded protein response seen during plasma cell development.

Of interest, the other genes involved in the PCgs were expressed by the myeloid/DC lineages, many of which peaked at day 1. This is consistent with the notion that the magnitude of the innate immune response is also associated with antibody responses from influenza vaccination.

Featured
Posted in .loc function, python

My personal experience with the .loc function for gene queries

Gene query can be made easier with programming tools such as Python and SQL

With more scientific journals mandating the publication of datasets into the public data repository, data scientists can now consider analysing these deposited datasets to derive new biological insights, or leverage on these datasets to cross-check their own research findings. However, searching a long list of genes across multiple datasets can be challenging, especially if you are querying hundreds to thousands of genes.

Imagine another situation where 100 genes are involved in a biological pathway. You are interested to find out whether your gene expression changes are just influenced by a small subset of genes involved in the pathway or if the majority of genes in the pathway are affected. At this point, you may think that “Control-F” is your solution, but you will have to click 100 times to find the expression levels of all the 100 genes involved. While achievable with pure resilience, such manual methods are often slow and error-prone.

Here, I recommend using python to streamline these query processes, allowing you to quickly query gene expression levels by using just a few lines of codes. To provide a relevant example, we will use the dataset published by Zak DE et al., PNAS, where I have previously described the research design and research outcomes. In short, the authors investigated the gene expression profile of seropositive and seronegative subjects receiving the MRKAd5/HIV vaccine.

I will describe my personal encounter that eventually derived the final code for my own applications. First, I assigned the variable Ad5_seroneg as the data containing the fold-change (FC), p-value (pval) and adjusted p-value for the seronegative subjects using the below command:

import csv
import numpy as np
import pandas as pd
Ad5_seroneg = pd.read_csv('/Users/kuanrongchan/Desktop/Ad5_days_seroneg_Daniel_2012.csv')
Ad5_seroneg.head(2)

Output file is as follows:

Gene_SymbolFC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
0A1BG-1.134080.1577810.50037-1.037420.6099120.804213-1.139440.1430990.678669-1.048680.5899130.997579
1A1CF-1.041990.3837500.638881.074750.0655790.1701181.019200.6860740.929992-1.013690.7725000.997579

Now that we are sure that the data is properly loaded, we will use the .loc command to find the expression of interferon-related genes, by setting the Gene_symbol column as index keys using the “set_index” command. In this case, I used IFNA14, IFNB1, IFNG as a proof of concept and saved these genes under IFN_keys. The IFN_keys are then used to query against the dataset previously saved under Ad5_seroneg.

IFN_keys = ["IFNA14", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

Output file is as follows:

Gene_SymbolFC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
IFNA141.139020.2446500.542742-1.037400.6857290.8522021.057650.6141470.909833-1.035690.7522600.997579
IFNB1-1.041200.6601080.8216551.040450.5970050.7948621.017440.8504860.970073-1.133910.1750990.997579
IFNG1.018090.9273400.968282-1.185670.2916960.5181501.001240.9949820.9984321.166440.4350680.997579

Fantastic! The .loc function solved the problem! You could theoretically replace the gene names saved under IFN_keys to any query gene list you desire.

At this point, you may think that everything is solved and we can proceed with using this command freely. However, I was curious to find out what would happen if you replace the query key with a gene that is not found in the microarray? Would this command still work? To illustrate this point, I replaced IFNA14 with zzz, since we know there is no gene called zzz.

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

Output file is as follows:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-54-01a2a359e521> in <module>
      1 IFN_keys = ["zzz", "IFNB1", "IFNG"]
----> 2 Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    893 
    894             maybe_callable = com.apply_if_callable(key, self.obj)
--> 895             return self._getitem_axis(maybe_callable, axis=axis)
    896 
    897     def _is_scalar_access(self, key: Tuple):

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1111                     raise ValueError("Cannot index with multidimensional key")
   1112 
-> 1113                 return self._getitem_iterable(key, axis=axis)
   1114 
   1115             # nested tuple slicing

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1051 
   1052         # A collection of keys
-> 1053         keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1054         return self.obj._reindex_with_indexers(
   1055             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1319 
   1320             with option_context("display.max_seq_items", 10, "display.width", 80):
-> 1321                 raise KeyError(
   1322                     "Passing list-likes to .loc or [] with any missing labels "
   1323                     "is no longer supported. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['zzz'], dtype='object', name='Gene_Symbol'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

To me, this is a disaster! By entering a gene that was not found in the dataset means that I will not be able to have any useful output from python. This will also mean that querying genes across various datasets that are using different platforms will be potentially challenging. If I wanted to make this code work for me, I knew I needed to find a solution.

I took some time to find the solution, only to realise that the reindex function in python can be used to circumvent this issue. When I tried this code instead:

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').reindex(IFN_keys2)

Output is as follows:

FC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
Gene_Symbol
zzzNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
IFNB1-1.041200.6601080.8216551.040450.5970050.7948621.017440.8504860.970073-1.133910.1750990.997579
IFNG1.018090.9273400.968282-1.185670.2916960.5181501.001240.9949820.9984321.166440.4350680.997579

Amazing outcome! Now, genes that are not tested in the gene expression dataset will be considered as NaN, which can allow me to execute ‘dropna’ function later to drop these columns. Most importantly, I am able to get the output for the remaining genes that I am interested to query. With this command, querying genes or gene sets across different datasets is much simpler, faster and reproducible.

Featured
Posted in art

Worth spending the time?

Sometimes, one may wonder why we are spending so much time and effort to make, or troubleshoot codes. I thought this diagram can best illustrate this point. If the work is done repetitively or takes a long while to do, it is probably worth the effort the automate this process, not only to save time but also increase reproducibility. So when coding fails, do not give up, because in a larger scheme of things, you are actually saving or even helping others save precious time 🙂

Featured
Posted in influenza, Resource

Systems biology of immunity to MF59-adjuvanted versus nonadjuvanted trivalent seasonal influenza vaccines in early childhood

Figure showing that the differentially expressed genes (DEGs) were significantly higher in adults compared to children under 2 years of age. Overall, the adjuvanted inactivated influenza vaccine generated more DEGs than the unadjuvanted counterpart. Source from Nakaya et al., PNAS, 2016.

The trivalent inactivated influenza vaccine (TIV) is poorly immunogenic and has low effectiveness under 2 years of age. In this study by Nakaya et al., 2016 , the investigators studied the innate, adaptive and molecular responses to the seasonal TIV and MF59-adjuvanted TIV (ATIV) in 90 children from 14- to 24-months of age.

At day 28 post-boost, HAI geometric mean titers were higher in ATIV vaccinees compared with TIV. However, the magnitude of the plasmablast response was much lower in children than in adults. TIV and ATIV induced a similar magnitude of IgM- and IgG-secreting plasmablast cells in children.

MF59-adjuvanted TIV vaccine induced a higher expansion of multicytokine-producing vaccine-specific CD4+ T cells, mostly producing TNF-α and IL-2.

ATIV induced more alterations in gene expression at days 1, 3, and 7 post-boost compared with TIV. However, the numbers of DEGs were much smaller than in an adult cohort. There was high heterogeneity in the individual host responses, which could have accounted for the fewer DEGs detected.

Gene set enrichment analysis (GSEA) on individual subject’s responses revealed that at day 1 post vaccination, the positively enriched modules are M75 “antiviral interferon signature,” S5 “dendritic cell signature,” M16 “Toll-like receptor (TLR) and inflammatory signaling.” Among the negatively enriched modules, several modules related to T-cell function, NK cells, and cell cycle were found, including M7.1 “T cell activation,” M7.2 “enriched in NK cells,” and M4.0 “cell cycle and transcription”

GSEA was also applied to rank genes based on correlation with HAI titers. These modules include M75 “antiviral interferon signature,” M165 “enriched in activated dendritic cells,” and several others. These modules were positively correlated with HAI response at days 1, 7, and 28 following the booster shot.

The kinetics of enrichment of two blood transcriptomic modules associated with antibody-secreting cells (M156.1 and S3) show that enrichment is higher on days 7 and 28 for ATIV. The M156.1 module “plasma cell, immunoglobulins” was only significant at day 28 postboost, suggesting that unlike in adults, the expansion of antibody secreting cells may occur after day 7.

Overall, these findings highlight the differences in host immune responses between children and adults.

Featured
Posted in Data visualisation, python, Venn diagram

Using python codes to study unique or shared elements in Venn diagrams

As elaborated in my previous post, Venn diagrams are useful for identifying unique and shared genes between different treatment conditions. While Venny remains to be one of my favourite tools for plotting Venn diagrams, inputting lists of genes can be cumbersome when the gene lists are long (e.g. >1,000 genes) or when gene lists are provided as .csv or .txt files.

Here, I introduce how you can use python to identify the unique and overlapping genes. The operations, & for intersection, and | for union can be used. An example is shown below:

# Assume each alphabet as individual genes
# Treatment A and treatment B serve as variables

Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}

Total = list(Treatment_A | Treatment_B)
print(sorted(Total))

Intersection = list(Treatment_A & Treatment_B)
print(sorted(Intersection))

Treatment_A_no_B = list(Treatment_A - (Treatment_A & Treatment_B))
print(sorted(Treatment_A_no_B))

Treatment_B_no_A = list(Treatment_B - Treatment_A & Treatment_B)
print(sorted(Treatment_B_no_A))
# Output from above commands are as follows:
['A', 'B', 'C', 'D', 'E', 'F']
['B', 'C', 'D']
['A']
['E', 'F']

Based on the counts, we can then plot a proportionate Venn diagram using the following code:

from matplotlib_venn import venn2 
from matplotlib import pyplot as plt

venn2(subsets = (1, 2, 3), set_labels = ('Treatment A', 'Treatment B'))
plt.title("Treatment A vs Treatment B")
plt.show()
Output file after plt.show() command

Edited on 5 July 2021, with contribution from Gabrielle 🙂 You can also plot a Venn diagram with 3 different comparisons as well using the command below:

import matplotlib.pyplot as plt
from matplotlib_venn import venn3

set1 = set(["A", "B", "C", "D"])
set2 = set(["B", "C", "D", "E", "F"])
set3 = set(["C", "D", "E","G","H","J"])

venn3([set1, set2, set3], ('Treatment_A', 'Treatment_B', 'Treatment_C'))

plt.show()
Output file after the plt.show() command
# To obtain the elements in the shared and unique treatments
Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}
Treatment_C = {"C", "D", "E","G","H","J"}

Total = list(Treatment_A | Treatment_B | Treatment_C)
print(sorted(Total))

Intersect_all = Treatment_A & Treatment_B & Treatment_C
print(sorted(Intersect_all))

Treatment_A_no_BC = list(Treatment_A - (Treatment_B | Treatment_C))
print(sorted(Treatment_A_no_BC))

Treatment_B_no_AC = list(Treatment_B - (Treatment_A | Treatment_C))
print(sorted(Treatment_B_no_AC))

Treatment_C_no_AB = list(Treatment_C - (Treatment_A | Treatment_B))
print(sorted(Treatment_C_no_AB))
# Output file
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J']
['C', 'D']
['A']
['F']
['G', 'H', 'J']

To use these commands for your application, simply replace the alphabets to your gene lists. Happy coding!

Featured
Posted in About me

1,000 all-time views!

Happy to announce that my blog has reached 1,000 viewers. I would like to take this opportunity to thank everyone for reading my entries and hope that you have enjoyed the content thus far!

Featured
Posted in MMR, Resource

Transcriptomic profiling of different responder types in adults after a Priorix vaccination

Study design by Bartholomeus et al., Vaccine, 2020

Measles, mumps and rubella (MMR) used to be common childhood diseases until the advent of the MMR vaccine, which significantly reduced MMR cases worldwide. According to CDC, a single dose of the MMR vaccine is 93% effective against measles, 78% effective against mumps and 97% against rubella. However, in unprotected individuals, either due to vaccine non-responders or those with poor immunological memory over time, can still be at risk of MMR virus infection.

In this report, E Bartholomeus et al. attempts to study the transcriptomic signatures underlying differences in vaccine responses between individuals. 40 healthy individuals (between 20 and 30 years) whom have received at least one MMR vaccine dose earlier in life, were recruited. Whole blood was collected at days 0, 3 and 7 for whole blood transcriptomic analysis. Later time points of days 21, 150 and 365 was used to evaluate humoral responses to the vaccines (See schematic at top).

Subjects were categorised as: (a) High Ab: individuals with a relatively high Ab titer before vaccination that remained stable or further increased after vaccination (b) Low Ab: individuals with a relatively low Ab titer before vaccination that remained low after vaccination (c) Long response: individuals with a relatively low Ab titer before vaccination that increased after vaccination and stayed stable within the first year (d) Peak response: individuals with a relatively low Ab titer before vaccination that increased at day 21, and then decreased by day 150 and 365. Since 3 live-attenuated viruses are administered together, the analysis was stratified according to responses to each viruses as well.

Comparisons between categories yielded few differentially expressed genes (DEGs). Interestingly, the time comparisons per response group analysis showed that that individuals under the long response group had the most DEGs, as compared to the other categories, especially at day 7 post-vaccination. Pathway analyses reveal that most of these upregulated genes are related to the innate and humoral immune responses.

Featured
Posted in Data visualisation, gProfiler

gProfiler: A useful webtool to query multiple lists of differentially expressed genes

Manhattan plot showing the enriched pathways modulated by AS01B after day 1 post-vaccination (first dose) with the Hepatitis B inactivated vaccine by gProfiler. Different adjuvants, AS01B, AS01E, AS03, AS04 and Alum were tested. No DEGs detected in AS04 and Alum. Data based on Laurane et al., 2020, Science Translational Medicine.

Comparing multiple variables with microarray or RNAseq often generates different sets of differentially expressed genes (DEGs). To examine the similarities and differences in molecular pathways modulated by these sets of DEGs, tools such as Enrichr or DAVID can be used to investigate the individual pathways involved. A Venn diagram can then be used to extract the intersecting or unique pathways involved.

While such comparisons are easy to perform if you only have a few comparisons to make, this method of analysis can be time-consuming when the query size becomes larger. Therefore, in scenarios where multiple comparison of DEGs are required, I recommend using gProfiler for analysis of enriched pathways. One major strength of gProfiler is to be able to submit multiple query lists, by simply adding a “>” symbol between your gene query lists.

To provide more context, consider that you are comparing the day 1 responses between the different adjuvants, published by Laurane et al. After determining the DEGs for the different conditions based on a defined cutoff (Fold change > 1.3, adjusted p-value < 0.05 in this case), we can input the DEGs into gProfiler for the respective conditions. Once the “Run as multiquery” box is checked, gProfiler generates manhattan plots for all the different conditions, showing the enriched pathways that are detected across various databases such as GOMF, GOBP, GOCC, KEGG, Reactome, as well as their respective p-values (See top figure for the output). The corresponding p-values for the other conditions are reflected on the y-axis.

The p-values for all enriched pathways can also be analysed in a table form (see below). Indeed, consistent with the publication findings, AS01B and AS01E behave more similarly as compared to AS03, although most of the pathways among all three adjuvants are overlapping.

Pathway analysis with Reactome using gProfiler comparing pathways modulated by AS01B, AS01E and AS03 after 1 post-vaccination

Besides gene queries against the default databases, you can also use gProfiler to query against any other custom databases, by uploading the custom GMT file. In addition, there are other functions that gProfiler can perform. For instance, g:Convert can convert gene IDs into gene symbols. g:SNPense can also provide information of the SNP queries.

Overall, gProfiler is a useful webtool that allows quick assessment of enriched pathways across multiple variables. A limitation, however, is that the genes involved in the pathways are not shown. To circumvent this, the results from gProfiler can be complemented with Enrichr to identify the gene transcripts responsible for the enriched pathways.

Featured
Posted in About me

Certificate of completion: Data visualization fundamentals with python

Certification of completion for data visualisation fundamentals by Dataquest

Happy to have completed the course on data visualisation fundamentals by Dataquest. Key packages required are:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

numpy and pandas packages are important for storing and manipulating datasets, as well as for performing calculations for correlation. matplotlib.pyplot package is useful for plotting scatterplots, histograms and cumulative frequency distributions. Finally seaborn package is can plot multi-dimensional data involving multiple variables. The nitty gritty details of how to change to settings of graphs, such as adding graph titles, changing x or y-axis ticks, rotation of labelling, sizes of symbols and colours are also covered in this course. Satisfied with the learning experience, which is definitely much more important than the certificate itself. To all others trying to learn coding like me, Gambatte!

Featured
Posted in Malaria, Resource

Expression of genes associated with immunoproteasome processing of major histocompatibility complex peptides is indicative of protection with adjuvanted RTS,S malaria vaccine

Figure highlighting that subjects with increased expression of genes involved in formation of the immunoproteasome complex after RTS,S vaccination are associated with increased protection from malaria challenge. Specific genes involved highlighted in red. Source from MT Vahey, JID, 2010.

The RTS,S is a subunit recombinant vaccine expressed in yeast that represents the central repeat and C-terminal portion of the Plasmodium falciparum circumsporozoite protein (CSP) covalently linked with the S antigen of hepatitis B virus. The overall protective efficacy of RTS,S/ASO1B in malaria-naive adults is ∼50%.

The differences in the spectrum of the protective responses to adjuvanted RTS,S, and the ability to do an experimental challenge with the mosquito-borne falciparum malaria, offers an opportunity to decipher the host genomic mechanisms involved in vaccine efficacy against malaria.

In this study by MT Vahey et al., 2010, 39 vaccine recipients were assessed at study entry, on the day of the third vaccination, at 24 h, 72 h, and 2 weeks after vaccination, and on day 5 after challenge. Of 39 vaccine recipients, 13 were protected and 26 were not.

Most DEGs were detected at day 1 post vaccination, with the majority of these transcripts associated with proinflammatory responses. Most of these responses resolve at day 3 post-vaccination.

After day 5 of malaria challenge, prediction analysis of microarray (PAMR) identified 393 genes that are differentially expressed in subjects who are protected and unprotected. Most of these genes are related to apoptosis and cell cycle.

At 2 weeks after the third vaccination but before malaria challenge, 32 genes belonging to the proteasome degradation pathway separated protected vs unprotected subjects. Examples include PSME2 (a component of the 11S regulator), and PSMA4, PSMB6, and PSMB9 which forms the immunoproteasome involved in processesing of peptides for presentation to the MHC complex.

This study may highlight the critical role of T-cells as correlate of protection against malaria. Data deposited at Gene Expression Omnibus, with accession number GSE18323.

Featured
Posted in Introduction, python

Five reasons why python is better than excel in processing big omics datasets

Big data analysis with python from pandas package can be easier than excel. Picture source from https://images.app.goo.gl/wq34AegEWn73BcCQ8

My self-learning with dataquest has allowed me to use python to make panda dataframes that allows basic data query, creating columns based on pre-defined formulas and graph plotting. Despite my basic knowledge of python, I have begun to appreciate that python could allow more efficient data processing compared to excel. Some of the main advantages of using python are:

1. Processed DataFrames are assigned in codes, mitigating the need for storing multiple files or spreadsheets.

For instance, a data scientist will usually want to convert their values to Z-score before doing principal component analysis, hierarchical clustering or plotting heatmaps. The conversion in excel is easy if you are analysing a small dataset. However, if analysing a dataset with dimensions of 20 variables with 20,000 genes (with total of 400,000 numerical values), the user will usually have to save the file in a different excel file to prevent lagging issues. In python, the conversion can be easily stored using the following code as an example:

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

In this case, the Z-score transformed data is assigned under ‘scaled data’, and you can subsequently do a PCA or heatmap with this ‘scaled_data’.

2. Python is less prone to errors compared to excel

Using excel files to analyse small datasets is convenient and my preferred choice. However, with large datasets, the drag and drop function is not perfect. For instance, a lot of time and effort is needed to drag the cells if you have a large number of variables. The drop function is also not useful if you have empty cells in your dataset. Finally, combining “filter” or “sort” functions together with formulas is cumbersome in excel, especially for large datasets. In python, the formulas can be easily stored in a new column. For instance, to add new columns of log2FC and -logP values into a dataset (assigned as dataset in this case), we just need to use the below command:

# Command generates log2FC and -logP column
dataset['log2FC'] = np.log2(dataset['FC'])
dataset['negative_log_pvalue'] = np.log10(dataset['p_value'])*-1

We can then use these new columns to plot graphs, such as volcano plots and p-value histograms.

3. Python is more time and space efficient in processing of large datasets

The maximum limit of excel is 1,048,576 rows by 16,384 columns, which means that you will have to store all your genes from microarray or RNAseq in rows. Even if you stored all your genes in rows, performing excel functions in large datasets can take a long time (can take several minutes). In some scenarios, due to the long processing time, the excel program may shutdown, causing you to lose your updated file. This means that you will have to save your files regularly and each time you press “save”, it takes a few minutes, which means a lot of time wasted on waiting. In python, particularly with Jupyterlab, the codes are automatically saved and the DataFrame dimensions are limitless. This also means you can manage and transpose the files whenever you like, within seconds.

Another point to note is that all the codes can be saved and shared, which means that the same codes can be quickly applied to any other large datasets. This is unlike excel which will need you to drag and drop for every dataset you are analysing.

4. Searching for variables or genes across different datasets are quicker in python than excel

Within one dataset, excel is able to do multiple gene queries with the INDEX function. However, for multiple large datasets, this is not efficient. First, you will have to open multiple spreadsheets, and then use the INDEX function to query each and every spreadsheet. Overall, this takes a really long time if you have many spreadsheets. In python, when you assign your gene names to be the index column, you can use the codes to perform the specific query for every dataset. All the files can be nicely stored in Jupyterlab or Jupyternotebook, which means that the files can be easily accessible by python codes.

5. Python can draw more sophisticated graphs than excel

With more people sharing their codes and developing packages, Python can draw more sophisticated graphs which is important for data visualisation of large datasets. For instance, python can do hierarchical clustering and heatmaps easier than excels. Complex functions required for principal component analyses is easier to perform in python compared to excel.

Overall, based on these advantages, I believe that python is an essential programming language for data scientists. I would hence strongly recommend learning python if you are considering a career in data science.

Featured
Posted in python, Research tools

Introducing Jupyterlab for efficient bioinformatic workflows

In my future blog entries, I will be using and explaining how the Python Programming Language can be used for data visualisation. To run some of the commands and codes, I recommend downloading JupyterLab, which is a web-based interactive development environment for storing Jupyter notebooks, code, and data. The advantage of using JupyterLab is that it is free to download, open-source, flexible and supports multiple computer languages, including Julia, Python, and R. If needed, you can even download the SQL plug-in to execute SQL commands in Jupyterlab.

The process of downloading is simple:

  1. Visit the Anaconda website, and click on the download icon. Download the 64-bit Graphical Installer based on your computer OS.
  2. Open the package after download, and follow the instructions to download Anaconda into your computer.
  3. Launch the Anaconda-Navigator by clicking the icon. For Mac users, the icon should appear under the “Applications” tab
  4. Launch JupyterLab, choose Python3 notebook, which will eventually direct you to the notebook server’s URL.

You can import your .csv or..txt datafiles directly into JupyterLab to start analysing your dataset in Python. You can also export your notebook as a Jupyter Interactive Notebook (.ipynb file format) if you’d like to share the codes with another person. I believe that JupyterLab will enable more efficient workflows, regardless of tool or language.

Featured
Posted in Adenovirus, Resource

Merck Ad5/HIV induces broad innate immune activation that predicts CD8+ T-cell responses but is attenuated by preexisting Ad5 immunity

Figure (top) showing the protein–protein interaction network that links between CD8+ T-cell response-associated genes and constituents of functional gene modules differentially regulated by MRKAd5/HIV. Study is done by Zak DE et al., PNAS, 2012.

The Step Study revealed that the MRKAd5/HIV had poor efficacy in protecting against HIV-1 infection, especially in subjects with pre-existing Ad5 immunity. However, no clear mechanisms have been identified to date why the seropositive subjects showed reduced efficacy. In this manuscript, Zak DE et al., describe the possible mechanisms of vaccine failure using a systems biology approach

35 healthy HIV-1-uninfected adults, 20–50 years old were recruited for the study. 10 subjects (3 seropositive and 7 seronegative subjects) PBMCs were isolated and transcript levels were analysed by microarray at 4–6, 24, 72, and 168 h post-vaccination.

Strong induction of innate immune responses were detected at the gene and protein level, peaking at 24 hours post-vaccination and resolving at 72 hours post-vaccination. In addition, protein levels of CXCL-10, ITAC, monocyte chemoattractant protein-1 (MCP-1), and MCP-2, as well as immunoregulatory IL-10 and IL-1Ra, were upregulated.

92% concordance between the in vivo and in vitro induction of IFN response genes was observed, indicating that these immune responses observed were modulated by viral infection. Genes not captured in vitro are related to cell lineages, possibly due to lack of cell trafficking in vitro.

When comparing subjects with Ad5 nAb titers ≤ 200 and >200, significant attenuation of the innate immune responses were seen in seropositive subjects compared to the seronegative subjects.

Two chemokines, MCP-1 and MCP-2, discriminated between the strong and moderate CD8-responders as compared to those subjects with poor or no CD8 responses.

Interestingly, the gene signatures that correlated with CD8 T-cell responses were seen at day 3, where members of the “Cytotoxicity,” “T cells,” and “Lymphoid lineage” modules were positively correlated (see above figure). This finding may support the notion that prolonged upregulation of the innate immune gene transcripts could influence CD8 responses.

Data is deposited in the Gene Expression Omnibus (GEO) database, GSE22822.

Featured
Posted in Research tools

Beware of excel autocorrect feature when analysing gene lists

Excel is a great way of handling big datasets, but beware of this problem: This software autocorrects some of the gene symbols into dates. Without having a gene description column on your datasheet, these conversions are irreversible as there is no way to recover the gene name based on dates.

One way to circumvent this issue is to always open a blank excel sheet and open the .txt or .csv file from “File -> Open.” At the last step (step 3; see figure at top), you have to assign the appropriate columns as text (see figure on top). If your gene name is not at the first column, you will have to manually click on the column with your gene names.

If you have accidentally forgotten to open your excel file in this manner, the other option is to assign the gene symbol based on the gene description. To do this, sort your gene symbols in ascending, which will quickly identify the gene names that are converted to dates. Then, format your entire column as text. Finally, based on the table below, assign the gene symbol back to the correct one.