Posted in python

Python workflow for omics data analysis

Bioinformatic Python codes for volcano plot, DEG and heatmap analysis

Thank you all for the great interest in my blog. For the year end, I have decided to compile the blog posts that I have made for this year so we can revise what we have learnt before starting the next year with new content. If you have followed my posts, a lot of emphasis has been placed in data analysis of omics data, including volcano plot, DEG identification and heatmap analysis. I have thus re-written some of these posts in Medium, curating the codes so that readers can also implement them easily. Furthermore, a concrete example dataset is provided in GitHub to bring context into the codes used!

For Python codes on volcano plots, please click here.

For Python codes on DEGs, please click here.

For Python codes on clustergrams and heatmaps, please click here.

All of the above visualisation tools are important before doing any downstream analysis, including pathway enrichment analysis, interaction maps, gene network etc., which I will be covering more next year!

Posted in Data visualisation, IPyWidgets, python

Identifying differentially expressed genes with ipywidgets

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual
 
@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
 return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']
fig = go.Figure(data=[
 go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
 go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
 barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
 family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

Posted in Data visualisation, python

So you think you have finished analysing your data? Think again…

It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.

1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?

Two different ways of showing association of age with antibody responses

In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.

2) Are the cutoff values in your data analysis justified?

What cutoff should you try?

We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.

3) How did you manage the missing values and outliers? Can we do better?

Missing-Puzzle-Pieces-600x398-betanews - Ministry Insights
Handling missing data and outliers are important to ensure data interpretation is sound

Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.

4) Besides the primary research goal, what more can we learn more from the data?

proyeccion-continuada-de-luz Imagenes y fotos Premium de Istock
Sometimes, being too focused on what you are chasing may cause you to lose other important knowledge that the data provides

As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.

5) Look beyond statistical significance for biological meaning

As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.

In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.

Posted in correlation, Data visualisation, python

Correlation matrix – A useful tool to visualise relatedness between multiple variables

Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output
test_1 = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv',index_col=0)
test_1.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
10.2847200.0806110.9971330.151514
20.1638440.8187040.4599750.143680

Next, we can visualise the correlation coefficient across all variables using the command:

corrM = test_1.corr()
corrM

Output file showing all the correlation coefficient values is as follows:

ab30dab60dab90dtcell7d
ab30d1.0000000.3789260.042423-0.324900
ab60d0.3789261.000000-0.489996-0.374259
ab90d0.042423-0.4899961.000000-0.392458
tcell7d-0.324900-0.374259-0.3924581.000000

We can also determine the p-values of the correlation by using the commands, as follows:

from scipy import stats 
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
                     columns=test_1.columns, index=test_1.columns)
pvals

Output file is as follows:

ab30dab60dab90dtcell7d
ab30d6.646897e-640.2802149.073675e-010.359673
ab60d2.802137e-010.0000001.505307e-010.286667
ab90d9.073675e-010.1505316.646897e-640.261956
tcell7d3.596730e-010.2866672.619556e-010.000000

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

corr = test_1.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, cmap=cmap, mask=mask, center=0, square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

Posted in Data visualisation, KEA3

KEA3: A web-based tool to predict involvement of upstream kinases based on a list of gene or protein queries.

(Left) Top 10 kinases that could be involved in transcriptional responses to the yellow fever live-attenuated vaccine (YF-17D). (Right) Interactive map showing the relationship of the different kinases identified in the left panel. Query list is based on the upregulated differentially expressed genes at day 7 post YF-17D vaccination from Querec et al., Nature Immunology, 2009.

Protein kinases catalyze the transfer of a phosphate group from ATP to other proteins’ threonine, serine, or tyrosine residues. This addition of phosphate group to a protein can influence substrate protein activity, stability, localization, and interactions with other molecules. While kinases can be suitably targeted by drugs, characterization of the cell kinome is not easy, as intercellular staining with phospho-specific antibodies is required.

A plausible solution is to leverage on the transcriptomic data to down-select kinase candidates for further validation. Here, I introduce Kinase Enrichment Analysis 3 (KEA3) developed by MV Kuleshov et al., which is a webserver application that predicts upstream kinases based on a list of gene or protein queries.

The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data. By integrating KSIs and KPIs across data sources, KEA3 produces a composite ranking that improves the recovery of the expected kinases. In addition, the relationship between the top predicted kinase are also displayed in an interactive map.

I tested the ability for KEA3 to evaluate the possible kinases involved in the host transcriptomic responses to the YF-17D vaccine published by Querec et al., Nature Immunology, 2009. Taking the up-regulated differentially expressed genes at day 7 post-YF17D administration as the query list, the top 10 kinase hits are displayed in the figure at the top of this post. Notably, these kinases appear to be highly interconnected and the predicted involvement of EIF2AK2-JAK1-JAK2-TYK axis suggests the involvement of these kinases in triggering type-I interferon responses. This finding is consistent with previous studies showing that YF-17D induces strong interferon and antiviral responses.

Overall, KEA3 is a user-friendly tool that allows users to quickly predict the upstream kinases involved, based on a list of proteins or genes. While an experimental validation will be needed to confirm the involvement of these predicted kinases, the tool provides an informed prediction on the kinases involved that can be used for future studies.

Posted in About me

Hiring a bioinformatics or computer scientist that specialises in database query systems

Raimme - platform features

My lab, from the Emerging Infectious Diseases (EID) Programme, Singapore, is seeking to employ one postdoctoral researcher . This position presents an exciting opportunity to be a global leader in the application of systems biology and omics technologies to guide development of vaccines and therapeutics against emerging infectious diseases. The applicant will be working with a team of scientists and clinicians to use data-science to streamline clinical trials. The job scopes will include, but not limited to, curating existing omics datasets from our clinical trials and other published clinical trials, creating a resource database documenting biological data and designing algorithms/jupyter notebooks for query optimisation. The position comes with the potential for further career development, as well as engaging in translational research. 

Applicants must have a PhD in computer sciences, bioinformatics, biological sciences or in any other related disciplines. Each applicant should a strong background in using statistical computer languages, especially R, Python and SQL to enable them to build database systems of high availability and quality. Knowledge on query optimization, analysis and troubleshooting interactions is also highly preferred. With the support from a network of scientist and clinician collaborators, applicants should be highly motivated researchers who wish to make an impact in research.

The post will be subjected to the Duke-NUS Medical School terms and conditions of service, and the salaries will be based on applicants’ experience and track records. To apply, please visit the following address: https://careers.nus.edu.sg/DukeNUS/job/Outram-Park-Research-Fellow-%28OREIDCKR%29-Outr/4110044/

Posted in Data visualisation, Volcano Plot

Genoppi: A new tool to plot volcano plots with annotated functions

Volcano plot plotted using Genoppi. Data is from Vahey et al., J Infect Dis., 2010, demonstrating the types of proteins that are most significantly induced in protected individuals receiving the RTS,S malaria vaccine. Functions are annotated from HGNC, ranked from greatest to lowest number of hits.

As elaborated in my previous post, volcano plots are a great way of visualising the differentially expressed genes (DEGs) that are most affected by a particular treatment (compared to untreated control). However, a limitation is that volcano plots do not directly annotate whether the most influential genes perform similar functions. Here, I introduce Genoppi, which is an open-source computational tool (R package source code: github.com/lagelab/Genoppi) that allows plotting of interactive volcano plots with the corresponding gene functions derived from HGNC, GO or MSigDB (see example figure on top). In addition, by assigning a bait protein of interest, the tool is able to identify the interacting partners that are significant on the volcano plot. The interaction partners are compiled from InWeb_InBioMap, iRefIndex, or BioPlex which includes data from >40000 scientific articles. If interested to find out if SNPs (single-nucleotide polymorphisms) could play a role in your dataset, you may also assign the GWAS study from the NHGRI-EBI GWAS catalog to the dataset.

There are several other functions that Genoppi can do, but not elaborated in this blog post. Interested users may visit the details within the publication or in the guide provided in the website. Overall, while Genoppi is likely to be most utilised by scientists interested in proteomics and interactome research, my experience suggests that Genoppi can be potentially applied more broadly to transcriptomics analyses as well.

Posted in Hepatitis B, Resource

Transcriptional profiles of adjuvanted hepatitis B vaccines display variable interindividual homogenity but a shared core signature

Clustergram showing similarities and differences in gene expression for different adjuvants. The day 1 and day 31 (day 1 after 2nd dose) for AS01 and AS03 adjuvants cluster together, whereas AS04 and Alum cluster together with the late responses (days 3-7). AS01 and AS03 induces greater number of DEGs compared to AS04 and Alum. Source from: Laurane et al., Science Translational Medicine, 2020

Adjuvants are an important component in inactivated vaccines, as they are known to trigger activation of innate immune responses that is essential for eliciting adaptive immune responses. In this study, Laurane et al., Science Translational Medicine, 2020 investigated how the transcriptional profile in humans is altered by 5 different types of adjuvants, namely:

  1. AS01B (MPL + QS-21)
  2. AS01E (half dose of MPL + QS-21)
  3. AS03 (a-tocopherol + squalene)
  4. AS04 (MPL+Alum)
  5. Alum

They also examined the inter-individual variations in vaccine responses to the different adjuvants. The antigen used is the Hepatitis B virus (HBV) surface antigen (HBs).

112 healthy subjects were given two intramuscular injections of HBs antigen (18 for AS01B, 23 for AS01E, 28 for AS03, 22 for AS04, 21 for Alum). Whole blood transcriptome profiled at 0, 3-6hrs, 1, 14 days after first dose and 0, 3-6hrs, 1, 3, 7 days after second dose. The second dose was given 30 days after first dose.

Most DEGs are detected at day 1 and day 31 post vaccination. Highest number of DEGs were seen in AS01B and AS01E , followed by AS03. Much lesser DEGs were seen with AS04 and Alum. The upregulated DEGs for AS01B, AS01E, and AS03 comprised of genes mostly related to innate immune genes and interferon responses. This is interesting because AS01 and AS03 components are different but yet, they induce transcriptional responses that are converging.

Interestingly, heterogeneity among individual responses for the AS03 recipients was higher compared to AS01 recipients.

Some of the day 1 genes related to innate response, interferon pathway and NK cells are correlated with humoral and cellular responses at day 44. However, this only accounted for <40% of the correlated genes. Genes from other pathways such as cell cycle and metabolism were also associated with the humoral and cellular responses at day 44.

Posted in influenza, Resource

Temporal Dynamics of Host Molecular Responses Differentiate Symptomatic and Asymptomatic Influenza A Infection

Figure describing the 8 distinct clusters of genes that are temporally modulated (0-108hrs) in the symptomatic (Sx) and the asymptomatic (Asx) volunteers after challenge with influenza virus. Source: Yongsheng H et al., PLOS Genetics, 2011.

A deep understanding of the molecular underpinnings underlying severe viral disease outcome in humans is critical for the development of drugs and therapeutics. Controlled human infection studies, in which volunteers are intentionally infected with a pathogen, can advance our understanding of disease pathogenesis as the incubation time, time-course of disease progression, symptomatic rates and immune responses can be closely monitored. In this manuscript published by Yongsheng H et al., PLOS Genetics, 2011, the authors investigated how the transcripts are differentially modulated in the symptomatic and asymptomatic subjects after challenge with the live influenza (H3N2/Wisconsin) strain

17 healthy adults were inoculated with live influenza (H3N2/Wisconsin) strain at 3 different doses (1∶10, 1∶100, 1∶1000, 1∶10000). 9 subjects were symptomatic whereas 8 were asymptomatic. Changes in host peripheral blood gene expression measured at -12, 0, 12, 21, 29, 36, 45, 53, 60, 69, 77, 84, 93, 101 and 108 hrs. 

Increasing doses of virus does not correlate with increased symptomatic outcome. This finding is congruent with our previous findings showing vaccine viremia does not influence symptomatic rates (Chan et al., JCI insight, 2017). In contrast, gene signature patterns were strongly associated with disease severity.

Using EDGE with false discovery rate (FDR) significance level (q-value)<0.01, 5,076 genes were temporally changed, comparing between symptomatic and asymptomatic phenotypes. Self-organizing maps (SOM) identified eight distinct classes with differential expression dynamics (See figure above).

Cluster 3 reveal genes that are uniquely increased in symptomatic subjects. These include PRR genes such as Toll-like receptor 7 (TLR7), the RNA helicases (RIG-I), and interferon induced with helicase C domain 1 (IFIH1). In addition, 11 genes from the TLR signaling pathway, including MyD88, TRAF6, and STAT1. NOD1, RIPK2, NOD2, NLPR3, and CASP5 and CASP1 and IL1b were increased in symptomatic subjects but not in the asymptomatic subjects.

Cluster 6 genes identified genes that are uniquely increased in the asymptomatic subjects. Enriched pathways were enriched are related to cellular response to oxidative stress. These include superoxide dismutase (SOD1) and serine/threonine kinase 25 (STK25 or SOK1), which have been linked to anti-oxidant/stress response and reduced concentration of ROS. This cluster also contain genes related to ribosomal synthesis, suggesting that protein biosynthesis could be protective against severe disease.

Both raw and normalized gene expression data are available at GEO (GSE30550).