Posted in Data visualisation, python

Making interactive and executable notebooks/dashboards with Lux and Voilà packages

As described in my previous blog posts, pair plots and correlation matrices are useful to quickly analyse relationships between multiple variables. However, when analysing the datasets with more than 10-20 variables, these analysis methods have their limitations. One critical limitation is the difficulty in assessing the effects of 2 or more categorical variables. Another is the difficulty in visualising a large number of variables plotted altogether in one plot.

Here, I propose two python packages that will make graph plotting less tedious. First is the Voilà package, that supports Jupyter interactive widgets and second is the Lux package, which is integrated with a Jupyter interactive widget that allows users to quickly browse through large collections of data directly within their Jupyter notebooks. These packages are useful in displaying data interactively, which makes them highly versatile for fast experimentation with data. I will share the features of these 2 packages as I run through the procedures for running these packages:

First, we need to install the packages. To install Voilà, execute the following command:

pip install voila

Note that this command will also be automatically be installed as an extension if you have version 3 of JupyterLab installed:

Next, to install Lux, enter the following command in terminal:

pip install lux-api

With these packages, you are now ready to use these applications. We will import all the necessary packages and use the iris dataset from PlotlyExpress as a proof of concept.:

import csv
import lux
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = px.data.iris()

The output file is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Next, launch Voilà in the Jupyter Notebook extension (indicated in arrow):

This should allow you to toggle between Pandas and Lux freely (indicated in arrow):

The Lux package automatically performs a pairwise plot against all variables in your dataframe, allowing you to focus on the scatterplots that show correlated relationship. Under the “Distribution” tab, you can quickly visualise the distribution of the data:

For the correlation graph, you can click onto the plots that you are interested in, and click the magnifying glass which indicates “intent.” This will then zoom into the graph further, allowing you to see how the other variables (in this case, species), interact with your scatterplot. This immediately provides you with a 3-dimensional view of the dataset.

As you can begin to appreciate, the Voilà package is a powerful tool to plot interactive widgets directly within your local address. You can even publish it online, allowing other users to use the interactive widgets to explore the dataset according to their research intent. Combined with the Lux package, this allows users, even those without bioinformatics background, to zoom into the interesting scatterplots for further analysis.

Posted in Data visualisation

Visualising distributions with pie charts, donut plots and stacked graphs

Pie charts can be used to display distribution of variables within datasets.

A pie chart is a chart that uses the pie slices to display the relative size or proportion of the different variables. In general, pie charts are easier to visualise if the pie has no more than 7 slices.

Consider the dataset where 10 patients are infected with virus X, 6 of which experience no symptoms. The remaining 4 patients are symptomatic, of which 2 of them experience headaches, 1 with fever and 1 with rashes. Here, you can simply use a pie chart to display the data:

I have used the Prism software to plot the pie chart. Note that you can use the colour variants to emphasise on the subjects with symptoms. It is a good habit to include the total counts when presenting pie charts, as these charts meant to only display relative sizes. Note that the pie chart always start at the 12 o’clock position for easy visualisation.

Sometimes, when pie charts are difficult to visualise, another way is to plot a donut plot. The advantage of donut plot is that you can provide additional information in the middle. However, it is important to make sure that the donut is not too thin that can make visualisation difficult. An example is as shown below:

Finally, sometimes it might be more useful to use a stacked bar chart to show the distribution of the data. The stacked bar chart may be more useful when you want to group variables with similar properties (In this case, you can group all the symptomatic patients together). An example is as follows:

There are no fixed rules to justify which of these charts are more suitable for data visualisation. When in doubt, there is no harm in trying to plot all of these graphs and then decide later which is the most suitable for publications or presentations!

Posted in Data visualisation, Pathway analysis, python

Dot plots as a visualisation tool for pathway analysis

As described in my previous blog post, heatmaps allow quick visualisation of various measurements between samples. The magnitude differences are usually represented as hue or colour intensity changes. However, if you want to include another parameter, this can be challenging. Imagine the scenario where you identified 5 Gene Ontology Biological Pathways (GOBP) which are significantly different between the infected and uninfected samples over a course of 3 days. To plot them on a graph, you can choose to negative log-transform the adjusted p-values and then plot a heatmap as shown below:

However, if you want to also display the combined score from your EnrichR analysis, you will have to plot another heatmap:

As shown in the example above, you will need 2 figure panels to fully describe your pathway analysis. A more elegant way to display these results could thus be to use a dot plot. In simple terms, dot plots are a form of data visualisation that plots data points as dots on a graph. The advantage of plotting data points in dots rather than rectangles in heatmaps is that you can alter the size of the dots to add another dimension to your data visualisation. For instance, in this specific example, you can choose to display the p-values to be proportional to the size of the dots and the hue of the dots to represent enrichment score. This also means you only need one graph to fully represent the pathway analysis!

Dot plots can be easily plotted in Python, using either the Seaborn package or the Plotly Express package. I personally prefer the Plotly Express package as the syntax is simpler and you can mouse over the dots to display the exact values. To plot the dot plot, we first load the standard packages:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

We then load a ‘test’ dataset from my desktop, into a format where columns will contain timepoints, pathway terms, negative logP values and combined scores. It is also a good habit to convert the timepoint to a “string” datatype so that the x-axis does not include the default time-points such as 1.5 and 2.5.

df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df['timepoint'] = pd.Series(df['timepoint'], dtype="string")
df.head(5)

Output is as follows:

timepointTermadjusted_pvalCombined_scoreneg_logP
01Defense Response To Virus3.942000e-2587.53124.404283
12Defense Response To Virus3.940000e-27875.31026.404283
23Defense Response To Virus5.000000e-022.0001.301030
31Defense Response To Symbiont2.256000e-2595.55524.646661
42Defense Response To Symbiont2.260000e-27955.55026.646661

Finally, the dot plot can be plotted using the following syntax:

fig = px.scatter(df, x="timepoint", y="Term", color="Combined_score", size="neg_logP", color_continuous_scale=px.colors.sequential.Reds)

fig.show()

Output is a dotplot, where size is proportional to the -log p-value and the colour intensity. You can choose to customise your colours available at this website:

Because of the ability of the dot-plot to add another dimension of analysis, most pathway analysis are presented as dot-plots. However, I am sure there are other scenerios where dot plots can be appropriately used. Next time, if you decide to plot multiple heatmaps, do consider the possibility of using dot-plots as an alternative data visualisation tool!

Posted in Data visualisation, pair plots, python

Visualising multidimensional data with pair plots

Pairplots can be effectively used to visualise relationship between multiple variables, which allows you to zoom into the interesting aspects of the dataset for further analysis

As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.

Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.

To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.

We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:

# Load packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load iris dataset from plotlyexpress
df = px.data.iris()
df.head()

The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:

g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)

Output is as follows:

Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.

Posted in Data visualisation, IPyWidgets, python

Making Jupyter notebooks interactive with IPyWidgets

There are many instances where scientists and data scientists will have to identify cut-offs before deeper data analyses. For instance, what is the fold change and adjusted p-value threshold that is most appropriate for identification of DEGs? For pre-rank pathway analysis, what is an appropriate cut-off for enrichment scores to be considered significant? For RNAseq, what is the minimum counts that we should filter out without trimming away too many transcripts? Executing these tasks can take a long time, even with coding because you will need multiple lines of coding to identify the most suitable cut-off.

Consider that you are unsure what is an appropriate cut-off for identification of DEGs. The usual way will be to probably do a query on the number of DEGs identified when fold-change is at least 1.5 with adjusted p-value < 0.05. Imagine you identified only 30 DEGs amongst 20,000 genes. This may indicate that your cutoff may be too stringent, and this time, you decide to adjust to fold-change of at least 1.2 with adjusted p-value < 0.05. Again, you will require another line of code. Imagine this time you found 1000 genes that satisfy the criteria and now you think this number is now too many for pathway analysis. Again, this means you might need another code to query fold-change of at least 1.3, and the cycle repeats until you find that suitable cutoff.

To circumvent this issue, I introduce IPython widgets (IPyWidgets), which turn Jupyter Notebooks from static documents into interactive dashboards, perfect for exploring and visualizing data. To install, simply type the command:

pip install ipywidgets

We first import the packages and load a dataset. Here, I load a dataset from Zak DE et al. from my BENEATH account. The study looks into the transcriptomic responses in human subjects across time, after taking the MRKAd5/HIV vaccination, and the details have been previously elaborated in my blog:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import beneath
df = await beneath.load_full("kuanrongchan/vaccine-repository/ad5-seroneg-daniel-2012")
df.info()
df.head(2)

The output for the header columns is as follows. fc indicate fold-change, pval indicate p-value with respect to day 0 and qval indicate adjusted p-value with respect to day 0. The time-points considered are 6hrs, 1, 3 and 7 days after vaccination.

genefc006hpval006hqval006hfc001dpval001dqval001dfc003dpval003dqval003dfc007dpval007dqval007d@meta.timestamp
0A1BG-1.134080.1577810.500370-1.037426.099120e-010.804213-1.139440.1430990.678669-1.048680.5899130.9975792021-07-16 00:24:39.972000+00:00
1A1CF-1.041990.3837500.6388801.074756.557900e-020.1701181.019200.6860740.929992-1.013690.7725000.9975792021-07-16 00:24:39.972000+00:00

We will use the following command to make an interactive slider. Assuming we are fixing adjusted p-value to be less than 0.05 and want to find out the number of DEGs based on a fold-change cut-off:

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

@interact
def show_DEGs(column1=['fc001d','fc003d','fc007d'], x=(0,3,0.1),column2=['qval001d','qval003d','qval007d']):
    return len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index)

Output is as follows:

With the @interact decorator, the IPyWidgets library automatically gives us a text box and a slider for choosing a column and number! In this case, we can move the slider from 0 to 3, and for the different time-points (6hrs, 1, 3 and 7 days post-vaccination) to see the number of upregulated DEGs for different fold-change cut-offs. The output shown has the slider moved to 1.3 and 1.5, which yielded 1510 and 905 DEGs respectively.

You can also use the @interact function with .loc function to call out the genes and expression values that satisfy the criteria designated by your slider:

Finally, you can also use the IPyWidgets to do multiple comparisons, especially using the dropdown widget. For example, if you want to quickly find out if the transcript expression at two time-points are correlated, you can use IPyWidgets to create the dropdown menus for multiple comparisons:

@interact
def correlations(column1=list(df.select_dtypes('number').columns),
                 column2=list(df.select_dtypes('number').columns)):
    print(f"Correlation:{df[column1].corr(df[column2])}")

The output is two dropdown lists, that would allow you to quickly evaluate the correlation of any 2 variables in the dataframe:

The potential of IPyWidgets is limitless, allowing you to scan through data very quickly, and brings some life to the static Jupyter Notebook. You can also use this for graph plotting, scanning through files etc. The possibilities are endless… 🙂

Posted in correlation, Data visualisation, python

Correlation matrix – A useful tool to visualise relatedness between multiple variables

Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output
test_1 = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv',index_col=0)
test_1.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
10.2847200.0806110.9971330.151514
20.1638440.8187040.4599750.143680

Next, we can visualise the correlation coefficient across all variables using the command:

corrM = test_1.corr()
corrM

Output file showing all the correlation coefficient values is as follows:

ab30dab60dab90dtcell7d
ab30d1.0000000.3789260.042423-0.324900
ab60d0.3789261.000000-0.489996-0.374259
ab90d0.042423-0.4899961.000000-0.392458
tcell7d-0.324900-0.374259-0.3924581.000000

We can also determine the p-values of the correlation by using the commands, as follows:

from scipy import stats 
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
                     columns=test_1.columns, index=test_1.columns)
pvals

Output file is as follows:

ab30dab60dab90dtcell7d
ab30d6.646897e-640.2802149.073675e-010.359673
ab60d2.802137e-010.0000001.505307e-010.286667
ab90d9.073675e-010.1505316.646897e-640.261956
tcell7d3.596730e-010.2866672.619556e-010.000000

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

corr = test_1.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, cmap=cmap, mask=mask, center=0, square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

Posted in Data visualisation, Heatmaps, python

The art of plotting heatmaps

Bar charts provide precise measurements with standard deviation, but showing multiple bar charts to show trends for multiple genes or proteins is difficult to visualise. In such cases, a heatmap can be a useful alternative to visualise changes in multiple gene or protein expression across different conditions. In a heatmap, the variation in magnitude of expression is often visualised by changing the intensity or hue. For instance, a higher expression can be depicted by using the colour red, and a greater increase in expression by a darker intensity. On the other hand, downregulation can be depicted by using the colour blue (more colour-blind friendly than green), with a stronger intensity indicating greater downregulation. While plotting raw expression values in a heatmap is acceptable if the relative abundance between genes and proteins are similar, data normalisation is required when the baseline gene or protein abundance is different.

For instance, consider you want to plot expression levels of gene A and gene B in a single heatmap. Gene A has a baseline expression of 5 copies per cell whereas gene B has a much higher baseline expression of 50 copies per cell. Let’s assume the condition where during a virus infection, gene A expression is reduced to 2.5 copies/cell but gene B expression is increased by 2 fold to 100 copies/cell. In this case, due to the high baseline abundance of gene B, plotting the heatmap using gene abundance will mask the ability to visualise differential expression of gene A after infection.

To display the relative expression of both genes A and B in a heatmap, data normalisation is recommended. One of the most common strategies is to plot a heatmap based on log2 fold-change instead of abundance. In this specific case, the log2 fold-change for gene A and gene B after virus infection is -1 and 1 respectively, and these values can be nicely presented in a heatmap to depict the phenomena. Another common alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, which means that values above 0 are above average and values below 0 are below average. Hence, the Z-score transformation standardises the values between genes or proteins to the same mean and standard deviation, enabling users to easily compare expression values across genes and proteins. However, an important point to note is that Z-score can only be applied for values that are obtained within a single experiment.

To plot a heatmap, Microsoft excel (under the format cell column) or specialised software such as PRISM can be used. However, most of these tools do not perform clustering of heatmaps . From my experience, besides clustergrammer, python is one of the fastest and easiest way to plot a heatmap or clustergram. After loading the data, the heatmap can be quickly plotted with a single line of code, and with various kinds of customisation. The step by step commands and code descriptions are as follows:

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading data into python as dataframe and check that the dataset is properly loaded
# Gene symbol is assigned as the index column
df = pd.read_csv('/Users/kuanrongchan/Desktop/Blog.csv',index_col=0)
df.head()

Output file:

Gene_SymbolControl_1Control_2Control_3Condition_1_1Condition_1_2Condition_1_3Condition_2_1Condition_2_2Condition_2_3
HERC52.352.181.903.412.993.252.672.552.46
IFIT14.504.424.586.316.266.236.085.525.21
IFIT24.344.844.765.915.965.955.565.364.97
IFNA41.421.521.441.732.052.351.141.541.61
IFNA70.881.220.561.461.231.701.230.710.64
# Packages to plot clustergram
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar

# Plot clustergram
g = sns.clustermap(df, z_score=0, cmap="vlag", figsize=(10, 4), method='average', cbar_kws={"ticks":[-1.5,0,1.5]}) 

# Various customisations can be performed on the above code as follows:
# Z_score = 0 ; tells python to plot the Z-score values
# cmap = "vlag" ; indicates red-blue heatmap
# figsize ; indicates figure size dimensions
# method = 'average' ; default option
# robust=True ; can be added in the code to plot as robust quantiles instead of the extreme values
# cbar_kws={"ticks":[-1.5,0,1.5]} ; defines points to show in colour bar
# g.cax.set_visible(False) ; can be added to remove colour bar
# Add yticklabels=True ; can be added for big heatmaps to display all gene symbols 

Output:

At a single glance, you can quickly tell that condition 2 behaves similarly with the control condition, whereas condition 1 show a marked increase in expression of IFNA4, IFNA7, HERC5, IFIT1 and IFIT2. The replicates are also clustered together. As you may already appreciate, this kind of data presentation is particularly useful for analysing big datasets.

Posted in Data visualisation, python, Venn diagram

Using python codes to study unique or shared elements in Venn diagrams

As elaborated in my previous post, Venn diagrams are useful for identifying unique and shared genes between different treatment conditions. While Venny remains to be one of my favourite tools for plotting Venn diagrams, inputting lists of genes can be cumbersome when the gene lists are long (e.g. >1,000 genes) or when gene lists are provided as .csv or .txt files.

Here, I introduce how you can use python to identify the unique and overlapping genes. The operations, & for intersection, and | for union can be used. An example is shown below:

# Assume each alphabet as individual genes
# Treatment A and treatment B serve as variables

Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}

Total = list(Treatment_A | Treatment_B)
print(sorted(Total))

Intersection = list(Treatment_A & Treatment_B)
print(sorted(Intersection))

Treatment_A_no_B = list(Treatment_A - (Treatment_A & Treatment_B))
print(sorted(Treatment_A_no_B))

Treatment_B_no_A = list(Treatment_B - Treatment_A & Treatment_B)
print(sorted(Treatment_B_no_A))
# Output from above commands are as follows:
['A', 'B', 'C', 'D', 'E', 'F']
['B', 'C', 'D']
['A']
['E', 'F']

Based on the counts, we can then plot a proportionate Venn diagram using the following code:

from matplotlib_venn import venn2 
from matplotlib import pyplot as plt

venn2(subsets = (1, 2, 3), set_labels = ('Treatment A', 'Treatment B'))
plt.title("Treatment A vs Treatment B")
plt.show()
Output file after plt.show() command

Edited on 5 July 2021, with contribution from Gabrielle 🙂 You can also plot a Venn diagram with 3 different comparisons as well using the command below:

import matplotlib.pyplot as plt
from matplotlib_venn import venn3

set1 = set(["A", "B", "C", "D"])
set2 = set(["B", "C", "D", "E", "F"])
set3 = set(["C", "D", "E","G","H","J"])

venn3([set1, set2, set3], ('Treatment_A', 'Treatment_B', 'Treatment_C'))

plt.show()
Output file after the plt.show() command
# To obtain the elements in the shared and unique treatments
Treatment_A = {"A", "B", "C", "D"}
Treatment_B = {"B", "C", "D", "E", "F"}
Treatment_C = {"C", "D", "E","G","H","J"}

Total = list(Treatment_A | Treatment_B | Treatment_C)
print(sorted(Total))

Intersect_all = Treatment_A & Treatment_B & Treatment_C
print(sorted(Intersect_all))

Treatment_A_no_BC = list(Treatment_A - (Treatment_B | Treatment_C))
print(sorted(Treatment_A_no_BC))

Treatment_B_no_AC = list(Treatment_B - (Treatment_A | Treatment_C))
print(sorted(Treatment_B_no_AC))

Treatment_C_no_AB = list(Treatment_C - (Treatment_A | Treatment_B))
print(sorted(Treatment_C_no_AB))
# Output file
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J']
['C', 'D']
['A']
['F']
['G', 'H', 'J']

To use these commands for your application, simply replace the alphabets to your gene lists. Happy coding!

Posted in Data visualisation, gProfiler

gProfiler: A useful webtool to query multiple lists of differentially expressed genes

Manhattan plot showing the enriched pathways modulated by AS01B after day 1 post-vaccination (first dose) with the Hepatitis B inactivated vaccine by gProfiler. Different adjuvants, AS01B, AS01E, AS03, AS04 and Alum were tested. No DEGs detected in AS04 and Alum. Data based on Laurane et al., 2020, Science Translational Medicine.

Comparing multiple variables with microarray or RNAseq often generates different sets of differentially expressed genes (DEGs). To examine the similarities and differences in molecular pathways modulated by these sets of DEGs, tools such as Enrichr or DAVID can be used to investigate the individual pathways involved. A Venn diagram can then be used to extract the intersecting or unique pathways involved.

While such comparisons are easy to perform if you only have a few comparisons to make, this method of analysis can be time-consuming when the query size becomes larger. Therefore, in scenarios where multiple comparison of DEGs are required, I recommend using gProfiler for analysis of enriched pathways. One major strength of gProfiler is to be able to submit multiple query lists, by simply adding a “>” symbol between your gene query lists.

To provide more context, consider that you are comparing the day 1 responses between the different adjuvants, published by Laurane et al. After determining the DEGs for the different conditions based on a defined cutoff (Fold change > 1.3, adjusted p-value < 0.05 in this case), we can input the DEGs into gProfiler for the respective conditions. Once the “Run as multiquery” box is checked, gProfiler generates manhattan plots for all the different conditions, showing the enriched pathways that are detected across various databases such as GOMF, GOBP, GOCC, KEGG, Reactome, as well as their respective p-values (See top figure for the output). The corresponding p-values for the other conditions are reflected on the y-axis.

The p-values for all enriched pathways can also be analysed in a table form (see below). Indeed, consistent with the publication findings, AS01B and AS01E behave more similarly as compared to AS03, although most of the pathways among all three adjuvants are overlapping.

Pathway analysis with Reactome using gProfiler comparing pathways modulated by AS01B, AS01E and AS03 after 1 post-vaccination

Besides gene queries against the default databases, you can also use gProfiler to query against any other custom databases, by uploading the custom GMT file. In addition, there are other functions that gProfiler can perform. For instance, g:Convert can convert gene IDs into gene symbols. g:SNPense can also provide information of the SNP queries.

Overall, gProfiler is a useful webtool that allows quick assessment of enriched pathways across multiple variables. A limitation, however, is that the genes involved in the pathways are not shown. To circumvent this, the results from gProfiler can be complemented with Enrichr to identify the gene transcripts responsible for the enriched pathways.

Posted in Data visualisation, KEA3

KEA3: A web-based tool to predict involvement of upstream kinases based on a list of gene or protein queries.

(Left) Top 10 kinases that could be involved in transcriptional responses to the yellow fever live-attenuated vaccine (YF-17D). (Right) Interactive map showing the relationship of the different kinases identified in the left panel. Query list is based on the upregulated differentially expressed genes at day 7 post YF-17D vaccination from Querec et al., Nature Immunology, 2009.

Protein kinases catalyze the transfer of a phosphate group from ATP to other proteins’ threonine, serine, or tyrosine residues. This addition of phosphate group to a protein can influence substrate protein activity, stability, localization, and interactions with other molecules. While kinases can be suitably targeted by drugs, characterization of the cell kinome is not easy, as intercellular staining with phospho-specific antibodies is required.

A plausible solution is to leverage on the transcriptomic data to down-select kinase candidates for further validation. Here, I introduce Kinase Enrichment Analysis 3 (KEA3) developed by MV Kuleshov et al., which is a webserver application that predicts upstream kinases based on a list of gene or protein queries.

The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data. By integrating KSIs and KPIs across data sources, KEA3 produces a composite ranking that improves the recovery of the expected kinases. In addition, the relationship between the top predicted kinase are also displayed in an interactive map.

I tested the ability for KEA3 to evaluate the possible kinases involved in the host transcriptomic responses to the YF-17D vaccine published by Querec et al., Nature Immunology, 2009. Taking the up-regulated differentially expressed genes at day 7 post-YF17D administration as the query list, the top 10 kinase hits are displayed in the figure at the top of this post. Notably, these kinases appear to be highly interconnected and the predicted involvement of EIF2AK2-JAK1-JAK2-TYK axis suggests the involvement of these kinases in triggering type-I interferon responses. This finding is consistent with previous studies showing that YF-17D induces strong interferon and antiviral responses.

Overall, KEA3 is a user-friendly tool that allows users to quickly predict the upstream kinases involved, based on a list of proteins or genes. While an experimental validation will be needed to confirm the involvement of these predicted kinases, the tool provides an informed prediction on the kinases involved that can be used for future studies.