Posted in Data visualisation, IPyWidgets, python

Identifying differentially expressed genes with ipywidgets

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes. The full article on DEGs is fully summarised here.

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual
 
@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
 return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']
fig = go.Figure(data=[
 go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
 go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
 barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
 family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s