Posted in Data visualisation, python

Plotting volcano plots with Plotly

As mentioned previously, I have highlighted why volcano plots are so important in omics research. A full article summarising the main points and rationale are as described here. In this entry, we will explore how we can use Plotly to build volcano plots!

We will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination. On my end, I have processed and compiled the fold-change, ratio , p-values and adjusted p-values (q-values) of the seronegative subjects with respective to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.

We first load the required packages (pandas, numpy and plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them. I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:

df = pd.read_csv(‘https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1 day, 3 day and 7 day time-points compared to baseline (timepoint = 0):

The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:

df[‘log2FC_6h’] = np.log2(df[‘ratio_6h’])
df[‘log2FC_1d’] = np.log2(df[‘ratio_1d’])
df[‘log2FC_3d’] = np.log2(df[‘ratio_3d’])
df[‘log2FC_7d’] = np.log2(df[‘ratio_7d’])
df[‘negative_log_pval_6h’] = np.log10(df[‘qval_6h’]) * (-1)
df[‘negative_log_pval_1d’] = np.log10(df[‘qval_1d’]) * (-1)
df[‘negative_log_pval_3d’] = np.log10(df[‘qval_3d’]) * (-1)
df[‘negative_log_pval_7d’] = np.log10(df[‘qval_7d’]) * (-1)

Now we are ready to plot the volcano plots for the different time-points. The advantage of using Python is that we can overlay the volcano plots of the different time-points all in one graph. The commands are as follows:

fig = go.Figure()
trace1 = go.Scatter(
 x=df[‘log2FC_6h’],
 y=df[‘negative_log_pval_6h’],
 mode=’markers’,
 name=’6hrs’,
 hovertext=list(df.index)
)
trace2 = go.Scatter(
 x=df[‘log2FC_1d’],
 y=df[‘negative_log_pval_1d’],
 mode=’markers’,
 name=’day 1',
 hovertext=list(df.index)
)
trace3 = go.Scatter(
 x=df[‘log2FC_3d’],
 y=df[‘negative_log_pval_3d’],
 mode=’markers’,
 name=’day 3',
 hovertext=list(df.index)
)
trace4 = go.Scatter(
 x=df[‘log2FC_7d’],
 y=df[‘negative_log_pval_7d’],
 mode=’markers’,
 name=’day 7',
 hovertext=list(df.index)
)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
fig.update_layout(title=’Volcano plot for seronegatives’)
fig.show()

I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the fig.update_layout, we also added the title of the plot. Output for the graph is thus as follows:

At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time-points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day post-vaccination.

Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:

fig_d1 = px.scatter(df, x=’log2FC_1d’, y=’negative_log_pval_1d’, text=df.index)
fig_d1.update_traces(textposition=’top center’)
fig_d1.update_layout(
 title_text=’Volcano plot for seronegatives (day 1)’
)
fig_d1.show()

Output file is as follows:

Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have a quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s