Posted in correlation, Data visualisation, python

Correlation matrix – A useful tool to visualise relatedness between multiple variables

Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output
test_1 = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv',index_col=0)
test_1.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
10.2847200.0806110.9971330.151514
20.1638440.8187040.4599750.143680

Next, we can visualise the correlation coefficient across all variables using the command:

corrM = test_1.corr()
corrM

Output file showing all the correlation coefficient values is as follows:

ab30dab60dab90dtcell7d
ab30d1.0000000.3789260.042423-0.324900
ab60d0.3789261.000000-0.489996-0.374259
ab90d0.042423-0.4899961.000000-0.392458
tcell7d-0.324900-0.374259-0.3924581.000000

We can also determine the p-values of the correlation by using the commands, as follows:

from scipy import stats 
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
                     columns=test_1.columns, index=test_1.columns)
pvals

Output file is as follows:

ab30dab60dab90dtcell7d
ab30d6.646897e-640.2802149.073675e-010.359673
ab60d2.802137e-010.0000001.505307e-010.286667
ab90d9.073675e-010.1505316.646897e-640.261956
tcell7d3.596730e-010.2866672.619556e-010.000000

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

corr = test_1.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, cmap=cmap, mask=mask, center=0, square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

One thought on “Correlation matrix – A useful tool to visualise relatedness between multiple variables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s