Correlations are a great way to show if the variables are associated with each other. The strength of correlation is often represented by the correlation coefficient, where values range between -1.0 and 1.0. A positive value indicates positive association while a negative value indicates negative association. While there are no fixed definition, a correlation value larger than 0.7 or smaller than -0.7 is usually considered as a strong correlation. To quickly visualise the correlation between multiple variables, correlation matrixes can be used. In my opinion, the Python programming language allows individuals to most quickly plot these matrices, which is what I will elaborate in this blog entry.

First, we load the dataset and packages into python, where I have assigned headers to be variables (ab30d, ab60d, ab90d and tcell7d) and the first column to be the subject ID column. We will assign this dataframe as “test_1”

```
# Import packages
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import data with 10 subjects into python. In this case only the top 2 rows will be shown in the output
test_1 = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv',index_col=0)
test_1.head(2)
```

Output file is as follows:

subject | ab30d | ab60d | ab90d | tcell7d |
---|---|---|---|---|

1 | 0.284720 | 0.080611 | 0.997133 | 0.151514 |

2 | 0.163844 | 0.818704 | 0.459975 | 0.143680 |

Next, we can visualise the correlation coefficient across all variables using the command:

```
corrM = test_1.corr()
corrM
```

Output file showing all the correlation coefficient values is as follows:

ab30d | ab60d | ab90d | tcell7d | |
---|---|---|---|---|

ab30d | 1.000000 | 0.378926 | 0.042423 | -0.324900 |

ab60d | 0.378926 | 1.000000 | -0.489996 | -0.374259 |

ab90d | 0.042423 | -0.489996 | 1.000000 | -0.392458 |

tcell7d | -0.324900 | -0.374259 | -0.392458 | 1.000000 |

We can also determine the p-values of the correlation by using the commands, as follows:

```
from scipy import stats
from scipy.stats import pearsonr
pvals = pd.DataFrame([[pearsonr(test_1[c], test_1[y])[1] for y in test_1.columns] for c in test_1.columns],
columns=test_1.columns, index=test_1.columns)
pvals
```

Output file is as follows:

ab30d | ab60d | ab90d | tcell7d | |
---|---|---|---|---|

ab30d | 6.646897e-64 | 0.280214 | 9.073675e-01 | 0.359673 |

ab60d | 2.802137e-01 | 0.000000 | 1.505307e-01 | 0.286667 |

ab90d | 9.073675e-01 | 0.150531 | 6.646897e-64 | 0.261956 |

tcell7d | 3.596730e-01 | 0.286667 | 2.619556e-01 | 0.000000 |

Finally, to plot the correlation matrix with the correlation coefficient values within, we execute the following command:

```
corr = test_1.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))
ax = sns.heatmap(corr, cmap=cmap, mask=mask, center=0, square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})
```

Output file is a beautiful correlation matrix:

At one glance, you can quickly tell which of the variables are positively and which are negatively correlated. Depending on your correlation matrix size, you can tailor the size of the output diagram by changing the figsize dimensions within the code. Correlation matrixes can also be complemented with pair plots, which I will elaborate more in my future blog entry.

## One thought on “Correlation matrix – A useful tool to visualise relatedness between multiple variables”