As discussed in my previous blog entry, correlation coefficients can measure the strength of relationship between 2 variables. In general, correlation coefficients of 0.5 and above is considered strong correlation, 0.3 is moderate and 0.1 is considered weak. To quickly visualise the relatedness between multiple variables, correlation matrix can be used. However, there are some limitations. For instance, the correlation matrix does not provide data on the steepness of slope between variables, nor does it display the relationship if the variables are not linearly correlated. In addition, the correlation matrix also cannot inform how categorical variables can potentially interact with these other continuous variables.
Let me provide you with an example to illustrate my case. Consider you want to understand whether the flower sepal length, sepal width, petal length, petal width are correlated. Plotting a correlation matrix for these 4 variables will allow you to visualise the strength of relationship between these continuous variables. However, if you have another categorical variable, for example flower species and you are interested to understand how flower species interacts with all these different parameters, then merely plotting a correlation matrix is inadequate.
To get around this problem, pair plots can be used. Pair plots allows us to see the distribution between multiple continuous variables. In Python, you can even use colours to visualise how the categorical variables interact with the other continuous variables. Surprisingly, plotting pair plots in Python is relatively straightforward. Let’s use the iris dataset from Plotly Express as a proof of concept.
We first load the required packages, the iris dataset from Plotly Express, and display all the variables from the dataset with the head() function:
# Load packages import csv import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px # Load iris dataset from plotlyexpress df = px.data.iris() df.head()
The output file looks like this, with sepal length, sepal width, petal length, petal width as continuous variables and flower species as the categorical variable
To visualise the correlation between continuous variables, the pair plot can be quickly plotted with a single line of code. In addition, we can annotate the different flower species with different colours. Finally, rather than just plotting the same scatterplot on both sides of the matrix, we will also plot the lower half of the matrix with the best fit straight line so we can visualise the slope of the relationship. The code used is thus as follows:
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde") g.map_lower(sns.regplot)
Output is as follows:
Isn’t it cool that you can immediately see that the variables are strongly correlated with each other, and these parameters are dependent on the flower species? As you may already appreciate, the pair plot is a useful tool to visualise and analyse multi-dimensional datasets.