Correlation Analysis: Multivariable [Under Construction]
May 23, 2019
Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Introduction
Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.
Setup
We will start by loading the wine_v2
, tips
and questions_data
datasets.
wine_data = load_commonlounge_dataset('wine_v2')
tips_data = load_commonlounge_dataset('tips')
qns = load_commonlounge_dataset('content-creators/swarnabha/questions_data')
Let’s set the pandas display width to 200 characters, and the maximum columns to display to 15.
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)
Correlation Matrix and Heatmap
Let’s dive into the Wine
dataset.
wine_data.info()
There are 13 numerical variables. If we were to study the correlation between all the pairs there will be 78 correlation coefficients! One for each pair!
${{13}\choose{2}} = 78$ (combinations)
Instead of listing out all the correlation coefficients (between every pair of variables) separately, we can represent them by a correlation matrix.
Correlation Matrix
An $n \times n$ correlation matrix can be formed for a set of $n$ numerical variables named $X_1, X_2, ... X_n$ , such that the $(i,j)$ element of the matrix is the correlation coefficient between $X_i$ and $X_j$
Hence, we can form a $13 \times 13$ correlation matrix.
To form the matrix , we will call the corr()
method of the pandas DataFrame object.
This is the syntax:
correlation_matrix = DataFrame.corr(method="")
For the method
parameter, we pass one of the following arguments to decide how the correlation coefficient be calculated:
"pearson"
: Pearson’s product-moment correlation coefficient"spearman"
: Spearman’s rank correlation coefficient
Before we create the correlation matrix, we make a copy of the wine dataset, with only the numerical input variables. by dropping the categorical variable wine
, with the pandas method - DataFrame.drop(label="", axis=1)
# creating a subset of Wine DataFrame with only numerical variables.
wine_data_num = wine_data.drop(labels=["Wine"], axis=1)
# creating correlation matrix
corr_matrix = wine_data_num.corr(method="pearson")
# display correlation matrix
print(corr_matrix)
As you can see, although the information here is useful, it is difficult to infer anything from this matrix and analyse the information.
It would be very cumbersome to even recognise which variables are negatively correlated and which are positively correlated!
Heatmaps
To take care of this we will use a visual tool called heatmap from the seaborn library.
The heatmap()
function will create a two-dimensional graphical representation of data where the individual values that are contained in a matrix are mapped to colors.
A colourmap is used for this purpose , where a continuous spectrum of colours is used to represent the numerical values of the correlation coefficient.
We will use this function to create a heatmap, by passing the correlation matrix as the argument. Let’s use the parameter’s linewidth
, vmin
and vmax
.
sns.heatmap(corr_matrix, linewidths=0, vmin=None, vmax=None)
Parameters:
linewidths
: to draw lines separating every cell. The arguments defines the width of the line. Accepts float values and default is 0.vmin
,vmax
: Values of correlation coefficient to anchor the colour map. If nothing is passed, it infers limits from the matrix. Accepts float values. ( For our purpose here , we want the colourmap to extend from -1 to +1)
Since our correlation matrix is a DataFrame object, the heatmap function uses the Index/Column names to label the columns and rows.
Let us plot and see how it looks.
sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1)
plt.show()
Observe the colour coding on the right side of the plot according to a colourmap.
The X-axis and Y-axis labels have been arranged according to the column names from our dataset.
Notice the extremely light and extremely dark colours and recognise which variables have very high positive and negative correlations!
Analysis:
We can see that all the diagonal elements are extremely light colours. Thats because they contain the correlation coefficient of the variables with themselves. Hence correlation coefficient value is 1.
- we can also see light colours which implies high positive correlation between
OD
andFlavanoids
,Phenols
andFlavanoids
- we can also see dark colours which implies high negative correlation between
Nonflavanoid.phenols
andFlavanoids
,Hue
andMalic.acid
Note: You can also see that the cells on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the correlation matrix. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal
Other parameters:
You can also use the parameter cmap
to modify the plot.
This parameter controls the the mapping from data values to color space. For example you can pass "RdYlGn"
as an argument to get colour map of Red-Yellow-Green
.
sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1, cmap="RdYlGn")
plt.show()
PairGrid plots
Although correlation coefficient plays a central role in correlation analysis, the information it provides is often not enough.
As explained earlier, just from the correlation coefficient we won’t get to know quite a lot about the relationship between the variables.
There is a way to visualise relationship between all the pairs of the variables using instances of PairGrid
class.
PairGrid
allows us to draw multiple plots within a single figure. Thus we can draw scatterplots between each pair of variables within a single figure. An instance of the PairGrid
class will be our multi plot figure.
pairplot()
function
We will draw these figures using the function pairplot()
which returns a PairGrid
instance.
The figure drawn consist of rows and columns of subplots of pair of variables.
This is the syntax:
PairGrid_object = sns.pairplot(data, vars=column_features)
- Here
data
refers to the dataframe containing all the variables. vars
refers to the list of columns we want to show in our plot. We will only use selected 5 numerical features from thewine_data
DataFrame , because using all 12 features might be visually too crowded and hinder our analysis.
Note:
pairplot()
function orPairGrid
class can only handletidy
data, i.e., dataframe where each column is a variable and each row is an observation.
Let us plot our figure!
# list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"]
# plot the figures
ax = sns.pairplot(wine_data, vars=c)
plt.show()
Explanation
The names of the variables of each row (at the left) gives us the variable on the Y-axis along the row.
The names of the variables of each column (at the bottom) gives us the variable on the X-axis along that column.
As we can see, this figure gives a lot of insight into the data!
Diagonals
Notice that the diagonals and non-diagonal elements has very different plots. This is because all the plots on the non-diagonal, have a different variable on X-axis and Y-axis. Thus the scatterplots give meaningful visuals.
The diagonal plots on the other hand have the same variable on the X-axis and the Y-axis. Thus if we were to simply plot the same variables on the X and Y axis, we would get all the points on a straight line along a $45^o$ degrees slope! (a correlation coefficient of 1). So we plot the histogram for the variables along the diagonal.
You can also see that the subplots on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the way the whole figure is designed. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal.
Analysis
Pair plots helps us to draw meaningful analysis from them:
- We can see from the first few of plots, that
Alcohol
seems to have a more or less symmetric distribution, and that it has very little correlation with any other variable on our plot - From the second row of plots we can see that
Phenols
have a very strong positive correlation withFlavanoids
, and weaker positive correlation withProanth
andOD
.
Other parameters
We can use the parameter hue
to further improve the subplots.
The hue
parameter takes in a categorical variable, and uses the information to plot the datapoints with different colours according to the labels in that variable.
For example, let us pass the target variable "Wine"
as an argument to the hue
parameter and see how it improves.
# list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"]
ax = sns.pairplot(wine_data, vars=c, hue="Wine")
plt.show()
FacetGrid Plots
Let us now look at another interesting type of plot, that help us in correlation analysis.
We have seen how the hue
parameter helps us in analysing relationships between a pair of variables, while demarcating the points according to a third variable (categorical) .
But what if we wanted to further distinguish the points with respect to other categorical variables in the data?
For example in the tips
dataset, we might want to demarcate the points according to Sex
(Male, Female) and Time
(Lunch, Dinner) in the same figure!
Let’s construct plots which allows to visualise relationship between multiple variables separately within subsets of our dataset.
We will do so using plots created by the FacetGrid
objects.
relplot()
function
We will use the function relplot()
to call an instance of the FacetGrid
class, which will plot the desired figure.
Unlike PairGrid
plots, in FacetGrid
plots, the horizontal and vertical axes denote the same variable in every subplot. In each subplot we can differentiate between the subsets of the data by passing arguments to relevant parameters.
This is the syntax:
FacetGrid_Object = relplot(data, x="", y="", hue="")
Parameters:
- In the
data
parameter, we pass the DataFrame as the argument. - In the
x
andy
parameter’s, we pass the column label from the DataFrame, to be plotted in the x-axis and y-axis respectively. - In the
hue
parameter, we pass the column name according to which the map plot aspects be mapped to different colours
Let us use the relplot()
function to explore the tips
dataset. We will analyse the relationship between total_bill
and tips
variable.
Let’s plot one subplot:
- Where the color(hue) is determined by the column
day
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", hue="day")
plt.show()
As we can see although, the hue
parameter gives more information, it is hard to understand much from this graph.
Let us see if we can separate the points according to the values from the variable time
.
col
parameter
Let us separately draw subplots, side by side in a row, each containing points according to labels from time
.
For this we will use the parameter col
and pass the argument "time"
.
ax = sns.relplot(data=tips, x="total_bill", y="tip", col="time", hue="day")
I will explain the parameters, after the figure.
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", hue="day")
plt.show()
Note: Since we are analysing the relationship between the variables
total_bill
andtips
, the x-axis will denotetotal_bill
and y-axis will denotetip
in all subplots drawn by therelplot()
function.
Think of this as a table of subplots, with one row and two columns.
- The subplot in the first column only has data points from “time=Dinner” and the second column has data points from “time=Lunch”
- Thus each column shows a different facet , according to the argument (variable) passed to the
col
parameter
row
parameter
Let us see if we can improve it further.
Let us add more rows of subplots, where every row will contain data points according to values in the sex
variable.
We will use a parameter called row
and pass the name of the variable sex
as an argument.
Rest of the syntax remains intact.
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")
plt.show()
So in the row
and col
parameter we basically pass the variable names according to which different faceting of the grid will take place .
Analysis:
We can see some interesting patterns here.
- The correlation between tip and total bill is quite strong if the
sex
is female andtime
is lunch. - The lunch data almost entirely comes from Thursdays and Fridays.
- The correlation between
tips
andtotal_bill
seem to be stronger for lunch data than dinner data. Although we should do further analysis to verify this
Summary
Let’s summarize the syntax of creating correlation matrix and the various plots.
Correlation Matrix and Heatmap
- Correlation Matrix :
correlation_matrix = DataFrame.corr(method="")
- Heatmap :
sns.heatmap(corr_matrix, linewidths=float, vmin=float, vmax=float)
PairGrid Plots
Using the pairplot()
function
ax = sns.pairplot(data, vars=list, hue="column")
FacetGrid Plots
Using the relplot()
function
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")