Correlation Analysis: Multivariable [Under Construction]

May 23, 2019

Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.

Introduction

Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.

Setup

We will start by loading the wine_v2 , tips and questions_data datasets.

wine_data = load_commonlounge_dataset('wine_v2')
tips_data = load_commonlounge_dataset('tips')
qns = load_commonlounge_dataset('content-creators/swarnabha/questions_data')

Let’s set the pandas display width to 200 characters, and the maximum columns to display to 15.

pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)

Correlation Matrix and Heatmap

Let’s dive into the Wine dataset.

wine_data.info()

There are 13 numerical variables. If we were to study the correlation between all the pairs there will be 78 correlation coefficients! One for each pair!

${{13}\choose{2}} = 78$ (combinations)

Instead of listing out all the correlation coefficients (between every pair of variables) separately, we can represent them by a correlation matrix.

Correlation Matrix

An $n \times n$ correlation matrix can be formed for a set of $n$ numerical variables named $X_1, X_2, ... X_n$ , such that the $(i,j)$ element of the matrix is the correlation coefficient between $X_i$ and $X_j$

Hence, we can form a $13 \times 13$ correlation matrix.

To form the matrix , we will call the corr() method of the pandas DataFrame object.

This is the syntax:

correlation_matrix = DataFrame.corr(method="")

For the method parameter, we pass one of the following arguments to decide how the correlation coefficient be calculated:

"pearson" : Pearson’s product-moment correlation coefficient
"spearman" : Spearman’s rank correlation coefficient

Before we create the correlation matrix, we make a copy of the wine dataset, with only the numerical input variables. by dropping the categorical variable wine, with the pandas method - DataFrame.drop(label="", axis=1)

# creating a subset of Wine DataFrame with only numerical variables.  
wine_data_num = wine_data.drop(labels=["Wine"], axis=1)
# creating correlation matrix
corr_matrix = wine_data_num.corr(method="pearson")
# display correlation matrix
print(corr_matrix)

As you can see, although the information here is useful, it is difficult to infer anything from this matrix and analyse the information.

It would be very cumbersome to even recognise which variables are negatively correlated and which are positively correlated!

Heatmaps

To take care of this we will use a visual tool called heatmap from the seaborn library.

The heatmap() function will create a two-dimensional graphical representation of data where the individual values that are contained in a matrix are mapped to colors.

A colourmap is used for this purpose , where a continuous spectrum of colours is used to represent the numerical values of the correlation coefficient.

We will use this function to create a heatmap, by passing the correlation matrix as the argument. Let’s use the parameter’s linewidth, vmin and vmax.

sns.heatmap(corr_matrix, linewidths=0, vmin=None, vmax=None)

Parameters:

linewidths : to draw lines separating every cell. The arguments defines the width of the line. Accepts float values and default is 0.
vmin, vmax : Values of correlation coefficient to anchor the colour map. If nothing is passed, it infers limits from the matrix. Accepts float values. ( For our purpose here , we want the colourmap to extend from -1 to +1)

Since our correlation matrix is a DataFrame object, the heatmap function uses the Index/Column names to label the columns and rows.

Let us plot and see how it looks.

sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1)
plt.show()

Observe the colour coding on the right side of the plot according to a colourmap.

The X-axis and Y-axis labels have been arranged according to the column names from our dataset.

Notice the extremely light and extremely dark colours and recognise which variables have very high positive and negative correlations!

Analysis:

We can see that all the diagonal elements are extremely light colours. Thats because they contain the correlation coefficient of the variables with themselves. Hence correlation coefficient value is 1.

we can also see light colours which implies high positive correlation between
OD and Flavanoids,
Phenols and Flavanoids
we can also see dark colours which implies high negative correlation between
Nonflavanoid.phenols and Flavanoids,
Hue and Malic.acid

Note: You can also see that the cells on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the correlation matrix. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal

Other parameters:

You can also use the parameter cmap to modify the plot.

This parameter controls the the mapping from data values to color space. For example you can pass "RdYlGn" as an argument to get colour map of Red-Yellow-Green.

sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1, cmap="RdYlGn")
plt.show()

PairGrid plots

Although correlation coefficient plays a central role in correlation analysis, the information it provides is often not enough.

As explained earlier, just from the correlation coefficient we won’t get to know quite a lot about the relationship between the variables.

There is a way to visualise relationship between all the pairs of the variables using instances of PairGrid class.

PairGrid allows us to draw multiple plots within a single figure. Thus we can draw scatterplots between each pair of variables within a single figure. An instance of the PairGrid class will be our multi plot figure.

`pairplot()` function

We will draw these figures using the function pairplot() which returns a PairGrid instance.

The figure drawn consist of rows and columns of subplots of pair of variables.

This is the syntax:

PairGrid_object = sns.pairplot(data, vars=column_features)

Here data refers to the dataframe containing all the variables.
vars refers to the list of columns we want to show in our plot. We will only use selected 5 numerical features from the wine_data DataFrame , because using all 12 features might be visually too crowded and hinder our analysis.

Note: pairplot() function or PairGrid class can only handle tidy data, i.e., dataframe where each column is a variable and each row is an observation.

Let us plot our figure!

# list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"] 
# plot the figures 
ax = sns.pairplot(wine_data, vars=c)
plt.show()

Explanation

The names of the variables of each row (at the left) gives us the variable on the Y-axis along the row.

The names of the variables of each column (at the bottom) gives us the variable on the X-axis along that column.

As we can see, this figure gives a lot of insight into the data!

Diagonals

Notice that the diagonals and non-diagonal elements has very different plots. This is because all the plots on the non-diagonal, have a different variable on X-axis and Y-axis. Thus the scatterplots give meaningful visuals.

The diagonal plots on the other hand have the same variable on the X-axis and the Y-axis. Thus if we were to simply plot the same variables on the X and Y axis, we would get all the points on a straight line along a $45^o$ degrees slope! (a correlation coefficient of 1). So we plot the histogram for the variables along the diagonal.

You can also see that the subplots on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the way the whole figure is designed. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal.

Analysis

Pair plots helps us to draw meaningful analysis from them:

We can see from the first few of plots, that Alcohol seems to have a more or less symmetric distribution, and that it has very little correlation with any other variable on our plot
From the second row of plots we can see that Phenols have a very strong positive correlation with Flavanoids, and weaker positive correlation with Proanth and OD.

Other parameters

We can use the parameter hue to further improve the subplots.

The hue parameter takes in a categorical variable, and uses the information to plot the datapoints with different colours according to the labels in that variable.

For example, let us pass the target variable "Wine" as an argument to the hue parameter and see how it improves.

 # list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"] 
ax = sns.pairplot(wine_data, vars=c, hue="Wine")
plt.show()

FacetGrid Plots

Let us now look at another interesting type of plot, that help us in correlation analysis.

We have seen how the hue parameter helps us in analysing relationships between a pair of variables, while demarcating the points according to a third variable (categorical) .

But what if we wanted to further distinguish the points with respect to other categorical variables in the data?

For example in the tips dataset, we might want to demarcate the points according to Sex (Male, Female) and Time (Lunch, Dinner) in the same figure!

Let’s construct plots which allows to visualise relationship between multiple variables separately within subsets of our dataset.

We will do so using plots created by the FacetGrid objects.

`relplot()` function

We will use the function relplot() to call an instance of the FacetGrid class, which will plot the desired figure.

Unlike PairGrid plots, in FacetGrid plots, the horizontal and vertical axes denote the same variable in every subplot. In each subplot we can differentiate between the subsets of the data by passing arguments to relevant parameters.

This is the syntax:

FacetGrid_Object = relplot(data, x="", y="", hue="")

Parameters:

In the data parameter, we pass the DataFrame as the argument.
In the x and y parameter’s, we pass the column label from the DataFrame, to be plotted in the x-axis and y-axis respectively.
In the hue parameter, we pass the column name according to which the map plot aspects be mapped to different colours

Let us use the relplot() function to explore the tips dataset. We will analyse the relationship between total_bill and tips variable.

Let’s plot one subplot:

Where the color(hue) is determined by the column day

ax = sns.relplot(data=tips_data, x="total_bill", y="tip", hue="day")
plt.show()

As we can see although, the hue parameter gives more information, it is hard to understand much from this graph.

Let us see if we can separate the points according to the values from the variable time.

`col` parameter

Let us separately draw subplots, side by side in a row, each containing points according to labels from time.

For this we will use the parameter col and pass the argument "time".

ax = sns.relplot(data=tips, x="total_bill", y="tip", col="time", hue="day")

I will explain the parameters, after the figure.

ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", hue="day")
plt.show()

Note: Since we are analysing the relationship between the variables total_bill and tips, the x-axis will denote total_bill and y-axis will denote tip in all subplots drawn by the relplot() function.

Think of this as a table of subplots, with one row and two columns.

The subplot in the first column only has data points from “time=Dinner” and the second column has data points from “time=Lunch”
Thus each column shows a different facet , according to the argument (variable) passed to the col parameter

`row` parameter

Let us see if we can improve it further.

Let us add more rows of subplots, where every row will contain data points according to values in the sex variable.

We will use a parameter called row and pass the name of the variable sex as an argument.

Rest of the syntax remains intact.

ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")
plt.show()

So in the row and col parameter we basically pass the variable names according to which different faceting of the grid will take place .

Analysis:

We can see some interesting patterns here.

The correlation between tip and total bill is quite strong if the sex is female and time is lunch.
The lunch data almost entirely comes from Thursdays and Fridays.
The correlation between tips and total_bill seem to be stronger for lunch data than dinner data. Although we should do further analysis to verify this

Summary

Let’s summarize the syntax of creating correlation matrix and the various plots.

Correlation Matrix and Heatmap

Correlation Matrix :

correlation_matrix = DataFrame.corr(method="")

Heatmap :

sns.heatmap(corr_matrix, linewidths=float, vmin=float, vmax=float)

PairGrid Plots

Using the pairplot() function

ax = sns.pairplot(data, vars=list, hue="column")

FacetGrid Plots

Using the relplot() function

ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")

Introduction

Setup

Correlation Matrix and Heatmap

Correlation Matrix

Heatmaps

PairGrid plots

pairplot() function

Explanation

Other parameters

FacetGrid Plots

relplot() function

col parameter

row parameter

Summary

Correlation Matrix and Heatmap

PairGrid Plots

FacetGrid Plots

`pairplot()` function

`relplot()` function

`col` parameter

`row` parameter