Correlation Analysis: Two Variables
May 23, 2019
Introduction
Correlation analysis can help us understand whether, and how strongly, a pair of variables are related.
In data science and machine learning, this can help us understand relationships between features/predictor variables and outcomes. It can also help us understand dependencies between different feature variables.
For example:
- How strong is the correlation between mental stress and cardiac issues?
- Is there a correlation between literacy rate and frequency of criminal activities?
This tutorial will help you learn the different techniques and approaches used to understand correlations that exist between features in any dataset.
Correlation coefficient
In correlation analysis, we also estimate a sample correlation coefficient between a pair of variables. Correlation coefficient of a pair of variables helps us understand how strongly one variable changes with respect to another.
Correlation coefficient ranges between a value of -1 to 1.
The graphs below show different pairs of variables with different correlation coefficient $\rho$:
Graphs of different variables with different Correlation Coefficients. [Source: Wikipedia]
The magnitude signifies the strength of relationship between the variables.
- So, a value of of -1 or +1 indicates that the two variables are perfectly related.
- A value of 0 indicates that the variables have no relation.
The sign (positive / negative) signifies the direction of relationship:
- A positive value indicates that higher values of one variable are accompanied with higher values of the other variable
- A negative value indicates that higher values of one variable are accompanied with lower values of the other variable
Types of Correlation Coefficients
The two most common types of correlation coefficient are Pearson’s and Spearman’s.
Pearson’s correlation coefficient is a measure of linear correlation between two variables. The graphs you saw above show Pearson’s correlation coefficient.
Spearman’s rank correlation coefficient assesses monotonic relationships, irrespective of whether it is linear or not. That is to say, it is only interested in knowing that an increase in one variable causes an increase in another variable.
Let us take an example to understand the difference:
Example of a non-linear monotonically related pair of variables. [Source: Wikipedia]
Here we can see that when there is an increase in X, there is always an increase in Y. However, the increase is not linear.
Thus Pearson’s correlation is 0.88 while Spearman’s correlation is a perfect 1.
Also note that correlation is a symmetric relationship. That is to say, if A correlates with B, then B correlates with A.
Correlation vs Causation
Before we proceed further, we should clarify that although correlation shows relationships between variables and it helps in predictive analysis, it does not imply causation.
For example: The sale of Sunglasses and the sale of ice-cream are highly correlated. But an increase in ice-cream sales does not cause an increase in the sales of sunglasses, or vice-versa. In this case, they are correlated because they both depend on a third independent variable, which is how hot / sunny it is.
Limitations of Correlation Coefficient
Although correlation coefficients give us an idea about the strength of the relationship between two variables, it is not possible for a single number to give us the full picture.
Importance of visualisation
Since correlation coefficients can only give us limited information, visualisation can be of great aid in understanding the relationship between variables.
For example, the following graphs show the famous example Anscombe’s Quartet. It consists of four data sets, where the correlation coefficient is exactly the same, even though the data sets are very different from one another.
Anscombe’s Quartet [Source: Wikipedia]
Set-up
For the rest of this tutorial (and the next tutorial as well), we will be see how to use visualization to understand the relation between two variables. To do this, we will we using the seaborn
Python library and a couple of datasets.
Importing
We will be using the standard data science libraries — NumPy, Pandas, Matplotlib and Seaborn. So let’s start by importing them.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# change pandas display options
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)
We’ve also asked pandas to increase the display width to 200 characters, and the maximum number of columns it should display to 20. This will make sure we can view our data properly.
Datasets
We will be using the wine
dataset and the tips
dataset available on CommonLounge. We’ll load these datasets using the load_commonlounge_dataset()
function, which returns a DataFrame.
Let’s first load the wine
dataset (we’ll load a modified version for this tutorial):
wine_data = load_commonlounge_dataset('wine_v2')
Let’s also load the tips
dataset:
tips_data = load_commonlounge_dataset('tips')
Below, we’ve included a brief description of these two datasets.
Wine Dataset
This dataset is the result of a chemical analysis of wines grown (in the same region) in Italy.
The following are the variables in the dataset:
Wine
: This is the target variable to be predicted. It is a categorical variable divided into a set of three classes denoting three different types of wines. The classes are labelled as 1, 2 and 3.
All other attributes are continuous numerical variables:
Alcohol
: alcohol contentMalic.acid
: one of the principal organic acidsAsh
: inorganic matter leftAcl
: the alkalinity of ashMg
: magnesiumPhenols
: PhenolsFlavanoids
: particular type of phenolNonflavanoid.phenols
: particular type of phenolProanth
: particular type of phenolColor.int
: color intensityHue
: hue of a wineOD
: protein content measurementsProline
: an amino acid
Let’s take a glimpse of the first few instances in the dataset using the head()
function:
print(wine_data.head())
The data is taken from UCI Machine Learning Repository.
Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Tips Dataset
In the tips
dataset contains information about the tips collected at a restaurant.
Here’s a description of the attributes:
tip
: (continuous) Tips paid. The target variable to be predicted.total_bill
: (continuous) Total billsex
: (categorical) Male / Femalesmoker
: (categorical) smokerday
: (categorical) day of the weektime
: (categorical) Dinner / Lunchsize
: (integer) size of the party
This data is taken from Seaborn’s data repository.
Again, let’s take a glimpse of the first few instances in the dataset:
print(tips_data.head())
Questions dataset
We will also use a third dataset, which we will only use for the exercises in this article. For now, we will load it and keep it aside.
This is a synthetic dataset composed of 5 feature variables called feature1
, feature2
, feature3
, feature4
and feature5
. It has two target variables called target1
and target2
.
qns = load_commonlounge_dataset('corr_qns')
Correlation analysis between a pair of Numerical variables
To explore correlation between two numerical variables, we will use two different kinds of plots: Scatter plots and Hexbin plots.
The tips
dataset has a output variable which is continuous and some input variables which are also continuous. Let us plot them against each other and see if there is a trend.
Scatter Plots
We’ll start by plotting the tips
variable against the `totalbill`_ variable.
To do so, we will use the jointplot()
function from the seaborn
library.
Here’s the syntax for the function:
sns.jointplot(xdata, ydata)
Let’s give it a try:
sns.jointplot(tips_data["total_bill"], tips_data["tip"])
plt.show()
As you can see, apart from plotting a scatter plot, the jointplot()
function also plots the univariate distribution of x and y on the x and y axis.
Analysis:
- We can infer from this graph that there is indeed a positive correlation between
tip
andtotal_bill
. However, it looks like the correlation is stronger for lower values oftotal_bill
than the higher values. - For lower values (between about 0 to 20), the values of
tip
rises steadily with an increase in value oftotal_bill
. - But as the value of
total_bill
increases (between 30 to 50), we see the variance intip
increasing. Some values oftip
are high for hightotal_bill
while some values oftip
are very low. So the correlation at this stage is not very strong. - We can also see that
total_bill
is more skewed thantip
, which is relatively more symmetric, centered roughly around 3 dollars.
These insights would have been hard to extract without visualisation.
Hexbin plots
As we saw in the graph above, the data-points in a scatter plot may overlap each other. This makes it harder to gauge the density of points in certain regions of the plot.
The hexbin plot helps overcome this problem.
Let us use the same variables to plot a hexbin plot. We will use the same jointplot()
function, but this time we will pass the kind
parameter the value "hex"
. Here’s the syntax:
sns.jointplot(xdata, ydata, kind="hex")
Note : The default argument for
kind
is"scatter"
. That is why it plots a scatter plot when no argument is passed to this parameter.
sns.jointplot(tips_data["total_bill"], tips_data["tip"], kind="hex")
plt.show()
Hexbin plots are equivalent of bar graphs, but for a pair of variables. It creates hexagonal bins, dividing the data in both x and y axis into intervals. The darker the colour of the hexagon, the higher is the frequency in that interval.
Analysis
- As we can again see, there is a strong cluster of points at the lower values of
tips
andtotal_bill
, while the data is much more spread out at the higher values. - This means that the
tip
is very likely to be $2-$4 if thetotal_bill
is between $10 and$20. But if thetotal_bill
is between $40 and$50, the tip could range anywhere between $3 and$10, all values being more or less equally likely.
Seaborn Syntax
So far for plotting, we have been using the following syntax:
sns.jointplot(xdata, ydata)
Since we have a dataframe with all the data, this usually translates to:
sns.jointplot(df["x column name"], df["y column name"])
When the x and y data are stored in the same DataFrame, seaborn also supports another syntax, which is:
sns.jointplot(x="x column name", y="y column name", data=df)
These two syntax are equivalent, and all the seaborn functions discussed in this tutorial support both.
Correlation analysis between a Numerical and a Categorical variable
To explore correlation between a categorical and a numerical variable, we will use Strip plots, Swarm plots, Boxplots and Violin Plots. All the these plots will be drawn using functions from the seaborn
library.
For this section, we will use both the datasets.
In the wine
dataset, the output variable Wine
is categorical, while all the input features are numerical.
In the tips
dataset, the output variable tips
is numerical, while the input variables are categorical.
Let us explore these cases visually and see if we can draw some inferences.
Strip Plots
Strip plots are similar to scatter plots, but for the situation when one of the variables is categorical.
We will be using the stripplot()
function. The syntax is:
sns.stripplot(x="column name", y="column name", data=DataFrame)
Let us plot the output variable tip
against the input variable time
from tips_data
.
sns.stripplot(x="time", y="tip", data=tips_data)
plt.show()
Although we can see the data points, they are not very clear because so many of them overlap. This makes it difficult to gauge the density of the data points.
Solution: We can fix this by using the parameter called jitter
.
jitter
can take in float value for the amount of jitter, or take a boolean value where True
or 1
has a default value of jitter associated with it, and False
or 0
means no jitter.
The default value of jitter
is True
.
Let’s re-draw the plot with the added parameter:
sns.stripplot(x="time", y="tip", data=tips_data, jitter=0.25)
plt.show()
The plot looks much better now.
Analysis
- Although the difference isn’t drastic, it looks like dinner tips on average are higher than lunch tips.
- We can also see that the number of data points is more for dinner.
Strip Plot for wine dataset
Let us now plot the output variable Wine
against the input variable Alcohol
from wine_data
.
sns.stripplot(x="Alcohol", y="Wine", data=wine_data, jitter=0.25)
plt.show()
We can see a big problem in the graph.
Since we plotted the categorical variable on the y-axis and our categorical labels are integers, the stripplot()
function is not able to infer which variable is actually categorical, and what is the orientation.
Solution: To fix this, we can use the orient
parameter to explicitly tell the function to plot the graph horizontally.
Valid values for orient are "v"
(orient vertically) and "h"
(orient horizontally).
Let us re-draw the plot with the added parameter:
sns.stripplot(x="Alcohol", y="Wine", data=wine_data, orient="h", jitter=0.25)
plt.show()
That’s much better!
Analysis:
We can very clearly see that there is a relationship between Alcohol
levels and category of Wine
.
- Wines of category 2 have much lower
Alcohol
levels (average around 12%). - Wines of category 1 have the highest
Alcohol
levels (average around 13.5% - 14%). - And wines of category 3 are somewhere in the middle. (average
Alcohol
level around 13% - 13.5%).
Swarm plots
One of the major problems with strip plots was the overlap of data points, which made it difficult to understand the density of points in some places.
Swarm plots are similar to strip plots, but the points are adjusted so that they don’t overlap.
We will be using the swarmplot()
function. The syntax is:
sns.swarmplot(x="column name", y="column name", data=DataFrame)
Let us plot the variables tip
against day
from tips_data
:
sns.swarmplot(x="day", y="tip", data=tips_data)
plt.show()
We can see, the plots are much cleaner than the strip plots. The breadth of the plots gives us a good sense of where most of the data points lie.
Analysis
We can draw the following inferences:
- Majority of the data seems to be from the weekend.
- On average, tips on weekends seem to be higher than tips on weekdays.
- We can see a lot of points on the same horizontal line. That’s probably because people like paying tips in round values such as $2,$3 , $4 or$2.50, $3.50 and so forth.
Boxplots
You should have already come across boxplots before in a separate tutorial. But in case you skipped it, we’ve provided a short recap.
Here’s a sample boxplot for a toy data set having just 10 observations – 144, 147, 153, 154, 156, 157, 161, 164, 170, 181.
Box plot for the toy dataset
A box plot has the following components:
- 1st Quartile (25th percentile)
- Median (2nd Quartile or 50th percentile)
- 3rd Quartile (75th percentile)
- Interquartile Range (IQR – difference between 3rd and 1st Quartile)
- Whiskers — marks the lowest data point which lies within 1.5 IQR of the 1st quartile, and the highest data point which lies within 1.5 IQR of the 3rd quartile
- Outliers — any data points beyond 1.5 IQR of the 1st or 3rd quartile, i.e. values which are greater than 3rd Quartile + 1.5 * IQR or less than 1st Quartile – 1.5 * IQR.
In other words, whiskers extend from the quartiles to the rest of the distribution, except for points that are determined to be “outliers”.
The syntax for plotting a boxplot is:
sns.boxplot(x="column name", y="column name", data=DataFrame, orient="orientation")
The orient
parameter is used exactly for the same reason as we saw in the Swarm plot section — that is, to explicitly provide information about which variable is categorical.
Let us plot a graph between Proline
and Wine
from wine_data
:
sns.boxplot(x="Proline", y="Wine", data=wine_data, orient="h")
plt.show()
Analysis:
As we can see, the plot gives us a good view of the relation between proline and wine category, without actually plotting the individual data points.
- For category 1
Wine
, the medianProline
level is higher than any value ofProline
from category 2 and 3Wine
. Thus high values ofProline
directly imply that the wine is from category 1. - For low levels of
Proline
means, the overlap between category 2 and 3 is quite high, so we can’t say much conclusively about them from this plot.
Violinplots
Violinplots are similar to boxplots, but they have the added advantage that it also shows the actual underlying distribution of the data.
Here’s what a sample violin plot looks like:
Note: Violin plot displays an estimate of the distribution of data using a method called kernel density estimation. This estimation procedure is influenced by the sample size, and violins for relatively small samples might look misleadingly smooth.
The syntax is as follows:
sns.violinplot(x="column name", y="column name", data=DataFrame, orient="orientation")
Let us plot again plot the Proline
variable with Wine
, but this time it’s a violinplot:
sns.violinplot(x="Proline", y="Wine", data=wine_data, orient="h")
plt.show()
Analysis
- From this plot, we could make all the conclusions we made earlier from the boxplot.
- In addition, we can see the
Proline
levels for category 2 vs category 3Wines
much more clearly, and although there is an overlap, the modes are slightly separated.
Ending note
The orient
parameter can be used for all the plots mentioned in this section - Strip plots, Swarm plots, Box plots and Violin plots.
Summary
- Correlation coefficient (ranges from -1 to +1) tells us about the relationship between two variables.
- Its magnitude signifies how strongly related the variables are.
- Its sign signifies the direction of relationship (positive vs negative).
- Scatterplots and hexbin plots visualize two numerical variables.
- Stripplots and swarmplots visualize a numerical variable with a categorical variable.
- Boxplots and violinplots also visualize a numerical variable with a categorical variable, but they don’t plot every data point, and instead display quartiles, density, etc.
orient
parameter is used to explicitly specify which variable is categorical.
Reference
Two numerical variables:
Scatterplot:
sns.jointplot(xdata, ydata)
Hexbin plots:
sns.jointplot(xdata, ydata, kind="hex")
One numerical variable and one categorical variable:
Stripplot:
sns.stripplot(xdata, ydata, jitter=float, orient="h" or "v")
Swarmplot:
sns.swarmplot(xdata, ydata, orient="h")
Boxplot:
sns.boxplot(xdata, ydata, orient="h")
Violinplot:
sns.violinplot(xdata, ydata, orient="h")