Data Transformations [Under Construction]
May 23, 2019
Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
As part of data cleaning, we often need to transform data into a more usable form for data science and machine learning.
We will start by discussing data transformation methods for numerical variables first, and then move on to categorical variables.
Let’s get started!
Set-up
We will need the Pandas and NumPy libraries for this tutorial, so let us load them at the beginning.
import pandas as pd
import numpy as np
In this tutorial, we’ll be looking at some tables. So let’s also ask pandas to increase the display width to 120 characters, and the maximum number of columns it should display to 10:
pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 10)
Finally, we will be using the student_transformations
dataset to illustrate the various concepts for this tutorial. Let’s load the dataset:
# load the dataset and see first few rows
df = load_commonlounge_dataset('student_transformations')
print(df.head())
As in the previous tutorial, let’s start by importing the libraries, changing the pandas display options, and loading the dataset.
Numerical Variables
Let’s address numerical variable transformations first.
Skewed distributions: Log transformation
Often in the real-world, we encounter exponential distributions. Exponential distributions are extremely skewed, and many machine learning models don’t work well with extremely skewed data.
One way to fix this, is to take the log of all values for that variable. In our dataset, the variables Entrance Rank
and Family Income
have skewed distributions. Let’s verify that using a histogram:
plt.subplots_adjust(hspace=1)
plt.subplot(1, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(1, 2, 2)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.show()
Performing the log transformation on these variables reduces the skewness. We’ll use the np.log()
function do to this:
df['Family Income'] = np.log(df['Family Income'])
df['Entrance Rank'] = np.log(df['Entrance Rank'])
print(df.head())
Notice the changed values of both variables.
Let’s visualize the new distribution of by plotting histogram.
plt.subplots_adjust(hspace=1)
plt.subplot(1, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(1, 2, 2)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
Now, our data for these two variables resembles the normal distribution, which is ideal for machine learning.
Data Re-scaling
Many algorithms require numerical data to be re-scaled before fitting the model. Here are some reasons why data re-scaling is important:
- It makes the results meaningful (especially for distance based algorithms).
- It ensures variables with high magnitude don’t dominate the data analysis, models and predictions.
- Many machine learning algorithms converge faster on re-scaled data.
The two most common methods for re-scaling are normalization and standardization. Let’s see them one by one.
Normalization
In normalization, the data is re-scaled to be in the range 0 to 1. So after rescaling, the minimum value of the variable is 0 and maximum value is 1.
The formula for normalizing is:
$$ x_{new} = \frac{x - min}{max - min} $$Before we do normalization, let’s make a copy of our data and take a look at the first few rows.
df_temp = df.copy()
print(df.head())
Now, we can perform the normalization.
from sklearn import preprocessing
# Normalization of numeric variables - not to be implemented on df, so using df_temp
df_temp.loc[:,numeric_labels] = preprocessing.MinMaxScaler().fit_transform(df_temp.loc[:,numeric_labels])
# first five observations
df_temp.iloc[:,:-1].head()
Variables having different scales and units can be easily compared after Normalization. Let’s plot the histogram for visualization of normalization.
plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
plt.hist(df_temp['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(3, 2, 2)
plt.hist(df_temp['Matriculation'], ec='black', bins=20)
plt.title('Matriculation')
plt.subplot(3, 2, 3)
plt.hist(df_temp['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.subplot(3, 2, 4)
plt.hist(df_temp['Physics'], ec='black', bins=20)
plt.title('Physics')
plt.subplot(3, 2, 5)
plt.hist(df_temp['Chemistry'], ec='black', bins=20)
plt.title('Chemistry')
plt.subplot(3, 2, 6)
plt.hist(df_temp['Maths'], ec='black', bins=20)
plt.title('Maths')
All numerical variables has a range of ( 0, 1 )
.
Standardization
Here the data (x
) is re-scaled so that mean becomes 0 (μ = 0
) and standard deviation becomes 1 (σ = 1
). The standardization formula is:
Consider our data above with Entrance Rank
, Family Income
and marks of various subjects, which are having different scales and units. We can compare these features and use them in our models once we have standardized them. The results obtained will be more accurate for standardized data.
See the python code for standardizing numerical variables.
#standardizing numerical variables
from sklearn import preprocessing
df.loc[:,numeric_labels] = preprocessing.scale(df.loc[:,numeric_labels])
#first five observations
df.iloc[:,:-1].head()
We plot the histogram here, to identify the effect of standardization
#histogram of numerical variables
plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(3, 2, 2)
plt.hist(df['Matriculation'], ec='black', bins=20)
plt.title('Matriculation')
plt.subplot(3, 2, 3)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.subplot(3, 2, 4)
plt.hist(df['Physics'], ec='black', bins=20)
plt.title('Physics')
plt.subplot(3, 2, 5)
plt.hist(df['Chemistry'], ec='black', bins=20)
plt.title('Chemistry')
plt.subplot(3, 2, 6)
plt.hist(df['Maths'], ec='black', bins=20)
plt.title('Maths')
We have implemented standardization for all our numerical variables.
Standardization vs Normalization
The rule of thumb for choosing which method to use for data scaling is:
- Normalize when the variables have different units.
- Standardize when a variable has a Gaussian distribution.
Categorical Variable Transformation
One hot encoding
Many machine learning algorithms cannot work with categorical data directly. They need to be converted to numerical representation before processing.
Each possible label for the categorical variable is converted into dummy/indicator variables. These dummy variables (of the categorical variable) are assigned values 0 except for one of them is given value 1, corresponding to the category it belongs.
Take an example of encoding the categorical variable SeatType
. It is having 9 different categories or labels. So nine dummy variable dataframe is created, each having one category label as its name. Every row will be having only one 1, while all others be 0.
Then the original categorical variable in the data set is replaced with new encoded dummy/indicator variables. While replacing, one dummy variable is omitted intentionally to avoid collinearity.
#one hot encoding - all categorical variables at once
# creating dummy variables to convert categorical into numeric values
dummies = pd.get_dummies(df[categoric_labels], prefix= categoric_labels, drop_first=True)
df.drop(categoric_labels, axis=1, inplace = True)
df = pd.concat([df, dummies], axis =1)
Observe the one hot encoded
categorical variables. For each category except for one, a new variable is created.
Summary
- Log => (LogTransformer?, PowerTransformer, )
- Scaling => (MaxAbsScaler, MinMaxScaler, RobustScalar, StandardScalar)
- One-hot encoding => (OneHotEncoder / get_dummies)
- Binarize / Bin => Numerical thresholding (Binarizer, KBinsDiscretizer)
- Normalization => (Normalizer) — full row has unit norm
- Sklearn preprocessing Functions vs Classes