Handling Duplicates, Missing values and Outliers [Under Construction]
May 23, 2019
Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Duplicate Observations
The data may contain duplicate observations. We have to analyze the data set to take a decision on whether to drop duplicates or not.
In our data set, there is no chance of two observations (students) be identical on all variables such as Entrance Rank
, Admn Yr
, etc. Duplicates occur due to some data entry error. These type of duplicates must be dropped.
Alternatively, some data sets may contain duplicate values which represents the pattern of the process under measure. For example, the iris
dataset in sklearn
has 4 variables which measures the length
and width
of petals
and sepals
150 iris flowers. This data set contains duplicates, which need not be removed.
To remove the duplicates in our data set, we use df.drop_duplicates()
. The index of the dataframe has to reset after dropping duplicates, done by reset_index()
.
print('No of duplicates present:', df.duplicated().sum())
df = df.drop_duplicates()
# reset index.
df = df.reset_index(drop=True)
print('Shape:', df.shape)
54 duplicates were present in the dataset and are removed. Now we are left with 481 observations.
Missing Values
Missing values has to be addressed during data cleaning. Common methods of dealing with missing values are
- Deleting the observations with missing values.
- Imputing the missing values by median/mode, kNN etc.
Checking for missing values
Let’s check whether our data suffers form missing values
print('Data set shape :', df.shape)
#missing values
df.isnull().sum()
Yes, seven variables are having missing values. The number of missing values are listed to the right of the corresponding variable in the above output. We have missing values for 6 numerical variables and 1 categorical variable.
A backup copy of the dataframe df
is saved as df_missing
for later use.
#backup copy of df with missing values
df_missing = df.copy()
The methods to deal with missing values are explained below.
Delete observations with missing values
The simplest approach is to delete the observations having missing values. But the major disadvantage is that it reduces the power of the model as it reduces the sample size. If an observation has too many missing values across variables, then it’s wise to delete that observation, as it doesn’t add any value to the model.
print('Data set shape before dropping missing values :', df.shape)
df.dropna()
print('Data set shape after dropping missing values :', df.dropna().shape)
This step results in dropping 40 observations, which is close to 8% of our datasize. That isn’t too bad, but it isn’t ideal either. Let’s try some other approaches.
Imputation by Median / Mode
We can impute the missing values of numerical variables by its median. For categorical variables, missing values are imputed by mode. See the codes below.
# fill missing values with median for numerical values
df.fillna(df[numeric_labels].median())
# fill missing values with mode for categorical values
df.fillna(df[categoric_labels].mode().iloc[0])
Imputation by Median / Mode of Subgroup
We can impute with median/mode of a subgroup with respect to few selected variables. If the data values of a variable are known to depend upon some other variables in the datset, this method of imputation is ideal. The code for imputation of numerical variables by median grouped by Seat Type
and mode of categorical variables grouped by Seat Type
is :
df_temp = df.copy()
for var in numeric_labels:
missing = df_temp[var].isnull()
df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
for var in categoric_labels:
missing = df_temp[var].isnull()
df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform(lambda S: S.mode()[0])
df_temp.isnull().sum()
kNN Imputation
The missing values of a variable may be imputed using kNN. kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. We will discuss more about kNN imputation after one hot encoding categorical variables, which is a recommended step before kNN.
Recommended treatment for missing values
Our dataset too suffer from missing values. The missing value exist in 6 numerical variables namely, Entrance Rank
, Family Income
, Matriculation
, Physics
, Chemistry
and Maths
. It is known that these variables are highly correlated to the variable Seat Type
. By plotting boxplot grouped by Seat Type
helps verify our assumption.
plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(x='Seat Type', y='Entrance Rank',data=df)
plt.subplot(3, 2, 2)
sns.boxplot(x='Seat Type', y='Matriculation',data=df)
plt.subplot(3, 2, 3)
sns.boxplot(x='Seat Type', y='Family Income',data=df)
plt.subplot(3, 2, 4)
sns.boxplot(x='Seat Type', y='Physics',data=df)
plt.subplot(3, 2, 5)
sns.boxplot(x='Seat Type', y='Chemistry',data=df)
plt.subplot(3, 2, 6)
sns.boxplot(x='Seat Type', y='Maths',data=df)
Therefore, for all numerical variables, missing values are imputed by median grouped by Seat Type
.
for var in numeric_labels:
missing = df[var].isnull()
df.loc[missing, [var]] = df.groupby('Seat Type')[var].transform('median')
The categorical variable Proposed Residence
also contain missing values, which can be imputed by mode grouped by Seat Type
.
#categoric variable Proposed Residence imputed by mode of subgroup Seat Type
missing = df['Proposed Residence'].isnull()
df.loc[missing, ['Proposed Residence']] = df.groupby('Seat Type')['Proposed Residence'].transform(lambda S: S.mode()[0])
Final check
Yes, now we have completed the missing value analysis. Let’s check if we left anything out!
df.isnull().sum()
We can see that no missing values exists in our data set. All missing values has been addressed.
Outliers
An outlier is an observation that appears far away and diverges from the overall pattern of the data set. Outliers need close attention — otherwise they can result in wildly wrong estimations.
Let’s see this with a simple example. Suppose we have a sample data set of 10 numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
The mean for the above data is 5.5.
Now let’s replace the number 10 with 1000. So the new data set will be: 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000.
The mean is 104.5.
As you can see, having a single outlier had a significant effect on the mean value, which results in wrong estimation of the distribution. Outliers drastically influence the results of data analysis and statistical modeling, which has negative impact on the prediction.
Outliers are of two types.
- Artificial Outliers - Outliers that arise due to data entry, experimental or other errors.
- Natural Outliers - Outliers which are not because of any error, but which occur naturally and throw some light into the internal processes and hence require close attention. These observations have to be segmented separately and analyzed.
Box plot is a handy tool for detection of outliers. See more about box plots here.
Let’s visualize outliers by drawing boxplot
plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(df['Entrance Rank'])
plt.subplot(3, 2, 2)
sns.boxplot(df['Matriculation'])
plt.subplot(3, 2, 3)
sns.boxplot(df['Family Income'])
plt.subplot(3, 2, 4)
sns.boxplot(df['Physics'])
plt.subplot(3, 2, 5)
sns.boxplot(df['Chemistry'])
plt.subplot(3, 2, 6)
sns.boxplot(df['Maths'])
We see that outliers exist for Matriculation
, Physics
, Chemistry
and Maths
. By the code below, we find the number of outliers in each of these variables.
for var in numeric_labels:
print(var)
q1, q3 = np.percentile(df.loc[:,var], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
above_upper = df[var] > upper_bound
below_lower = df[var] < lower_bound
outliers = above_upper | below_lower
print(outliers.sum()) # number of outliers
Therefore, the number of outliers in Matriculation
is 21, Physics
has 6, Chemistry
has 1 and Maths
has 14 outliers.
The common techniques to deal with outliers are
Deleting Observations
We delete observations having outlier values if it is due to data entry or some other errors (artificial outliers). Again it depends on the size of data set. For small sized data set’s, this method is not recommended. However, the code for the same is:
print('Shape before dropping outliers:', df.shape)
for var in numeric_labels:
q1, q3 = np.percentile(df.loc[:,var], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
above_upper = df[var] > upper_bound
below_lower = df[var] < lower_bound
outliers = outliers | above_upper | below_lower
print(outliers.sum()) # number of outliers
#drop outliers
df.drop(df[outliers==True].index)
print('Shape after dropping outliers:', df.drop(df[outliers==True].index).shape)
There are 33 rows having outliers. Whe they are dropped, the number of observations reduced to 448. Alternatively, we can impute the outlier values.
Imputing
As in the case of missing values, outliers can also be imputed by median or kNN. Imputation is normally done for
- artificial outliers, or
- outliers which arise due to some random cause which have negligible chance of repeatability and doesn’t affect our analysis.
We can impute the numeric variables by median. The following code shows imputing outliers by median grouped by Seat Type
.
df_temp = df.copy()
for var in numeric_labels:
print(var)
q1, q3 = np.percentile(df_temp.loc[:,var], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
above_upper = df_temp[var] > upper_bound
below_lower = df_temp[var] < lower_bound
outliers = above_upper | below_lower
print(outliers.sum()) # number of outliers
print(df_temp.loc[outliers, [var]]) # before replacing
df_temp.loc[outliers, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
print(df_temp.loc[outliers, [var]]) # after replacing
Segment and Analyze Separately
If the outliers present are natural outliers and are in significant number, one method is to treat them separately while building models. For example, we can split the data into two different groups and build individual models for each group.
Transforming and binning values
Another method to eliminate outliers is by transforming variables. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Here we assign the values in different bins determined by the class interval. These variables can then be treated as categorical variables.
No processing
Natural outliers which does not deviate much form the main stream data can left without any processing. We might often use this
Recommended method for Outliers
For our dataset, the outliers of Matriculation
, Physics
, Chemistry
and Maths
are natural outliers, and for our analysis its recommended not to do any processing.
Categorical Variable Trasformation
One hot encoding
Many machine learning algorithms cannot work with categorical data directly. They need to be converted to numerical representation before processing.
Each possible label for the categorical variable is converted into dummy/indicator variables. These dummy variables (of the categorical variable) are assigned values 0 except for one of them is given value 1, corresponding to the category it belongs.
Take an example of encoding the categorical variable SeatType
. It is having 9 different categories or labels. So nine dummy variable dataframe is created, each having one category label as its name. Every row will be having only one 1, while all others be 0.
Then the original categorical variable in the data set is replaced with new encoded dummy/indicator variables. While replacing, one dummy variable is omitted intentionally to avoid collinearity.
#one hot encoding - all categorical variables at once
# creating dummy variables to convert categorical into numeric values
dummies = pd.get_dummies(df[categoric_labels], prefix= categoric_labels, drop_first=True)
df.drop(categoric_labels, axis=1, inplace = True)
df = pd.concat([df, dummies], axis =1)
Observe the one hot encoded
categorical variables. For each category except for one, a new variable is created.
KNN Imputation of Missing Values
kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. For kNN all variables should be of numeric nature. As we have one hot encoded categoric variables and converted them into numeric, we can apply kNN imputation on missing data. The code for imputing all missing values at once based on kNN is:
from fancyimpute import KNN
print('Missing values:\n', df_missing.isnull().sum())
#one hot encoding
dummies = pd.get_dummies(df_missing[categoric_labels], prefix= categoric_labels, drop_first=True)
df_missing.drop(categoric_labels, axis=1, inplace = True)
df_missing = pd.concat([df_missing,dummies], axis =1 )
# fit
df_missing.loc[:, :] = KNN(k=5).fit_transform(df_missing)
print('Missing values:\n', df_missing.isnull().sum())