Handling Duplicates, Missing values and Outliers [Under Construction]

May 23, 2019

Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.

Duplicate Observations

The data may contain duplicate observations. We have to analyze the data set to take a decision on whether to drop duplicates or not.

In our data set, there is no chance of two observations (students) be identical on all variables such as Entrance Rank, Admn Yr, etc. Duplicates occur due to some data entry error. These type of duplicates must be dropped.

Alternatively, some data sets may contain duplicate values which represents the pattern of the process under measure. For example, the iris dataset in sklearn has 4 variables which measures the length and width of petals and sepals 150 iris flowers. This data set contains duplicates, which need not be removed.

To remove the duplicates in our data set, we use df.drop_duplicates(). The index of the dataframe has to reset after dropping duplicates, done by reset_index().

print('No of duplicates present:', df.duplicated().sum())
df = df.drop_duplicates()
# reset index.
df = df.reset_index(drop=True)
print('Shape:', df.shape)

54 duplicates were present in the dataset and are removed. Now we are left with 481 observations.

Missing Values

Missing values has to be addressed during data cleaning. Common methods of dealing with missing values are

Deleting the observations with missing values.
Imputing the missing values by median/mode, kNN etc.

Checking for missing values

Let’s check whether our data suffers form missing values

print('Data set shape :', df.shape)
#missing values
df.isnull().sum()

Yes, seven variables are having missing values. The number of missing values are listed to the right of the corresponding variable in the above output. We have missing values for 6 numerical variables and 1 categorical variable.

A backup copy of the dataframe df is saved as df_missing for later use.

#backup copy of df with missing values
df_missing = df.copy()

The methods to deal with missing values are explained below.

Delete observations with missing values

The simplest approach is to delete the observations having missing values. But the major disadvantage is that it reduces the power of the model as it reduces the sample size. If an observation has too many missing values across variables, then it’s wise to delete that observation, as it doesn’t add any value to the model.

print('Data set shape before dropping missing values :', df.shape)
df.dropna()
print('Data set shape after dropping missing values :', df.dropna().shape)

This step results in dropping 40 observations, which is close to 8% of our datasize. That isn’t too bad, but it isn’t ideal either. Let’s try some other approaches.

Imputation by Median / Mode

We can impute the missing values of numerical variables by its median. For categorical variables, missing values are imputed by mode. See the codes below.

# fill missing values with median for numerical values
df.fillna(df[numeric_labels].median())
# fill missing values with mode for categorical values
df.fillna(df[categoric_labels].mode().iloc[0])

Imputation by Median / Mode of Subgroup

We can impute with median/mode of a subgroup with respect to few selected variables. If the data values of a variable are known to depend upon some other variables in the datset, this method of imputation is ideal. The code for imputation of numerical variables by median grouped by Seat Type and mode of categorical variables grouped by Seat Type is :

df_temp = df.copy()
for var in numeric_labels:
    missing = df_temp[var].isnull()
    df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
for var in categoric_labels:
    missing = df_temp[var].isnull()
    df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform(lambda S: S.mode()[0])
df_temp.isnull().sum()

kNN Imputation

The missing values of a variable may be imputed using kNN. kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. We will discuss more about kNN imputation after one hot encoding categorical variables, which is a recommended step before kNN.

Recommended treatment for missing values

Our dataset too suffer from missing values. The missing value exist in 6 numerical variables namely, Entrance Rank, Family Income, Matriculation, Physics, Chemistry and Maths. It is known that these variables are highly correlated to the variable Seat Type. By plotting boxplot grouped by Seat Type helps verify our assumption.

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(x='Seat Type', y='Entrance Rank',data=df)
plt.subplot(3, 2, 2)
sns.boxplot(x='Seat Type', y='Matriculation',data=df)
plt.subplot(3, 2, 3)
sns.boxplot(x='Seat Type', y='Family Income',data=df)
plt.subplot(3, 2, 4)
sns.boxplot(x='Seat Type', y='Physics',data=df)
plt.subplot(3, 2, 5)
sns.boxplot(x='Seat Type', y='Chemistry',data=df)
plt.subplot(3, 2, 6)
sns.boxplot(x='Seat Type', y='Maths',data=df)

Therefore, for all numerical variables, missing values are imputed by median grouped by Seat Type.

for var in numeric_labels:
    missing = df[var].isnull()
    df.loc[missing, [var]] = df.groupby('Seat Type')[var].transform('median')

The categorical variable Proposed Residence also contain missing values, which can be imputed by mode grouped by Seat Type.

#categoric variable Proposed Residence imputed by mode of subgroup Seat Type
missing = df['Proposed Residence'].isnull()
df.loc[missing, ['Proposed Residence']] = df.groupby('Seat Type')['Proposed Residence'].transform(lambda S: S.mode()[0])

Final check

Yes, now we have completed the missing value analysis. Let’s check if we left anything out!

df.isnull().sum()

We can see that no missing values exists in our data set. All missing values has been addressed.

Outliers

An outlier is an observation that appears far away and diverges from the overall pattern of the data set. Outliers need close attention — otherwise they can result in wildly wrong estimations.

Let’s see this with a simple example. Suppose we have a sample data set of 10 numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

The mean for the above data is 5.5.

Now let’s replace the number 10 with 1000. So the new data set will be: 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000.

The mean is 104.5.

As you can see, having a single outlier had a significant effect on the mean value, which results in wrong estimation of the distribution. Outliers drastically influence the results of data analysis and statistical modeling, which has negative impact on the prediction.

Outliers are of two types.

Artificial Outliers - Outliers that arise due to data entry, experimental or other errors.
Natural Outliers - Outliers which are not because of any error, but which occur naturally and throw some light into the internal processes and hence require close attention. These observations have to be segmented separately and analyzed.

Box plot is a handy tool for detection of outliers. See more about box plots here.

Let’s visualize outliers by drawing boxplot

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(df['Entrance Rank'])
plt.subplot(3, 2, 2)
sns.boxplot(df['Matriculation'])
plt.subplot(3, 2, 3)
sns.boxplot(df['Family Income'])
plt.subplot(3, 2, 4)
sns.boxplot(df['Physics'])
plt.subplot(3, 2, 5)
sns.boxplot(df['Chemistry'])
plt.subplot(3, 2, 6)
sns.boxplot(df['Maths'])

We see that outliers exist for Matriculation, Physics, Chemistry and Maths. By the code below, we find the number of outliers in each of these variables.

for var in numeric_labels:
    print(var)
    q1, q3 = np.percentile(df.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df[var] > upper_bound
    below_lower = df[var] < lower_bound
    outliers = above_upper | below_lower
    print(outliers.sum()) # number of outliers

Therefore, the number of outliers in Matriculation is 21, Physics has 6, Chemistry has 1 and Maths has 14 outliers.

The common techniques to deal with outliers are

Deleting Observations

We delete observations having outlier values if it is due to data entry or some other errors (artificial outliers). Again it depends on the size of data set. For small sized data set’s, this method is not recommended. However, the code for the same is:

print('Shape before dropping outliers:', df.shape)
for var in numeric_labels:    
    q1, q3 = np.percentile(df.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df[var] > upper_bound
    below_lower = df[var] < lower_bound
    outliers = outliers | above_upper | below_lower
print(outliers.sum()) # number of outliers    
#drop outliers
df.drop(df[outliers==True].index)
print('Shape after dropping outliers:', df.drop(df[outliers==True].index).shape)

There are 33 rows having outliers. Whe they are dropped, the number of observations reduced to 448. Alternatively, we can impute the outlier values.

Imputing

As in the case of missing values, outliers can also be imputed by median or kNN. Imputation is normally done for

artificial outliers, or
outliers which arise due to some random cause which have negligible chance of repeatability and doesn’t affect our analysis.

We can impute the numeric variables by median. The following code shows imputing outliers by median grouped by Seat Type.

df_temp = df.copy()
for var in numeric_labels:
    print(var)
    q1, q3 = np.percentile(df_temp.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df_temp[var] > upper_bound
    below_lower = df_temp[var] < lower_bound
    outliers = above_upper | below_lower
    print(outliers.sum()) # number of outliers
    print(df_temp.loc[outliers, [var]]) # before replacing
    df_temp.loc[outliers, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
    print(df_temp.loc[outliers, [var]]) # after replacing

Segment and Analyze Separately

If the outliers present are natural outliers and are in significant number, one method is to treat them separately while building models. For example, we can split the data into two different groups and build individual models for each group.

Transforming and binning values

Another method to eliminate outliers is by transforming variables. Natural log of a value reduces the variation caused by extreme values.

Binning is also a form of variable transformation. Here we assign the values in different bins determined by the class interval. These variables can then be treated as categorical variables.

No processing

Natural outliers which does not deviate much form the main stream data can left without any processing. We might often use this

Recommended method for Outliers

For our dataset, the outliers of Matriculation, Physics, Chemistry and Maths are natural outliers, and for our analysis its recommended not to do any processing.

Categorical Variable Trasformation

One hot encoding

Many machine learning algorithms cannot work with categorical data directly. They need to be converted to numerical representation before processing.

Each possible label for the categorical variable is converted into dummy/indicator variables. These dummy variables (of the categorical variable) are assigned values 0 except for one of them is given value 1, corresponding to the category it belongs.

Take an example of encoding the categorical variable SeatType. It is having 9 different categories or labels. So nine dummy variable dataframe is created, each having one category label as its name. Every row will be having only one 1, while all others be 0.

Then the original categorical variable in the data set is replaced with new encoded dummy/indicator variables. While replacing, one dummy variable is omitted intentionally to avoid collinearity.

#one hot encoding - all categorical variables at once
# creating dummy variables to convert categorical into numeric values
dummies = pd.get_dummies(df[categoric_labels], prefix= categoric_labels, drop_first=True)
df.drop(categoric_labels, axis=1, inplace = True)
df = pd.concat([df, dummies], axis =1)

Observe the one hot encoded categorical variables. For each category except for one, a new variable is created.

KNN Imputation of Missing Values

kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. For kNN all variables should be of numeric nature. As we have one hot encoded categoric variables and converted them into numeric, we can apply kNN imputation on missing data. The code for imputing all missing values at once based on kNN is:

from fancyimpute import KNN
print('Missing values:\n', df_missing.isnull().sum())
#one hot encoding
dummies = pd.get_dummies(df_missing[categoric_labels], prefix= categoric_labels, drop_first=True)
df_missing.drop(categoric_labels, axis=1, inplace = True)
df_missing = pd.concat([df_missing,dummies], axis =1 )
# fit
df_missing.loc[:, :] = KNN(k=5).fit_transform(df_missing)
print('Missing values:\n', df_missing.isnull().sum())