Outlier treatment and Featuring Engineering would be the next steps in continuation to Missing Data Treatment.

Outliers Definition:

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. It is an observation that deviates from overall pattern on a sample.

Outliers can be caused due to multiple reasons like:

  • Data Entry Errors
  • Intermediate Data processing Errors
  • Instrument recording errors
  • Natural Error or which does not denote an error

Outlier Detection and Treatment:

Outliers can be Univariate or Multivariate meaning on single feature or collection of features respectively.  Boxplots or Scatter plots are an easy way of determining the outliers in a sample. Interquartile Range is most commonly used technique to determine an outlier. Below is an example to understand the same:

Sample: 4, 7, 9, 11, 12, 20 (arranged in ascending order)

Divide sample into 2 so lower half is 4, 7, 9 and upper half is 11, 12, 20

Find Median of lower and upper half which is 7 and 12 respectively

So Q1 = 7 and Q3 = 12

IQR = Q3 – Q1 which is 5 (12 -7 = 5)

Outliers:  a = Q1 – (1.5*IQR); b = Q3 = Q1 – (1.5*IQR)

any number < a or > b is an outlier.

Hence in our sample 20 is an outlier.

Outlier Treatment:

One way of getting rid outliers is by removing them observation (or deleting the row). But by doing so we might reduce the size of the sample or dataset which will not help in modelling.

Another most commonly used technique is using transformation or binning the variable. Transformation is by taking the natural log of the value which greatly reduces the variation specially used when we have too extreme values. Binning as the name suggests classifying all the values into set of defined bins (like 0-10; 10-20; 20-30 and so on..)

Feature Engineering:

This is an art of extracting more meaningful information from the existing data without adding anything new.  In housing data, having total Sq. Feet is more relevant than having individual floor Square Feet which can be considered as Feature Engineering.

Feature Engineering process should be performed after completing Exploratory Data Analysis (EDA), Missing value treatment and Outlier treatment.

Now we will see it in action using the same Ames housing data set. I will be using the output of my missing value treatment script discussed in another post. You can find the script here .

Import Libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,10
import seaborn as sb

Read Dataset:

df = pd.read_csv('C:/../Ames_cleaned.csv')
df.shape

(2919, 70)
Let us try to find outliers in continuous variables like:
‘1stFlrSF’,’2ndFlrSF’,’TotalBsmtSF’

sb.boxplot(x=df[['1stFlrSF','2ndFlrSF','TotalBsmtSF']])

Clearly we can see there are some extreme values for 1stFlrSF and TotalBsmtSF. However we will not do an outlier treatment now because we will do some Feature Engineering to them. We will combine all of them as TotalSF and see how the boxplot shows up.

#Feature Engineering
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']
sb.boxplot(x=df['TotalSF'], orient='v')

Now lets see for SalePrice, GrLivArea and LotFrontage

sb.boxplot(x=df['GrLivArea'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')

Clearly there are some extreme outliers in the data. We will use log transformation to reduce the variation which means the outliers will be removed completely but extreme values become reasonably better.

Log transforming TotalSF, GrLivArea, LotFrontage

#Log transforming
df['GrLivArea'] = np.log(df['GrLivArea'])
df['LotFrontage'] = np.log(df['LotFrontage'])
df['TotalSF'] = np.log(df['TotalSF'])

#plotting after transformation
sb.boxplot(x=df['GrLivArea'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')
plt.show()
sb.boxplot(x=df['TotalSF'], orient='v')

 

It is evident from the above graphs that extreme outliers are treated.

We can combine BedroomAbvGr and TotRmsAbvGrd into TotalRooms.
Following that we will drop all the variables which were feature engineered

df['TotalRooms'] = df['BedroomAbvGr'] + df['TotRmsAbvGrd']

# Drop Feature engineered columns
col_fe = ['1stFlrSF','2ndFlrSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BedroomAbvGr','TotRmsAbvGrd']
df.drop(col_fe, axis=1, inplace=True)
print('Total Features after removing engineered features: ', (df.shape))

Total Features after removing engineered features: (2919, 64)

Split into train and test datasets

train = df.loc[df['source'] == 'train']
test = df.loc[df['source'] == 'test']
print(train.shape, test.shape)

(1460, 64) (1459, 64)

For categorical variables we will try to find the percentage of most frequent value. In other words percentage of mode.

#identifying numerical and categorical features
num_feat = train.dtypes[train.dtypes != 'object'].index
print('Total of numeric features: ', len(num_feat))
cat_feat = train.dtypes[train.dtypes == 'object'].index
print('Total of categorical features: ', len(cat_feat))

# Highest value Frequency percentage in categorical variables 
for i in list(cat_feat):
    pct = df[i].value_counts()[0] / 2919
    print('Highest value Percentage of {}: {:3f}'.format(i, pct))

Highest value Percentage of BldgType: 0.830764
Highest value Percentage of BsmtCond: 0.892771
Highest value Percentage of BsmtExposure: 0.652278
Highest value Percentage of BsmtFinType1: 0.318602
Highest value Percentage of BsmtFinType2: 0.881466
Highest value Percentage of BsmtQual: 0.439534
Highest value Percentage of CentralAir: 0.932854
Highest value Percentage of Condition1: 0.860226
Highest value Percentage of Condition2: 0.989723
Highest value Percentage of Electrical: 0.915382
Highest value Percentage of ExterCond: 0.869476
Highest value Percentage of ExterQual: 0.615964
Highest value Percentage of Exterior1st: 0.351490
Highest value Percentage of Exterior2nd: 0.347722
Highest value Percentage of FireplaceQu: 0.486468
Highest value Percentage of Foundation: 0.448099
Highest value Percentage of Functional: 0.931483
Highest value Percentage of GarageCond: 0.909215
Highest value Percentage of GarageFinish: 0.421377
Highest value Percentage of GarageQual: 0.892086
Highest value Percentage of GarageType: 0.590271
Highest value Percentage of Heating: 0.984584
Highest value Percentage of HeatingQC: 0.511477
Highest value Percentage of HouseStyle: 0.503940
Highest value Percentage of KitchenQual: 0.511477
Highest value Percentage of LandContour: 0.898253
Highest value Percentage of LandSlope: 0.951696
Highest value Percentage of LotConfig: 0.730730
Highest value Percentage of LotShape: 0.636862
Highest value Percentage of MSZoning: 0.777321
Highest value Percentage of MasVnrType: 0.606029
Highest value Percentage of Neighborhood: 0.151764
Highest value Percentage of PavedDrive: 0.904762
Highest value Percentage of RoofMatl: 0.985269
Highest value Percentage of RoofStyle: 0.791367
Highest value Percentage of SaleCondition: 0.822885
Highest value Percentage of SaleType: 0.865365
Highest value Percentage of Street: 0.995889
Highest value Percentage of Utilities: 0.999657
Highest value Percentage of source: 0.500171

We will drop the variables which have mode > 80%. This is because there is very less variability and hence might not have significant contribution to the model.

# Drop columns which have frequency of value more than 80% of all values 
col_drop = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical','ExterCond',
           'Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive','RoofMatl',
           'SaleCondition','SaleType','Street','Utilities']
df.drop(col_drop, axis=1, inplace=True)
print('Total features after dropping categorical features: ', df.shape)

Total features after dropping categorical features: (2919, 44)

With this we have tried to understand some basics on Outlier Treatment and Feature Engineering. Feature Engineering is the most critical step which determines the success of the model. Hence an in-depth understanding of the domain of the dataset is a huge advantage to derive relevant features necessary for building a Predictive Model.

-Hari Mindi


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.