Outlier treatment and Featuring Engineering would be the next steps in continuation to Missing Data Treatment.

Outliers Definition:

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. It is an observation that deviates from overall pattern on a sample.

Outliers can be caused due to multiple reasons like:

• Data Entry Errors
• Intermediate Data processing Errors
• Instrument recording errors
• Natural Error or which does not denote an error

Outlier Detection and Treatment:

Outliers can be Univariate or Multivariate meaning on single feature or collection of features respectively.  Boxplots or Scatter plots are an easy way of determining the outliers in a sample. Interquartile Range is most commonly used technique to determine an outlier. Below is an example to understand the same:

Sample: 4, 7, 9, 11, 12, 20 (arranged in ascending order)

Divide sample into 2 so lower half is 4, 7, 9 and upper half is 11, 12, 20

Find Median of lower and upper half which is 7 and 12 respectively

So Q1 = 7 and Q3 = 12

IQR = Q3 – Q1 which is 5 (12 -7 = 5)

Outliers:  a = Q1 – (1.5*IQR); b = Q3 = Q1 – (1.5*IQR)

any number < a or > b is an outlier.

Hence in our sample 20 is an outlier.

Outlier Treatment:

One way of getting rid outliers is by removing them observation (or deleting the row). But by doing so we might reduce the size of the sample or dataset which will not help in modelling.

Another most commonly used technique is using transformation or binning the variable. Transformation is by taking the natural log of the value which greatly reduces the variation specially used when we have too extreme values. Binning as the name suggests classifying all the values into set of defined bins (like 0-10; 10-20; 20-30 and so on..)

Feature Engineering:

This is an art of extracting more meaningful information from the existing data without adding anything new.  In housing data, having total Sq. Feet is more relevant than having individual floor Square Feet which can be considered as Feature Engineering.

Feature Engineering process should be performed after completing Exploratory Data Analysis (EDA), Missing value treatment and Outlier treatment.

Now we will see it in action using the same Ames housing data set. I will be using the output of my missing value treatment script discussed in another post. You can find the script here .

Import Libraries:

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,10
import seaborn as sb```

```df = pd.read_csv('C:/../Ames_cleaned.csv')
df.shape```

`(2919, 70)`
Let us try to find outliers in continuous variables like:
‘1stFlrSF’,’2ndFlrSF’,’TotalBsmtSF’

```sb.boxplot(x=df[['1stFlrSF','2ndFlrSF','TotalBsmtSF']])
``` Clearly we can see there are some extreme values for 1stFlrSF and TotalBsmtSF. However we will not do an outlier treatment now because we will do some Feature Engineering to them. We will combine all of them as TotalSF and see how the boxplot shows up.

```#Feature Engineering
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']
sb.boxplot(x=df['TotalSF'], orient='v')``` Now lets see for SalePrice, GrLivArea and LotFrontage

```sb.boxplot(x=df['GrLivArea'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')``` Clearly there are some extreme outliers in the data. We will use log transformation to reduce the variation which means the outliers will be removed completely but extreme values become reasonably better.

Log transforming TotalSF, GrLivArea, LotFrontage

```#Log transforming
df['GrLivArea'] = np.log(df['GrLivArea'])
df['LotFrontage'] = np.log(df['LotFrontage'])
df['TotalSF'] = np.log(df['TotalSF'])

#plotting after transformation
sb.boxplot(x=df['GrLivArea'], orient='v')
plt.show()
sb.boxplot(x=df['LotFrontage'], orient='v')
plt.show()
sb.boxplot(x=df['TotalSF'], orient='v')``` It is evident from the above graphs that extreme outliers are treated.

We can combine BedroomAbvGr and TotRmsAbvGrd into TotalRooms.
Following that we will drop all the variables which were feature engineered

```df['TotalRooms'] = df['BedroomAbvGr'] + df['TotRmsAbvGrd']

# Drop Feature engineered columns
col_fe = ['1stFlrSF','2ndFlrSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BedroomAbvGr','TotRmsAbvGrd']
df.drop(col_fe, axis=1, inplace=True)
print('Total Features after removing engineered features: ', (df.shape))```

`Total Features after removing engineered features: (2919, 64)`

Split into train and test datasets

```train = df.loc[df['source'] == 'train']
test = df.loc[df['source'] == 'test']
print(train.shape, test.shape)```

`(1460, 64) (1459, 64)`

For categorical variables we will try to find the percentage of most frequent value. In other words percentage of mode.

```#identifying numerical and categorical features
num_feat = train.dtypes[train.dtypes != 'object'].index
print('Total of numeric features: ', len(num_feat))
cat_feat = train.dtypes[train.dtypes == 'object'].index
print('Total of categorical features: ', len(cat_feat))

# Highest value Frequency percentage in categorical variables
for i in list(cat_feat):
pct = df[i].value_counts() / 2919
print('Highest value Percentage of {}: {:3f}'.format(i, pct))```

```Highest value Percentage of BldgType: 0.830764 Highest value Percentage of BsmtCond: 0.892771 Highest value Percentage of BsmtExposure: 0.652278 Highest value Percentage of BsmtFinType1: 0.318602 Highest value Percentage of BsmtFinType2: 0.881466 Highest value Percentage of BsmtQual: 0.439534 Highest value Percentage of CentralAir: 0.932854 Highest value Percentage of Condition1: 0.860226 Highest value Percentage of Condition2: 0.989723 Highest value Percentage of Electrical: 0.915382 Highest value Percentage of ExterCond: 0.869476 Highest value Percentage of ExterQual: 0.615964 Highest value Percentage of Exterior1st: 0.351490 Highest value Percentage of Exterior2nd: 0.347722 Highest value Percentage of FireplaceQu: 0.486468 Highest value Percentage of Foundation: 0.448099 Highest value Percentage of Functional: 0.931483 Highest value Percentage of GarageCond: 0.909215 Highest value Percentage of GarageFinish: 0.421377 Highest value Percentage of GarageQual: 0.892086 Highest value Percentage of GarageType: 0.590271 Highest value Percentage of Heating: 0.984584 Highest value Percentage of HeatingQC: 0.511477 Highest value Percentage of HouseStyle: 0.503940 Highest value Percentage of KitchenQual: 0.511477 Highest value Percentage of LandContour: 0.898253 Highest value Percentage of LandSlope: 0.951696 Highest value Percentage of LotConfig: 0.730730 Highest value Percentage of LotShape: 0.636862 Highest value Percentage of MSZoning: 0.777321 Highest value Percentage of MasVnrType: 0.606029 Highest value Percentage of Neighborhood: 0.151764 Highest value Percentage of PavedDrive: 0.904762 Highest value Percentage of RoofMatl: 0.985269 Highest value Percentage of RoofStyle: 0.791367 Highest value Percentage of SaleCondition: 0.822885 Highest value Percentage of SaleType: 0.865365 Highest value Percentage of Street: 0.995889 Highest value Percentage of Utilities: 0.999657 Highest value Percentage of source: 0.500171```

We will drop the variables which have mode > 80%. This is because there is very less variability and hence might not have significant contribution to the model.

```# Drop columns which have frequency of value more than 80% of all values
col_drop = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical','ExterCond',
'Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive','RoofMatl',
'SaleCondition','SaleType','Street','Utilities']
df.drop(col_drop, axis=1, inplace=True)
print('Total features after dropping categorical features: ', df.shape)```

`Total features after dropping categorical features: (2919, 44)`

With this we have tried to understand some basics on Outlier Treatment and Feature Engineering. Feature Engineering is the most critical step which determines the success of the model. Hence an in-depth understanding of the domain of the dataset is a huge advantage to derive relevant features necessary for building a Predictive Model.

-Hari Mindi

Categories: DataSciencePython

This site uses Akismet to reduce spam. Learn how your comment data is processed.