Outlier treatment and Featuring Engineering would be the next steps in continuation to Missing Data Treatment.

**Outliers Definition:**

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. It is an observation that deviates from overall pattern on a sample.

Outliers can be caused due to multiple reasons like:

- Data Entry Errors
- Intermediate Data processing Errors
- Instrument recording errors
- Natural Error or which does not denote an error

**Outlier Detection and Treatment:**

Outliers can be Univariate or Multivariate meaning on single feature or collection of features respectively. Boxplots or Scatter plots are an easy way of determining the outliers in a sample. Interquartile Range is most commonly used technique to determine an outlier. Below is an example to understand the same:

Sample: 4, 7, 9, 11, 12, 20 (arranged in ascending order)

Divide sample into 2 so lower half is 4, 7, 9 and upper half is 11, 12, 20

Find Median of lower and upper half which is 7 and 12 respectively

So Q1 = 7 and Q3 = 12

IQR = Q3 – Q1 which is 5 (12 -7 = 5)

Outliers: a = Q1 – (1.5*IQR); b = Q3 = Q1 – (1.5*IQR)

any number < a or > b is an outlier.

Hence in our sample 20 is an outlier.

**Outlier Treatment:**

One way of getting rid outliers is by removing them observation (or deleting the row). But by doing so we might reduce the size of the sample or dataset which will not help in modelling.

Another most commonly used technique is using transformation or binning the variable. **Transformation** is by taking the natural log of the value which greatly reduces the variation specially used when we have too extreme values. **Binning** as the name suggests classifying all the values into set of defined bins (like 0-10; 10-20; 20-30 and so on..)

**Feature Engineering:**

This is an art of extracting more meaningful information from the existing data without adding anything new. In housing data, having total Sq. Feet is more relevant than having individual floor Square Feet which can be considered as Feature Engineering.

Feature Engineering process should be performed after completing Exploratory Data Analysis (EDA), Missing value treatment and Outlier treatment.

Now we will see it in action using the same Ames housing data set. I will be using the output of my missing value treatment script discussed in another post. You can find the script here .

Import Libraries:

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from matplotlib import rcParams rcParams['figure.figsize'] = 12,10 import seaborn as sb

Read Dataset:

df = pd.read_csv('C:/../Ames_cleaned.csv') df.shape

`(2919, 70)`

Let us try to find outliers in continuous variables like:

‘1stFlrSF’,’2ndFlrSF’,’TotalBsmtSF’

sb.boxplot(x=df[['1stFlrSF','2ndFlrSF','TotalBsmtSF']])

Clearly we can see there are some extreme values for 1stFlrSF and TotalBsmtSF. However we will not do an outlier treatment now because we will do some Feature Engineering to them. We will combine all of them as TotalSF and see how the boxplot shows up.

#Feature Engineering df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF'] sb.boxplot(x=df['TotalSF'], orient='v')

Now lets see for SalePrice, GrLivArea and LotFrontage

sb.boxplot(x=df['GrLivArea'], orient='v') plt.show() sb.boxplot(x=df['LotFrontage'], orient='v') plt.show() sb.boxplot(x=df['LotFrontage'], orient='v')

Clearly there are some extreme outliers in the data. We will use log transformation to reduce the variation which means the outliers will be removed completely but extreme values become reasonably better.

Log transforming TotalSF, GrLivArea, LotFrontage

#Log transforming df['GrLivArea'] = np.log(df['GrLivArea']) df['LotFrontage'] = np.log(df['LotFrontage']) df['TotalSF'] = np.log(df['TotalSF']) #plotting after transformation sb.boxplot(x=df['GrLivArea'], orient='v') plt.show() sb.boxplot(x=df['LotFrontage'], orient='v') plt.show() sb.boxplot(x=df['TotalSF'], orient='v')

It is evident from the above graphs that extreme outliers are treated.

We can combine BedroomAbvGr and TotRmsAbvGrd into TotalRooms.

Following that we will drop all the variables which were feature engineered

df['TotalRooms'] = df['BedroomAbvGr'] + df['TotRmsAbvGrd'] # Drop Feature engineered columns col_fe = ['1stFlrSF','2ndFlrSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BedroomAbvGr','TotRmsAbvGrd'] df.drop(col_fe, axis=1, inplace=True) print('Total Features after removing engineered features: ', (df.shape))

`Total Features after removing engineered features: (2919, 64)`

Split into train and test datasets

train = df.loc[df['source'] == 'train'] test = df.loc[df['source'] == 'test'] print(train.shape, test.shape)

`(1460, 64) (1459, 64)`

For categorical variables we will try to find the percentage of most frequent value. In other words percentage of mode.

#identifying numerical and categorical features num_feat = train.dtypes[train.dtypes != 'object'].index print('Total of numeric features: ', len(num_feat)) cat_feat = train.dtypes[train.dtypes == 'object'].index print('Total of categorical features: ', len(cat_feat)) # Highest value Frequency percentage in categorical variables for i in list(cat_feat): pct = df[i].value_counts()[0] / 2919 print('Highest value Percentage of {}: {:3f}'.format(i, pct))

`Highest value Percentage of BldgType: 0.830764`

Highest value Percentage of BsmtCond: 0.892771

Highest value Percentage of BsmtExposure: 0.652278

Highest value Percentage of BsmtFinType1: 0.318602

Highest value Percentage of BsmtFinType2: 0.881466

Highest value Percentage of BsmtQual: 0.439534

Highest value Percentage of CentralAir: 0.932854

Highest value Percentage of Condition1: 0.860226

Highest value Percentage of Condition2: 0.989723

Highest value Percentage of Electrical: 0.915382

Highest value Percentage of ExterCond: 0.869476

Highest value Percentage of ExterQual: 0.615964

Highest value Percentage of Exterior1st: 0.351490

Highest value Percentage of Exterior2nd: 0.347722

Highest value Percentage of FireplaceQu: 0.486468

Highest value Percentage of Foundation: 0.448099

Highest value Percentage of Functional: 0.931483

Highest value Percentage of GarageCond: 0.909215

Highest value Percentage of GarageFinish: 0.421377

Highest value Percentage of GarageQual: 0.892086

Highest value Percentage of GarageType: 0.590271

Highest value Percentage of Heating: 0.984584

Highest value Percentage of HeatingQC: 0.511477

Highest value Percentage of HouseStyle: 0.503940

Highest value Percentage of KitchenQual: 0.511477

Highest value Percentage of LandContour: 0.898253

Highest value Percentage of LandSlope: 0.951696

Highest value Percentage of LotConfig: 0.730730

Highest value Percentage of LotShape: 0.636862

Highest value Percentage of MSZoning: 0.777321

Highest value Percentage of MasVnrType: 0.606029

Highest value Percentage of Neighborhood: 0.151764

Highest value Percentage of PavedDrive: 0.904762

Highest value Percentage of RoofMatl: 0.985269

Highest value Percentage of RoofStyle: 0.791367

Highest value Percentage of SaleCondition: 0.822885

Highest value Percentage of SaleType: 0.865365

Highest value Percentage of Street: 0.995889

Highest value Percentage of Utilities: 0.999657

Highest value Percentage of source: 0.500171

We will drop the variables which have mode > 80%. This is because there is very less variability and hence might not have significant contribution to the model.

# Drop columns which have frequency of value more than 80% of all values col_drop = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical','ExterCond', 'Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive','RoofMatl', 'SaleCondition','SaleType','Street','Utilities'] df.drop(col_drop, axis=1, inplace=True) print('Total features after dropping categorical features: ', df.shape)

`Total features after dropping categorical features: (2919, 44)`

With this we have tried to understand some basics on **Outlier Treatment** and **Feature Engineering**. Feature Engineering is the most critical step which determines the success of the model. Hence an in-depth understanding of the domain of the dataset is a huge advantage to derive relevant features necessary for building a Predictive Model.

**-Hari Mindi**