In my previous post i briefly discussed about different techniques of handling missing data for building Machine Learning model.

Below is Python script for treating missing data in Ames dataset.

(Download the dataset from here)

As usual open Jupyter notebook and import our libraries to start

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from matplotlib import rcParams rcParams['figure.figsize'] = 12,10 import seaborn as sb

**Read the dataset**

train = pd.read_csv('C:/../houseprices_train.csv') test = pd.read_csv('C:/../houseprices_test.csv') print(train.shape, test.shape)

`(1460, 81) (1459, 80)`

**Adding source column** to each dataset for identification and then combine them

#Adding source train['source'] = 'train' test['source'] = 'test' print(train.shape, test.shape)

`(1460, 82) (1459, 81)`

**Combine datasets**

df = pd.concat([train, test]) df.shape

`(1460, 82) (1459, 81)`

(2919, 82)

**Find Missing Value Columns**

- Lets find the percentage of missing values for all the columns that have missing values.
- Python represents missing data a NaN and are identified by isnull() function.

#percentage of missing values null_df = df.columns[df.isnull().any()] num = df[null_df].isnull().sum() round(num/2919, 2)

`Alley 0.93`

BsmtCond 0.03

BsmtExposure 0.03

BsmtFinSF1 0.00

BsmtFinSF2 0.00

BsmtFinType1 0.03

BsmtFinType2 0.03

BsmtFullBath 0.00

BsmtHalfBath 0.00

BsmtQual 0.03

BsmtUnfSF 0.00

Electrical 0.00

Exterior1st 0.00

Exterior2nd 0.00

Fence 0.80

FireplaceQu 0.49

Functional 0.00

GarageArea 0.00

GarageCars 0.00

GarageCond 0.05

GarageFinish 0.05

GarageQual 0.05

GarageType 0.05

GarageYrBlt 0.05

KitchenQual 0.00

LotFrontage 0.17

MSZoning 0.00

MasVnrArea 0.01

MasVnrType 0.01

MiscFeature 0.96

PoolQC 1.00

SalePrice 0.50

SaleType 0.00

TotalBsmtSF 0.00

Utilities 0.00

dtype: float64

That is a long list of columns that have missing values.

In certain cases columns will have zero values which can be considered as missing values. Hence we will also look at columns having data as zero and express that as percentage of total data.

#percentage of zero values for each numeric variable zero_df = df.columns[(df == 0).any()] num = (df[zero_df] == 0).sum() round(num/2919, 2)

`2ndFlrSF 0.57`

3SsnPorch 0.99

BedroomAbvGr 0.00

BsmtFinSF1 0.32

BsmtFinSF2 0.88

BsmtFullBath 0.58

BsmtHalfBath 0.94

BsmtUnfSF 0.08

EnclosedPorch 0.84

Fireplaces 0.49

FullBath 0.00

GarageArea 0.05

GarageCars 0.05

HalfBath 0.63

KitchenAbvGr 0.00

LowQualFinSF 0.99

MasVnrArea 0.60

MiscVal 0.96

OpenPorchSF 0.44

PoolArea 1.00

ScreenPorch 0.91

TotalBsmtSF 0.03

WoodDeckSF 0.52

dtype: float64

Now that we have all the columns with missing and zero values data, we will use some basic techniques to impute them with some appropriate values.

**Drop unwanted columns**

- First i would like drop the features that have more than 90% missing data because their contribution to the model is very insignificant.
- Alley, MiscFeature, PoolQC – these have more than 90% missing data
- BsmtHalfBath, LowQualFinSF, MiscVal, PoolArea – These are numeric features which have more than 90% data as zeros.
- We will drop all these columns except for BsmtHalfBath, 3SsnPorch, ScreenPorch which we will look at later stages.

drop_col = ['Alley','LowQualFinSF', 'MiscFeature', 'MiscVal','PoolArea', 'PoolQC'] df.drop(drop_col, axis=1, inplace=True) df.shape

`(2919, 73)`

**Imputing missing values for –** **BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual**

- These features seem to have equal number of missing values
- Description for above variables is given as ‘NA’ = ‘No Basement’.
- First we will change NA for the above fields to something Python can read instead of considering them as missing values

col = ['BsmtCond','BsmtExposure','BsmtFinType1', 'BsmtFinType2', 'BsmtQual'] for i in col: Nan_rows = df[i].isnull() df.loc[Nan_rows, i] = 'None' print ('Null values of {} is: {:d}' .format(i, df[i].isnull().sum()))

`Null values of BsmtCond is: 0`

Null values of BsmtExposure is: 0

Null values of BsmtFinType1 is: 0

Null values of BsmtFinType2 is: 0

Null values of BsmtQual is: 0

**Lets examine BsmtFinSF1 and BsmtFinSF2**

zero_bfsf1 = (df['BsmtFinSF1'] == 0) print (df.loc[zero_bfsf1, ].groupby('BsmtFinType1').BsmtFinSF1.count()) zero_bfsf2 = (df['BsmtFinSF2'] == 0) print (df.loc[zero_bfsf2, ].groupby('BsmtFinType2').BsmtFinSF1.count())

`BsmtFinType1`

None 78

Unf 851

Name: BsmtFinSF1, dtype: int64

BsmtFinType2

BLQ 1

None 78

Unf 2492

Name: BsmtFinSF1, dtype: int64

- BsmtFinSF1 and BsmtFinSF2 can have zero values where BsmtFinType1 or BsmntFinType2 are NA or Unf (unfinished)
- I would assume that to be treated as no basement
- Hence changing unf values to None for bsmtFinType1 and BsmntFinType2

Unf_rows = (df['BsmtFinType1'] == 'Unf') df.loc[Unf_rows,'BsmtFinType1'] = 'None' print('Valuecount of BsmtFinType1: ', df['BsmtFinType1'].value_counts()) Unf_rows = (df['BsmtFinType2'] == 'Unf') df.loc[Unf_rows,'BsmtFinType2'] = 'None' print('Valuecount of BsmtFinType2: ', df['BsmtFinType2'].value_counts())

`Valuecount of BsmtFinType1: None 930`

GLQ 849

ALQ 429

Rec 288

BLQ 269

LwQ 154

Name: BsmtFinType1, dtype: int64

Valuecount of BsmtFinType2: None 2573

Rec 105

LwQ 87

BLQ 68

ALQ 52

GLQ 34

Name: BsmtFinType2, dtype: int64

- We will look at zero values for BsmtUnfSF

zero_df = (df['BsmtUnfSF'] == 0) df.loc[zero_df, ].groupby('BsmtCond').BsmtUnfSF.count()

`BsmtCond`

None 79

Name: BsmtUnfSF, dtype: int64

This shows that 79 are genuine zero values as there is no basement and we need to impute the remaining

print('Zero Values for BsmtUnfSF Before: ', df.loc[zero_df,'BsmtUnfSF'].count()) zero_df = ((df['BsmtUnfSF'] == 0) & (df['BsmtCond'] != 'None')) df.loc[zero_df,'BsmtUnfSF'] = df['BsmtUnfSF'].mean() print('Zero Values for BsmtUnfSF After: ', df.loc[zero_df,'BsmtUnfSF'].count())

`Zero Values for BsmtUnfSF Before: 241`

Zero Values for BsmtUnfSF After: 162

**Garage Parameters**

- Description says ‘NA’ or null values on GarageCond mean ‘No Garage’ which is same for other garage features.
- Numeric Garage fields ‘GarageArea’, GarageCars and GarageYrBlt should be made zero for rows where GarageCond is null

null_g = df['GarageCond'].isnull() df.loc[null_g, ['GarageCond','GarageFinish','GarageQual','GarageType']] = 'None' df.loc[null_g, ['GarageArea','GarageCars','GarageYrBlt']] = 0

**Imputing FireplaceQu**

- Null values mean no Fireplace, hence we will fill the with ‘None’.
- Also zero values in Fireplaces correspond to null value rows of FireplaceQu which is correct.

null_f = df['FireplaceQu'].isnull() df.loc[null_f, 'FireplaceQu'] = 'None'

**Imputing Fence and LotFrontage**

- Null values in fence means No Fence according to description, hence imputing with ‘None’
- LotFrontage has null values which we will impute with mean

df['Fence'].fillna('None', inplace=True) df['LotFrontage'].fillna(df['LotFrontage'].mean(), inplace=True) print ('Null values of Fence: ', df['Fence'].isnull().sum()) print ('Null values of LotFrontage: ', df['LotFrontage'].isnull().sum())

Null values of Fence: 0

Null values of LotFrontage: 0

**Imputing MasVnrArea and MasVnrType**

- All Null values of MasVnrType will be filled with ‘None’ and corresponding MasVnrArea will be zero
- MasVnrType with MasVnrArea are related. If area is zero then type should be ‘None’ according to description
- Similarly of type is ‘None’ then Area should be zero
- Hence we will update MasVnrType as ‘None’ where MasVnrArea is zero
- Also all Null values of Type fill be filled with ‘None’ and corresponding Area will be zero

null_m = df['MasVnrType'].isnull() df.loc[null_m, 'MasVnrArea'] = 0 df.loc[null_m, 'MasVnrType'] = 'None' null_m = (df['MasVnrArea'] == 0) df.loc[null_m, 'MasVnrType'] = 'None' null_m = (df['MasVnrType'] == 'None') df.loc[null_m, 'MasVnrArea'] = 0

**Now we will do some FEATURE ENGINEERING try to impute missing values**

- We will combine OpenPorch, EnclosedPorch, 3SsnPorch and Screenporch into a single variable.
- Instead of porch area, we will mark the new variable as 1 for Porch exists and 0 for No Porch

df['Porch'] = df['OpenPorchSF']+df['EnclosedPorch']+df['3SsnPorch']+df['ScreenPorch'] df['Porch'] = df['Porch'].astype(bool).astype(int) df['Porch'].value_counts()

`1 2046`

0 873

Name: Porch, dtype: int64

- We will now combine FullBath, BsmtFullBath and HalfBath, BsmtHalfBath to show the total number of Full and Half Bathrooms

df['FullBath'] = df['FullBath'] + df['BsmtFullBath'] df['HalfBath'] = df['HalfBath'] + df['BsmtHalfBath']

Since we created new features from existing ones, lets drop them

drop_col = ['BsmtFullBath','BsmtHalfBath', '3SsnPorch', 'EnclosedPorch', 'OpenPorchSF','ScreenPorch'] df.drop(drop_col, axis=1, inplace=True) df.shape

`(2919, 71)`

Now we are sure we have addressed the zero value features and most of the missing values in other features.

lets see what else is remaining in completing our missing data exercise

null_df = df.columns[df.isnull().any()] num = df[null_df].isnull().sum() num

`BsmtFinSF1 1`

BsmtFinSF2 1

BsmtUnfSF 1

Electrical 1

Exterior1st 1

Exterior2nd 1

FullBath 2

Functional 2

HalfBath 2

KitchenQual 1

MSZoning 4

SalePrice 1459

SaleType 1

TotalBsmtSF 1

Utilities 2

dtype: int64

- SalePrice is the feature we need to predict, hence ignore that
- There are some continuous & categorical features left with missing data.
- For Continuous variables, we will impute with MEAN and for categorical data we will impute with MODE

cont_col = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF'] cat_col = ['Electrical', 'Exterior1st', 'Exterior2nd', 'FullBath', 'Functional', 'HalfBath', 'KitchenQual', 'MSZoning','SaleType','Utilities'] for i in cont_col: df[i].fillna(df[i].mean(), inplace=True) print('Null values left for {} is: {:d}'.format(i, df[i].isnull().sum())) for j in cat_col: df[j].fillna(df[j].mode()[0], inplace=True) print('Null values left for {} is: {:d}'.format(j, df[j].isnull().sum()))

`Null values left for BsmtFinSF1 is: 0`

Null values left for BsmtFinSF2 is: 0

Null values left for BsmtUnfSF is: 0

Null values left for TotalBsmtSF is: 0

Null values left for Electrical is: 0

Null values left for Exterior1st is: 0

Null values left for Exterior2nd is: 0

Null values left for FullBath is: 0

Null values left for Functional is: 0

Null values left for HalfBath is: 0

Null values left for KitchenQual is: 0

Null values left for MSZoning is: 0

Null values left for SaleType is: 0

Null values left for Utilities is: 0

Now we have a cleaned data without any missing data which is ready to be processed further building the ML model. Lets save the cleaned data which we will use in the next steps of building a Machine Learning model.

The dataset is not yet completely ready as there are some outliers (extreme values) in the data and might also need more Feature Engineering which i will save it for another post on another day…!

**-Hari Mindi**