In my previous post i briefly discussed about different techniques of handling missing data for building Machine Learning model.

Below is Python script for treating missing data in Ames dataset.

(Download the dataset from here)

As usual open Jupyter notebook and import our libraries to start

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,10
import seaborn as sb

Read the dataset

train = pd.read_csv('C:/../houseprices_train.csv')
test = pd.read_csv('C:/../houseprices_test.csv')
print(train.shape, test.shape)

(1460, 81) (1459, 80)

Adding source column to each dataset for identification and then combine them

#Adding source
train['source'] = 'train'
test['source'] = 'test'
print(train.shape, test.shape)

(1460, 82) (1459, 81)

Combine datasets

df = pd.concat([train, test])
df.shape

(1460, 82) (1459, 81)
(2919, 82)

Find Missing Value Columns

  • Lets find the percentage of missing values for all the columns that have missing values.
  • Python represents missing data a NaN and are identified by isnull() function.
#percentage of missing values
null_df = df.columns[df.isnull().any()]
num = df[null_df].isnull().sum()
round(num/2919, 2) 

Alley 0.93
BsmtCond 0.03
BsmtExposure 0.03
BsmtFinSF1 0.00
BsmtFinSF2 0.00
BsmtFinType1 0.03
BsmtFinType2 0.03
BsmtFullBath 0.00
BsmtHalfBath 0.00
BsmtQual 0.03
BsmtUnfSF 0.00
Electrical 0.00
Exterior1st 0.00
Exterior2nd 0.00
Fence 0.80
FireplaceQu 0.49
Functional 0.00
GarageArea 0.00
GarageCars 0.00
GarageCond 0.05
GarageFinish 0.05
GarageQual 0.05
GarageType 0.05
GarageYrBlt 0.05
KitchenQual 0.00
LotFrontage 0.17
MSZoning 0.00
MasVnrArea 0.01
MasVnrType 0.01
MiscFeature 0.96
PoolQC 1.00
SalePrice 0.50
SaleType 0.00
TotalBsmtSF 0.00
Utilities 0.00
dtype: float64

That is a long list of columns that have missing values.
In certain cases columns will have zero values which can be considered as missing values. Hence we will also look at columns having data as zero and express that as percentage of total data.

#percentage of zero values for each numeric variable
zero_df = df.columns[(df == 0).any()]
num = (df[zero_df] == 0).sum()
round(num/2919, 2) 

2ndFlrSF 0.57
3SsnPorch 0.99
BedroomAbvGr 0.00
BsmtFinSF1 0.32
BsmtFinSF2 0.88
BsmtFullBath 0.58
BsmtHalfBath 0.94
BsmtUnfSF 0.08
EnclosedPorch 0.84
Fireplaces 0.49
FullBath 0.00
GarageArea 0.05
GarageCars 0.05
HalfBath 0.63
KitchenAbvGr 0.00
LowQualFinSF 0.99
MasVnrArea 0.60
MiscVal 0.96
OpenPorchSF 0.44
PoolArea 1.00
ScreenPorch 0.91
TotalBsmtSF 0.03
WoodDeckSF 0.52
dtype: float64

Now that we have all the columns with missing and zero values data, we will use some basic techniques to impute them with some appropriate values.

Drop unwanted columns

  • First i would like drop the features that have more than 90% missing data because their contribution to the model is very insignificant.
  • Alley, MiscFeature, PoolQC – these have more than 90% missing data
  • BsmtHalfBath, LowQualFinSF, MiscVal, PoolArea – These are numeric features which have more than 90% data as zeros.
  • We will drop all these columns except for BsmtHalfBath, 3SsnPorch, ScreenPorch which we will look at later stages.
drop_col = ['Alley','LowQualFinSF', 'MiscFeature', 'MiscVal','PoolArea', 'PoolQC']
df.drop(drop_col, axis=1, inplace=True)
df.shape

(2919, 73)

Imputing missing values for – BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual
  • These features seem to have equal number of missing values
  • Description for above variables is given as  ‘NA’   = ‘No Basement’.
  • First we will change NA for the above fields to something Python can read instead of considering them as missing values
col = ['BsmtCond','BsmtExposure','BsmtFinType1', 'BsmtFinType2', 'BsmtQual']
for i in col:
    Nan_rows = df[i].isnull()
    df.loc[Nan_rows, i] = 'None'
    print ('Null values of {} is: {:d}' .format(i, df[i].isnull().sum()))

Null values of BsmtCond is: 0
Null values of BsmtExposure is: 0
Null values of BsmtFinType1 is: 0
Null values of BsmtFinType2 is: 0
Null values of BsmtQual is: 0

Lets examine BsmtFinSF1 and BsmtFinSF2

zero_bfsf1 = (df['BsmtFinSF1'] == 0)
print (df.loc[zero_bfsf1, ].groupby('BsmtFinType1').BsmtFinSF1.count())

zero_bfsf2 = (df['BsmtFinSF2'] == 0)
print (df.loc[zero_bfsf2, ].groupby('BsmtFinType2').BsmtFinSF1.count())

BsmtFinType1
None 78
Unf 851
Name: BsmtFinSF1, dtype: int64
BsmtFinType2
BLQ 1
None 78
Unf 2492
Name: BsmtFinSF1, dtype: int64

  • BsmtFinSF1 and BsmtFinSF2 can have zero values where BsmtFinType1 or BsmntFinType2 are NA or Unf (unfinished)
  • I would assume that to be treated as no basement
  • Hence changing unf values to None for bsmtFinType1 and BsmntFinType2
Unf_rows = (df['BsmtFinType1'] == 'Unf')
df.loc[Unf_rows,'BsmtFinType1'] = 'None'
print('Valuecount of BsmtFinType1: ', df['BsmtFinType1'].value_counts())

Unf_rows = (df['BsmtFinType2'] == 'Unf')
df.loc[Unf_rows,'BsmtFinType2'] = 'None'
print('Valuecount of BsmtFinType2: ', df['BsmtFinType2'].value_counts())

Valuecount of BsmtFinType1: None 930
GLQ 849
ALQ 429
Rec 288
BLQ 269
LwQ 154
Name: BsmtFinType1, dtype: int64
Valuecount of BsmtFinType2: None 2573
Rec 105
LwQ 87
BLQ 68
ALQ 52
GLQ 34
Name: BsmtFinType2, dtype: int64

  • We will look at zero values for BsmtUnfSF
zero_df = (df['BsmtUnfSF'] == 0)
df.loc[zero_df, ].groupby('BsmtCond').BsmtUnfSF.count()

BsmtCond
None 79
Name: BsmtUnfSF, dtype: int64

This shows that 79 are genuine zero values as there is no basement and we need to impute the remaining

print('Zero Values for BsmtUnfSF Before: ', df.loc[zero_df,'BsmtUnfSF'].count())
zero_df = ((df['BsmtUnfSF'] == 0) & (df['BsmtCond'] != 'None'))
df.loc[zero_df,'BsmtUnfSF'] = df['BsmtUnfSF'].mean()
print('Zero Values for BsmtUnfSF After: ', df.loc[zero_df,'BsmtUnfSF'].count())

Zero Values for BsmtUnfSF Before: 241
Zero Values for BsmtUnfSF After: 162

Garage Parameters

  • Description says  ‘NA’ or null values on GarageCond mean ‘No Garage’ which is same for other garage features.
  • Numeric Garage fields ‘GarageArea’, GarageCars and GarageYrBlt should be made zero for rows where GarageCond is null
null_g = df['GarageCond'].isnull()
df.loc[null_g, ['GarageCond','GarageFinish','GarageQual','GarageType']] = 'None'
df.loc[null_g, ['GarageArea','GarageCars','GarageYrBlt']] = 0

Imputing FireplaceQu

  • Null values mean no Fireplace, hence we will fill the with ‘None’.
  • Also zero values in Fireplaces correspond to null value rows of FireplaceQu which is correct.
null_f = df['FireplaceQu'].isnull()
df.loc[null_f, 'FireplaceQu'] = 'None'

Imputing Fence and LotFrontage

  • Null values in fence means No Fence according to description, hence imputing with ‘None’
  • LotFrontage has null values which we will impute with mean
df['Fence'].fillna('None', inplace=True)
df['LotFrontage'].fillna(df['LotFrontage'].mean(), inplace=True)
print ('Null values of Fence: ', df['Fence'].isnull().sum())
print ('Null values of LotFrontage: ', df['LotFrontage'].isnull().sum())


Null values of Fence: 0
Null values of LotFrontage: 0

Imputing MasVnrArea and MasVnrType

  • All Null values of MasVnrType will be filled with ‘None’ and corresponding MasVnrArea will be zero
  • MasVnrType with MasVnrArea are related. If area is zero then type should be ‘None’ according to description
  • Similarly of type is ‘None’ then Area should be zero
  • Hence we will update MasVnrType as ‘None’ where MasVnrArea is zero
  • Also all Null values of Type fill be filled with ‘None’ and corresponding Area will be zero
null_m = df['MasVnrType'].isnull()
df.loc[null_m, 'MasVnrArea'] = 0
df.loc[null_m, 'MasVnrType'] = 'None'

null_m = (df['MasVnrArea'] == 0)
df.loc[null_m, 'MasVnrType'] = 'None'

null_m = (df['MasVnrType'] == 'None')
df.loc[null_m, 'MasVnrArea'] = 0

Now we will do some FEATURE ENGINEERING try to impute missing values

  • We will combine OpenPorch, EnclosedPorch, 3SsnPorch and Screenporch into a single variable.
  • Instead of porch area, we will mark the new variable as 1 for Porch exists and 0 for No Porch
df['Porch'] = df['OpenPorchSF']+df['EnclosedPorch']+df['3SsnPorch']+df['ScreenPorch']
df['Porch'] = df['Porch'].astype(bool).astype(int)

df['Porch'].value_counts()

1 2046
0 873
Name: Porch, dtype: int64

  • We will now combine FullBath, BsmtFullBath and HalfBath, BsmtHalfBath to show the total number of Full and Half Bathrooms
df['FullBath'] = df['FullBath'] + df['BsmtFullBath']
df['HalfBath'] = df['HalfBath'] + df['BsmtHalfBath']

Since we created new features from existing ones, lets drop them

drop_col = ['BsmtFullBath','BsmtHalfBath', '3SsnPorch', 'EnclosedPorch', 'OpenPorchSF','ScreenPorch']
df.drop(drop_col, axis=1, inplace=True)
df.shape

(2919, 71)

Now we are sure we have addressed the zero value features and most of the missing values in other features.
lets see what else is remaining in completing our missing data exercise

null_df = df.columns[df.isnull().any()]
num = df[null_df].isnull().sum()
num

BsmtFinSF1 1
BsmtFinSF2 1
BsmtUnfSF 1
Electrical 1
Exterior1st 1
Exterior2nd 1
FullBath 2
Functional 2
HalfBath 2
KitchenQual 1
MSZoning 4
SalePrice 1459
SaleType 1
TotalBsmtSF 1
Utilities 2
dtype: int64

  • SalePrice is the feature we need to predict, hence ignore that
  • There are some continuous & categorical features left with missing data.
  • For Continuous variables, we will impute with MEAN and for categorical data we will impute with MODE
cont_col = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF']
cat_col = ['Electrical', 'Exterior1st', 'Exterior2nd', 'FullBath', 'Functional', 'HalfBath', 'KitchenQual',
           'MSZoning','SaleType','Utilities']
for i in cont_col:
    df[i].fillna(df[i].mean(), inplace=True)
    print('Null values left for {} is: {:d}'.format(i, df[i].isnull().sum()))
    
for j in cat_col:
    df[j].fillna(df[j].mode()[0], inplace=True)
    print('Null values left for {} is: {:d}'.format(j, df[j].isnull().sum()))

Null values left for BsmtFinSF1 is: 0
Null values left for BsmtFinSF2 is: 0
Null values left for BsmtUnfSF is: 0
Null values left for TotalBsmtSF is: 0
Null values left for Electrical is: 0
Null values left for Exterior1st is: 0
Null values left for Exterior2nd is: 0
Null values left for FullBath is: 0
Null values left for Functional is: 0
Null values left for HalfBath is: 0
Null values left for KitchenQual is: 0
Null values left for MSZoning is: 0
Null values left for SaleType is: 0
Null values left for Utilities is: 0

Now we have a cleaned data without any missing data which is ready to be processed further building the ML model. Lets save the cleaned data which we will use in the next steps of building a Machine Learning model.

The dataset is not yet completely ready as there are some outliers (extreme values) in the data and might also need more Feature Engineering which i will save it for another post on another day…!

-Hari Mindi

 


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.