This post is continuation to the previous posts on Data Science process. We will try to predict housing prices based on some training dataset. I am using Ames housing dataset and based on my previous posts on the process this will be in continuation to them.

In my previous posts, i covered the following steps:

There are different types of machine learning algorithms used for different purposes. Choice of the algorithm depends on the type of dataset we are working on. I will do a different post on most common types of machine learning algorithms for beginners.

For housing prices dataset we need to predict house prices based on given features. For this purpose we will use Linear Regression algorithm.

What is Linear Regression:

I would assume most of you are like me who go mad looking at mathematical formulas. Hence i will try to explain the way i understand.

“Using Linear Regression we can quantify Strength of a relationship between independent and dependent variables”

Again that is lot of jargon…!

Work_Experience is an independent variable and Salary is a dependent variable. If we have good data about these 2 variables, using Linear Regression we can identify the relation. We can possibly predict that as experience increases Salary increases and based on that data we can predict the Salary of an Employee with some ‘x’ year of work experience.

There are many other websites which do better job at explaining this and i would strongly recommend that.

How are house prices predicted?

  • I will use the Feature Engineered data Ames Housing dataset from my previous post
  • We will split the training data into 70:30
  • We will train the algorithm with 70% training data and test it against remaining 30%
  • We will test predicted vs actual values and using a model evaluation technique to get a score

Lets get started..

Feature Selection:

From my last post after being done with Outlier treatment and Feature Engineering we are left with 64 variables. We also saw the correlation percentage to the target variable ‘SalePrice’.

  • We will select the features which have more than 40% correlation to the target variable, hence we will drop features that show less than 40% correlation.
  • We will also drop features which have more most frequent value percentage  more than 80%

(Script continued from previous post)

# We will modify yearbuilt and yearremodadd as their respective ages from the year of sale
df['YearBltAge'] = df['YrSold'] - df['YearBuilt']
df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']
df.shape

(2919, 66)

df.drop(['YearBuilt','YearRemodAdd'], axis=1, inplace=True)
#identifying numerical and categorical features
num_feat = train.dtypes[train.dtypes != 'object'].index
print('Total of numeric features: ', len(num_feat))
cat_feat = train.dtypes[train.dtypes == 'object'].index
print('Total of categorical features: ', len(cat_feat))

# Highest value Frequency percentage in categorical variables 
for i in list(cat_feat):
    pct = df[i].value_counts()[0] / 2919
    print('Highest value Percentage of {}: {:3f}'.format(i, pct))

Total of numeric features: 24
Total of categorical features: 40
Highest value Percentage of BldgType: 0.830764
Highest value Percentage of BsmtCond: 0.892771
Highest value Percentage of BsmtExposure: 0.652278
Highest value Percentage of BsmtFinType1: 0.318602
Highest value Percentage of BsmtFinType2: 0.881466
Highest value Percentage of BsmtQual: 0.439534
Highest value Percentage of CentralAir: 0.932854
Highest value Percentage of Condition1: 0.860226
Highest value Percentage of Condition2: 0.989723
Highest value Percentage of Electrical: 0.915382
Highest value Percentage of ExterCond: 0.869476
Highest value Percentage of ExterQual: 0.615964
Highest value Percentage of Exterior1st: 0.351490
Highest value Percentage of Exterior2nd: 0.347722
Highest value Percentage of FireplaceQu: 0.486468
Highest value Percentage of Foundation: 0.448099
Highest value Percentage of Functional: 0.931483
Highest value Percentage of GarageCond: 0.909215
Highest value Percentage of GarageFinish: 0.421377
Highest value Percentage of GarageQual: 0.892086
Highest value Percentage of GarageType: 0.590271
Highest value Percentage of Heating: 0.984584
Highest value Percentage of HeatingQC: 0.511477
Highest value Percentage of HouseStyle: 0.503940
Highest value Percentage of KitchenQual: 0.511477
Highest value Percentage of LandContour: 0.898253
Highest value Percentage of LandSlope: 0.951696
Highest value Percentage of LotConfig: 0.730730
Highest value Percentage of LotShape: 0.636862
Highest value Percentage of MSZoning: 0.777321
Highest value Percentage of MasVnrType: 0.606029
Highest value Percentage of Neighborhood: 0.151764
Highest value Percentage of PavedDrive: 0.904762
Highest value Percentage of RoofMatl: 0.985269
Highest value Percentage of RoofStyle: 0.791367
Highest value Percentage of SaleCondition: 0.822885
Highest value Percentage of SaleType: 0.865365
Highest value Percentage of Street: 0.995889
Highest value Percentage of Utilities: 0.999657
Highest value Percentage of source: 0.500171

#Find correlation for numeric variables

target = 'SalePrice'

corr = train.corr()
corr_abs = corr.abs()

nr_num_cols = len(num_feat)

ser_corr = corr_abs.nlargest(nr_num_cols, target)[target]
print(ser_corr)

SalePrice 1.000000
OverallQual 0.790982
TotalSF 0.761573
GrLivArea 0.695118
GarageCars 0.640409
GarageArea 0.623431
FullBath 0.582934
YearBltAge 0.523350
RemodAge 0.509079
MasVnrArea 0.473461
Fireplaces 0.466929
TotalRooms 0.444828
LotFrontage 0.331692
WoodDeckSF 0.324413
Porch 0.296678
LotArea 0.263843
GarageYrBlt 0.261366
HalfBath 0.250628
KitchenAbvGr 0.135907
MSSubClass 0.084284
OverallCond 0.077856
MoSold 0.046432
YrSold 0.028923
Id 0.021917
Name: SalePrice, dtype: float64

#Drop columns based on corr < 40% 
col_drop_corr = ['GarageYrBlt','LotFrontage','WoodDeckSF','Porch','LotArea','HalfBath','KitchenAbvGr','MSSubClass',
                 'OverallCond','MoSold','YrSold']
df.drop(col_drop_corr, axis=1, inplace=True)
print('Total features: ', df.shape)

Total features: (2919, 53)

#Drop variables with percentage frequent value > 80%
col_drop_mode = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical',
                'ExterCond','Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive',
                'RoofMatl','SaleCondition','SaleType','Street','Utilities']
df.drop(col_drop_mode, axis=1, inplace=True)
print('Total features: ', df.shape)

Total features: (2919, 33)

Hence we would be using 31 features in our model ( excluding ‘source’ and ‘ID’ variables)

Since we have a mix of numeric and non-numeric features, we need to convert them to numeric for the algorithm to interpret them. Hence we will use encoding and create dummy features. Let us understand with a simple example:

We have a variable ‘BsmtFinType1’ which has 6 types of values as: ‘None’, ‘GLQ’, ‘ALQ’, ‘Rec’, ‘BlQ’, ‘LwQ’

Encoding assigns numeric values to each of them as:

None – 0, GlQ – 1, ALQ – 2, Rec – 3, BLQ – 4, LwQ – 5

Creation of dummy variable – This column is now be transformed into 6 columns as:

BsmtFinType1_0, BsmtFinType1_1, BsmtFinType1_2,..so on. The value is denoted as 1 for value present and 0 when it is not present.

Lets get to the code

cat_cols = df.dtypes[df.dtypes == 'object'].index
cat_cols

Index(['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st',
'Exterior2nd', 'FireplaceQu', 'Foundation', 'GarageFinish',
'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig',
'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood', 'RoofStyle',
'source'],
dtype='object')

#Integer conversions (Label Encoder)
from sklearn.preprocessing import LabelEncoder
lc = LabelEncoder()

cat_cols = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu', 
           'Foundation','GarageFinish', 'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape',
           'MSZoning','MasVnrType', 'Neighborhood','RoofStyle'] 
for i in cat_cols:
     df[i] = lc.fit_transform(df[i])
    
df.shape

(2919, 33)

#One hot encoding

col_encod = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu', 
             'Fireplaces', 'Foundation', 'FullBath', 'GarageCars', 'GarageFinish', 'GarageType', 'HeatingQC', 
             'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood', 
             'OverallQual', 'RoofStyle', 'TotalRooms']
df = pd.get_dummies(df, columns=col_encod)

df.shape

(2919, 193)

As you can see from the result there columns have increased due to encoding. (We could have had fewer by making yearbuilt and yearremodadd into fewer buckets)

We will divide the dataset back into our original Train and Test sets.

Train – has values for SalePrice

Test – without SalePrice values which we need to predict.

#Dividing back into test and train dataset
train = df.loc[df['source'] == 'train']
test = df.loc[df['source'] == 'test']
print(train.shape, test.shape)

(1460, 193) (1459, 193)

test.drop(['source'], axis = 1, inplace=True)
train.drop(['source'], axis = 1, inplace=True)
#Importing algorithm libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

We will now split our training set into 2 sets one for training and other to compare our model and see how accurately it predicts the values.

#Split-out validation dataset
pred_col = [x for x in train.columns if x not in ['SalePrice', 'Id']]
X = train[pred_col]
y = train['SalePrice']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state= 0)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((1022, 1688), (438, 1688), (1022,), (438,))

#Declare algorithm
lin_reg = LinearRegression()

#Fit the training set
lin_reg.fit(X_train, y_train)

score = lin_reg.score(X_test,y_test)
print ('Model Score: {:4f}'.format(score))

Model Score: 0.804574
The above function used for measuring the model accuracy is called R – Squared and it shows that our model has 80.4% accuracy rate in predicting the values which is not bad for a beginner..!

Hope you enjoyed it so far. I will try to make another post on implementing each ML algorithm and this time using much simpler datasets.

-Hari Mindi


2 Comments

Srivatsa · May 26, 2018 at 4:59 pm

Good model prediction and that too getting a 80% prediction is good. As this is training data I’m sure the results are pretty good. But with real data getting a prediction of 70% is fantastic IMHO. Keep sharing your thoughts. Good job my dear friend

Vatsa

    harimindi · June 4, 2018 at 6:48 pm

    Thanks buddy..!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.