This post is continuation to the previous posts on Data Science process. We will try to predict housing prices based on some training dataset. I am using Ames housing dataset and based on my previous posts on the process this will be in continuation to them.

In my previous posts, i covered the following steps:

There are different types of machine learning algorithms used for different purposes. Choice of the algorithm depends on the type of dataset we are working on. I will do a different post on most common types of machine learning algorithms for beginners.

For housing prices dataset we need to predict house prices based on given features. For this purpose we will use Linear Regression algorithm.

**What is Linear Regression:**

I would assume most of you are like me who go mad looking at mathematical formulas. Hence i will try to explain the way i understand.

*“Using Linear Regression we can quantify Strength of a relationship between independent and dependent variables”*

Again that is lot of jargon…!

Work_Experience is an independent variable and Salary is a dependent variable. If we have good data about these 2 variables, using Linear Regression we can identify the relation. We can possibly predict that as experience increases Salary increases and based on that data we can predict the Salary of an Employee with some ‘x’ year of work experience.

There are many other websites which do better job at explaining this and i would strongly recommend that.

**How are house prices predicted?**

- I will use the Feature Engineered data Ames Housing dataset from my previous post
- We will split the training data into 70:30
- We will train the algorithm with 70% training data and test it against remaining 30%
- We will test predicted vs actual values and using a model evaluation technique to get a score

Lets get started..

**Feature Selection:**

From my last post after being done with Outlier treatment and Feature Engineering we are left with 64 variables. We also saw the correlation percentage to the target variable ‘SalePrice’.

- We will select the features which have more than 40% correlation to the target variable, hence we will drop features that show less than 40% correlation.
- We will also drop features which have more most frequent value percentage more than 80%

(Script continued from previous post)

# We will modify yearbuilt and yearremodadd as their respective ages from the year of sale df['YearBltAge'] = df['YrSold'] - df['YearBuilt'] df['RemodAge'] = df['YrSold'] - df['YearRemodAdd'] df.shape

`(2919, 66)`

df.drop(['YearBuilt','YearRemodAdd'], axis=1, inplace=True)

#identifying numerical and categorical features num_feat = train.dtypes[train.dtypes != 'object'].index print('Total of numeric features: ', len(num_feat)) cat_feat = train.dtypes[train.dtypes == 'object'].index print('Total of categorical features: ', len(cat_feat)) # Highest value Frequency percentage in categorical variables for i in list(cat_feat): pct = df[i].value_counts()[0] / 2919 print('Highest value Percentage of {}: {:3f}'.format(i, pct))

`Total of numeric features: 24`

Total of categorical features: 40

Highest value Percentage of BldgType: 0.830764

Highest value Percentage of BsmtCond: 0.892771

Highest value Percentage of BsmtExposure: 0.652278

Highest value Percentage of BsmtFinType1: 0.318602

Highest value Percentage of BsmtFinType2: 0.881466

Highest value Percentage of BsmtQual: 0.439534

Highest value Percentage of CentralAir: 0.932854

Highest value Percentage of Condition1: 0.860226

Highest value Percentage of Condition2: 0.989723

Highest value Percentage of Electrical: 0.915382

Highest value Percentage of ExterCond: 0.869476

Highest value Percentage of ExterQual: 0.615964

Highest value Percentage of Exterior1st: 0.351490

Highest value Percentage of Exterior2nd: 0.347722

Highest value Percentage of FireplaceQu: 0.486468

Highest value Percentage of Foundation: 0.448099

Highest value Percentage of Functional: 0.931483

Highest value Percentage of GarageCond: 0.909215

Highest value Percentage of GarageFinish: 0.421377

Highest value Percentage of GarageQual: 0.892086

Highest value Percentage of GarageType: 0.590271

Highest value Percentage of Heating: 0.984584

Highest value Percentage of HeatingQC: 0.511477

Highest value Percentage of HouseStyle: 0.503940

Highest value Percentage of KitchenQual: 0.511477

Highest value Percentage of LandContour: 0.898253

Highest value Percentage of LandSlope: 0.951696

Highest value Percentage of LotConfig: 0.730730

Highest value Percentage of LotShape: 0.636862

Highest value Percentage of MSZoning: 0.777321

Highest value Percentage of MasVnrType: 0.606029

Highest value Percentage of Neighborhood: 0.151764

Highest value Percentage of PavedDrive: 0.904762

Highest value Percentage of RoofMatl: 0.985269

Highest value Percentage of RoofStyle: 0.791367

Highest value Percentage of SaleCondition: 0.822885

Highest value Percentage of SaleType: 0.865365

Highest value Percentage of Street: 0.995889

Highest value Percentage of Utilities: 0.999657

Highest value Percentage of source: 0.500171

#Find correlation for numeric variables target = 'SalePrice' corr = train.corr() corr_abs = corr.abs() nr_num_cols = len(num_feat) ser_corr = corr_abs.nlargest(nr_num_cols, target)[target] print(ser_corr)

`SalePrice 1.000000`

OverallQual 0.790982

TotalSF 0.761573

GrLivArea 0.695118

GarageCars 0.640409

GarageArea 0.623431

FullBath 0.582934

YearBltAge 0.523350

RemodAge 0.509079

MasVnrArea 0.473461

Fireplaces 0.466929

TotalRooms 0.444828

LotFrontage 0.331692

WoodDeckSF 0.324413

Porch 0.296678

LotArea 0.263843

GarageYrBlt 0.261366

HalfBath 0.250628

KitchenAbvGr 0.135907

MSSubClass 0.084284

OverallCond 0.077856

MoSold 0.046432

YrSold 0.028923

Id 0.021917

Name: SalePrice, dtype: float64

#Drop columns based on corr < 40% col_drop_corr = ['GarageYrBlt','LotFrontage','WoodDeckSF','Porch','LotArea','HalfBath','KitchenAbvGr','MSSubClass', 'OverallCond','MoSold','YrSold'] df.drop(col_drop_corr, axis=1, inplace=True) print('Total features: ', df.shape)

`Total features: (2919, 53)`

#Drop variables with percentage frequent value > 80% col_drop_mode = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical', 'ExterCond','Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive', 'RoofMatl','SaleCondition','SaleType','Street','Utilities'] df.drop(col_drop_mode, axis=1, inplace=True) print('Total features: ', df.shape)

`Total features: (2919, 33)`

Hence we would be using 31 features in our model ( excluding ‘source’ and ‘ID’ variables)

Since we have a mix of numeric and non-numeric features, we need to convert them to numeric for the algorithm to interpret them. Hence we will use encoding and create dummy features. Let us understand with a simple example:

We have a variable ‘BsmtFinType1’ which has 6 types of values as: ‘None’, ‘GLQ’, ‘ALQ’, ‘Rec’, ‘BlQ’, ‘LwQ’

Encoding assigns numeric values to each of them as:

None – 0, GlQ – 1, ALQ – 2, Rec – 3, BLQ – 4, LwQ – 5

Creation of dummy variable – This column is now be transformed into 6 columns as:

BsmtFinType1_0, BsmtFinType1_1, BsmtFinType1_2,..so on. The value is denoted as 1 for value present and 0 when it is not present.

Lets get to the code

cat_cols = df.dtypes[df.dtypes == 'object'].index cat_cols

`Index(['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st',`

'Exterior2nd', 'FireplaceQu', 'Foundation', 'GarageFinish',

'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig',

'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood', 'RoofStyle',

'source'],

dtype='object')

#Integer conversions (Label Encoder) from sklearn.preprocessing import LabelEncoder lc = LabelEncoder() cat_cols = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu', 'Foundation','GarageFinish', 'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape', 'MSZoning','MasVnrType', 'Neighborhood','RoofStyle'] for i in cat_cols: df[i] = lc.fit_transform(df[i]) df.shape

`(2919, 33)`

#One hot encoding col_encod = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath', 'GarageCars', 'GarageFinish', 'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood', 'OverallQual', 'RoofStyle', 'TotalRooms'] df = pd.get_dummies(df, columns=col_encod) df.shape

`(2919, 193)`

As you can see from the result there columns have increased due to encoding. (We could have had fewer by making yearbuilt and yearremodadd into fewer buckets)

We will divide the dataset back into our original Train and Test sets.

Train – has values for SalePrice

Test – without SalePrice values which we need to predict.

#Dividing back into test and train dataset train = df.loc[df['source'] == 'train'] test = df.loc[df['source'] == 'test'] print(train.shape, test.shape)

`(1460, 193) (1459, 193)`

test.drop(['source'], axis = 1, inplace=True) train.drop(['source'], axis = 1, inplace=True)

#Importing algorithm libraries from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

We will now split our training set into 2 sets one for training and other to compare our model and see how accurately it predicts the values.

#Split-out validation dataset pred_col = [x for x in train.columns if x not in ['SalePrice', 'Id']] X = train[pred_col] y = train['SalePrice'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state= 0) X_train.shape,X_test.shape,y_train.shape,y_test.shape

`((1022, 1688), (438, 1688), (1022,), (438,))`

#Declare algorithm lin_reg = LinearRegression() #Fit the training set lin_reg.fit(X_train, y_train) score = lin_reg.score(X_test,y_test) print ('Model Score: {:4f}'.format(score))

`Model Score: 0.804574`

The above function used for measuring the model accuracy is called R – Squared and it shows that our model has 80.4% accuracy rate in predicting the values which is not bad for a beginner..!

Hope you enjoyed it so far. I will try to make another post on implementing each ML algorithm and this time using much simpler datasets.

**-Hari Mindi**

## 2 Comments

## Srivatsa · May 26, 2018 at 4:59 pm

Good model prediction and that too getting a 80% prediction is good. As this is training data I’m sure the results are pretty good. But with real data getting a prediction of 70% is fantastic IMHO. Keep sharing your thoughts. Good job my dear friend

Vatsa

## harimindi · June 4, 2018 at 6:48 pm

Thanks buddy..!