This post is continuation to the previous posts on Data Science process. We will try to predict housing prices based on some training dataset. I am using Ames housing dataset and based on my previous posts on the process this will be in continuation to them.

In my previous posts, i covered the following steps:

There are different types of machine learning algorithms used for different purposes. Choice of the algorithm depends on the type of dataset we are working on. I will do a different post on most common types of machine learning algorithms for beginners.

For housing prices dataset we need to predict house prices based on given features. For this purpose we will use Linear Regression algorithm.

What is Linear Regression:

I would assume most of you are like me who go mad looking at mathematical formulas. Hence i will try to explain the way i understand.

“Using Linear Regression we can quantify Strength of a relationship between independent and dependent variables”

Again that is lot of jargon…!

Work_Experience is an independent variable and Salary is a dependent variable. If we have good data about these 2 variables, using Linear Regression we can identify the relation. We can possibly predict that as experience increases Salary increases and based on that data we can predict the Salary of an Employee with some ‘x’ year of work experience.

There are many other websites which do better job at explaining this and i would strongly recommend that.

How are house prices predicted?

• I will use the Feature Engineered data Ames Housing dataset from my previous post
• We will split the training data into 70:30
• We will train the algorithm with 70% training data and test it against remaining 30%
• We will test predicted vs actual values and using a model evaluation technique to get a score

Lets get started..

Feature Selection:

From my last post after being done with Outlier treatment and Feature Engineering we are left with 64 variables. We also saw the correlation percentage to the target variable ‘SalePrice’.

• We will select the features which have more than 40% correlation to the target variable, hence we will drop features that show less than 40% correlation.
• We will also drop features which have more most frequent value percentage  more than 80%

(Script continued from previous post)

```# We will modify yearbuilt and yearremodadd as their respective ages from the year of sale
df['YearBltAge'] = df['YrSold'] - df['YearBuilt']
df.shape
```

`(2919, 66)`

```df.drop(['YearBuilt','YearRemodAdd'], axis=1, inplace=True)
```
```#identifying numerical and categorical features
num_feat = train.dtypes[train.dtypes != 'object'].index
print('Total of numeric features: ', len(num_feat))
cat_feat = train.dtypes[train.dtypes == 'object'].index
print('Total of categorical features: ', len(cat_feat))

# Highest value Frequency percentage in categorical variables
for i in list(cat_feat):
pct = df[i].value_counts() / 2919
print('Highest value Percentage of {}: {:3f}'.format(i, pct))
```

```Total of numeric features: 24 Total of categorical features: 40 Highest value Percentage of BldgType: 0.830764 Highest value Percentage of BsmtCond: 0.892771 Highest value Percentage of BsmtExposure: 0.652278 Highest value Percentage of BsmtFinType1: 0.318602 Highest value Percentage of BsmtFinType2: 0.881466 Highest value Percentage of BsmtQual: 0.439534 Highest value Percentage of CentralAir: 0.932854 Highest value Percentage of Condition1: 0.860226 Highest value Percentage of Condition2: 0.989723 Highest value Percentage of Electrical: 0.915382 Highest value Percentage of ExterCond: 0.869476 Highest value Percentage of ExterQual: 0.615964 Highest value Percentage of Exterior1st: 0.351490 Highest value Percentage of Exterior2nd: 0.347722 Highest value Percentage of FireplaceQu: 0.486468 Highest value Percentage of Foundation: 0.448099 Highest value Percentage of Functional: 0.931483 Highest value Percentage of GarageCond: 0.909215 Highest value Percentage of GarageFinish: 0.421377 Highest value Percentage of GarageQual: 0.892086 Highest value Percentage of GarageType: 0.590271 Highest value Percentage of Heating: 0.984584 Highest value Percentage of HeatingQC: 0.511477 Highest value Percentage of HouseStyle: 0.503940 Highest value Percentage of KitchenQual: 0.511477 Highest value Percentage of LandContour: 0.898253 Highest value Percentage of LandSlope: 0.951696 Highest value Percentage of LotConfig: 0.730730 Highest value Percentage of LotShape: 0.636862 Highest value Percentage of MSZoning: 0.777321 Highest value Percentage of MasVnrType: 0.606029 Highest value Percentage of Neighborhood: 0.151764 Highest value Percentage of PavedDrive: 0.904762 Highest value Percentage of RoofMatl: 0.985269 Highest value Percentage of RoofStyle: 0.791367 Highest value Percentage of SaleCondition: 0.822885 Highest value Percentage of SaleType: 0.865365 Highest value Percentage of Street: 0.995889 Highest value Percentage of Utilities: 0.999657 Highest value Percentage of source: 0.500171```

```#Find correlation for numeric variables

target = 'SalePrice'

corr = train.corr()
corr_abs = corr.abs()

nr_num_cols = len(num_feat)

ser_corr = corr_abs.nlargest(nr_num_cols, target)[target]
print(ser_corr)
```

```SalePrice 1.000000 OverallQual 0.790982 TotalSF 0.761573 GrLivArea 0.695118 GarageCars 0.640409 GarageArea 0.623431 FullBath 0.582934 YearBltAge 0.523350 RemodAge 0.509079 MasVnrArea 0.473461 Fireplaces 0.466929 TotalRooms 0.444828 LotFrontage 0.331692 WoodDeckSF 0.324413 Porch 0.296678 LotArea 0.263843 GarageYrBlt 0.261366 HalfBath 0.250628 KitchenAbvGr 0.135907 MSSubClass 0.084284 OverallCond 0.077856 MoSold 0.046432 YrSold 0.028923 Id 0.021917 Name: SalePrice, dtype: float64```

```#Drop columns based on corr < 40%
col_drop_corr = ['GarageYrBlt','LotFrontage','WoodDeckSF','Porch','LotArea','HalfBath','KitchenAbvGr','MSSubClass',
'OverallCond','MoSold','YrSold']
df.drop(col_drop_corr, axis=1, inplace=True)
print('Total features: ', df.shape)
```

`Total features: (2919, 53)`

```#Drop variables with percentage frequent value > 80%
col_drop_mode = ['BldgType','BsmtCond','BsmtFinType2','CentralAir','Condition1','Condition2','Electrical',
'ExterCond','Functional','GarageCond','GarageQual','Heating','LandContour','LandSlope','PavedDrive',
'RoofMatl','SaleCondition','SaleType','Street','Utilities']
df.drop(col_drop_mode, axis=1, inplace=True)
print('Total features: ', df.shape)
```

`Total features: (2919, 33)`

Hence we would be using 31 features in our model ( excluding ‘source’ and ‘ID’ variables)

Since we have a mix of numeric and non-numeric features, we need to convert them to numeric for the algorithm to interpret them. Hence we will use encoding and create dummy features. Let us understand with a simple example:

We have a variable ‘BsmtFinType1’ which has 6 types of values as: ‘None’, ‘GLQ’, ‘ALQ’, ‘Rec’, ‘BlQ’, ‘LwQ’

Encoding assigns numeric values to each of them as:

None – 0, GlQ – 1, ALQ – 2, Rec – 3, BLQ – 4, LwQ – 5

Creation of dummy variable – This column is now be transformed into 6 columns as:

BsmtFinType1_0, BsmtFinType1_1, BsmtFinType1_2,..so on. The value is denoted as 1 for value present and 0 when it is not present.

Lets get to the code

```cat_cols = df.dtypes[df.dtypes == 'object'].index
cat_cols
```

```Index(['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'FireplaceQu', 'Foundation', 'GarageFinish', 'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood', 'RoofStyle', 'source'], dtype='object')```

```#Integer conversions (Label Encoder)
from sklearn.preprocessing import LabelEncoder
lc = LabelEncoder()

cat_cols = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu',
'Foundation','GarageFinish', 'GarageType', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape',
'MSZoning','MasVnrType', 'Neighborhood','RoofStyle']
for i in cat_cols:
df[i] = lc.fit_transform(df[i])

df.shape
```

`(2919, 33)`

```#One hot encoding

col_encod = ['BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'ExterQual', 'Exterior1st','Exterior2nd', 'FireplaceQu',
'Fireplaces', 'Foundation', 'FullBath', 'GarageCars', 'GarageFinish', 'GarageType', 'HeatingQC',
'HouseStyle', 'KitchenQual', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'Neighborhood',
'OverallQual', 'RoofStyle', 'TotalRooms']
df = pd.get_dummies(df, columns=col_encod)

df.shape
```

`(2919, 193)`

As you can see from the result there columns have increased due to encoding. (We could have had fewer by making yearbuilt and yearremodadd into fewer buckets)

We will divide the dataset back into our original Train and Test sets.

Train – has values for SalePrice

Test – without SalePrice values which we need to predict.

```#Dividing back into test and train dataset
train = df.loc[df['source'] == 'train']
test = df.loc[df['source'] == 'test']
print(train.shape, test.shape)
```

`(1460, 193) (1459, 193)`

```test.drop(['source'], axis = 1, inplace=True)
train.drop(['source'], axis = 1, inplace=True)
```
```#Importing algorithm libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
```

We will now split our training set into 2 sets one for training and other to compare our model and see how accurately it predicts the values.

```#Split-out validation dataset
pred_col = [x for x in train.columns if x not in ['SalePrice', 'Id']]
X = train[pred_col]
y = train['SalePrice']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state= 0)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
```

`((1022, 1688), (438, 1688), (1022,), (438,))`

```#Declare algorithm
lin_reg = LinearRegression()

#Fit the training set
lin_reg.fit(X_train, y_train)

score = lin_reg.score(X_test,y_test)
print ('Model Score: {:4f}'.format(score))
```

`Model Score: 0.804574`
The above function used for measuring the model accuracy is called R – Squared and it shows that our model has 80.4% accuracy rate in predicting the values which is not bad for a beginner..!

Hope you enjoyed it so far. I will try to make another post on implementing each ML algorithm and this time using much simpler datasets.

-Hari Mindi 