How to experiment prediction on housing price using machine learning? Part 4 – Find the best machine learning prediction model

Let get back to our problem of predicting the house price in California.

Let us recap, so far

In this Post,

I will first introduce on basic use of a machine learning algorithms

  • a litte bit of theory,
  • how to invoke a machine learning algorithm for training,
  • and then perform prediction on the trained model,
  • measure its prediction performance of the trained model.

Back to the machine learning experimentation approach for the prediction,

  • We will then on a fixed training dataset try out different categories of machine learning algorithms,
  • Shortlist the best performing ones,
  • Fine tune the algorithm to find the best parameters to improve the performance.

First a bit of theory on machine learning algorithm training and test mechanics

Since we have to predict a continuous value the house price based on all the various parameters in the data set. We have identified our problem as a regression prediction problem. Let us train a “learned” model based on a regression aglorithm.

The goal of a prediction

One of the most basic is called linear regression. the idea is to estimate the linear relation (parameter A) between the input x and the output y.

in 1 dimension it is a line to estimate, y,A,x are value

y=A*x

in multi dimension y and x are vectors and A is the matrix .

The goal is to get a robust A or prediction model to be able to predict y for unknown input x.

The training of the prediction model

We learned the relation form a training set where both inputs x and outcomes y are known for this we use part of our dataset. to give an idea we sort of iteratively on each training example update

A_trained=y_training/x_training

Our machine learning algorithm is just the method that update the prediction model A_trained over each training examples from the dataset

The performance of the model

Then on a test set -remaining of our dataset not used for the training-, we can measure our performance on how far our estimation is off form the reality

y_prediction=A_trained*x_test

error= Distance metric between (y_test-y_estimated)

Ideally, to able to predict perfectly we need to reach an error distance of 0.

 

Machine learning algorithm prediction put to practice

You not need to actually code the machine learning algorithms with machine learning libraries since it has been implemented.

But you need to acquire through experimentation the categories of algorithms by problems it can solved (regression, classification,…)

and the feasibility based on the dataset available

Sci kit learn invoke Machine learning with a function estimator. the Estimator suppose a machine learning model which give the method to update the model. It has and input X and an output Y.

To train the model we call fith method:

estimator.fit (X, Y)
To predict the model, we call the predict method:
estimator.predict(X)

Let us train the housing price prediction model using linear regression

First, I prepare my training data housing_tr with the data preparation pipeline from the raw data.
Second, I call the estimator called “lin_reg” on the housing_tr and the corresponding labels which corresponds to th Housing price in the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_tr, housing_labels)

 

We need to test the prediction performance of our trained linar model.

We extract a test set of first 5 examples

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

 

Perform the prediction on a sample set

print(“Predictions:”, lin_reg.predict(some_data_tr))

Predictions: [204148.3860687  327883.69051085 203077.8489566   77646.24130012
 190761.15058625]

While the correct housing price should have been:

print(“Labels:”, list(some_labels))

Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

We notice that our model can be improved

  • first example and 4th example ou prediction is a litte bit off!
  • but on the 3 examples our model gives a fair estimation

Let’s run our prediction on the training set

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_tr)

We use mean square error a metric euclidian distance to measure the distance between labels and prediction. it is quite a popular measure used in machine learning.

lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

The error score will serve us as a reference to compare with other machine learning models

70111.96013979446

Quicky try a few training data to set configuration to estimate the overall performance using K-Fold test

On a given machine learning model try various mix dataset to find the optimal score.
K-fold is a powerful approach to split and cross validate the best score
  • divide into K equal subsets,
  • Reserve a subset for test among the K,
  • Train the model on K-1 subset,
  • Cross validate the predication on the 1/K subset
Of course, the process takes more time, we run the algorithms K-times.
Sci-kit learn allow to perform the test in one line of code. 10 fold cross validation on tree

lin_scores = cross_val_score(lin_reg, housing_tr, housing_labels,scoring=”neg_mean_squared_error”, cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

Let us define a function to display the performance score based on the 10-fold and some distribution statistics the mean and standard deviation.

def display_scores(scores):
print(“Scores:”, scores)
print(“Mean:”, scores.mean())
print(“Standard deviation:”, scores.std())

display_scores(lin_rmse_scores)

Scores: [68963.61688194 68839.63684476 69785.60651571 75243.21769363 68500.75844944 72362.83262839 66676.37467564 69577.30476395 73475.67650846 69361.20374155]
Mean: 70278.622870347
Standard deviation: 2464.2599212914047

the K-fold enable us to refine the average score we can obtain on various training data set.

 

Bench the machine algorithms to shortlist the best approach

Which algorithm to select?

We decide to try out other regression alogrihms and measure the score.

Sci-kit learn propose a cheat sheet to propose the most promissing alorithms to try out for your problem based on your dataset:

http://scikit-learn.org/stable/tutorial/machine_learning_map/

Try out other algorithms

We try out just like tree algorithms and random forest and measure the results with 10-fold.

The code is similar to linear regression.

We obtain the following scores

  • Tree not as good as linear regression:
    display_scores(tree_rmse_scores)
Scores: [65579.74729653 68637.79207294 72535.50114775 70784.0669092
 70806.43900039 72087.81338119 68394.03592069 66393.79175012
 73434.62187504 71884.96831115]
Mean: 70053.8777664995
Standard deviation: 2536.762569699643
  • Random Forest give a better average score and and regular performance (lower standard deviation):
Scores: [51199.23575575 49583.94814821 51892.80626221 53710.85756483
 53472.15770125 54375.46300265 52022.81519684 52703.47098501
 54221.51944773 52010.72009322]
Mean: 52519.299415769674
Standard deviation: 1412.8362788759223

Random forest is a promising Machine learning algorithm. We choose to short list that algorithm and search and fine tune the parameters.

 

Fine tune the chosen machine learning algorithm

Sci-kit learn provide default parameters upon calling a machine learning method.

For example, when we build the random forest model

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_tr, housing_labels)

We want to find the best parameters to decrease improve the prediction score.

Sci kit learn allow to define a grid of parameters and to run the machine algorithms on those parameters:

from sklearn.model_selection import GridSearchCV
param_grid = [
{‘n_estimators’: [3, 10, 30], ‘max_features’: [2, 4, 6, 8]},
{‘bootstrap’: [False], ‘n_estimators’: [3, 10], ‘max_features’: [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring=’neg_mean_squared_error’)
grid_search.fit(housing_tr, housing_labels)

Athe end of the run we can explore

  • The best parameters found

grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}
  • Display the best model

grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
  • Display the score obtain per paramter combinaison

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):
print(np.sqrt(-mean_score), params)

64487.67051811388 {'max_features': 2, 'n_estimators': 3}
56278.89406547513 {'max_features': 2, 'n_estimators': 10}
54185.34290654665 {'max_features': 2, 'n_estimators': 30}
...

Finally we can evaluate on the test dataset the performance score of our best model based on random forest:

final_model = grid_search.best_estimator_

final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

48145.52861415705

 

This conclude the experimentation on House pricing prediction using machine learning, We have learned that

  • You don’t need to know in detail the theory on the Machine learning algorithm but rather to have an understading on which one work best on a given situation.
  • The minimum mathematics required to run a model is limited to linear algebra and statistics.
  • The machine learning libraries facilitate and minimize the code to train , evaluate and select the best algorithm.
  • The quality and the volume of  labelled data is key to obtain relaibility of the predictions.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu