Let get back to our problem of predicting the house price in California.

## Let us recap, so far

- we framed our problem and confirm that it is problem that can be tackled with machine learning using regression to predict price based on others factor in our dataset: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-1-frame-your-experiment/
- We acquired a dataset on California House price
- We installed the tools required (Python framework anaconda, Jupyter Notebook) for the experimentation:https://learn-ai-tech.com/how-to-start-with-machine-learning/
- We have analyzed & visualize the housing data and identified key features (location, income revenue, distance to beach…) that: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-2-analyze-your-data/
- In the last post, we have prepared our data through a
**series of transformation (a data pipeline)**to obtain from the raw dataset a “prepared” data set without missing values and numeric values. The prepared dataset can be used as robust input to the standard Machine learning algorithms implemented in the libraries: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-3-prepare-your-data/

**In this Post,**

I will first introduce on basic use of a machine learning algorithms

- a litte bit of theory,
- how to invoke a machine learning algorithm for training,
- and then perform prediction on the trained model,
- measure its prediction performance of the trained model.

Back to the machine learning experimentation approach for the prediction,

- We will then on a fixed training dataset try out different categories of machine learning algorithms,
- Shortlist the best performing ones,
- Fine tune the algorithm to find the best parameters to improve the performance.

**First a bit of theory on machine learning algorithm training and test mechanics**

Since we have to predict a continuous value the house price based on all the various parameters in the data set. We have identified our problem as a regression prediction problem. Let us train a “learned” model based on a regression aglorithm.

**The goal of a prediction**

One of the most basic is called linear regression. the idea is to estimate the linear relation (parameter A) between the input x and the output y.

in 1 dimension it is a line to estimate, y,A,x are value

y=A*x

in multi dimension y and x are vectors and A is the matrix .

**The goal is to get a robust A or prediction model to be able to predict y for unknown input x.**

**The training of the prediction model**

We learned the relation form a training set where both inputs x and outcomes y are known for this we use part of our dataset. to give an idea we sort of iteratively on each training example update

A_trained=y_training/x_training

**Our machine learning algorithm is just the method that update the prediction model A_trained over each training examples from the dataset**

## The performance of the model

Then on a test set -remaining of our dataset not used for the training-, we can measure our performance on how far our estimation is off form the reality

y_prediction=A_trained*x_test

error= Distance metric between (y_test-y_estimated)

Ideally, to able to predict perfectly we need to reach an error distance of 0.

**Machine learning algorithm prediction put to practice**

You not need to actually code the machine learning algorithms with machine learning libraries since it has been implemented.

But you need to acquire through experimentation the categories of algorithms by problems it can solved (regression, classification,…)

and the feasibility based on the dataset available

Sci kit learn invoke Machine learning with a function estimator. the Estimator suppose a machine learning model which give the method to update the model. It has and input X and an output Y.

To train the model we call fith method:

## Let us train the housing price prediction model using linear regression

lin_reg = LinearRegression()

lin_reg.fit(housing_tr, housing_labels)

We need to test the prediction performance of our trained linar model.

We extract a test set of first 5 examples

some_data = housing.iloc[:5]

some_labels = housing_labels.iloc[:5]

## Perform the prediction on a sample set

print(“Predictions:”, lin_reg.predict(some_data_tr))

While the correct housing price should have been:

print(“Labels:”, list(some_labels))

We notice that our model can be improved

- first example and 4th example ou prediction is a litte bit off!
- but on the 3 examples our model gives a fair estimation

## Let’s run our prediction on the training set

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_tr)

We use mean square error a metric euclidian distance to measure the distance between labels and prediction. it is quite a popular measure used in machine learning.

lin_mse = mean_squared_error(housing_labels, housing_predictions)

lin_rmse = np.sqrt(lin_mse)

lin_rmse

The error score will serve us as a reference to compare with other machine learning models

## Quicky try a few training data to set configuration to estimate the overall performance using K-Fold test

- divide into K equal subsets,
- Reserve a subset for test among the K,
- Train the model on K-1 subset,
- Cross validate the predication on the 1/K subset

lin_scores = cross_val_score(lin_reg, housing_tr, housing_labels,scoring=”neg_mean_squared_error”, cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)

Let us define a function to display the performance score based on the 10-fold and some distribution statistics the mean and standard deviation.

def display_scores(scores):

print(“Scores:”, scores)

print(“Mean:”, scores.mean())

print(“Standard deviation:”, scores.std())

display_scores(lin_rmse_scores)

Scores: [68963.61688194 68839.63684476 69785.60651571 75243.21769363 68500.75844944 72362.83262839 66676.37467564 69577.30476395 73475.67650846 69361.20374155] Mean: 70278.622870347 Standard deviation: 2464.2599212914047

the K-fold enable us to refine the average score we can obtain on various training data set.

**Bench the machine algorithms to shortlist the best approach**

### Which algorithm to select?

We decide to try out other regression alogrihms and measure the score.

Sci-kit learn propose a cheat sheet to propose the most promissing alorithms to try out for your problem based on your dataset:

http://scikit-learn.org/stable/tutorial/machine_learning_map/

### Try out other algorithms

We try out just like tree algorithms and random forest and measure the results with 10-fold.

The code is similar to linear regression.

We obtain the following scores

**Tree not as good as linear regression:**

display_scores(tree_rmse_scores)

**Random Forest give a better average score and and regular performance (lower standard deviation):**

Scores: [51199.23575575 49583.94814821 51892.80626221 53710.85756483 53472.15770125 54375.46300265 52022.81519684 52703.47098501 54221.51944773 52010.72009322] Mean: 52519.299415769674 Standard deviation: 1412.8362788759223

Random forest is a promising Machine learning algorithm. We choose to short list that algorithm and search and fine tune the parameters.

## Fine tune the chosen machine learning algorithm

Sci-kit learn provide default parameters upon calling a machine learning method.

For example, when we build the random forest model

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()

forest_reg.fit(housing_tr, housing_labels)

We want to find the best parameters to decrease improve the prediction score.

Sci kit learn allow to define a grid of parameters and to run the machine algorithms on those parameters:

from sklearn.model_selection import GridSearchCV

param_grid = [

{‘n_estimators’: [3, 10, 30], ‘max_features’: [2, 4, 6, 8]},

{‘bootstrap’: [False], ‘n_estimators’: [3, 10], ‘max_features’: [2, 3, 4]},

]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring=’neg_mean_squared_error’)

grid_search.fit(housing_tr, housing_labels)

Athe end of the run we can explore

**The best parameters found**

grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}

**Display the best model**

grid_search.best_estimator_

**Display the score obtain per paramter combinaison**

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):

print(np.sqrt(-mean_score), params)

64487.67051811388 {'max_features': 2, 'n_estimators': 3} 56278.89406547513 {'max_features': 2, 'n_estimators': 10} 54185.34290654665 {'max_features': 2, 'n_estimators': 30} ...

Finally we can evaluate on the test dataset the performance score of our best model based on random forest:

final_model = grid_search.best_estimator_

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)

final_rmse = np.sqrt(final_mse)

**This conclude the experimentation on House pricing prediction using machine learning, We have learned that**

**You don’t need to know in detail the theory on the Machine learning algorithm but rather to have an understading on which one work best on a given situation.****The minimum mathematics required to run a model is limited to linear algebra and statistics.****The machine learning libraries facilitate and minimize the code to train , evaluate and select the best algorithm.****The quality and the volume of labelled data is key to obtain relaibility of the predictions.**