How to experiment prediction on housing price using machine learning? Part 3 – Prepare your Data

Let get back to our problem of predicting the house price in California.

Let us recap, so far

In this post, we will address the next step of the experimentation the data preparation. Just like cooking where ingredients should be cleaned, chopped, data must be cleansed and normalized.

It consists into transforming the raw data set into a prepared data set:

  • removed and treated of inconsistent data (outliers – error of measures) or incomplete data (missing data in a record)
  • shaped into a standard shape to apply the machine learning algorithm functions as implemented in libraries (e.g convert word tag into number)
  • trimmed (keep relevant features for training) to as light as possible in order to play efficiently with various machine learning algorithms

the output of the data preparation is a sequence automatic of transformations called pipeline that can prepare automatically data for the training and for the evaluation data.Thus upon the selection and evaluation of the model, we can reuse the pipeline as function.

Clean the dataset

Machine learning algorithms expect a dataset with a numerical value with every value filled with a number. Human can cope with some unknown value while computer cannot.

Let check our dataset information on the file structure (number of records, null fields)
housing.info()

Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object

We notice that all the features are filled (“non-null”) with numerical (“float64”) value on the 2046 records except for 2 features:

  1. total_bedrooms has only 20433 with non null values. meaning that on 207 records, there is no value
  2. ocean_proximity is not a number but a word tag.

Too bad, those two features influence the house price we want to predict (refer to the correlation test in https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-2-analyze-your-data/).We need to take account of those fields for our prediction engine.

Treat incomplete data

Let us handle first the empty value problem with “total_bedrooms”. There are 3 approaches to cure the problem:

  1. remove the records in the dataset with empty value. housing.dropna(subset=[“total_bedrooms”])
  2. overlook the fields. housing.drop(“total_bedrooms”, axis=1)
  3. replace missing data with median value (the “middle” number of total_bedrooms in california check statisicts definition for more details) total bedroom. the median will not unbalanced the dataset
    median = housing[“total_bedrooms”].median()
    housing[“total_bedrooms”].fillna(median, inplace=True)

I choose option 3 to keep all my records. I store my output dataset in a dedicated copy.

Encode category data into numbers

if we zoom in on the feature “ocean_proximity” , we can identify 4 categories
housing[“ocean_proximity”].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5

to transform into numerical value, we can replace ocean proximity with 4  new features into the prepared dataset each corresponding to category to each label : “<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND” with value 1 (if the tag was there) or 0 (if not).

Consider the vector from the dataset on label “Ocean proximity”  separately:

housing_cat = housing[“ocean_proximity”]

If we display the vector
housing_cat.head(10)

17606     <1H OCEAN
18632     <1H OCEAN
14650    NEAR OCEAN
3230         INLAND
3555      <1H OCEAN
19480        INLAND
8879      <1H OCEAN
13685        INLAND
4937      <1H OCEAN
4861      <1H OCEAN

We encode the label category into an index label in value using factorize (using panda function) into a dataframe housing_cat_encoded
housing_cat_encoded, housing_categories = housing_cat.factorize()

If we display the encoded vector on the first 10 records.

housing_cat_encoded[:10]
array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

“<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND” becomes respectively label 0 (0 is Near ocean),1,2,3,4
housing_categories

Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

Now let us, shape the data into an array of label per index (we us scikit learn encoder function). we obtain an array with a lot of zeros this can facilitate machine learning algorithms processing in terms of performance:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

We can display as an array on the first 10 records.

housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

We can add this array to the prepared dataset instead of the vector of ocean_proximity.

Automate the data preparation with pipeline

We have identified two transformations to performed on data to be compatible with machine learning algorithms proposed in standard in the libaries:

  1. replace median value on missing value on “total bed rooms”
  2. replace labelled tag ocean proximity with extra columns on new category label “<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND”

When testing the machine learning algorithms to find the best algorithm and the best parameters, we will need to repeat those steps again and again on various training datasets and on test datasets.

To automate the process, we create a data pipeline to combine the transformations standard (from the libary) or custom functions (your own). the concept of pipeline is standard in machine learning libraires and in datascience.

Let illustrate on the first transformation replace with median value on Scikit learn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
(‘imputer’, Imputer(strategy=”median”)),
….
])

we can call the transformation on any dataset sharing the structure of the housing dataset calling the num_pipeline:
housing_num_tr = num_pipeline.fit_transform(housing_num)

Again we can create another pipeline on category label encoding with a custom function here:

cat_attribs = [“ocean_proximity”]
cat_pipeline = Pipeline([
(‘selector’, DataFrameSelector(cat_attribs)),

….])

Those two transformations can be conbined in another overall data preparation pipeline using Scikit-learn “FeatureUnion” to join two pipelines, we have previously defined:

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
(“num_pipeline”, num_pipeline),
(“cat_pipeline”, cat_pipeline),
])

You can call our overall  pipeline on trained Data or test Data:

housing_prepared= full_pipeline.fit_transform(housing_train).

Conclusion

The data preparation might not be as sexy as machine learning algorithms selection, but it is a must have. As real data is quite messy and all the data might not be relevant for the experimentations. In practice, the data preparation step can take more a large share of the your time in machine learning project. In the next post we will tackle the most awaited step, the algorithms selection for our prediction application.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu