Let get back to our problem of predicting the house price in California.
Let us recap, so far
- we framed our problem and confirm that it is problem that can be tackled with machine learning using regression to predict price based on others factor in our dataset: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-1-frame-your-experiment/
- We acquired a dataset on California House price
- We installed the tools required (Python framework anaconda, Jupyter Notebook) for the experimentation :https://learn-ai-tech.com/how-to-start-with-machine-learning/
- Since previous post we have analyzed & visualize the housing data and identified key features (location,income revenue, distance to beach…) that: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-2-analyze-your-data/
In this post, we will address the next step of the experimentation the data preparation. Just like cooking where ingredients should be cleaned, chopped, data must be cleansed and normalized.
It consists into transforming the raw data set into a prepared data set:
- removed and treated of inconsistent data (outliers – error of measures) or incomplete data (missing data in a record)
- shaped into a standard shape to apply the machine learning algorithm functions as implemented in libraries (e.g convert word tag into number)
- trimmed (keep relevant features for training) to as light as possible in order to play efficiently with various machine learning algorithms
the output of the data preparation is a sequence automatic of transformations called pipeline that can prepare automatically data for the training and for the evaluation data.Thus upon the selection and evaluation of the model, we can reuse the pipeline as function.
Clean the dataset
Machine learning algorithms expect a dataset with a numerical value with every value filled with a number. Human can cope with some unknown value while computer cannot.
Let check our dataset information on the file structure (number of records, null fields)
Data columns (total 10 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 total_bedrooms 20433 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 median_house_value 20640 non-null float64 ocean_proximity 20640 non-null object
We notice that all the features are filled (“non-null”) with numerical (“float64”) value on the 2046 records except for 2 features:
- total_bedrooms has only 20433 with non null values. meaning that on 207 records, there is no value
- ocean_proximity is not a number but a word tag.
Too bad, those two features influence the house price we want to predict (refer to the correlation test in https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-2-analyze-your-data/).We need to take account of those fields for our prediction engine.
Treat incomplete data
Let us handle first the empty value problem with “total_bedrooms”. There are 3 approaches to cure the problem:
- remove the records in the dataset with empty value. housing.dropna(subset=[“total_bedrooms”])
- overlook the fields. housing.drop(“total_bedrooms”, axis=1)
- replace missing data with median value (the “middle” number of total_bedrooms in california check statisicts definition for more details) total bedroom. the median will not unbalanced the dataset
median = housing[“total_bedrooms”].median()
I choose option 3 to keep all my records. I store my output dataset in a dedicated copy.
Encode category data into numbers
if we zoom in on the feature “ocean_proximity” , we can identify 4 categories
<1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5
to transform into numerical value, we can replace ocean proximity with 4 new features into the prepared dataset each corresponding to category to each label : “<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND” with value 1 (if the tag was there) or 0 (if not).
Consider the vector from the dataset on label “Ocean proximity” separately:
housing_cat = housing[“ocean_proximity”]
If we display the vector
17606 <1H OCEAN 18632 <1H OCEAN 14650 NEAR OCEAN 3230 INLAND 3555 <1H OCEAN 19480 INLAND 8879 <1H OCEAN 13685 INLAND 4937 <1H OCEAN 4861 <1H OCEAN
We encode the label category into an index label in value using factorize (using panda function) into a dataframe housing_cat_encoded
housing_cat_encoded, housing_categories = housing_cat.factorize()
If we display the encoded vector on the first 10 records.
array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)
“<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND” becomes respectively label 0 (0 is Near ocean),1,2,3,4
Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')
Now let us, shape the data into an array of label per index (we us scikit learn encoder function). we obtain an array with a lot of zeros this can facilitate machine learning algorithms processing in terms of performance:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
We can display as an array on the first 10 records.
array([[1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], ..., [0., 0., 1., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 1., 0.]])
We can add this array to the prepared dataset instead of the vector of ocean_proximity.
Automate the data preparation with pipeline
We have identified two transformations to performed on data to be compatible with machine learning algorithms proposed in standard in the libaries:
- replace median value on missing value on “total bed rooms”
- replace labelled tag ocean proximity with extra columns on new category label “<1H OCEAN”,”INLAND”,”NEAR OCEAN”, “NEAR BAY”,”ISLAND”
When testing the machine learning algorithms to find the best algorithm and the best parameters, we will need to repeat those steps again and again on various training datasets and on test datasets.
To automate the process, we create a data pipeline to combine the transformations standard (from the libary) or custom functions (your own). the concept of pipeline is standard in machine learning libraires and in datascience.
Let illustrate on the first transformation replace with median value on Scikit learn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
we can call the transformation on any dataset sharing the structure of the housing dataset calling the num_pipeline:
housing_num_tr = num_pipeline.fit_transform(housing_num)
Again we can create another pipeline on category label encoding with a custom function here:
cat_attribs = [“ocean_proximity”]
cat_pipeline = Pipeline([
Those two transformations can be conbined in another overall data preparation pipeline using Scikit-learn “FeatureUnion” to join two pipelines, we have previously defined:
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
You can call our overall pipeline on trained Data or test Data:
The data preparation might not be as sexy as machine learning algorithms selection, but it is a must have. As real data is quite messy and all the data might not be relevant for the experimentations. In practice, the data preparation step can take more a large share of the your time in machine learning project. In the next post we will tackle the most awaited step, the algorithms selection for our prediction application.