How to experiment prediction on housing price using machine learning? Part 2 – Analyze your Data

Recap: in the previous post, I have framed our problem on predicting the median housing price by district as a machine learning problem. In this post I will cover the data analysis. the goal is to get acquainted with the dataset and explore the relevant features to determine the House price. The data analysis is a typically the first step in the experimentation, before the data preparation and the ML model selection.


Framing Sanity Check

Business sanity check, as a consultant it is good to inquire the business goal of a machine learning system & to assess the current way to tackle the problem. Suppose

  • Our Business goal will be to create a system that will predict whether it is worth to invest in an area or not.
  • As for the current way, the pricing is achieved through tedious process that require the expertise of real estate agent. this can serve as a future performance reference for our system

The chosen Dataset is the California Housing price:

  • zip can be downloaded on:
  • import Dataset with scikit-learn:

Now let us specify the machine learning problem:

  • a Supervision Problem: Housing price is can learned from a labeled dataset. Whereas , a unsupervised problem would be to learn from unlabeled data
  • a Regression Problem: since we want to predict the house price. Whereas, A classification formulation would to classify into category House price such as cheap, medium, expensive.
  • Laptop infrastructure is fin,  the dataset is 20040 records on 9 features. we can even “display the file on excel”. We do not need to store on Big Data infra (storage on processing on multiples machines).


All the steps of the Data preparation can run from a Jupyter Notebook. Let’s call our notebook “Housing”. I want to highlight the main steps rather than coding itself:

  • I’ll mention the main functions and libraries used
  • I’ll mention the mathematic/stats concepts at stake

Fetch you Data file csv

Extract your data in folder dataset folder with the dataset.csv.

The data set should be the uploaded into a dataframe (object to store matrix array) using pandas (python library for array).  You can refer to linear algebra basics if you forgot about matrices and vector manipulation.

The dataframe advantage is to be processed really fast as it exists in the computer in-memory.

the panda command is quite simple to read form a csv and load a dataframe Housing: “import pandas as pd … housing= pd.read_csv(csv_path)”

Explore your dataset

You can play with the dataframe to get some characteristics:

  • Display the first top 5 lines of the array including the header:


longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
  • get information on the file structure (number of records, null fields):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

We can notice that “total_bedrooms” contains some empty value

  • check value category type of a field. for exampl,e here we count the number of records per category on Ocean proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
  • We can get some statistics on all the fields with numerical value (count, min, max , standard deviation). You need some basics on Statistics there e.g Standard deviation measure the spread of records

    longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
    count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
    mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
    std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
    min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
    25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
    50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
    75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
    max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
  • you can display the histograms version with matplotlib library (Python graph function)

housing.hist(bins=50, figsize=(20,15))

We can notice that the House prices have been capped at 500 000 dollars max.

Analyze you Data: Stats Visualize, correlation matrix

You can display your data through visualization. Let’s plot House price on geographical coordinate (latitude longitude).

We use python library matplotlib: “housing.plot(kind=”scatter”, x=”longitude”, y=”latitude”, alpha=0.1)”

We recognize city map of California. We can realize the house are located in high density city: Silicon Valley , San Francisco, San Diego

Let us display surface circle population size & price with color:

housing.plot(kind=”scatter”, x=”longitude”, y=”latitude”, alpha=0.4,
s=housing[“population”]/100, label=”population”, figsize=(10,7),
c=”median_house_value”, cmap=plt.get_cmap(“jet”), colorbar=True,

We confirm our intuition: higher the urban density the higher price and the closer to the ocean, the higher the price.

Check the coorelation

What are the feature that influence the most the house price?

Let us look at correlation between each pair of feature. Correlation search a linear relation and miss non linear correlation, it mesaure form from -1 to 1:

  • +1 strong correlation ,
  • -1 strong negative correlation,
  • 0 no correlation.

We use pandas function to build an array of correlation
corr_matrix = housing.corr()

We can display the correlation for the house value feature and other feature:


median_house_value    1.000000
median_income         0.687160
income_cat            0.642274
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724

House value is mostly

  • positively influenced by median income, the richer the distict the higher the price
  • and inversly influenced by the latitude geography: the North of Californoa is cheaper

We can also display the linear correlation between price and income

housing.plot(kind=”scatter”, x=”median_income”, y=”median_house_value”,

The graph is close to a diagonal as oppose to a random scatter.

In the next posts, I will present the data preparation and the machine learning model selection.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu