Machine learning is an engineering craft. You need to approach a business problem through exprimentation on data: a test and learn approach to determine which techniques of machine learning work best.
Our first Machine Learning experiment will consist in the prediction of housing price in market. Two virtues of this example are to illustrate simply the experimentation approach and the input data can be simple Spreadsheet.
Facing any business problem, you need to check the following steps:
First, Frame your experiment
- What is the problem I want to tackle: Experiment objectives?
- Can my problem be best solved best with machine learning?
- Do I have enough Data available to solve my problem?
- What are the tools I can use to host my experiment?
Then, Run the experiment
- Ingest the Data
- Preprocess the Data
- Try approaches & test, save your results into models
- Bench models and select the best one
Let’s illustrate with the housing price problem.
What is the problem I want to tackle: Experiment objectives?
I can formulate my problem as:”I want to be able to predict the selling/renting price of a real estate based on the characteristics of the real estate“. When searching on real estate agency, there is no fixed rules to price a house: the answer is “it depends” on many factors the environment the size of the house, the demand and offer, proximity to transportation…
Can my problem be best solved best with machine learning?
So the model of price of a house is unknown with a lot of parameters that can change. Modeling a pricing engine will require to interview a lot of real estate experts with no guarantee of the results.
But we have a lot of advertisement of house to be sold. From the advertisements, we can infer a House price prediction! This a machine learning problem.
Do I have enough Data available to solve my problem?
Although we have a lot of advertisement, the data is unstructured. Indeed, each online real estate platform strucutre its ads with images and free text with different level of information. Collecting those data automatically and transposing it to a normalized format will be a hassel.
Instead, let us turn to already prepared spreadsheet that can found on Open Data Initiatives. Indeed each large city or government want to promote innovation fueled with Data. So they publish free to use massive Data on their administration. For example:
- UK government: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
- France, tax: https://cfspart.impots.gouv.fr/patrimelweb/flux.ex#ancreDuHaut
- Los Angeles: https://data.lacity.org/
- New York City: https://opendata.cityofnewyork.us/
Other Data source can be found on:
- Machine learning Forum e.g Reddit: https://www.reddit.com/r/MachineLearning/
- University Academic portal e.g http://archive.ics.uci.edu/ml/datasets.html
- Kaggle portal for machine learning competitions: https://www.kaggle.com/
What are the tools I can use to host my experiment?
You need 4 basic tools:
- Programming engine
- Libraries that provide ready to use functions:
- Data manipulation Libraries,
- ML Libraries,
- Ploting libraries
- Notebook manager to log the code and the comments of your expriment
Yes, you need to code but you don’t need to be an expert. We will start with Python since it is most widely used in the machine learning community.
You can install the Python engine Anaconda: https://www.anaconda.com/download/.It is open source and free to use.
The installation is really click and run. Anaconda package also by default the Python libraries for:
- Machine learning : Scikit-learn , http://scikit-learn.org/stable/index.html
- Data manipulation (import/transform/export Data): Numpy
- Plot Vizualisation: matplotlib
Moreover, Anaconda propose Jupyter as a notebook manager (http://jupyter.org/). You will be able in this environment to call libraries functions and save your experiment.
You can test Jupyter online :http://jupyter.org/try
Your all set to run your experiment. to be continued on next blog Post.