Previously we have work on machine learning predictions using structured, numerical data such as (e.g prices, number of bedroom , X-Y coordinates ,…) :
In this new experiment, I would like to tackle a machine learning problem based on text data. Hopefully, the methodology remains the same:
- Frame the experiment,
- Prepare the data,
- Search the best machine learning algorithms,
- Fine tune the algorithm with the best parameters.
For this this new post series, I choose the Sentiment analysis of movie reviews. The purpose to show how to tackle with machine learning a problem with text data.
On natural language processing
Because the input data is “human language” and not a structured forms. The artificial intelligence area that deals with those data is called natural language processing. The goal is for machine to be able to understand, respond in human language (application are chatbots). Alan Turing an english mathematician formulated a test for machine – the Turing test– to achieve “intelligence”: A human should have a chat with the robot without noticing that it is a robot.
As of July 2018, we are not there. But the demo of Google duplex brings us close to that goal:
First, Frame your experiment
What is the problem I want to tackle: Experiment objectives?
The problem is to determine have a machine reader that can judge whether a published movie review is positive or negative. The applications robot that can automatically moderate forum or social comments. For example, with the same idea, Facebook detect porn or racist content. The robot can alert a human moderation team.
Business Sanity Check
As reference a manual way, would be almost unfeasible to have a team reviews all the comments with a decent response time.So the business case seems fair.
Can my problem be best solved best with machine learning?
The problem with comments is that unlike a Like/unlike button, it can be tricky to model what makes a positive or negative review. We can guess that keywords can be a possible solution. but we cannot sure our list is exhaustive plus we have potentially the whole english dictionary to cover.
The data is there to save the day: we can have tons of comments reviews or rated by the user themselves (e.g movies rating) or tagged by moderation teams. This problem is best solved with machine learning.
Choose your dataset
We choose movie review from Cornell University: http://www.cs.cornell.edu/people/pabo/movie-review-data/
It contains 2000 reviews classified as positive or negative.
Let us specify the machine learning problem type:
- a Supervision Problem: judgement positive/negative can learned from a labeled dataset.
- a Classification Problem: since we want to classify a unknown movie review into category positive/negative.
- a Laptop infrastructure is fine, the dataset is 2000 records on X features (X being the number of words in the reviews) We do not need to store on Big Data infra (storage on processing on multiples machines)
What are the tools I can use to host my experiment?
For the tooling, we keep the same setup as for the Housing price prediction experiment (See: https://learn-ai-tech.com/how-to-experiment-prediction-on-housing-price-using-machine-learning-part-1-frame-your-experiment/) :
- Programming engine: Python
- Libraries that provide ready to use functions:
- Data manipulation Libraries, Numpy
- ML Libraries, Sci-kit learn
- Ploting libraries, matplotlib
- Notebook manager, Jupyter
We are all set to run your experiment. To be continued on next blog post.