How to analyze sentiment on movie reviews using machine learning? Part 2 – Prepare text Data & Find the best machine learning algorithm

Let get back to our problem of analyzing sentiment on movie reviews.

Let us recap, so far

In this Post,

I will first prepare text data and identify a series of transformation (a data pipeline) to obtain from the raw dataset a “prepared” data set into numeric values. The prepared dataset can be used as robust input to the standard machine learning algorithms implemented in the libraries.

Then we will find the best machine learning classification model to detect sentiment on movies review:

  • On a fixed training dataset try out different categories of machine learning algorithms,
  • Shortlist the best performing ones,
  • Fine tune the algorithm to find the best parameters to improve the performance.

For the basic theory ,you can refer the predication approach with supervised learning (training a prediction model with labels example) whether it is a classification problem such as sentiment analysis or a regression problem such as Housing price prediction:

Fetch & check out the dataset

Fetch Data

We load a list of files in the working memory of the laptop using sci-kit learn command “load_files”


from sklearn.datasets import load_files
dataset = load_files(movie_reviews_data_folder, shuffle=False)

Check out the data, we have exactly 2000 movies reviews

print(“n_samples: %d” % len(

n_samples: 2000

All the movies reviews are stacked in a holder object a “bunch” with standards fields that can be accessed as dictionary keys or python attributes.

For examples the category (labels) can be accessed as


['neg', 'pos']

The content of the text data can be accessed as[0]

Our dataset is composed of text and labels are text:

b'plot : two teen couples go to a church party , drink and then drive ...

Prepare the text dataset

Split into train and test data

We can save 25% (reviews) of our dataset as test and keep 75% (1500 reviews) for training. We let Sci-kit learn perform a random split in our dataset:

from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(,, test_size=0.25, random_state=None)

We notice that our first examples from the training vector does not correspond to the first example in the dataset:


b'back in february at the monthly los angeles comic book and science fiction convention...


Encode category data into numbers

For speed and computation purpose the labels are stored as index 0 as negative/1 as positive[:]

array([0, 0, 0, ..., 1, 1, 1])

to convert back into text labels, we can refer to the index:

for t in[:10]:


Encode text data into numbers : tokenizing text

We need to map each text in a vector of numbers. For this we use the bag of words methods:

  1. Each words of the reviews is assigned to a dictionary, a corpus feature with more than 100 000 words,
  2. For each document I, our dataset is represented as collection of vectors X[I,J], J index for word w in the dictionary,
  3. Each doc has less than 1000 words,
  4. We have huge dimension but sparse dataset (lot of zeros since the reviews use a small subset of the dictionary,
  5. We can optimise the memory storage of the vector using scipy.sparse.

In Sci-kit learn, the function “CountVectorizer” Count create corpus dictionary and transform a document into feature vector.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_train)

Form the collection of 1500 document, we obtain a 1500X35418 dataset array


(1500, 35418)

We can verify that each word is assigned to an index in the corpus dictionary:



Normalize text data

Some words does not discriminate movies reviews because they appears regardless of the sentiment such as: I, and , besides, walk, the, actor…

We need to give to those words that appear frequently a least weight.

We use a function to Normalize occurences across text in Term frequencies (word/length of the text):

  • Downscale words less informative that appears in many documents inverse document frequencies,
  • Fit estimator to data then apply transformation.

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

(1500, 35418)

Let us train a sentiment analysis model

Naives Bayes estimates the word distribution of a positive reviews and negative (conditional probability) based on the examples provided int the training set. It’s a simple algorithm to start with to tackle classification problems. It provide a base line at cheap computation cost.

In Sci-kit learn we call estimator mutlinomialNB (Naives Bayes) on the normalized training vector:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Perform the prediction on a sample set

Let us predict our trained Sentiment analysis model with a simple examples. I create two movie reviews:

docs_new = [‘This movie was horrible’, ‘I like the story’]

I apply the normalization form the dictionary

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

We try to predict the Sentiment and the test seems OK:

predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print(‘%r => %s’ % (doc, dataset.target_names[category]))

'This movie was horrible' => neg
'I like the story' => pos

Automate the data preparation & training  with a  pipeline

You noticed that we need the apply the same transformations (vectorization transformation and classification) to the test set as the ones on the training set. To shorten those transformations, we create a pipeline:

from sklearn.pipeline import Pipeline
text_clf = Pipeline([(‘vect’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, MultinomialNB()),

We can train the model in a single command, y_train)

Bench the machine algorithms to shortlist the best approach

Try simple Machine learning algorithm for classification

Let us evaluate the Naives Bayes model on the test examples (the 25% of the dataset, we saved before):

import numpy as np
predicted = text_clf.predict(docs_test)
np.mean(predicted == y_test)


With Naives Bayes, we reach an accuracy of 77,8% of sentiment correctly detected.

Let us try out another algorithm

We can try Support vector machine in short SVM. The idea is to find relevant boundary between examples of a known dataset. Once learned those boundaries can discriminate new example into a category.

It is an iterative algorithm and takes usually more computation resources than naive Bayes.

In sci-kit learn, SVM is called as SGD classifier:

from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([(‘vect’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, SGDClassifier(loss=’hinge’, penalty=’l2′,
alpha=1e-3, random_state=42,
max_iter=5, tol=None)),
]), y_train)

predicted = text_clf.predict(docs_test)
np.mean(predicted == y_test)


With SVM, We reach a better results 81,6% of correct sentiment detection on the test review.

So let us continue with the SVM machine learning model  for this experiment.

Display the performance metrics

we can display the confusion matrix which represent:

  • the negative (0) recognize as correctly as negative (0) or incorrectly recognized as positive (1),
  • same with the positive.

from sklearn import metrics
cm= metrics.confusion_matrix(y_test, predicted)
print (cm)

[[191  59]
 [ 33 217]]

Ideally the diagonal of the confusion matrix  should be max and outside the diagonal should be zeros.

By displaying the confusion  matrix with matplotlib, we obtain

import matplotlib.pyplot as plt


We can obtain more details on the performance by calling metrics report per category in sci kit learn

from sklearn import metrics
print(metrics.classification_report(y_test, predicted,

             precision    recall  f1-score   support

        neg       0.86      0.83      0.84       252
        pos       0.83      0.86      0.85       248

avg / total       0.85      0.85      0.85       500


Fine tune the chosen machine learning algorithm

Our next and final steps, will be to optimize the parameters of the choosen SVM model.

the steps as similar to Housing prediction experiment:

We define a set a parameters to explore training/performance on each models.

Because we have text data, it is interesting to consider whether a representation per word (uni gram) or pair of words (bi gram) is more relevant to our to the learning. So we set the parameters to explore that aspect :

from sklearn.model_selection import GridSearchCV
parameters = {‘vect__ngram_range’: [(1, 1), (1, 2)],

We configure infra for grid search -1 use all the cores

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

With text, the computation duty is tremendous due to high dimension of words, so we restrict the search on subset of 200 examples:

gs_clf =[:200], y_train[:200])

Grid search will return the best accuracy score, the best parameters for the model:


for param_name in sorted(parameters.keys()):
print(“%s: %r” % (param_name, gs_clf.best_params_[param_name]))


This conclude the experimentation on sentiment analysis on movie reviews using machine learning, we have learned that

  • Text Data requires some normalization and transformation into sparse vectors in data preparation,
  • The same methodology as classic numerical value data applies:
    • framing,
    • analyzing the data,
    • prepartion,
    • testing,
    • select the machine learning algorithms ,
    • fine tuning.
  • The machine learning libraries facilitate and minimize the code to train , evaluate and select the best algorithm.
  • Text Data require more computation resources than classic native  numerical data: to work around you can
    • optimize on a smaller subset or
    • work with a Big Data Cluster.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu