Sentiment Analysis of Amazon Reviews using Python

Sentiment analysis in Python using natural language processing. This Python project is built using the nltk library.

Introduction

Have you ever bought a product from a shopping site like Amazon or ebay and given a review of it, or given your own opinion on a discussion on social media? Every day millions of such comments are made by people on various shopping sites, social media platforms, or any online services. All these users' reviews or comments are very important to strengthen the service of any company or organization.

But all those data or reviews or comments given by people are mostly in the form of unstructured data which makes it very difficult to analyze them. But nowadays it has become very easy to do with modern information technology. From users' reviews on a shopping site to users' comments on social media, it has become very easy to understand the underlying purpose with the help of Artificial Intelligence.

In this tutorial, we will try to understand the Positive and Negative intent of users' comments on a shopping site (In our case, 'Amazon') using Natural Language Processing. Here we will create a Project(Sentiment Analysis in Python using NLP) that will analyze the sentiment (Positive or Negative) of a comment made by us.

This was an Introductory part, let's move on to the Project Details and start talking about our main topic (Sentiment Analysis of Users' Comments or Reviews, using NLP).

Visit Also: Six Degrees of Kevin Bacon in Python: AI Project

The Project Details

Since NLP (Natural Language Processing) is the main topic of this project, let us first discuss this topic briefly.

What is NLP?

Natural language processing or NLP is a crucial part of Artificial Intelligence through which we train computers through programming to understand Human Languages and make correct decisions or answers based on that.

How to build this Project?

Hearing the word Artificial Intelligence, NLP (Natural Language Processing), etc. you might think it is going to be a complicated topic but trust me after covering the complete tutorial, you won't think so. As I said earlier, to build this Sentiment Analysis System, we need to train our computer first. But how, right?

To keep the overall discussion simple, I have collected some positive and negative reviews (text corpus) made by users, from one of the most popular shopping websites in the world, Amazon, and stored them in two separate files ('positive.txt' and 'negative.txt').

Next, we will create a Python Program where only the Python library called NLTK (Natural Language Toolkit) will do the rest very easily.

What is the role of the nltk library in this Project?

NLTK or Natural Language Toolkit is a very powerful tool provided by Python to analyze unstructured data (The amount of data could be small or huge too). It offers several classes or methods to perform various operations on such given data and trains computers to work on that to create a beautiful AI Model that can read or understand Human Languages.

To build this Sentiment Analysis Project, we will go through some processes step-by-step and take the help of some pre-defined algorithms which are already defined into this beautiful python library NLTK.

There is no need to take the burden over your head and we will leave this task to the nltk library. Instead, let's take a step-by-step look at exactly how we can develop the program.

The steps we will go through

1. Import Modules: First, we will import the necessary modules into our program.

2. Loading Data: The next task is to load the Corpus Data (those two text files, 'negative.txt' and 'positive.txt') and read the data (text data) from there.

3. Data Extraction: The data will be extracted in the form of sentences and stored in a python list.

4. Tokenization: This is the first important step we need to perform called tokenization (a process of breaking sentences and paragraphs into smaller chunks often called words). In this project, I used RegexpTokenizer which gives an additional advantage when tokenizing a paragraph or sentence. Since it uses regular expressions backward, it also removes punctuation from that given paragraph or sentence.

This will be very helpful when you are dealing with a large amount of data and want to reduce the computation as much as possible.

5. Removing Stopwords: After tokenizing our data, now we will remove the stopwords from it. Stopwords are those words that are used frequently in a language. For example, the, a, an, of, etc. This will also be helpful when calculating large amounts of data.

To accomplish this task, we will take the help of a class called stopwords from nltk.corpus. The stopwords are already listed here, so we don't have to introduce the stopwords separately to the program.

6. Lemmatization: The last stage of our data filtering will be lemmatization. Lemmatization is a process of extracting the base word from a given word. For example, the base word of 'running' will be 'run', as same as the lemmatized version of the word 'playing' will be 'play'.

7. Text Classification: We have successfully tokenized and filtered our data. Now we need to do text classification where we will classify our text data into an organized group. Here we will take the help of the Naive Bayes Classifier Algorithm to automatically analyze the text and then assign a set of pre-defined tags (sentimental tags; In this our project: 'Positive', and 'Negative'). It will help to categorize entire data based on its content.

8. Printing the Result: Lastly, we will print the possibility of the given comment by us in the run time to be 'positive', or 'negative'.

Requirements

Before we get to the source code, let's see what you need to install beforehand.

Install NLTK

Command: pip install nltk

Install NLTK Data

Since we are going to perform tokenization and lemmatization, we have to install some additional resources before we proceed. Let's do all these tasks first so that no errors occur when running the program later.

Open your python interpreter and run the following lines of code (Here we will download 'punkt' and 'wordnet', these two resources).


import nltk
nltk.download('punkt')
nltk.download('wordnet')

Source Code

Don't start copying the Source Code first, take a break here and do the following.

1. Download the project file directly through the Download Button provided below.

2. Unzip the zip file

3. Now create a python file in the project folder with this name: 'sentiment.py'

Now you are ready for copying the entire code. So, do that and paste it into the 'sentiment.py' program file. To run the program, you type the command: "python sentiment.py corpus", where 'sentiment.py' is the program file and 'corpus' is the directory where the text document files are stored [If you are using Linux, instead of python use python3].

The entire program is quite simple; I have added several comment lines at various places in the program. This will help you understand the whole code better at once rather than giving the source code in the form of several parts.


# Sentiment Analysis Project using Natural Language Processing
# The code is written in Python Programming Language

import nltk
import os
import sys
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

# Declaring an object of WordNetLemmatizer class
lema = WordNetLemmatizer()
# Getting the stopwords and storing them into a set variable.
stop_words = set(stopwords.words("english"))
# Object of the RegexpTokenizer class
tokenizer = RegexpTokenizer(r'\w+')

# The main function
def main():
    # Read data from files
    if len(sys.argv) != 2:
        sys.exit("Usage: python sentiment.py corpus")
    positives, negatives = load_data(sys.argv[1])

    # Create a set of all words
    words = set()
    for document in positives:
        words.update(document)
    for document in negatives:
        words.update(document)

    # Filter the data
    words = filter_data(words)
    
    # Extract features from the text
    training = []
    training.extend(generate_features(positives, words, "Positive"))
    training.extend(generate_features(negatives, words, "Negative"))

    # Text Classification using Naive Bayes Classifier
    classifier = nltk.NaiveBayesClassifier.train(training)
    s = input("Please input your comment here: ")
    result = (classify(classifier, s, words))

    # Getting the label key: 'Positive', and 'Negative'
    positive, negative = result.samples()

    # Getting the probability value of each label: 
    # 'Positive', and 'Negative'
    pos_value = f"{result.prob(positive):.5f}"
    neg_value = f"{result.prob(negative):.5f}"

    print()

    # Printing the Result
    if pos_value > neg_value:
        print("Review Status: Positive🙂")
    else:
        print("Review Status: Negative🙁")

# Function to filter the pre-defined dataset
def filter_data(data):
    filtered_data1 = set()
    filtered_data2 = set()

    # Removing stop words
    for word in data:
        if word not in stop_words:
            filtered_data1.add(word)

    # Performing Lemmatization
    for word in filtered_data1:
        filtered_data2.add(lema.lemmatize(word))

    return filtered_data2

# Function to Tokenize the text data
def extract_words(document):
    return set(
        word.lower() for word in tokenizer.tokenize(document)
        if any(c.isalpha() for c in word)
    )

# Read the text data from these two text files, 'positivies.txt'
# 'negatives.txt' which are already stored in the corpus directory.
def load_data(directory):
    result = []
    for filename in ["positives.txt", "negatives.txt"]:
        with open(os.path.join(directory, filename)) as f:
            result.append([
                extract_words(line)
                for line in f.read().splitlines()
            ])
    return result

def generate_features(documents, words, label):
    features = []
    for document in documents:
        features.append(({
            word: (word in document)
            for word in words
        }, label))
    return features

# Classify the data
def classify(classifier, document, words):
    document_words = extract_words(document)
    features = {
        word: (word in document_words)
        for word in words
    }
    return classifier.prob_classify(features)

if __name__ == "__main__":
    main()

Output

Here, I have tried to input my multiple review comments into the program. Let's see what our program gave us as a result.

Output 1

Please input your comment here: My mobile was broken.

Review Status: Negative🙁

Output 2

Please input your comment here: Customer care service is not so good.

Review Status: Negative🙁

Output 3

Please input your comment here: Very cheap price

Review Status: Positive🙂

Output 4

Please input your comment here: Delivery service is great.

Review Status: Positive🙂

Output Video

Watch the entire video to understand how the project actually works.

Download the Project File

Download the required corpus data via the Download Button below and then place the program file named 'sentiment.py' there.

Summary

In this lesson, we built a Sentiment Analysis Project using Python where we have taken the help of Natural Language Processing which is an important part of Artificial Intelligence. We trained our AI Model through some pre-collected Positive and Negative review comments collected from the Amazon Shopping site. We went through several steps to build this sentiment analysis project and each of them is very important to understand.

Not just Amazon reviews, you can analyze the intent of any comment taken from anywhere using this Sentiment Analysis System, but there is a condition; it will work only for the English language.

If you have any queries related to this project, please do not hesitate to let me know in the comment section about your issue. You will get a reply soon.

Thanks for reading!

PySeek

Sentiment Analysis of Amazon Reviews using Python - NLP