Question Answering System in Python using NLP

question answering system in python using natural language processing

Introduction

Imagine a machine answering all your questions, you throw different kinds of questions at that machine and it answers every one of your questions without getting tired. In this case, the machine is nothing but a computer. This is not unrealistic at all. This has also been made possible by the advanced technology today and the contribution of modern information technology is outstanding in this field where Artificial Intelligence continues to play one of the key roles.

Here, we will create a simple Question Answering System in Python using Natural Language Processing (NLP) which will be able to answer our questions using its own intelligence within a certain range.

The reason I said, "a certain range" is because here we will train our AI model with some data on some specific topic collected from the web. The AI will then take the query from us, understand that and provide the most relevant result based on that pre-collected data.

This was an Introductory part, let's move on to the Project Details and start talking about our main topic (Question Answering System in Python using NLP).

Visit Also: Sentiment Analysis of Amazon Reviews using Python - NLP

The Project Details

Since NLP (Natural Language Processing) is the main topic of this project, let us first discuss this topic briefly.

What is NLP?

Natural language processing or NLP is a crucial part of Artificial Intelligence through which we train computers through programming to understand Human Languages and make correct decisions or answers based on that.

How to build this project?

Hearing the word Artificial Intelligence, NLP (Natural Language Processing), etc. you might think it is going to be a complicated topic but trust me after covering the complete tutorial, you won't think so. Here, we need to train our AI model with some data first. But how, right?

To keep the overall discussion simple, I've chosen some popular topics, collected data related to that topic from Wikipedia pages, and stored them in separate text files by topic. This is our Corpus data.

For example, in the main project file, you will find these text files as the corpus of documents: 'kevin_mitnick.txt', 'linux.txt', 'microsoft.txt', 'python.txt', and 'trump.txt'. You can add more data or modify the existing data here but remember one thing, the document file must be a text file. These files need to be placed in a folder and the name of that folder must be specified when running the program (you will get more information on this later).

Next, we'll create a Python Program. First, we will perform the Document Retrieval operation. Here, we will access the data of those text documents and try to find out which document(s) are most relevant to the user's query (the query must be in English). We will use tf-idf to find the most relevant documents.

Our next operation will be Passage (sentences) Retrieval where the most relevant document(s) will be split into several passages to determine the most relevant passage to the user's query. To do this, we will use a combination of idf (inverse document frequency) and query term density.

The entire program is so transparent and well organized (Comment lines will help you to understand the objective of each important segment of the code) but before moving there it's important to understand the driver or main function of our program briefly.

The Main Function

We will go through the following steps in the main function.

Step 1: We will load the text file from the directory (where those files are located. In our case the directory name is 'corpus') using the load_files function.

Step 2: Each of the files will be tokenized (using tokenize function) into a list of words.

Step 3: We will compute inverse document frequency (idf) for each of the words using the compute_idfs function.

Step 4: Now we have to take the input query from the user.

Step 5: The top_files function will find the most relevant file that matches the user's query.

Step 6: At last, the top sentences from those top matches files will be extracted which are the best match with the query.

We have to define these functions separately: load_files, tokenize, compute_idfs, top_files, and top_sentences. Don't be stressed. As I said before, I tried to keep the overall code as simple as possible. Functions are easy to understand. You just need a little patience and attention.

Requirements

We're so close to our source code, but before we get there let's see what you need to install beforehand.

NLTK

NLTK or Natural Language Toolkit is a very powerful tool provided by Python to analyze unstructured data (The amount of data could be small or huge too). It offers several classes or methods to perform various operations on such given data and trains computers to work on that to create a beautiful AI Model that can read or understand Human Languages.

Install NLTK: pip install nltk

Install NLTK Data

Since we are going to perform tokenization and lemmatization, we have to install some additional resources before we proceed. Let's do all these tasks first so that no errors occur when running the program later.

Open your python interpreter and run the following lines of code (Here we will download 'punkt' and 'wordnet', these two resources).


import nltk
nltk.download('punkt')
nltk.download('wordnet')

Source Code

Don't start copying the Source Code first, take a break here and do the following.

1. Download the project file directly through the Download Button provided below.

2. Unzip the zip file.

3. Now create a python file in the project folder with this name: 'question.py'.

Now you are ready for copying the entire code. So, do that and paste it into the 'question.py' program file.

To run the program, you type the command: "python question.py corpus", where 'question.py' is the program file and 'corpus' is the directory where the text document files are stored [If you are using Linux, instead of python use python3].

I left another directory called 'small' in there which contains two text files with small amount of text data. It may help you determine how the code is handling the data.


import nltk
import sys
import os
import string
import math
from nltk.corpus import stopwords


FILE_MATCHES = 1
SENTENCE_MATCHES = 1

# Getting the stopwords and storing them into 
# a set variable.
stop_words = set(stopwords.words("english"))


def main():
    # Check command-line arguments
    if len(sys.argv) != 2:
        sys.exit("Usage: python questions.py corpus")

    # Calculate IDF values across files
    files = load_files(sys.argv[1])
    file_words = {
        filename: tokenize(files[filename])
        for filename in files
    }
    file_idfs = compute_idfs(file_words)


    # Prompt user for query
    query = set(tokenize(input("Query: ")))

    # Determine top file matches according to TF-IDF
    filenames = top_files(query, file_words, file_idfs, \
    n=FILE_MATCHES)


    # Extract sentences from top files
    sentences = dict()
    for filename in filenames:
        for passage in files[filename].split("\n"):
            for sentence in nltk.sent_tokenize(passage):
                tokens = tokenize(sentence)
                if tokens:
                    sentences[sentence] = tokens

    # Compute IDF values across sentences
    idfs = compute_idfs(sentences)

    # Determine top sentence matches
    matches = top_sentences(query, sentences, idfs, \
    n=SENTENCE_MATCHES)
    print()
    for match in matches:
        print(match)


def load_files(directory):
    """
    Given a directory name, return a dictionary mapping 
    the filename of each `.txt` file inside that 
    directory to the file's contents as a string.
    """
    file_dict = {}

    for file in os.listdir(directory):
        with open(os.path.join(directory, file), \
        encoding="utf-8") as f:
            file_dict[file] = f.read()

    return file_dict


def tokenize(document):
    """
    Given a document (represented as a string), return 
    a list of all of the words in that document, in order.

    Process the document by converting all words in 
    lowercase, and removing any punctuation or English stopwords.
    """
    tokenized_data=nltk.tokenize.word_tokenize(document.lower())

    final_data = list()
    
    for item in tokenized_data:
        if item not in stop_words and item not in string.punctuation:
            final_data.append(item)

    return final_data


def compute_idfs(documents):
    """
    Given a dictionary of `documents` that maps names of
    documents to a list of words, return a dictionary that
    maps words to their IDF values.

    Any word that appears in at least one of the documents 
    should be in the resulting dictionary.
    """
    idf = dict()

    document_len = len(documents)

    all_words = set(sum(documents.values(), []))

    for word in all_words:
        count = 0
        for doc_values in documents.values():
            if word in doc_values:
                count += 1

        idf[word] = math.log(document_len/count)

    return idf

def top_files(query, files, idfs, n):
    """
    Given a `query` (a set of words), `files` 
    (a dictionary mapping names of files to a 
    list of their words), and `idfs` (a dictionary 
    mapping words to their IDF values), return a 
    list of the filenames of the `n` top files that 
    match the query, ranked according to tf-idf.
    """
    scores_dict = dict()
    for filename, filedata in files.items():
        file_score = 0
        for word in query:
            if word in filedata:
                file_score += filedata.count(word) * idfs[word]
        if file_score != 0:
            scores_dict[filename] = file_score

    sorted_list = list()

    for key, value in sorted(scores_dict.items(), \
    key=lambda v: v[1], reverse=True):
        sorted_list.append(key)
    
    return sorted_list[:n]


def top_sentences(query, sentences, idfs, n):
    """
    Given a `query` (a set of words), `sentences` 
    (a dictionary mapping sentences to a list of their words), 
    and `idfs` (a dictionary mapping words to their IDF values), 
    return a list of the `n` top sentences that match
    the query, ranked according to idf. If there are ties, 
    preference should be given to sentences that have a higher 
    query term density.
    """
    top_sentences = dict()
    for sentence, words in sentences.items():
        sent_score = 0
        for word in query:
            if word in words:
                sent_score += idfs[word]

        count = 0
        if sent_score != 0:
            for word in query:
                count += words.count(word)
            density = count / len(words)
            top_sentences[sentence] = (sent_score, density)
    
    sorted_list = list()

    for key in sorted(top_sentences.keys(), \
    key = lambda v: top_sentences[v], reverse=True):
        sorted_list.append(key)
    
    return sorted_list[:n]


if __name__ == "__main__":
    main()

Output

Here, I asked the AI a series of questions. Let's see what our AI gave us as a result.

Output 1

Query: Who is Kevin Mitnick?

Kevin David Mitnick (born August 6, 1963) is an American computer security consultant, author, and convicted hacker.

Output 2

Query: Who founded Microsoft?

Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800.

Output 3

Query: What are popular Linux distributions?

Popular Linux distributions include Debian, Fedora Linux, and Ubuntu, which in itself has many different distributions and modifications, including Lubuntu and Xubuntu.

Output 4

Query: Who created Python Programming Language?

Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.

Output 5

Query: In which year python 2.0 released?

Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.

Output 6

Query: Who is Donald Trump?

Donald Trump (full name Donald John Trump) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021.

Output 7

Query: Trump's children.

He had three children: Donald Jr. (born 1977), Ivanka (born 1981), and Eric (born 1984).

Output Video

Watch the entire video to understand how the project actually works.

Download the project file

Download the required corpus data via the Download Button below and then place the program file named 'question.py' there.

Important Note

This project is a solution to project6 CS50's Artificial Intelligence. I have defined the program in the question.py file on my own. It was a task that was given in lecture 6, Language. This course is totally free and everyone can take this course from there.

Summary

In this lesson, we built a Question Answering System using Python where we have taken the help of Natural Language Processing which is an important part of Artificial Intelligence. We trained our AI Model through some pre-collected topics collected from Wikipedia.

You are welcome to add more topics to this Question Answering System or modify the existing. Just keep in mind one thing, the document file must be a text file in the corpus directory. For any query related to this topic, please don't hesitate to leave your comment below. You will get a reply soon.

Have a nice day ahead, cheers!💙

PySeek

Question Answering System in Python using NLP - AI Project