Chatbot using nltk, NLP and tfidf vectorizer

Demand of virtual chat bots is increasing day by day, it serves many purposes but major one is reducing man-power. There are various approches to build a chat bot; 1. Intent and entitiy based (RASA). 2. Text and NLP based chat bot. 3.Dialogue based chatbot (GPT and GPT2).

We will demonstrate the text and NLP based approcah as it does not require much data and provides satisfing results in a very short period of time.

Python Code and explanation

Import the necessary libraries/modules.

Provide the raw data, represent it as question ::: answer format.

Perform Contraction mapping : Contraction mapping means expanding the terms like I’ll to I will.

Prepare separate lists of questions and answers, do the word tokenization of questions. Here we will tokenize and vectorize the questions only because a user will ask a question in this scenario. We will find out the cosine similarity between the user question and list of questions that we used to build the chat bot. We will return the best match if found else we will return a generic string.

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class. These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it does not discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. In layman term, we are finding the base word.
rocks : rock
corpora : corpus
better : good

Defining a function that replies on greeting.
Function to perform tfidf vectorization and cosine similarity. TFIDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. 

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space.
Collaborating everything and calling the functions as per the user question/input.
Sample Output: