There are various classical machine learning algorithms, such as Naive Bayes, Logistic Regression, Support Vector Machine, etc. We can use these algorithms for text classification. Most of these classification algorithms assume that the words in the text are independent of each other, and these algorithms are not able to handle the dependency between different words in the dataset. Text data is sequential data where one particular word is dependent on several other words. Recurrent Neural Networks can handle sequential data. But RNNs has two major drawbacks:
- There is a vanishing gradient problem in RNNs.
- There is an exploding gradient problem in RNNs.
Due to the above two drawbacks, sometime RNNs are not able to handle the Long Term dependencies. To handle these problems, there is a special variant of RNN called LSTM (Long Short Term Memory). It can handle long term as well as short term dependency without facing the vanishing as well as exploding gradient problem. So at the end of this article, you should be able to classify a text dataset using LSTM.
How we can feed the data to LSTM:
We have to feed the data to LSTM in a particular format. First, we will count all the unique words in the dataset, and according to the number of times the word has accord in the dataset, we will make a dictionary. We will sort this dictionary according to the number of times a word has occurred. Now we will check, at what position a word is occurring in the dictionary, and will assign that position(Numerical Value) as a representation of that word. Let suppose our text dataset is containing three sentences:
data = [“this dog is really fast”, “dog barks on the strangers”, “I like dogs who barks on strangers”]
First, we will store all the unique words and all the words in the two different lists.
Now we will create the dictionary that will contain word as a key and number of times a word has occurred as the value. We will sort this dictionary according to the value.
Output:
[(‘dog’ : 3),
(‘strangers’ : 2),
(‘on’ : 2),
(‘barks’ : 2),
(‘who’ : 1),
(‘like’ : 1),
(‘I’ : 1),
(‘the’ : 1),
(‘fast’ : 1),
(‘really’ : 1),
(‘is’ : 1),
(‘this’ : 1)]
After the below code entire sentence will get converted into the numerical representation.
new_data contains the Numerical representation of the text data, and it will look like:
[[ 12, 1, 11, 10, 9 ] ,[ 1, 4, 3, 8, 2 ] ,[ 7, 6, 1, 5, 4, 3, 2 ]]
We can see that the “dog” has occurred the highest number of times, so it has been assigned 1, “strangers” has occurred second-highest times, so it has got an encoding of 2 and so on.
We will break the dataset in to train and test set:
In the dataset, all the sentences will not be having the same length, so below code will help us in making the length of each sentence equal. It will pad extra encoding if the length is smaller then 600 words and remove less occurring words if the length is more than 600 words.
Now we will classify our sentences or dataset, first, we will import all the required libraries:
Define the LSTM Architecture:
Train the Model and Save the History:
Evaluate the Trained Model:
Exercise:
- Download the Amazon Review Data-Set : https://www.kaggle.com/snap/amazon-fine-food-reviews
- Perform all the above steps on this dataset.
- After performing the above steps just comment in the comment section and let us know the accuracy score of your model.