Amazon Review Text Classification using Logistic Regression (Python sklearn)

Overview: Logistic Regression is the most commonly used classical machine learning algorithms. Although its name contains regression, it can be used only for classification. Logistic Regression can only be used for binary classification, but modified Logistic Regression can also be used for multiclass classification.

It has various advantages over other algorithms such as:
  1. It has a really nice probabilistic interpretation, as well as geometric interpretation.
  2. It is a parametric algorithm, and we need to store the weights that we learned during the training process to make predictions on the test data.
  3. It is nothing but a linear regression function on which the Sigmoid Function has been applied to treat the outliers(or large values) in a better way.
    1. Linear Regression Y = f(x)
    2. Logistic Regression Y = sigmoid(f(x))
There are several assumptions while applying Logistic Regression on any dataset:
  1. All the features are not multicollinear, and it can be tested using a perturbation test.
  2. The dependent variable should be binary.
  3. The dataset size should be large enough.
Logistic Regression Implementation on the Text Dataset (Using Sklearn):

You can download the data from here: First, we will clean the dataset. I have written a detailed post on the text data cleaning. You can read it here:

After cleaning, we will divide the dataset into three parts, i.e., train, test, and validation set. Using the validation set, we will try to find out the optimal hyperparameters for the model. After getting optimal hyperparameter, we will test the model on the unseen data i.e. test set.

Now we vectorize the dataset using CountVectorizer (Bag of Words), it is one of the most straightforward methods to convert text data into numerical vector form.

Now we will import all the required that will be useful for the analysis.

Alpha, Penalty is the hyperparameters in Logistic Regression (there are others as well). We will try to find out the optimal values of these hyperparameters.

Output :

0.0001 ------> 0.5
0.001  ------>  0.7293641972138972
0.01  ------>  0.8886922437232533
0.1  ------>  0.9374969316048458
1  ------>  0.9399004712804476
10  ------>  0.9113632222156819
100  ------>  0.8794308252229597
Optimal AUC score: 0.9399004712804476
Optimal C: 1

“We can see that for c=1, we are getting an optimal AUC score, so for final modeling, we will use it.”

Our dataset we have two classes, so predict_proba(), is going to give us the probability of both the category. We can understand it by an example, so predict_proba() for a point will return the values like this [p,1-p], where p is the probability of positive point, and 1-p is the probability of the point being negative. For whichever category, we have a higher probability. We will assign that category to the test point.

OutPut:AUC score on test data: 0.8258208984684994

AUC score on the training data: 0.8909678471639081

Exercise for You:

  1. Download the same data from kaggle:
  2. Apply logistic regression on top of that data using a bag of words(BOW) only, as I have done in this post.
  3. Change the penalty from l1 to l2 and comment down your AUC score.
  4. If you are facing any difficulty in doing this analysis, please comment below I will share the full working code.

Additional Articles: