Introduction to Logistic Regression

Logistic Regression

  • A popular Binary Classification algorithm based on Supervised Learning.
  • Based on a given set of independent variable(s), the algorithm is used to predict the probability of a categorical (binary) dependent variable. The dependent variable holds values (0, failure) or (1, success)
  • The logistic regression model predicts P(Y=1) as a function of X. The probability of outcome Y being a success or failure is the result.
  • Instead of fitting a best line through the data (like linear regression), we fit an “S” shaped logistic function through the data. The curve tells you the likelihood of the outcome.
  • Logistic regression works better on large sample sizes.
  • Can also used for solving the multi-classification problems.
  • Trains a data model on known input and output data so that it can predict future outputs.

Logistic Regression Equation

We use the same regression equation of Y= mX + C but with some modifications made to Y value calculation.

We know the exponential of any value is always a positive number. And, any n divided by n+1 will always be lower than 1.
For simplicity, we can just say P(Y).

RHS above depicts the linear combination of independent variables. LHS is known as the log — odds or odds ratio or logit function and is the link function for Logistic Regression.

Interpreting the Link Function

  • Link function follows a sigmoid (shown below) function which limits its range of probabilities between 0 and 1.
  • We can interpret the regression equation as, a unit increase in variable x results in multiplying the odds ratio by ε to power β.
  • In other words, the regression coefficients explain the change in log(odds) in the response for a unit change in predictor.
  • However, since the relationship between p(X) and X is not straight line, a unit change in input feature doesn’t really affect the model output directly but it affects the odds ratio.

R squared in logistic regression

Objective is to find a measure that inherits the properties of the familiar R squared from linear regression. McFadden’s R squared is one popular metric used.
If predictor x values are real numbers, mean score R square works fine. When x values are themselves categorical, generally logarithm based MCFadden or pseudo R square is used.
Logistic regression models are fitted using the method of maximum likelihood — i.e. the parameter estimates are those values which maximize the likelihood of the data which have been observed. McFadden’s R squared measure is defined as

Where LC denotes the (maximized) likelihood value from the current fitted model, and Lnull denotes the corresponding value but for the null model — the model with only an intercept and no covariates.

Algorithm Implementation

Since logistic regression is a method of supervised learning, we first feed the algorithm with a trial data set (historical data of dependent ‘y’ and independent ‘x’ variable values).
Run the logistic regression function (R or Python) on the trial data set. The result of the function would be calculated value of probability distribution function for y.
Next prepare a test data set (again with both x and y values). This data set is to test the accuracy of prediction.
For all the x values in the data set, the algorithm finds the predicted y value (0 or 1), based on the probability function for y.
Finally, compare predicted y values with test data y values. To check test accuracy of how close the data matches. Ranges between 0 and 1, value close to 1 means good match.
Given the probability function, for any data set of x values, the y value can now be computed.

Applications

  • Medical preventive diagnosis, predict mortality in injured patients, risk of developing a disease.
  • Election prediction based on voter characteristics.
  • Preventive maintenance and prediction of failure in automotive industry.
  • Geographic image processing.
  • Measuring success rates of marketing campaigns based on customer feedback.
  • Earthquake, weather abnormalities prediction based on atmospheric parameters.

Thank you and Keep Learning

Text Classification using LSTM in Keras (Review Classification using LSTM)

There are various classical machine learning algorithms, such as Naive Bayes, Logistic Regression, Support Vector Machine, etc. We can use these algorithms for text classification. Most of these classification algorithms assume that the words in the text are independent of each other, and these algorithms are not able to handle the dependency between different words in the dataset. Text data is sequential data where one particular word is dependent on several other words. Recurrent Neural Networks can handle sequential data. But RNNs has two major drawbacks:

  1. There is a vanishing gradient problem in RNNs.
  2. There is an exploding gradient problem in RNNs.

Due to the above two drawbacks, sometime RNNs are not able to handle the Long Term dependencies. To handle these problems, there is a special variant of RNN called LSTM (Long Short Term Memory). It can handle long term as well as short term dependency without facing the vanishing as well as exploding gradient problem. So at the end of this article, you should be able to classify a text dataset using LSTM.

How we can feed the data to LSTM:

We have to feed the data to LSTM in a particular format. First, we will count all the unique words in the dataset, and according to the number of times the word has accord in the dataset, we will make a dictionary. We will sort this dictionary according to the number of times a word has occurred. Now we will check, at what position a word is occurring in the dictionary, and will assign that position(Numerical Value) as a representation of that word.  Let suppose our text dataset is containing three sentences: 

data = [“this dog is really fast”, “dog barks on the strangers”, “I like dogs who barks on strangers”]

 First, we will store all the unique words and all the words in the two different lists.

Now we will create the dictionary that will contain word as a key and number of times a word has occurred as the value. We will sort this dictionary according to the value. 

Output:

[(‘dog’ : 3),
(‘strangers’ : 2),
(‘on’ : 2),
(‘barks’ : 2),
(‘who’ : 1),
(‘like’ : 1),
(‘I’ : 1),
(‘the’ : 1),
(‘fast’ : 1),
(‘really’ : 1),
(‘is’ : 1),
(‘this’ : 1)]

After the below code entire sentence will get converted into the numerical representation.

 new_data contains the Numerical representation of the text data, and it will look like: 

[[ 12, 1, 11, 10, 9 ] ,[ 1, 4, 3, 8, 2 ] ,[ 7, 6, 1, 5, 4, 3, 2 ]] 

We can see that the “dog” has occurred the highest number of times, so it has been assigned 1, “strangers” has occurred second-highest times, so it has got an encoding of 2 and so on. 

We will break the dataset in to train and test set:

In the dataset, all the sentences will not be having the same length, so below code will help us in making the length of each sentence equal. It will pad extra encoding if the length is smaller then 600 words and remove less occurring words if the length is more than 600 words.

Now we will classify our sentences or dataset, first, we will import all the required libraries:

Define the LSTM Architecture:

Train the Model and Save the History:

Evaluate the Trained Model:

Exercise:

  1. Download the Amazon Review Data-Set : https://www.kaggle.com/snap/amazon-fine-food-reviews
  2. Perform all the above steps on this dataset.
  3. After performing the above steps just comment in the comment section and let us know the accuracy score of your model.

Amazon Review Text Classification using Logistic Regression (Python sklearn)

Overview: Logistic Regression is the most commonly used classical machine learning algorithms. Although its name contains regression, it can be used only for classification. Logistic Regression can only be used for binary classification, but modified Logistic Regression can also be used for multiclass classification.

It has various advantages over other algorithms such as:
  1. It has a really nice probabilistic interpretation, as well as geometric interpretation.
  2. It is a parametric algorithm, and we need to store the weights that we learned during the training process to make predictions on the test data.
  3. It is nothing but a linear regression function on which the Sigmoid Function has been applied to treat the outliers(or large values) in a better way.
    1. Linear Regression Y = f(x)
    2. Logistic Regression Y = sigmoid(f(x))
There are several assumptions while applying Logistic Regression on any dataset:
  1. All the features are not multicollinear, and it can be tested using a perturbation test.
  2. The dependent variable should be binary.
  3. The dataset size should be large enough.
Logistic Regression Implementation on the Text Dataset (Using Sklearn):

You can download the data from here: https://www.kaggle.com/snap/amazon-fine-food-reviews First, we will clean the dataset. I have written a detailed post on the text data cleaning. You can read it here: https://kkaran0908.github.io/Text_Data_Preprocessing.html

After cleaning, we will divide the dataset into three parts, i.e., train, test, and validation set. Using the validation set, we will try to find out the optimal hyperparameters for the model. After getting optimal hyperparameter, we will test the model on the unseen data i.e. test set.

Now we vectorize the dataset using CountVectorizer (Bag of Words), it is one of the most straightforward methods to convert text data into numerical vector form.

Now we will import all the required that will be useful for the analysis.

Alpha, Penalty is the hyperparameters in Logistic Regression (there are others as well). We will try to find out the optimal values of these hyperparameters.

Output :

0.0001 ------> 0.5
0.001  ------>  0.7293641972138972
0.01  ------>  0.8886922437232533
0.1  ------>  0.9374969316048458
1  ------>  0.9399004712804476
10  ------>  0.9113632222156819
100  ------>  0.8794308252229597
Optimal AUC score: 0.9399004712804476
Optimal C: 1

“We can see that for c=1, we are getting an optimal AUC score, so for final modeling, we will use it.”

Our dataset we have two classes, so predict_proba(), is going to give us the probability of both the category. We can understand it by an example, so predict_proba() for a point will return the values like this [p,1-p], where p is the probability of positive point, and 1-p is the probability of the point being negative. For whichever category, we have a higher probability. We will assign that category to the test point.

OutPut:AUC score on test data: 0.8258208984684994

AUC score on the training data: 0.8909678471639081

Exercise for You:

  1. Download the same data from kaggle: https://www.kaggle.com/snap/amazon-fine-food-reviews
  2. Apply logistic regression on top of that data using a bag of words(BOW) only, as I have done in this post.
  3. Change the penalty from l1 to l2 and comment down your AUC score.
  4. If you are facing any difficulty in doing this analysis, please comment below I will share the full working code.

Additional Articles:

  1. https://www.statisticssolutions.com/what-is-logistic-regression/
  2. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
  3. http://www.robots.ox.ac.uk/~az/lectures/ml/2011/lect4.pdf