Basic Introduction to Random Forest

Decision Tree is one of the most used classical machine learning algorithms. But One of the biggest drawbacks of Decision Tree is that they are highly prone to overfitting. So to minimize this problem, we use Random Forest. In this article, we will study the basics of Random Forest and various terminologies related to it.

  1. Random Forest is used for an ensemble of decision trees. It uses the base principle of bagging with random feature selection to create more diverse trees.
  2. Splitting a node during the construction of a tree, the split that is chosen is no longer the best split among all the features. Instead, the split picked is the best split among a random subset of the features
  3. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree)
  4. Due to averaging, its variance decreases, usually more than compensating the increase in bias, hence yielding overall a better result.
  5. Can handle curse of dimensionality as the ensemble uses only a small random portion of the full feature set. Less prone to over-fitting.
  6. Select Sqrt(P) where P = number of features

TRAINING PHASE

Algorithm to Train Random Forest

Lets explore Hyperparameters in Random Forest:

Random Forest has a lot of parameters, because of which cross-validation phase to find an optimal set of parameters takes a very long time. 

 

  1. n_estimators: integer, optional (default=10)The number of trees in the forest.
  2. criterion: string, optional (default=”gini”)The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
  3. max_depth: integer or None, optional (default=None)The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  4. max_features: int, float, string or None, optional (default=”auto”)The number of features to consider when looking for the best split:
  5. bootstrap : boolean, optional (default=True)
  6. Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
  7. oob_score: bool (default=False) Whether to use out-of-bag samples to estimate the generalization accuracy.
  8. warm_start: bool, optional (default=False)When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
  9. n_jobs: int or None, optional (default=None) The number of jobs to run in parallel for both fit and predict. None means 1 and -1 means using all processors.

Feature Sampling in Random Forest:

Advantages of a Random Forest

1. A random forest is more stable than any single decision tree because the results get averaged out; it is not affected by the instability and bias of an individual tree.
2. A random forest is immune to the curse of dimensionality since only a subset of features is used to split a node.
3. You can parallelize the training of forest since each tree is constructed independently.

4. You can calculate the OOB (Out-of-Bag) error using the training set which gives a really good estimate of the performance of the forest on unseen data.
Hence there is no need to split the data into training and validation; you can use all the data to train the forest.

The OOB error:

The OOB error is calculated by using each observation of the training set as a test observation. Since each tree is built on a bootstrap sample, each observation can be used as a test observation by those trees which did not have it in their bootstrap sample.

All these trees predict this observation and you get an error for a single observation. The final OOB error is calculated by calculating the error on each observation and aggregating it. It turns out that the OOB error is as good as a cross-validation error.

Thank you and Keep Learning 🙂

You May Like:

Text Classification using LSTM in Keras (Review Classification using LSTM)

There are various classical machine learning algorithms, such as Naive Bayes, Logistic Regression, Support Vector Machine, etc. We can use these algorithms for text classification. Most of these classification algorithms assume that the words in the text are independent of each other, and these algorithms are not able to handle the dependency between different words in the dataset. Text data is sequential data where one particular word is dependent on several other words. Recurrent Neural Networks can handle sequential data. But RNNs has two major drawbacks:

  1. There is a vanishing gradient problem in RNNs.
  2. There is an exploding gradient problem in RNNs.

Due to the above two drawbacks, sometime RNNs are not able to handle the Long Term dependencies. To handle these problems, there is a special variant of RNN called LSTM (Long Short Term Memory). It can handle long term as well as short term dependency without facing the vanishing as well as exploding gradient problem. So at the end of this article, you should be able to classify a text dataset using LSTM.

How we can feed the data to LSTM:

We have to feed the data to LSTM in a particular format. First, we will count all the unique words in the dataset, and according to the number of times the word has accord in the dataset, we will make a dictionary. We will sort this dictionary according to the number of times a word has occurred. Now we will check, at what position a word is occurring in the dictionary, and will assign that position(Numerical Value) as a representation of that word.  Let suppose our text dataset is containing three sentences: 

data = [“this dog is really fast”, “dog barks on the strangers”, “I like dogs who barks on strangers”]

 First, we will store all the unique words and all the words in the two different lists.

Now we will create the dictionary that will contain word as a key and number of times a word has occurred as the value. We will sort this dictionary according to the value. 

Output:

[(‘dog’ : 3),
(‘strangers’ : 2),
(‘on’ : 2),
(‘barks’ : 2),
(‘who’ : 1),
(‘like’ : 1),
(‘I’ : 1),
(‘the’ : 1),
(‘fast’ : 1),
(‘really’ : 1),
(‘is’ : 1),
(‘this’ : 1)]

After the below code entire sentence will get converted into the numerical representation.

 new_data contains the Numerical representation of the text data, and it will look like: 

[[ 12, 1, 11, 10, 9 ] ,[ 1, 4, 3, 8, 2 ] ,[ 7, 6, 1, 5, 4, 3, 2 ]] 

We can see that the “dog” has occurred the highest number of times, so it has been assigned 1, “strangers” has occurred second-highest times, so it has got an encoding of 2 and so on. 

We will break the dataset in to train and test set:

In the dataset, all the sentences will not be having the same length, so below code will help us in making the length of each sentence equal. It will pad extra encoding if the length is smaller then 600 words and remove less occurring words if the length is more than 600 words.

Now we will classify our sentences or dataset, first, we will import all the required libraries:

Define the LSTM Architecture:

Train the Model and Save the History:

Evaluate the Trained Model:

Exercise:

  1. Download the Amazon Review Data-Set : https://www.kaggle.com/snap/amazon-fine-food-reviews
  2. Perform all the above steps on this dataset.
  3. After performing the above steps just comment in the comment section and let us know the accuracy score of your model.

Introduction to Machine Learning (Supervised, Unsupervised, Reinforcement Learning)

Depending upon the business use cases, there are different kinds of machine learning algorithms. In this post we are going to learn about three basic machine learning approaches:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

In any machine learning project, we generally follow a fixed pattern. We have tried to define those points below, it will help us in understanding this different kind of machine learning algorithms.

Gathering Data — The Quality, and quantity of data that you gather will directly determine how good your predictive model can be. Some models require continuous live fed data. In any machine learning project, it is the most important step.

Data Visualization (Exploratory Data Analysis): With the help of exploratory data analysis, we try to get various insights about the data. It helps us in feature engineering/data preparation, appropriate model selection, evaluation metric, etc. 

Data Preparation — In real life almost all the time data is noisy and messy. We need to prepare the data to make any machine learning model on top of that. This process of data preparation is called data preprocessing. In the end, to train and test our model, we split the data into training and test data sets.

Choice of the ML algorithm — Depending on factors such as the nature of data (labeled or unlabelled), type of data (numerical, Audio-visual, categorical), the measures of accuracy, cost of human intervention/correction. We choose the appropriate algorithm although its kind of a hyperparameter but a good knowledge of maths behind some particular algorithm helps in choosing the appropriate algorithm.

Continuous Learning of the model — Incrementally improve the model’s performance, by adjusting output parameters or rewards in each iteration. Evaluate model accuracy.

Prediction/Result Analysis:  Predict the expected results by running the model. Present the output in meaningful human-readable forms (Tables, graphs, images, etc).

Why is it important?

  • It is a fact that data scientists spend 80% of their time cleaning and manipulation of data, and only 20% of their time actually analyzing or building the model on top of it !!
  • Administratively, incorrect/inconsistent data can lead to false conclusions and misdirected investments.
  • In the real world businesses, incorrect data can be costly. Many companies use customer databases that record data like contact information, addresses, and preferences.

Types of Machine Learning:

Supervised Learning:

Supervised Learning is a process of inferring a function from labeled training data. A supervised machine learning algorithm analyses the training data and produces an inferred function, which can be used for mapping new examples.

Supervised learning problems can be further grouped into Two Parts:

Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight” or “price of some home”.

Classification: A classification problem is when the output variable is categorical, such as “red” or “blue” or “disease” and “no disease”. Example of Supervised Machine Learning Algorithms: Naive Bayes, KNN, SVM, Logistic Regression, Decision Tree, Linear Regression, Random Forest, etc.

NOTE: A vast majority of practical machine learning uses supervised learning.  

Steps in Supervised Learning:
  • We will be having training data, in which corresponding to each data point we will be having a continuous value or a label.
  • We will train and validate the model by showing the example available in the training data.
  • Once we validate the model’s performance, we will test it on the unseen data or test dataset.

If the model was able to identify the output (almost matching the actual output, you hide from the model), you are ready to deploy your model. This process is also called the validation stage.

Real-World Application of Supervised Learning:
  1. Predict the flight ticket prices on Diwali to obtain the maximum profit to the airline company.
  2. Given an image, identify if it has been modified by some software such as Adobe Photoshop or not.
  3. Identify whether a website has obscene images or not.
  4. Predict an unusual behavior in an internet banking transaction.
Supervised Algorithms Working Cycle

Unsupervised Learning:

Unsupervised Learning is an ML technique to find patterns in data, in an exploratory manner. The data is not labeled, which means only the input variables(X) are given with no corresponding output variables. Algorithms are left to themselves to discover interesting patterns in the given data set.   Since data is unlabelled, there is no easy way to evaluate the accuracy of the algorithm — one feature that distinguishes unsupervised learning from supervised learning and reinforcement learning. Grouping of similar data into groups or clusters.

Unsupervised Learning problems can be further grouped into two parts:

Clustering: Grouping of similar data into groups or clusters.

Example: K-Means, K-Means++, K-Medoid, etc.

Dimensionality Reduction: Compression of the data to reduce the its complexity without altering its structure.

Example: Principal Component Analysis, Singular Value Decomposition, etc

Steps in Supervised Learning
  1. We will be having training data, in which corresponding to each data point we will be not having a label.
  2. Train the model on the training data, during training, we will be exploring the data, we will not be having any idea which variables are the output target in the data.
  3. Simplify and group the data so that it can be categorized into distinct sets.

If the model helps to identify useful real-world patterns, your model is successful. Measuring the accuracy of prediction is domain-specific and highly subjective.

Real-World Application of Unsupervised Learning:

  1. Recommendation systems in e-commerce sites such as Flipkart or Amazon work on the principle of unsupervised learning.
  2. Grouping the customers of a supermarket based on their purchasing behavior.
Workflow of Unsupervised Learning Algorithms

Reinforcement Learning:

The reinforcement learning algorithm (called the agent) continuously learns from the environment in an iterative fashion. Aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk.   In the process, the agent learns from its experiences of the environment until it explores the full range of possible states. The decision-making function is used to make the agent perform an action. After the action is performed, the agent receives a reward or reinforcement from the environment. The state-action pair of information about the reward is stored.

Steps in Reinforcement Learning:
  1. We will be having an initial state, corresponding to each initial state we will be having more than one next state. The input state is fed into the model and observed by the agent.
  2. Based on the input, the model returns a STATE. The decision-making function is used to make the agent perform an action. After the action is performed, based on its output, the agent receives a reward or reinforcement from the environment/user. The state-action pair of information about the reward is stored. This process continues in iterations and the model continuously keeps on learning from live data. At every step, it presents actions from states. The agent choosing the right step at each iteration is based on the Markov Decision Process.

EXAMPLE — “I don’t know how to act in this environment. Can you find a good policy/behavior and meanwhile I’ll give you feedback.”

Real-World Applications:

  1. Self-Driving cars work on the principle of Reinforcement learning.
  2. Games such as alpha go, chess, etc are a really nice example of Reinforcement Learning.
  3. Robots are another good example of Reinforcement Learning.

You May Also Like:

  1. All You Need to Know about Activation Functions (Sigmoid, Tanh Relu, Leaky Relu, Softmax)
  2. All You Need to Know About Sampling Distribution in Statistics
  3. Scratch Implementation of Stochastic Gradient Descent using Python
  4. Evaluation Metrics for Classification (Accuracy Score, Precision, Recall, Confusion Metric, F1-Score)
  5. Top Skills You Must Not Avoid to Become a Great Data Scientist
  6. Feedback on Your Preparation for Data Science or Machine Learning Jobs (Mock Interview)

All You Need to Know about Activation Functions (Sigmoid, Tanh Relu, Leaky Relu, Softmax)

Activation functions are a crucial part of any Neural Network. A deep learning model without applying an activation function is nothing but a simple linear regression model. Activation functions map the input to the output in a particular fashion. Activation functions help us in learning the intricate structure in the dataset. Earlier, when researchers were not aware of activation functions, we were not able to make efficient use of neural networks. In this article, we will learn about the following activation functions and what are their advantages and drawbacks.

  1. Linear Activation Function
  2. Binary Step Activation Function
  3. Sigmoid
  4. Tanh
  5. Relu
  6. Leaky Relu
  7. Softmax
Activation Functions in Neural Networks - Towards Data Science
Image Source: Wikipedia

Linear Activation Function:

In the Linear activation function, whatever input we are providing to the function same output will be generated. We can understand using the below formula.

F(x) = x (No Change in the Output)

But the problem with linear activation function is, doesn’t matter how many layers we are using in our neural network. Still, the output of our system will be linear only means our neural network will not be able to learn the non-linear structure in the dataset.

Binary Step Activation Function:

{displaystyle f(x)={begin{cases}0&{text{for }}x<0\1&{text{for }}xgeq 0end{cases}}}

Activation binary step.svg
Visualization of Binary Step Function

In this activation function, we are having only two states either the output with we 0 or 1. When the input is greater than or equal to zero, the output will be one else the output will be 0.

In Binary Step Function, we can change the threshold. In the above case, we have taken the limit as 0. We can change it.

One of the significant problems with the binary step activation function is that it is non-differentiable at x = 0. So in neural networks, every time, it can not be used for classical backpropagation to update the weights. 

Sigmoid: 

Graph of Sigmoid
It is one of the commonly used activation functions in Neural Networks, also called the logistic activation function. It has an S shape. It is going to squeeze all the values in the range (0, 1). The sigmoid activation function is differentiable, so we can optimize our model using simple backpropagation. One of the most significant drawbacks with Sigmoid is it can create the vanishing gradient problem in the network as all the time gradient will be less than zero. We will learn about vanishing gradient problems in a detailed way in some other post. 
{displaystyle f(x)=sigma (x)={frac {1}{1+e^{-x}}}}
Sigmoid Activation Function

{displaystyle f'(x)=f(x)(1-f(x))}
Derivative of Sigmoid



Tanh/Hyperbolic Tangent Activation Function:

Activation tanh.svg
Graph of Tanh
It is also like Sigmoid, or even we can say that it is the scaled version of Sigmoid. Like Sigmoid, it is also differentiable at all points. Its range is (-1,1), which means given a value, it will convert the value in the range between (-1,1). As it is a non-linear activation function, it can learn some of the complex structures in the dataset. But one of the major drawbacks with it is that like Sigmoid. It also has a vanishing gradient problem because of the small value of gradients. In most of the cases, we prefer Tanh over Sigmoid. 

{displaystyle f(x)=tanh(x)={frac {(e^{x}-e^{-x})}{(e^{x}+e^{-x})}}}
Tanh Activation Function


{displaystyle f'(x)=1-f(x)^{2}}
Derivative of Tanh

Relu:

It is one of the most used activation functions in 2020, and one of the states of the art activation function in deep learning. From the function, we can see that as we provide negative value to Relu, it changes it to zero; otherwise, it does not change the value. As it does not activates all the neurons at once, and output for some of the neurons is zero. It makes the network sparse and computation efficient.
 Activation rectified linear.svg
Graph of Relu
{displaystyle f(x)={begin{cases}0&{text{for }}xleq 0\x&{text{for }}x>0end{cases}}}
Relu Activation Function

{displaystyle f'(x)={begin{cases}0&{text{for }}xleq 0\1&{text{for }}x>0end{cases}}}
Derivative of Relu
But there are some of the problems with Relu as well. One of the major drawbacks is, it is not differentiable at x = 0, and at the same time, it does not have any upper bound. For some of the neurons with a negative value as input, the gradient is always zero, so the weights for those neurons do not get an update. So it may create some dead neurons, but it can be handled by reducing the learning and bias. As the mean activation in the network is not zero, there is always a positive bias in the system. 

Leaky Relu:

Relu was having one of the drawbacks that it was non-differentiable at x = 0, and if lots of negative biases are there in the data, lots of gradients will be zero, and error will not be able to propagate. It will make Relu dead. So to avoid these problems, we will use Leaky Relu. 
Activation prelu.svg
Leaky Relu Graph
In Leaky Relu, instead of making negative value directly zero, we will multiply it with some small number. Generally, this small number is .01. So in this way, we can avoid the problem of dead Relu because even if lots of negative biases are there, then also some errors will get propagated.
{displaystyle f(x)={begin{cases}0.01x&{text{for }}x<0\x&{text{for }}xgeq 0end{cases}}}
Leaky Relu
{displaystyle f'(x)={begin{cases}0.01&{text{for }}x<0\1&{text{for }}xgeq 0end{cases}}}
Derivative Leaky Relu
But it is not necessary that Leaky Relu outperformed Relu. Its results are not consistent. Leaky Relu should only be considered as an alternative to Relu.


Relu-6:


In normal Relu and Leaky Relu, there is no upper bound on the positive values given to the function. But in Relu-6, there is an upper limit. Once the value goes beyond six, we will squeeze it to 6. It has been set after a lot of experiments. The upper bound encourage the model to learn sparse features early.

Some Basic Activation Functions | Mustafa Murat ARAT
Image Source: Google


More About Relu: https://medium.com/@chinesh4/why-relu-tips-for-using-relu-comparison-between-relu-leaky-relu-and-relu-6-969359e48310 


Softmax:


In some of the real-life applications, instead of directly getting some binary prediction, we want to know the probability of each predicted category. Softmax actually is not a classical activation function. It is generally used in the last layer of the network to provide the chances of the classes.

{displaystyle f_{i}({vec {x}})={frac {e^{x_{i}}}{sum _{j=1}^{J}e^{x_{j}}}}}
Function for Softmax Activation

Image Source: Google
There are tons of other activation functions as well. There is no math that can tell you what activation function will work on your dataset. It’s like a hyperparameter that you need to tune using hyperparameter stunning. Here we have tried to explain the most widely used activation functions. If there is any mistake in the explanation of any activation function, please let us know in the comment section, we will try to improve it.

You May Like Some Other Articles as Well:

  1. Various Evaluation metrics for Machine Learning Classification Tasks (Confusion metric, precision, recall, accuracy score, f1-score, etc)
  2. Scratch Implementation of Stochastic Gradient Descent using Python.
  3. Measure Distance between Two Vectors in Machine Learning
  4. How to Prepare Data Structure and Algorithms for Machine Learning and Data Science Interview.
  5. How to use Linkedin to get Machine Learning or Data Science Jobs?

Scratch Implementation of Stochastic Gradient Descent using Python

Stochastic Gradient Descent, also called SGD, is one of the most used classical machine learning optimization algorithms. It is the variation of Gradient Descent. In Gradient Descent, we iterate through entire data to update the weights. As at each iteration we are using the whole dataset to update the weights, when the dataset size is too large, Gradient Descent becomes too expensive in terms of time complexity.

So to reduce the time, we do a slight variation in Gradient Descent, and this new algorithm is called Stochastic Gradient Descent. In SGD, at each iteration, we pick up a single data point randomly from the large dataset and update the weights based on the decision of that data point only. Following are the steps that we use in SGD:

  1. Randomly initialize the coefficients/weights for the first iteration. These could be some small random values.
  2. Initialize the number of epochs, learning rate to the algorithm. These are the hyperparameters so they can be tunned using cross-validation.
  3. In this step, we will make the predictions using the calculated coefficients till this point.
  4. Now we will calculate the error at this point.
  5. Update the weights according to the formula given in image 1.
  6. Go to step 3 if the number of epochs is over or the algorithm has converged.
Image:1 Weight Update in SGD

Below is the python implementation of SGD from Scratch:

Given a data point and the old coefficients, this block of code will update the weights.

Given some unknown data points along with the calculated coefficient, this part of the code will make predictions.

This part of the code will take various parameters such as Training Data, learning rate, number of epochs, range r, and will return the optimal value of coefficients. The learning rate, range r, and the number of epochs are hyperparameters and will be calculated using cross-validation.

Finally, after calculating the optimal set of coefficients, we will make the predictions on the test dataset.

You can execute the code by just copy-pasting the code in an ipython notebook. You need to provide X_train, X_test, learning rate, r, and the number of epochs. If you are not able to run the code, do let me know in the comment section. I will reply as soon as possible.   You can find out full working code on GitHub: https://github.com/kkaran0908/Stochastic-Gradient-Descent-From-Scratch

Simple Exercise : 

  1. Download the dataset from Kaggle: https://www.kaggle.com/c/boston-housing
  2. Perform all the above steps on this dataset.
  3. After performing the above steps just comment in the comment section and let us know the Root Mean Squared Error of your model. 

You May Like: 

  1. Model Evaluation metrics in Machine Learning (Precision, Recall, f1-score, Accuracy Score, Confusion Matrix)
  2. How to use linkedin to get a machine learning or data science job?
  3. How to prepare data structure and algorithms from machine learn interview?

Feedback on Your Preparation for Data Science or Machine Learning Jobs (Mock Interview)

In the first place, it is challenging to get an interview call for a Machine Learning profile. But if you get a call, it is essential to convert that call into an offer. Sometimes we feel that our preparation is good enough to crack a machine learning interview, but actually, that is not the case. So in this process of interview feedback, we will conduct a telephonic/hangout/skype interview and provide you the feedback on your preparation for the machine learning jobs. We are a group of IITians, working in various top-notch product based companies as a machine learning engineer and has worked extensively on real-life machine learning use cases. This process is free, fill out the form given below, and we will get back to you as soon as possible. 

Everything You Need to Know about Machine Learning Syllabus to Become a Data Scientist?

Data science or machine learning is a field where everyone wants to make his/her career. But many people do not know what to study to become a great machine learning engineer. There are tons of machine learning algorithms you can learn, but in today’s world, you need not learn all of them. In this article, we are going to discuss machine learning algorithms that we need to know to become a good machine learning engineer. Here we will discuss every algorithm in very brief.



Before further deep dive into the topic first, we will learn about some basic terminologies, that will help us in understanding the syllabus in a better fashion:

Classification: 

It is a technique where we will be given some fixed number of classes. Given a data point, we have to predict in which category this particular data point belongs to. Ex: let suppose we train a model to predict whether the given image is of cat or dog. It is called classification.

Regression: 

It is a technique where given a data point, we have to predict some real value corresponding to that data point, e.g., given the location and area of the house, predict the prices of the house(a continuous variable).

Supervised Algorithms:

In these sets of algorithms, corresponding to each data point, we will have a label, e.g., corresponding to each image we will be having, whether the image belongs to a cat or dog.

Unsupervised Algorithms:

In this set of algorithms, corresponding to each data point we will not be having a label eg. Given an image we will not have a label, whether the image belongs to a cat or dog.

Semi-Supervised Algorithms:

These are special sets of algorithms, where a small amount of the data will be labeled, and the rest of the data will be unlabelled. As we have understood some terminology, so now we will try to explore the machine learning algorithms according to their nature.

Classical Machine Learning Algorithms:

In this section of the post, we will talk about conventional machine learning algorithms. For this class of algorithms, first we need to extract the features from the raw data and then feed them to the algorithms. These algorithms are ancient algorithms and have been there since the 80s-90s.

Naive Bayes: 

It is one of the straightforward classical machine learning algorithms; it works on the principle of the core Bayes theorem. It can be used for Regression as well as Classification.

K-Nearest Neighbors: 

The K-NN is easy to implement a machine learning algorithm. It can be used for both Classification and Regression.

Logistic Regression: 

It is among the most used classical machine learning algorithms. It is the special version of linear Regression. Although it’s name contains Regression, it can only be used for classification. It has a beautiful probabilistic interpretation.

Linear Regression: 

It is a classical machine learning algorithm that is only used for Regression.

Decision Tree: 

Decision Tree is the classical machine learning algorithm that is based on the core principles of simple if-else statements. It is highly interpretable.

Random Forest: 

Random Forest is nothing but a combination of various decision trees. These are less interpretable as compared to simple decision trees as we are taking the decision based on the prediction of a bunch of decision trees. It can be used for both Classification as well as Regression.

Support Vector Machine:

Currently, it’s among the most used classical machine learning algorithms. SVM can also be used for classification as well as Regression, the thing that makes it different from other algorithms is kernel trick (you will learn about it when you will learn the math behind support vector machine).

Boosting/XGboost:

In the series of classical machine learning algorithms, it is state of the art. In most of the competitions, it is highly useful. One of the drawbacks with it is that it has lots of hyperparameters. It can be trained using backpropagation, so we can use GPUs to train the model, unlike other classical machine learning algorithms. It can be used for both Regression as well as Classification.

Unsupervised Algorithms

These algorithms are mainly used in data extraction. Corresponding to each data point, we don’t have any label. We must know the following algorithms to know this part of machine learning, also called Data Mining: 
  1. K-Means++

  2. Hierarchical Clustering

  3. K-Mediods

  4. DBSCN clustering

Time Series Algorithms: 

These are the set of algorithms; those are used for the prediction on the data that varies with time such as stock Prices etc. We can learn the following algorithms to know this part of machine learning, but these are ancient approaches that are generally not used in production.

  1. Auto-Regressive algorithm
  2. Moving Average Algorithm
  3. Auto-Regressive Moving Average Algorithm
  4. Auto-Regressive Integrated Moving Average Algorithm

Optimization Techniques:

In every machine learning algorithm, there is a loss function that we need to optimize to reach the optimal point. The optimal point is the point at which our algorithm has as little error as possible on the test dataset. Below are the set of algorithms that are used for optimization: 

  1. Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini Batch Stochastic Gradient Descent
  4. Adagrad (mainly used for neural networks)
  5. Adadelta
  6. RMSPROP
  7. Adam

Dimensionality Reduction Algorithms:

In real life, most of the time, we have a dataset that has very high dimensions. It has various drawbacks such as the problem of curse of dimensionality, high training and testing time, heavy memory requirement to fit the data into the memory. So using these sets of algorithms will help us in reducing the dimension of each data point in the dataset without losing much information. Below are some of the algorithms of this category:

  1. Principal Component Analysis
  2. T-SNE(A Nice Data Visualization can also be done using T-SNE)
  3. Truncated SVD

Deep Learning Approaches:

In the current time, most of the large organizations are having access to huge historical data, and at the same time, they also have the huge computational power to process that data. Because of these two reasons, in most real-life scenarios, deep learning approaches work way better than classical machine learning algorithms discussed in section 1. Although deep learning is a hot area of research, if you can learn below topics, it will help you in most of the tasks:

Convolution Neural Network:

These are state of the art for various computer vision tasks such as image classification, etc. Under this category, you can study various algorithms and pre-trained architecture mentioned below:

  1. VGG 16, VGG 19, ResNet 152, etc 
  2. RCNN, FRCNN, YOLO, etc

Recurrent Neural Network:

In real life, we see a huge amount of sequential data, where the current point in the data depends on some previous point. We can take an example of any English sentence; almost all the time, the current word depends on the previous words. RNN works well in case of sequential data.

Long Short Term Memory: 

In real life, we see a huge amount of sequential data, where the current point in the data depends on some previous point. We can take an example of any English sentence; almost all the time, the current word depends on the previous words. RNN works well in case of sequential data.

Gated Recurrent Unit:

It is also like LSTM only with slight differences. If you know the working of GRU it will help you a lot in developing the understanding of various other algorithms as well.

Encoder-Decoder: 

In some real-world applications, the length of input and output in the dataset is not fixed. We can take an example of language translation. Let’s suppose we want to convert an English sentence to its corresponding Hindi sentence. For different English sentences, Hindi conversion will have a different size (length). So to handle all these dependencies, encoder decoders are used. There are tons of other applications of encoder-decoder as well. Below are some other concepts that you need to know to consider yourself as a deep learning expert.

  1. Dropout
  2. Batch Normalization
  3. Weight Initialization Techniques (Usage and drawbacks)
  4. Activation Functions (What are the drawbacks of some particular activation function and why to use some of the particular activation function)

There is no fixed syllabus for deep learning. It’s a massive area of research. Every day new topics are getting included in deep learning. So try to update yourself with all the latest advancements that are taking place every day. The best way to learn about all these latest tools and techniques is by reading the latest research papers in that particular field. 

In this article, we have discussed machine learning and deep learning syllabus. If you are comfortable with all the techniques described above, along with the maths behind them, you can consider yourself as a good data scientist. 

If you think you are comfortable with all these things, you can fill this Google form, and we will take your interview, and based on your performance in the discussion, we will provide you the feedback about your machine learning or deep learning skills. 

If we made any mistake in assigning the wrong group to any algorithm, please do let me know in the comment section.

You May Like:
1. How to use Linkedin for data science or machine learning jobs?
2. How to prepare data structure and algorithms for machine learning and data science profile?

Which Laptop to Buy for Machine Learning and Deep Learning


Good hardware is always essential for a hassle-free data science journey. I bought my laptop back in 2014, during the 1st year of my B.Tech. It had 4GB RAM. But in early 2018, whenever I used to train any machine learning model, It used to take a very long time, even with a minimal dataset, and most of the time, I used to face the memory error due to the less size of RAM. Machine Learning and Deep Learning require high-performance computing resources. But due to financial constraints, a student can’t buy a costly system. So in this post, we are going to talk about, what are the necessary configuration that a system must have to train a machine learning and deep learning model in a reasonable time and with a fair amount of dataset.
Here I will not provide any specific laptop/brand, but I will try to provide the details about what configuration’s laptop will be sufficient for machine learning and deep learning tasks.


RAM Size: 

At least 8GB RAM with expandable memory up to at least 16GB is a must-have. With a larger RAM size, you can train and test your model with a large number of data points. There are some classical machine learning algorithms, such as K-Nearest Neighbors, where we need to have larger RAM to train & test the model. If you are having an option of expending the RAM size, it might be beneficial in case when you do not want to change your system; instead, you can increase the RAM size of your system by adding the external RAM.

Hard Disk: 

Sometime your dataset size may be as large as 200 GB(in college life, datasets are not that big, but when you will work in some organization. This size will be huge.), so you should have at least 500GB hard disk. As hard disk are cheaper, you should go for 1 TB; it will be beneficial if you go for SSD, as it will reduce the booting time of the system drastically and will increase the speed of your system like hell. With the help of larger hard disk, you can store Games, Datasets, Movies in your system without worrying about memory.

Processor:

 At least Intel i5 8th Generation, if possible, go for higher but it is also sufficient as of now. During the machine learning journey, you will often see that; you require many numbers of CPUs in your system. Because we can train our machine learning model using all the processor, by changing a few keyword/lines in the code. It will drastically reduce the training time of your model. I will suggest you only go with intel. Higher the processor lesser the training and testing time of your machine learning or deep learning model is going to be.

Graphics Card

If you are interested in gaming/deep learning, Graphics Card is a must-have part of your system. As the deep Learning models take a very long time in training, GPUs will help in reducing the training time of your model. GPUs only work on the algorithms that can be trained using backpropagation. It does not work with classical machine learning algorithms. You should go with at least NVIDIA GeForce GTX 1050 with a RAM size 4GB. You can go with an advanced graphics card as well, but the cost of the system will increase drastically. Only go with the Nvidia Graphics card.

Screen Size: 

Most of the laptops with these configurations will be having at least 15.6 inches screen size. I think it is sufficient for a good user experience.

Battery Life: 

Most of the laptops with these configurations will not be having a very long battery life, as there will be a high cooling system installed in the system, and it will drain the battery very fast. I will suggest to go with the one having the most extended battery life; it will be helpful.

Almost all the companies are trying to provide better service to their customers. You can go with any brand having the above configurations (according to your budget). Dell, HP, with the same arrangements, will be slightly expensive as compare to ASUS, ACER, MSI, etc. If you want my opinion, I will suggest you go with either MSI, ASUS, or ACER. It will cost you around 65k.

My Recommendation: 

Here I will recommend some of the laptops, most of these laptops are being used some of my friends for various machine learning and deep learning kind of work. 

Price Range (50-70k): 

  1. ASUS TUF Gaming FX505DT
  2. MSI Gaming GL63 9RCX-222IN (My current laptop)
  3. (Renewed) MSI Gaming GL63 9RCX
  4. Acer Nitro 7 Intel Core i5-9300H
  5. Acer Predator Helios 300 PH315-51
  6. Acer Nitro Core i5 8th Gen 15.6-inch Laptop
  7. Lenovo Legion Y540 9th Gen Intel Core i5 15.6 inch FHD Gaming Laptop

You can do your market research before buying any of the laptops. But one thing I will highly recommend if you are a student, don’t go beyond the 70k price range. In the comment section, do let me know which laptop you are currently using.

Amazon Review Text Classification using Logistic Regression (Python sklearn)

Overview: Logistic Regression is the most commonly used classical machine learning algorithms. Although its name contains regression, it can be used only for classification. Logistic Regression can only be used for binary classification, but modified Logistic Regression can also be used for multiclass classification.

It has various advantages over other algorithms such as:
  1. It has a really nice probabilistic interpretation, as well as geometric interpretation.
  2. It is a parametric algorithm, and we need to store the weights that we learned during the training process to make predictions on the test data.
  3. It is nothing but a linear regression function on which the Sigmoid Function has been applied to treat the outliers(or large values) in a better way.
    1. Linear Regression Y = f(x)
    2. Logistic Regression Y = sigmoid(f(x))
There are several assumptions while applying Logistic Regression on any dataset:
  1. All the features are not multicollinear, and it can be tested using a perturbation test.
  2. The dependent variable should be binary.
  3. The dataset size should be large enough.
Logistic Regression Implementation on the Text Dataset (Using Sklearn):

You can download the data from here: https://www.kaggle.com/snap/amazon-fine-food-reviews First, we will clean the dataset. I have written a detailed post on the text data cleaning. You can read it here: https://kkaran0908.github.io/Text_Data_Preprocessing.html

After cleaning, we will divide the dataset into three parts, i.e., train, test, and validation set. Using the validation set, we will try to find out the optimal hyperparameters for the model. After getting optimal hyperparameter, we will test the model on the unseen data i.e. test set.

Now we vectorize the dataset using CountVectorizer (Bag of Words), it is one of the most straightforward methods to convert text data into numerical vector form.

Now we will import all the required that will be useful for the analysis.

Alpha, Penalty is the hyperparameters in Logistic Regression (there are others as well). We will try to find out the optimal values of these hyperparameters.

Output :

0.0001 ------> 0.5
0.001  ------>  0.7293641972138972
0.01  ------>  0.8886922437232533
0.1  ------>  0.9374969316048458
1  ------>  0.9399004712804476
10  ------>  0.9113632222156819
100  ------>  0.8794308252229597
Optimal AUC score: 0.9399004712804476
Optimal C: 1

“We can see that for c=1, we are getting an optimal AUC score, so for final modeling, we will use it.”

Our dataset we have two classes, so predict_proba(), is going to give us the probability of both the category. We can understand it by an example, so predict_proba() for a point will return the values like this [p,1-p], where p is the probability of positive point, and 1-p is the probability of the point being negative. For whichever category, we have a higher probability. We will assign that category to the test point.

OutPut:AUC score on test data: 0.8258208984684994

AUC score on the training data: 0.8909678471639081

Exercise for You:

  1. Download the same data from kaggle: https://www.kaggle.com/snap/amazon-fine-food-reviews
  2. Apply logistic regression on top of that data using a bag of words(BOW) only, as I have done in this post.
  3. Change the penalty from l1 to l2 and comment down your AUC score.
  4. If you are facing any difficulty in doing this analysis, please comment below I will share the full working code.

Additional Articles:

  1. https://www.statisticssolutions.com/what-is-logistic-regression/
  2. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
  3. http://www.robots.ox.ac.uk/~az/lectures/ml/2011/lect4.pdf

Back Propagation Algorithm in Deep Learning

It is one of the most useful concepts in entire deep learning. Most of the algorithms are trained using backpropagation algorithm only. Here in this article, we are going to talk about, what are the various steps in training an algorithm using the backpropagation algorithm.

Steps in Backpropagation Algorithm:

We will be given some dataset to train the model. it will be in the form (Xi, Yi). Where Xi is the x values and Yi is the corresponding predicted value.

  1. First, we will initialize the weights using various methods such as random_uniform, random_normal, glorot_normal, glorot_uniform, he_normal, etc.
  2. Pass each data point Xi into the network (also called forward propagation).
  3. Calculate the loss by using (Yi and Ypredicted).
  4. Compute all the derivatives using the chain rule and to increase the training time use memoization to calculate the derivative.
  5. Update the weights using the available algorithms such as SGDAdagrad, Adam, Adadelta, etc.
  6. Until convergence, repeat the steps from 2 to 5.

An important thing about Back-Propagation is that it works only when the activation function is differentiable. If the function is easily differentiable, we can train our model very fast.