Benefits of using Numpy

Why is Numpy Better than Lists?


As a python programmer, one must be aware of why Numpy is better than Lists and how can we prove it. 

See, as a beginner, you can avoid this question, but if you have to work with big data sets; 

you should be aware of the benefits Numpy arrays have over lists.

In this article, we will discuss in what ways Numpy arrays are better and more suitable than lists.
Numpy Data Structure performs better in:




Numpy array have inbuilt functions like linear algebra operations built-in; 

which makes them better than lists.  


2) SIZE :


Numpy array takes up less space or size.

Let us understand with the help of an example;

  • First, we will import the numpy library, time and getsizeof from the system library.

  • Then, we will create an empty list and get its size.

  • We will fill the list with the n range of elements; i.e. 3000000  and will again get the size. 

 >> we see that the size has considerably increased.

  • Now, let’s convert the list to NumPy and take its size:

  • To prove our point we print out the difference between both the sizes:

Hence, our point is being proved by the above example;

that NumPy takes up less space as compared to the list data structure.




They have a need for speed and are faster than lists.

Let’s understand with the help of an example:

  • First of all, create a list of n elements and then print first and last 5 items:

  • Modifying lists and multiplying each item by 10 and printing some of them:   


>> we see here that all elements are now multiplied by 10.

  • Now, let’s calculate the processing time taken for this procedure: 


  • Now, we will convert the list into numpy : 

  • Modifying the array and then printing the elements: 


  • And, then calculate the time for the process:    

  • To prove our point we will find the difference between the two times:


>> Numpy is faster than a list by 0.53125s;

this is a small amount of time, but when you work with a large dataset, there will be a huge difference.


Numpy is designed to be efficient with matrix operations, more specifically; more processing in
NumPy is vectorized.
Vectorization involves, expressing mathematical operations; such as multiplication, that we are
using as occurring on the entire element, rather than a single statement.
With Vectorization, the underline code is parallelized, such that the operations we are performing
on NumPy array can run on multiple array elements at once, rather than looping through them
one at a time.
As long as the operations you are applying does not rely on any other array element,
as in the case of matrix multiplication; then vectorization will give you awesome speedups.
Looping over Python arrays, lists and dictionaries can be slow.
As, Vectorized operations in NumPy are mapped to highly optimized C code, making them much faster than their standard python counterparts. 

Hence, I will recommend you to use Numpy rather than lists in python.

Thank you for reading!            

This is the link to the code being used in the blog-


FLASK API to calculate WER, MER for text comparison in Python

Advance research is going on in the field of text analytics and NLP. Deep learning approach like Seq2Seq models and BERT have been established for tasks like language translation, abstractive text summarization and image captioning etc.
To determine how models are performing, we need performance metrics like rouge score, WER, blue score. WER or word error rate provides us a good vision by providing in depth analysis like words substituted, inserted or deleted.
Terminologies :  

The original text is the reference text or the gold standard text, machine, or model generated text is the hypothesis test.

WER is formalized as (number of words present in machine-generated text but not in original text + number of words present in the original text but not in machine-generated text + number of words substituted) / (number of words in original text)

MER is formalized as (number of words present in machine-generated but not in original text + number of words present in the original text but not in machine-generated text) / (number of the word in machine-generated text)

HTML template :
Python FLASK Code:

How to run the code:

Folder Structure:
   — templates
           — index.html

Make sure you have anaconda or python installed in your system with libraries flask, jiwer, numpy and re. Open anaconda prompt or terminal, navigate to WER folder and execute ‘python’. By default it will run on http://localhost:8018/ but you can change the port in file.

Test Result:

Basic Introduction to Random Forest

Decision Tree is one of the most used classical machine learning algorithms. But One of the biggest drawbacks of Decision Tree is that they are highly prone to overfitting. So to minimize this problem, we use Random Forest. In this article, we will study the basics of Random Forest and various terminologies related to it.

  1. Random Forest is used for an ensemble of decision trees. It uses the base principle of bagging with random feature selection to create more diverse trees.
  2. Splitting a node during the construction of a tree, the split that is chosen is no longer the best split among all the features. Instead, the split picked is the best split among a random subset of the features
  3. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree)
  4. Due to averaging, its variance decreases, usually more than compensating the increase in bias, hence yielding overall a better result.
  5. Can handle curse of dimensionality as the ensemble uses only a small random portion of the full feature set. Less prone to over-fitting.
  6. Select Sqrt(P) where P = number of features


Algorithm to Train Random Forest

Lets explore Hyperparameters in Random Forest:

Random Forest has a lot of parameters, because of which cross-validation phase to find an optimal set of parameters takes a very long time. 


  1. n_estimators: integer, optional (default=10)The number of trees in the forest.
  2. criterion: string, optional (default=”gini”)The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
  3. max_depth: integer or None, optional (default=None)The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  4. max_features: int, float, string or None, optional (default=”auto”)The number of features to consider when looking for the best split:
  5. bootstrap : boolean, optional (default=True)
  6. Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
  7. oob_score: bool (default=False) Whether to use out-of-bag samples to estimate the generalization accuracy.
  8. warm_start: bool, optional (default=False)When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
  9. n_jobs: int or None, optional (default=None) The number of jobs to run in parallel for both fit and predict. None means 1 and -1 means using all processors.

Feature Sampling in Random Forest:

Advantages of a Random Forest

1. A random forest is more stable than any single decision tree because the results get averaged out; it is not affected by the instability and bias of an individual tree.
2. A random forest is immune to the curse of dimensionality since only a subset of features is used to split a node.
3. You can parallelize the training of forest since each tree is constructed independently.

4. You can calculate the OOB (Out-of-Bag) error using the training set which gives a really good estimate of the performance of the forest on unseen data.
Hence there is no need to split the data into training and validation; you can use all the data to train the forest.

The OOB error:

The OOB error is calculated by using each observation of the training set as a test observation. Since each tree is built on a bootstrap sample, each observation can be used as a test observation by those trees which did not have it in their bootstrap sample.

All these trees predict this observation and you get an error for a single observation. The final OOB error is calculated by calculating the error on each observation and aggregating it. It turns out that the OOB error is as good as a cross-validation error.

Thank you and Keep Learning 🙂

You May Like:

Text Classification using LSTM in Keras (Review Classification using LSTM)

There are various classical machine learning algorithms, such as Naive Bayes, Logistic Regression, Support Vector Machine, etc. We can use these algorithms for text classification. Most of these classification algorithms assume that the words in the text are independent of each other, and these algorithms are not able to handle the dependency between different words in the dataset. Text data is sequential data where one particular word is dependent on several other words. Recurrent Neural Networks can handle sequential data. But RNNs has two major drawbacks:

  1. There is a vanishing gradient problem in RNNs.
  2. There is an exploding gradient problem in RNNs.

Due to the above two drawbacks, sometime RNNs are not able to handle the Long Term dependencies. To handle these problems, there is a special variant of RNN called LSTM (Long Short Term Memory). It can handle long term as well as short term dependency without facing the vanishing as well as exploding gradient problem. So at the end of this article, you should be able to classify a text dataset using LSTM.

How we can feed the data to LSTM:

We have to feed the data to LSTM in a particular format. First, we will count all the unique words in the dataset, and according to the number of times the word has accord in the dataset, we will make a dictionary. We will sort this dictionary according to the number of times a word has occurred. Now we will check, at what position a word is occurring in the dictionary, and will assign that position(Numerical Value) as a representation of that word.  Let suppose our text dataset is containing three sentences: 

data = [“this dog is really fast”, “dog barks on the strangers”, “I like dogs who barks on strangers”]

 First, we will store all the unique words and all the words in the two different lists.

Now we will create the dictionary that will contain word as a key and number of times a word has occurred as the value. We will sort this dictionary according to the value. 


[(‘dog’ : 3),
(‘strangers’ : 2),
(‘on’ : 2),
(‘barks’ : 2),
(‘who’ : 1),
(‘like’ : 1),
(‘I’ : 1),
(‘the’ : 1),
(‘fast’ : 1),
(‘really’ : 1),
(‘is’ : 1),
(‘this’ : 1)]

After the below code entire sentence will get converted into the numerical representation.

 new_data contains the Numerical representation of the text data, and it will look like: 

[[ 12, 1, 11, 10, 9 ] ,[ 1, 4, 3, 8, 2 ] ,[ 7, 6, 1, 5, 4, 3, 2 ]] 

We can see that the “dog” has occurred the highest number of times, so it has been assigned 1, “strangers” has occurred second-highest times, so it has got an encoding of 2 and so on. 

We will break the dataset in to train and test set:

In the dataset, all the sentences will not be having the same length, so below code will help us in making the length of each sentence equal. It will pad extra encoding if the length is smaller then 600 words and remove less occurring words if the length is more than 600 words.

Now we will classify our sentences or dataset, first, we will import all the required libraries:

Define the LSTM Architecture:

Train the Model and Save the History:

Evaluate the Trained Model:


  1. Download the Amazon Review Data-Set :
  2. Perform all the above steps on this dataset.
  3. After performing the above steps just comment in the comment section and let us know the accuracy score of your model.

All You Need to Know About Sampling Distribution in Statistics

Sampling is a process of drawing a predetermined number of observations from a larger population. It is very difficult to make predictions on the population i.e. when our data is very huge so we must take samples and make a prediction on sample data which represents our population.
sample refers to a smaller, manageable version of a larger group. It is a subset containing the characteristics of a larger population. The good maximum sample size is usually around 10% of the population. eg) You want to know the literacy rate of India so it is very difficult to collect the data from each and every person from the country, so we will collect the samples randomly. It is one of the important tasks to determine a correct sample from the population. 

Entire Population

In this case, we must ensure that data is highly random and not taken on the basis of anyone ground like a particular state or gender-wise to avoid any bias towards one category.

The sampling distributions are of two types:

Probability Distribution:

In this distribution, with randomization, every element gets an equal chance to be picked up.

Non-Probability Distribution:

In this distribution, every element does not get an equal chance to be selected.

Type of Distributions

Probability Distribution:

Probability sampling gives you the best chance to create a sample that is truly representative of the population. Using probability sampling for finding sample sizes means that you can employ statistical techniques like confidence intervals and margins of error to validate your results. There are various types of probability distribution sampling discussed below:

a) Simple Random Sampling :

Simple Random Sampling is mainly used when we don’t have any prior knowledge about the target variable. In this type of sampling, all the elements have an equal chance of being selected.
Simple random sampling method in statistics Vector Image
Simple Random Sampling
An example of a simple random sample would be the names of 50 employees being chosen out of a hat from a company of 500 employees. simple random sample is meant to be an unbiased representation of a group.

How you do simple random sampling?

  1. Define the population.
  2. Choose your sample size.
  3. List the population.
  4. Assign numbers to the units.
  5. Find random numbers.
  6. Select your sample.

    b) Systematic Sampling:

    Here the elements for the sample are chosen at regular intervals of population. First, all the elements are put together in a sequence. Here the selection of elements is systematic and not random except the first element.

    It is popular with researchers because of its simplicity. Researchers select items from an ordered population using a skip or sampling interval. For example, Saurabh can give a survey to every fourth customer that comes into the movie theatre.

    How you do systematic sampling?

    1. Calculate the sampling interval (the number of households in the population divided by the number of households needed for the sample)
    2. Select a random start between 1 and sampling intervals.
    3. Repeatedly add sampling interval to select subsequent households.

    c) Stratified Sampling:

    In stratified sampling, we divide the elements of the population into strata (means small groups) based upon the similarity measure. All the elements are homogenous within one group and heterogenous from others.

    How you do stratified sampling?

    1. Divide the population into smaller subgroups, or strata, based on the members’ shared attributes and characteristics.
    2. Step 2: Take a random sample from each stratum in a number that is proportional to the size of the stratum.

    Advantages of  Stratified Sampling:

    • A stratified sample can provide greater precision than a simple random sample of the same size.
    • Because it provides greater precision, a stratified sample often requires a smaller sample, which saves money.
    For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.

    The sample size for each strata (layer) is proportional to the size of the layer:

     A sample size of the strata = size of the entire sample/population size * layer size. 

    d) Cluster Sampling:

    In one stage, the entire cluster is selected randomly for sampling. Here our entire population is divided into different clusters and then clusters are randomly selected.
    In the second stage, here we first randomly select the clusters, combine those clusters and then randomly select samples from them.

    Cluster Sampling

    How you do cluster sampling?
    1. Estimate a population parameter.
    2. Compute sample variance within each cluster (for two-stage cluster sampling).
    3. Compute standard error.
    4. Specify a confidence level.
    5. Find the critical value (often z-score or a t-score).
    6. Compute margin of error.
    NOTE: Cluster sampling is less expensive and quicker.

    e) Multi-Stage Sampling: 

    Here, we can see the example where States are divided into districts further divided into villages and then households. In multi-stage sampling, the clusters are divided into groups and the groups are divided into subgroups until they cannot be further divided.

    Multi-Stage Sampling
    How you do multi-stage sampling?
    1. Choose a sampling frame, considering the population of interest.
    2. Select a sampling frame of relevant separate sub-groups.
    3. Repeat the second step if necessary.
    4. Using some variation of probability sampling, choose the members of the sample group from the sub-groups.
    Advantages: cost and speed. convenience (only need a list of clusters and individuals in selected clusters) usually more accurately than clusters for the same total size.

    2) Non-Probability Distribution types:

    Nonprobability sampling is a sampling technique where the odds of any member being selected for a sample cannot be calculated. Nonprobability sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection.

    Types of Non-Probability Sampling:

    Type of Non-Probability Sampling

    a) Convenience Sampling: 

    Convenience sampling which is also known as availability sampling is a specific type of non-probability sampling method. The sample is taken from a group of people easy to contact or to reach. For example, standing at a mall or a grocery store and asking people to answer questions would be an example of a convenience sample.

    Convenience Sampling
    The relative cost and time required to carry out a convenience sample are small in comparison to probability sampling techniques. This enables you to achieve the sample size you want in a relatively fast and inexpensive way limitations include data bias and generating inaccurate parameters. Perhaps the biggest problem with convenience sampling is dependence. Dependent means that the sample items are all connected to each other in some way.

    b) judgment Sampling:

    Judgment sampling is a common non-probability method. It is also called a purposive method. The researcher selects the sample based on the judgment. This is usually an extension of convenience sampling.

    Purposive sampling
    Judgment Sampling

    Judgment sampling may be used for a variety of reasons. In general, the goal of judgment sampling is to deliberately select units (e.g., individual people, events, objects) that are best suited to enable researchers to address their research questions. This is often done when the population of interest is very small, or desired characteristics of units are very rare, making probabilities sampling infeasible.

    c) Quota Sampling:

    A sampling method of gathering representative data from a group. As opposed to random sampling, quota sampling requires that representative individuals are chosen out of a specific subgroup. For example, a researcher might ask for a sample of 50 females or 50 individuals between the ages of 32-43.

    Quota Sample (Statistics) Definition | DeepAI
    Quota Sampling

    Quota sampling is used when the company is short of time or the budget of the person who is researching on the topic is limited. Quota sampling can also be used at times when detailed accuracy is not important. To create a quota sample, knowledge about the population and the objective should be well understood.

    d) Snowball Sampling:

    As described in Leo Goodman’s (2011) comment, snowball sampling was developed by Coleman (1958-1959) and Goodman (1961) as a means for studying the structure of social networks.
    Snowball sampling (or chain sampling, chain-referral, sampling referral sampling) is a non-probability sampling technique where existing study subjects recruited future subjects from among their acquaintances. Snowball sampling analysis is conducted once the respondents submit their feedback and opinions. Wsed where potential participants are hard to find.

    Snowball Sampling

    Advantage of Snowball Sampling:

    The chain referral process allows the researcher to reach populations that are difficult to sample when using other sampling methods. The process is cheap, simple and cost-efficient. This sampling technique needs little planning and fewer workforce compared to other sampling techniques.

    Disadvantages of Snowball Sampling:

    • The researcher has little control over the sampling method.
    • The representativeness of the sample is not guaranteed.
    • Sampling bias is also a fear of researchers when using this sampling technique.

    All You Need to Know about Activation Functions (Sigmoid, Tanh Relu, Leaky Relu, Softmax)

    Activation functions are a crucial part of any Neural Network. A deep learning model without applying an activation function is nothing but a simple linear regression model. Activation functions map the input to the output in a particular fashion. Activation functions help us in learning the intricate structure in the dataset. Earlier, when researchers were not aware of activation functions, we were not able to make efficient use of neural networks. In this article, we will learn about the following activation functions and what are their advantages and drawbacks.

    1. Linear Activation Function
    2. Binary Step Activation Function
    3. Sigmoid
    4. Tanh
    5. Relu
    6. Leaky Relu
    7. Softmax
    Activation Functions in Neural Networks - Towards Data Science
    Image Source: Wikipedia

    Linear Activation Function:

    In the Linear activation function, whatever input we are providing to the function same output will be generated. We can understand using the below formula.

    F(x) = x (No Change in the Output)

    But the problem with linear activation function is, doesn’t matter how many layers we are using in our neural network. Still, the output of our system will be linear only means our neural network will not be able to learn the non-linear structure in the dataset.

    Binary Step Activation Function:

    {displaystyle f(x)={begin{cases}0&{text{for }}x<0\1&{text{for }}xgeq 0end{cases}}}

    Activation binary step.svg
    Visualization of Binary Step Function

    In this activation function, we are having only two states either the output with we 0 or 1. When the input is greater than or equal to zero, the output will be one else the output will be 0.

    In Binary Step Function, we can change the threshold. In the above case, we have taken the limit as 0. We can change it.

    One of the significant problems with the binary step activation function is that it is non-differentiable at x = 0. So in neural networks, every time, it can not be used for classical backpropagation to update the weights. 


    Graph of Sigmoid
    It is one of the commonly used activation functions in Neural Networks, also called the logistic activation function. It has an S shape. It is going to squeeze all the values in the range (0, 1). The sigmoid activation function is differentiable, so we can optimize our model using simple backpropagation. One of the most significant drawbacks with Sigmoid is it can create the vanishing gradient problem in the network as all the time gradient will be less than zero. We will learn about vanishing gradient problems in a detailed way in some other post. 
    {displaystyle f(x)=sigma (x)={frac {1}{1+e^{-x}}}}
    Sigmoid Activation Function

    {displaystyle f'(x)=f(x)(1-f(x))}
    Derivative of Sigmoid

    Tanh/Hyperbolic Tangent Activation Function:

    Activation tanh.svg
    Graph of Tanh
    It is also like Sigmoid, or even we can say that it is the scaled version of Sigmoid. Like Sigmoid, it is also differentiable at all points. Its range is (-1,1), which means given a value, it will convert the value in the range between (-1,1). As it is a non-linear activation function, it can learn some of the complex structures in the dataset. But one of the major drawbacks with it is that like Sigmoid. It also has a vanishing gradient problem because of the small value of gradients. In most of the cases, we prefer Tanh over Sigmoid. 

    {displaystyle f(x)=tanh(x)={frac {(e^{x}-e^{-x})}{(e^{x}+e^{-x})}}}
    Tanh Activation Function

    {displaystyle f'(x)=1-f(x)^{2}}
    Derivative of Tanh


    It is one of the most used activation functions in 2020, and one of the states of the art activation function in deep learning. From the function, we can see that as we provide negative value to Relu, it changes it to zero; otherwise, it does not change the value. As it does not activates all the neurons at once, and output for some of the neurons is zero. It makes the network sparse and computation efficient.
     Activation rectified linear.svg
    Graph of Relu
    {displaystyle f(x)={begin{cases}0&{text{for }}xleq 0\x&{text{for }}x>0end{cases}}}
    Relu Activation Function

    {displaystyle f'(x)={begin{cases}0&{text{for }}xleq 0\1&{text{for }}x>0end{cases}}}
    Derivative of Relu
    But there are some of the problems with Relu as well. One of the major drawbacks is, it is not differentiable at x = 0, and at the same time, it does not have any upper bound. For some of the neurons with a negative value as input, the gradient is always zero, so the weights for those neurons do not get an update. So it may create some dead neurons, but it can be handled by reducing the learning and bias. As the mean activation in the network is not zero, there is always a positive bias in the system. 

    Leaky Relu:

    Relu was having one of the drawbacks that it was non-differentiable at x = 0, and if lots of negative biases are there in the data, lots of gradients will be zero, and error will not be able to propagate. It will make Relu dead. So to avoid these problems, we will use Leaky Relu. 
    Activation prelu.svg
    Leaky Relu Graph
    In Leaky Relu, instead of making negative value directly zero, we will multiply it with some small number. Generally, this small number is .01. So in this way, we can avoid the problem of dead Relu because even if lots of negative biases are there, then also some errors will get propagated.
    {displaystyle f(x)={begin{cases}0.01x&{text{for }}x<0\x&{text{for }}xgeq 0end{cases}}}
    Leaky Relu
    {displaystyle f'(x)={begin{cases}0.01&{text{for }}x<0\1&{text{for }}xgeq 0end{cases}}}
    Derivative Leaky Relu
    But it is not necessary that Leaky Relu outperformed Relu. Its results are not consistent. Leaky Relu should only be considered as an alternative to Relu.


    In normal Relu and Leaky Relu, there is no upper bound on the positive values given to the function. But in Relu-6, there is an upper limit. Once the value goes beyond six, we will squeeze it to 6. It has been set after a lot of experiments. The upper bound encourage the model to learn sparse features early.

    Some Basic Activation Functions | Mustafa Murat ARAT
    Image Source: Google

    More About Relu: 


    In some of the real-life applications, instead of directly getting some binary prediction, we want to know the probability of each predicted category. Softmax actually is not a classical activation function. It is generally used in the last layer of the network to provide the chances of the classes.

    {displaystyle f_{i}({vec {x}})={frac {e^{x_{i}}}{sum _{j=1}^{J}e^{x_{j}}}}}
    Function for Softmax Activation

    Image Source: Google
    There are tons of other activation functions as well. There is no math that can tell you what activation function will work on your dataset. It’s like a hyperparameter that you need to tune using hyperparameter stunning. Here we have tried to explain the most widely used activation functions. If there is any mistake in the explanation of any activation function, please let us know in the comment section, we will try to improve it.

    You May Like Some Other Articles as Well:

    1. Various Evaluation metrics for Machine Learning Classification Tasks (Confusion metric, precision, recall, accuracy score, f1-score, etc)
    2. Scratch Implementation of Stochastic Gradient Descent using Python.
    3. Measure Distance between Two Vectors in Machine Learning
    4. How to Prepare Data Structure and Algorithms for Machine Learning and Data Science Interview.
    5. How to use Linkedin to get Machine Learning or Data Science Jobs?

    Scratch Implementation of Stochastic Gradient Descent using Python

    Stochastic Gradient Descent, also called SGD, is one of the most used classical machine learning optimization algorithms. It is the variation of Gradient Descent. In Gradient Descent, we iterate through entire data to update the weights. As at each iteration we are using the whole dataset to update the weights, when the dataset size is too large, Gradient Descent becomes too expensive in terms of time complexity.

    So to reduce the time, we do a slight variation in Gradient Descent, and this new algorithm is called Stochastic Gradient Descent. In SGD, at each iteration, we pick up a single data point randomly from the large dataset and update the weights based on the decision of that data point only. Following are the steps that we use in SGD:

    1. Randomly initialize the coefficients/weights for the first iteration. These could be some small random values.
    2. Initialize the number of epochs, learning rate to the algorithm. These are the hyperparameters so they can be tunned using cross-validation.
    3. In this step, we will make the predictions using the calculated coefficients till this point.
    4. Now we will calculate the error at this point.
    5. Update the weights according to the formula given in image 1.
    6. Go to step 3 if the number of epochs is over or the algorithm has converged.
    Image:1 Weight Update in SGD

    Below is the python implementation of SGD from Scratch:

    Given a data point and the old coefficients, this block of code will update the weights.

    Given some unknown data points along with the calculated coefficient, this part of the code will make predictions.

    This part of the code will take various parameters such as Training Data, learning rate, number of epochs, range r, and will return the optimal value of coefficients. The learning rate, range r, and the number of epochs are hyperparameters and will be calculated using cross-validation.

    Finally, after calculating the optimal set of coefficients, we will make the predictions on the test dataset.

    You can execute the code by just copy-pasting the code in an ipython notebook. You need to provide X_train, X_test, learning rate, r, and the number of epochs. If you are not able to run the code, do let me know in the comment section. I will reply as soon as possible.   You can find out full working code on GitHub:

    Simple Exercise : 

    1. Download the dataset from Kaggle:
    2. Perform all the above steps on this dataset.
    3. After performing the above steps just comment in the comment section and let us know the Root Mean Squared Error of your model. 

    You May Like: 

    1. Model Evaluation metrics in Machine Learning (Precision, Recall, f1-score, Accuracy Score, Confusion Matrix)
    2. How to use linkedin to get a machine learning or data science job?
    3. How to prepare data structure and algorithms from machine learn interview?

    Top Skills You Must Not Avoid to Become a Great Data Scientist

    These days data scientist job is the most sought jobs among the youth. Everybody is trying to grab the data science skills. But some skills do not look crucial to become a good data scientist but are equally important. In this article, we will talk about all such skills that will make you stand out from the crowd. 

    Data Structure and Algorithms:

    If we talk about software engineering, data structure and algorithms are the most critical part of your life. Being a data scientist, if you are good with DSA, it will make you like a superhero. Most of the time, data scientists will be working with software engineering professionals, so this part of computer science is vital. Sometime it may happen that instead of using library implementation of some ML algorithm, you need to write the code from scratch. In such cases, DSA plays a crucial role.

    Maths and Statistics

    As we know, machine learning is nothing but a beautiful application of mathematics and probability. If you are good at maths, it will help you in understanding an algorithm intuitively. Further, it will also help you in choosing which algorithm to use on some particular dataset. At the same time,  with the help of statistics, we can get different insights from the dataset, it will be beneficial during the pre-processing stage of any machine learning project.

    Core Computer Science Skills:

    As a data scientist, if you know about the basics of computer science such as operating systems, computer networks, DBMS, etc. These skills will not only be highly helpful during the interview but make you stand out of the crowd at your workplace. If you have time, I will highly encourage you to know at-least how all these things work.

    Good Communication Skills:

    If you aspire to become a data scientist and you are not good with communication. It’s a big red flag for you. As a data scientist almost all the time, you will be working with the people having expertise in some other domain. They will not do not understand technical terms. So, in that case, your communication skills are going to help you a lot. If you think your communication skills are not very good, try to improve it as soon as possible.

    A Good Story Teller:

    As a data scientist, if you are good at storytelling, it will be like an icing on the cake for your career. Whatever you are doing, you should be able to explain interestingly.

    There are tons of data science or machine learning courses available over the internet. Most of the courses focus on the machine learning algorithm part and their library implementation. If you have the skills mentioned above, it will provide an excellent boost for your career. If I have missed anything in this article, do let me know in the comment section.

    You may like:
    1. Free feedback on your data scientist job preparation (Mock Interview)
    2. Get Feedback on You Resume for Software Engineering/ Machine Learning/ Data Science Jobs
    3. How to use Linkedin to get Machine Learning or Data Science Jobs?

    Feedback on Your Preparation for Data Science or Machine Learning Jobs (Mock Interview)

    In the first place, it is challenging to get an interview call for a Machine Learning profile. But if you get a call, it is essential to convert that call into an offer. Sometimes we feel that our preparation is good enough to crack a machine learning interview, but actually, that is not the case. So in this process of interview feedback, we will conduct a telephonic/hangout/skype interview and provide you the feedback on your preparation for the machine learning jobs. We are a group of IITians, working in various top-notch product based companies as a machine learning engineer and has worked extensively on real-life machine learning use cases. This process is free, fill out the form given below, and we will get back to you as soon as possible. 

    Everything You Need to Know about Machine Learning Syllabus to Become a Data Scientist?

    Data science or machine learning is a field where everyone wants to make his/her career. But many people do not know what to study to become a great machine learning engineer. There are tons of machine learning algorithms you can learn, but in today’s world, you need not learn all of them. In this article, we are going to discuss machine learning algorithms that we need to know to become a good machine learning engineer. Here we will discuss every algorithm in very brief.

    Before further deep dive into the topic first, we will learn about some basic terminologies, that will help us in understanding the syllabus in a better fashion:


    It is a technique where we will be given some fixed number of classes. Given a data point, we have to predict in which category this particular data point belongs to. Ex: let suppose we train a model to predict whether the given image is of cat or dog. It is called classification.


    It is a technique where given a data point, we have to predict some real value corresponding to that data point, e.g., given the location and area of the house, predict the prices of the house(a continuous variable).

    Supervised Algorithms:

    In these sets of algorithms, corresponding to each data point, we will have a label, e.g., corresponding to each image we will be having, whether the image belongs to a cat or dog.

    Unsupervised Algorithms:

    In this set of algorithms, corresponding to each data point we will not be having a label eg. Given an image we will not have a label, whether the image belongs to a cat or dog.

    Semi-Supervised Algorithms:

    These are special sets of algorithms, where a small amount of the data will be labeled, and the rest of the data will be unlabelled. As we have understood some terminology, so now we will try to explore the machine learning algorithms according to their nature.

    Classical Machine Learning Algorithms:

    In this section of the post, we will talk about conventional machine learning algorithms. For this class of algorithms, first we need to extract the features from the raw data and then feed them to the algorithms. These algorithms are ancient algorithms and have been there since the 80s-90s.

    Naive Bayes: 

    It is one of the straightforward classical machine learning algorithms; it works on the principle of the core Bayes theorem. It can be used for Regression as well as Classification.

    K-Nearest Neighbors: 

    The K-NN is easy to implement a machine learning algorithm. It can be used for both Classification and Regression.

    Logistic Regression: 

    It is among the most used classical machine learning algorithms. It is the special version of linear Regression. Although it’s name contains Regression, it can only be used for classification. It has a beautiful probabilistic interpretation.

    Linear Regression: 

    It is a classical machine learning algorithm that is only used for Regression.

    Decision Tree: 

    Decision Tree is the classical machine learning algorithm that is based on the core principles of simple if-else statements. It is highly interpretable.

    Random Forest: 

    Random Forest is nothing but a combination of various decision trees. These are less interpretable as compared to simple decision trees as we are taking the decision based on the prediction of a bunch of decision trees. It can be used for both Classification as well as Regression.

    Support Vector Machine:

    Currently, it’s among the most used classical machine learning algorithms. SVM can also be used for classification as well as Regression, the thing that makes it different from other algorithms is kernel trick (you will learn about it when you will learn the math behind support vector machine).


    In the series of classical machine learning algorithms, it is state of the art. In most of the competitions, it is highly useful. One of the drawbacks with it is that it has lots of hyperparameters. It can be trained using backpropagation, so we can use GPUs to train the model, unlike other classical machine learning algorithms. It can be used for both Regression as well as Classification.

    Unsupervised Algorithms

    These algorithms are mainly used in data extraction. Corresponding to each data point, we don’t have any label. We must know the following algorithms to know this part of machine learning, also called Data Mining: 
    1. K-Means++

    2. Hierarchical Clustering

    3. K-Mediods

    4. DBSCN clustering

    Time Series Algorithms: 

    These are the set of algorithms; those are used for the prediction on the data that varies with time such as stock Prices etc. We can learn the following algorithms to know this part of machine learning, but these are ancient approaches that are generally not used in production.

    1. Auto-Regressive algorithm
    2. Moving Average Algorithm
    3. Auto-Regressive Moving Average Algorithm
    4. Auto-Regressive Integrated Moving Average Algorithm

    Optimization Techniques:

    In every machine learning algorithm, there is a loss function that we need to optimize to reach the optimal point. The optimal point is the point at which our algorithm has as little error as possible on the test dataset. Below are the set of algorithms that are used for optimization: 

    1. Gradient Descent
    2. Stochastic Gradient Descent
    3. Mini Batch Stochastic Gradient Descent
    4. Adagrad (mainly used for neural networks)
    5. Adadelta
    6. RMSPROP
    7. Adam

    Dimensionality Reduction Algorithms:

    In real life, most of the time, we have a dataset that has very high dimensions. It has various drawbacks such as the problem of curse of dimensionality, high training and testing time, heavy memory requirement to fit the data into the memory. So using these sets of algorithms will help us in reducing the dimension of each data point in the dataset without losing much information. Below are some of the algorithms of this category:

    1. Principal Component Analysis
    2. T-SNE(A Nice Data Visualization can also be done using T-SNE)
    3. Truncated SVD

    Deep Learning Approaches:

    In the current time, most of the large organizations are having access to huge historical data, and at the same time, they also have the huge computational power to process that data. Because of these two reasons, in most real-life scenarios, deep learning approaches work way better than classical machine learning algorithms discussed in section 1. Although deep learning is a hot area of research, if you can learn below topics, it will help you in most of the tasks:

    Convolution Neural Network:

    These are state of the art for various computer vision tasks such as image classification, etc. Under this category, you can study various algorithms and pre-trained architecture mentioned below:

    1. VGG 16, VGG 19, ResNet 152, etc 
    2. RCNN, FRCNN, YOLO, etc

    Recurrent Neural Network:

    In real life, we see a huge amount of sequential data, where the current point in the data depends on some previous point. We can take an example of any English sentence; almost all the time, the current word depends on the previous words. RNN works well in case of sequential data.

    Long Short Term Memory: 

    In real life, we see a huge amount of sequential data, where the current point in the data depends on some previous point. We can take an example of any English sentence; almost all the time, the current word depends on the previous words. RNN works well in case of sequential data.

    Gated Recurrent Unit:

    It is also like LSTM only with slight differences. If you know the working of GRU it will help you a lot in developing the understanding of various other algorithms as well.


    In some real-world applications, the length of input and output in the dataset is not fixed. We can take an example of language translation. Let’s suppose we want to convert an English sentence to its corresponding Hindi sentence. For different English sentences, Hindi conversion will have a different size (length). So to handle all these dependencies, encoder decoders are used. There are tons of other applications of encoder-decoder as well. Below are some other concepts that you need to know to consider yourself as a deep learning expert.

    1. Dropout
    2. Batch Normalization
    3. Weight Initialization Techniques (Usage and drawbacks)
    4. Activation Functions (What are the drawbacks of some particular activation function and why to use some of the particular activation function)

    There is no fixed syllabus for deep learning. It’s a massive area of research. Every day new topics are getting included in deep learning. So try to update yourself with all the latest advancements that are taking place every day. The best way to learn about all these latest tools and techniques is by reading the latest research papers in that particular field. 

    In this article, we have discussed machine learning and deep learning syllabus. If you are comfortable with all the techniques described above, along with the maths behind them, you can consider yourself as a good data scientist. 

    If you think you are comfortable with all these things, you can fill this Google form, and we will take your interview, and based on your performance in the discussion, we will provide you the feedback about your machine learning or deep learning skills. 

    If we made any mistake in assigning the wrong group to any algorithm, please do let me know in the comment section.

    You May Like:
    1. How to use Linkedin for data science or machine learning jobs?
    2. How to prepare data structure and algorithms for machine learning and data science profile?