Whenever we are doing some data analysis, probability is one of the most important tools.

We can take the example of an organization. Let suppose 100k employees are working in the organization and we have to order the T-shirts for all the employees. But the problem is we don’t know the T-shirt size for every employee, and it’s very difficult to collect the t-shirt size from all the employees. Let the t-shirt sizes are S, M, X, XL.

Now we have the following questions:

How many S size t-shirts should we order?

How many M size t-shirts should we order?

One of the simplest solutions is to go to each employee and ask for their t-shirt size. But this process will take a very long time, so in such scenarios, distributions come very handily.

From domain knowledge, we know that t-shirt size depends on the height of the person. So at the entry gate, we can randomly collect the hight of 500 employees. From this sample, we can calculate the mean and standard deviation of these 500 employees.

Height ~ N ( Mean, Variance)

We can extend this distribution of 500 employees to the 100k employees and calculate the number of people with particular height and their t-shirt size(by mapping height to the t-shirt size). Here we are taking lots of assumptions, but in general, even after taking these assumptions results are mostly correct. As the sample size increases, the result’s accuracy also increases.

If you find any mistakes in this article, do provide your valuable suggestions in the comment section. We will be happy to correct them.

By looking at the shape of the data, We can learn a lot of things about the dataset.

Imagine a Histogram of heights (of many people).

Now, Imagine the bars getting thinner and thinner and the bins getting smaller and smaller. Till they are so thin that the outline of your Histogram looks like a smooth line, since this a distribution of continuous numbers, and there is an infinite possibility of heights.

If we let our bars be infinitely small, we get a smooth curve (like the last image), also known as DISTRIBUTION OF DATA.

In simple words, DISTRIBUTION represents all possible values for a set of data, and how often those values occur.

TYPES OF DISTRIBUTION:

We will see various distributions and their properties in detail:

Bernoulli Distribution

Any event where we have 1 trial and two possible outcomes follows Bernoulli Distribution.

These events may include – A coin flip, A single true or false quiz question, or Voting one of the two members.

Usually, when we are dealing with the Bernoulli Distribution, we either have :

The probability of one of the events occurring, OR

Have past data indicating some experimental probability.

In either case, the graph of Bernoulli is very simple. It consists of only two bars, one for each of the possible outcomes. One bar would rise up to its associated probability of P, and the other one would only reach 1-P.

So we can define the Bernoulli Distribution using the below equation:

Probability using Bernoulli Distribution

Bernoulli Graph

UNIFORM DISTRIBUTION

Uniform Distribution can be understood by means of Bernoulli Distribution. In this case, an unlimited number of outcomes are allowed and all outcomes have equal probability.

One such event is rolling a die. When we roll a six-sided die, we have equal chances of getting any value from 1 to 6.

The graph of this distribution would have 6 equal tall bars, all reaching up to one sixth.

Uniform Distribution Graph

Many events in gambling provide such odds, where each individual outcome is equally likely.

One drawback of Uniform Distribution is that the expected value provides us no relevant information, because, all outcomes have the same probability.

The Probability Density Function, Mean, and Variance for Continuous Uniform Distribution:

As in the dice example, we are having an output in the range [1,6]. Similarly if the outcomes in the range [a, b], probability density function:

PDF of Uniform Distribution

Mean of the Distribution

For random variable X:

X~U(0,23)

Find P(2 < X < 18):

P(2 < X < 18) = (18-2)*(1/(23-0)) = 16/23.

BINOMIAL DISTRIBUTION

In essence, Binomial events are a sequence of identical Bernoulli events.

We can also express a Bernoulli Distribution as a Binomial Distribution with a single trial.

To, better understand the difference between the two, suppose the following scenario:

You go to class, and your teacher gives the class a surprise test quiz, which you have not prepared for.

Luckily, the quiz contains 10 true or false questions.

In this case, guessing a single true or false question is a Bernoulli event. And, guessing the whole test is Binomial event.

The Binomial Distribution is the probability distribution of a sequence of experiments where each experiment produces a binary outcome and where each of the outcomes is independent of all the others.

Binomial Distribution can help you answer a question like this:

” If you flip a coin 20 times, what is the probability of getting 8 heads? “

The parameters of a binomial distribution are n and p where n is the total number of trials and p is the probability of success in each trial.

Formula to calculate the probability of some event occurring k times in the total n trails is given below (Probability of occurring the event is p, so the chance of not occurring will be (1-p) ):

The graph of Binomial Distribution represents the likelihood of attaining our desired outcomes a specific number of times. If we run n trials, our graph would consist of n+1 many bars; one for each unique value from 0 ton.

Binomial graph for 10 trials

So from the above plot, we can see that event A occurring five times and event B occurring five times has the highest probability among the ten events, as the bar corresponding to 5 has the highest chance.

NORMAL DISTRIBUTION

It is one of the commonly found continuous distribution. The Normal Distribution frequently occurs in nature, as well as in daily life, in various shapes and forms.

Some Examples of Normal Distribution:

Pizza delivery time follows the Normal Distribution.

The Height of the people in the world follows the Normal Distribution.

Marks in our university exams follow the Normal Distribution.

The weight of some animal let say Lion follows the Normal Distribution.

If we analyze the marks of the entire class in some exams, we find out that there are very few students who have either got very high marks or very fewer marks. In the exams, most of the students get average marks. So basically most of the students are having marks around the mean of the entire class except a few students. The student who gets extremely high or extremely low marks comes under the outlier. Because they are not following the trend.

Normal Distribution is defined as N( mu, sigma ), where mu is the mean of the Distribution, and sigma is the standard deviation of the Distribution.

Probability Density Function of Normal Distribution:

PDF of Normal Distribution

Now that you know, what types of events follow a Normal Distribution, let us examine some of its distinct characteristics.

For starters, the graph of Normal Distribution is bell-shaped. Therefore, the majority of data is centered around the mean. Thus, values further away from the mean are less likely to occur.

NORMAL DISTRIBUTION GRAPH

Furthermore, we can see that the graph is symmetric with regard to the mean. That suggests that the values are equally far away in opposite directions, would still be equally likely.

Another peculiarity of Normal Distribution is the “68,95,99.7” law.

This law suggests that for any normally distributed event, 68% of all the outcomes fall within 1 standard deviation away from the mean, 95% fall within 2 standard deviations, and 99.7 within 3.

To understand this, we take an example. From the above plot, let suppose the maximum and minimum marks in any exam are 240, 60, respectively. The entire class’s mean is (m = 150), and the standard deviation is (sigma = 30). Then 68% of the students who appeared in the exam will have marks in range (m – sigma, m + sigma), i.e. (120, 180), and approximately 95% students will be having marks in range (m – 2* sigma, m + 2*sigma), and so on. All the points outside (m – 3*sigma, m + 3*sigma) are generally considered as outliers, so we can remove such points during the data preprocessing.

The last part emphasizes the fact that outliers are extremely rare in Normal Distribution. It also suggests how much we know about a dataset only if we have the information that it is Normally Distributed.

POISSON DISTRIBUTION

The Poisson Distribution deals with the frequency with which the events occur in a specific interval.

Instead of Probability of an event, the Poisson Distribution requires knowing how often it occurs for a specific period of time or interval.

For example, a firefly might light up 3 times in 10 seconds on an average. We should use a Poisson Distribution if we want to determine the likelihood of it lighting up 8 times in 20 seconds.

So the Poisson distribution may be useful to model events such as

The number of meteorites greater than 1-meter diameter that strike Earth in a year

The number of patients arriving in an emergency room between 10 and 11 pm

The number of laser photons hitting a detector in a particular time interval

The graph of Poisson distribution plots the number of instances the event occurs in a standard interval of time and the probability for each one.

Poisson Distribution Graph

An event can occur a fixed number of times, such as 0, 1, 2, … in a given time interval. Let suppose the average number of events designated in the time interval is lambda. It is also called a rate parameter. Then the probability of observing k events in a range given by the equation:

where

Lambda is the average number of events per interval

e is the number 2.71828

k takes values 0, 1, 2, …

k! = k × (k − 1) × (k − 2) × … × 2 × 1 is the factorial of k.

Thus, our graph would always start from 0, since no event can happen a negative amount of time.

If you find any discrepancy in the post, please let us know in the comment section, Your suggestions are highly welcomed.