Whenever we are doing some data analysis, probability is one of the most important tools.

We can take the example of an organization. Let suppose 100k employees are working in the organization and we have to order the T-shirts for all the employees. But the problem is we don’t know the T-shirt size for every employee, and it’s very difficult to collect the t-shirt size from all the employees. Let the t-shirt sizes are S, M, X, XL.

Now we have the following questions:

How many S size t-shirts should we order?

How many M size t-shirts should we order?

One of the simplest solutions is to go to each employee and ask for their t-shirt size. But this process will take a very long time, so in such scenarios, distributions come very handily.

From domain knowledge, we know that t-shirt size depends on the height of the person. So at the entry gate, we can randomly collect the hight of 500 employees. From this sample, we can calculate the mean and standard deviation of these 500 employees.

Height ~ N ( Mean, Variance)

We can extend this distribution of 500 employees to the 100k employees and calculate the number of people with particular height and their t-shirt size(by mapping height to the t-shirt size). Here we are taking lots of assumptions, but in general, even after taking these assumptions results are mostly correct. As the sample size increases, the result’s accuracy also increases.

If you find any mistakes in this article, do provide your valuable suggestions in the comment section. We will be happy to correct them.

Whenever we need to find out the type of relationship between two variables/columns in a dataset. Co-variance concept comes into the picture. It is used to determine relationships between different random variables.

NO.

X

Y

1.

x1

y1

2.

x2

y2

3.

x3

y3

4.

x4

y4

5.

x5

y5

6.

x6

y6

We can calculate the co-variance between X and Y using the below formula:

By looking at the shape of the data, We can learn a lot of things about the dataset.

Imagine a Histogram of heights (of many people).

Now, Imagine the bars getting thinner and thinner and the bins getting smaller and smaller. Till they are so thin that the outline of your Histogram looks like a smooth line, since this a distribution of continuous numbers, and there is an infinite possibility of heights.

If we let our bars be infinitely small, we get a smooth curve (like the last image), also known as DISTRIBUTION OF DATA.

In simple words, DISTRIBUTION represents all possible values for a set of data, and how often those values occur.

TYPES OF DISTRIBUTION:

We will see various distributions and their properties in detail:

Bernoulli Distribution

Any event where we have 1 trial and two possible outcomes follows Bernoulli Distribution.

These events may include – A coin flip, A single true or false quiz question, or Voting one of the two members.

Usually, when we are dealing with the Bernoulli Distribution, we either have :

The probability of one of the events occurring, OR

Have past data indicating some experimental probability.

In either case, the graph of Bernoulli is very simple. It consists of only two bars, one for each of the possible outcomes. One bar would rise up to its associated probability of P, and the other one would only reach 1-P.

So we can define the Bernoulli Distribution using the below equation:

Probability using Bernoulli Distribution

Bernoulli Graph

UNIFORM DISTRIBUTION

Uniform Distribution can be understood by means of Bernoulli Distribution. In this case, an unlimited number of outcomes are allowed and all outcomes have equal probability.

One such event is rolling a die. When we roll a six-sided die, we have equal chances of getting any value from 1 to 6.

The graph of this distribution would have 6 equal tall bars, all reaching up to one sixth.

Uniform Distribution Graph

Many events in gambling provide such odds, where each individual outcome is equally likely.

One drawback of Uniform Distribution is that the expected value provides us no relevant information, because, all outcomes have the same probability.

The Probability Density Function, Mean, and Variance for Continuous Uniform Distribution:

As in the dice example, we are having an output in the range [1,6]. Similarly if the outcomes in the range [a, b], probability density function:

PDF of Uniform Distribution

Mean of the Distribution

For random variable X:

X~U(0,23)

Find P(2 < X < 18):

P(2 < X < 18) = (18-2)*(1/(23-0)) = 16/23.

BINOMIAL DISTRIBUTION

In essence, Binomial events are a sequence of identical Bernoulli events.

We can also express a Bernoulli Distribution as a Binomial Distribution with a single trial.

To, better understand the difference between the two, suppose the following scenario:

You go to class, and your teacher gives the class a surprise test quiz, which you have not prepared for.

Luckily, the quiz contains 10 true or false questions.

In this case, guessing a single true or false question is a Bernoulli event. And, guessing the whole test is Binomial event.

The Binomial Distribution is the probability distribution of a sequence of experiments where each experiment produces a binary outcome and where each of the outcomes is independent of all the others.

Binomial Distribution can help you answer a question like this:

” If you flip a coin 20 times, what is the probability of getting 8 heads? “

The parameters of a binomial distribution are n and p where n is the total number of trials and p is the probability of success in each trial.

Formula to calculate the probability of some event occurring k times in the total n trails is given below (Probability of occurring the event is p, so the chance of not occurring will be (1-p) ):

The graph of Binomial Distribution represents the likelihood of attaining our desired outcomes a specific number of times. If we run n trials, our graph would consist of n+1 many bars; one for each unique value from 0 ton.

Binomial graph for 10 trials

So from the above plot, we can see that event A occurring five times and event B occurring five times has the highest probability among the ten events, as the bar corresponding to 5 has the highest chance.

NORMAL DISTRIBUTION

It is one of the commonly found continuous distribution. The Normal Distribution frequently occurs in nature, as well as in daily life, in various shapes and forms.

Some Examples of Normal Distribution:

Pizza delivery time follows the Normal Distribution.

The Height of the people in the world follows the Normal Distribution.

Marks in our university exams follow the Normal Distribution.

The weight of some animal let say Lion follows the Normal Distribution.

If we analyze the marks of the entire class in some exams, we find out that there are very few students who have either got very high marks or very fewer marks. In the exams, most of the students get average marks. So basically most of the students are having marks around the mean of the entire class except a few students. The student who gets extremely high or extremely low marks comes under the outlier. Because they are not following the trend.

Normal Distribution is defined as N( mu, sigma ), where mu is the mean of the Distribution, and sigma is the standard deviation of the Distribution.

Probability Density Function of Normal Distribution:

PDF of Normal Distribution

Now that you know, what types of events follow a Normal Distribution, let us examine some of its distinct characteristics.

For starters, the graph of Normal Distribution is bell-shaped. Therefore, the majority of data is centered around the mean. Thus, values further away from the mean are less likely to occur.

NORMAL DISTRIBUTION GRAPH

Furthermore, we can see that the graph is symmetric with regard to the mean. That suggests that the values are equally far away in opposite directions, would still be equally likely.

Another peculiarity of Normal Distribution is the “68,95,99.7” law.

This law suggests that for any normally distributed event, 68% of all the outcomes fall within 1 standard deviation away from the mean, 95% fall within 2 standard deviations, and 99.7 within 3.

To understand this, we take an example. From the above plot, let suppose the maximum and minimum marks in any exam are 240, 60, respectively. The entire class’s mean is (m = 150), and the standard deviation is (sigma = 30). Then 68% of the students who appeared in the exam will have marks in range (m – sigma, m + sigma), i.e. (120, 180), and approximately 95% students will be having marks in range (m – 2* sigma, m + 2*sigma), and so on. All the points outside (m – 3*sigma, m + 3*sigma) are generally considered as outliers, so we can remove such points during the data preprocessing.

The last part emphasizes the fact that outliers are extremely rare in Normal Distribution. It also suggests how much we know about a dataset only if we have the information that it is Normally Distributed.

POISSON DISTRIBUTION

The Poisson Distribution deals with the frequency with which the events occur in a specific interval.

Instead of Probability of an event, the Poisson Distribution requires knowing how often it occurs for a specific period of time or interval.

For example, a firefly might light up 3 times in 10 seconds on an average. We should use a Poisson Distribution if we want to determine the likelihood of it lighting up 8 times in 20 seconds.

So the Poisson distribution may be useful to model events such as

The number of meteorites greater than 1-meter diameter that strike Earth in a year

The number of patients arriving in an emergency room between 10 and 11 pm

The number of laser photons hitting a detector in a particular time interval

The graph of Poisson distribution plots the number of instances the event occurs in a standard interval of time and the probability for each one.

Poisson Distribution Graph

An event can occur a fixed number of times, such as 0, 1, 2, … in a given time interval. Let suppose the average number of events designated in the time interval is lambda. It is also called a rate parameter. Then the probability of observing k events in a range given by the equation:

where

Lambda is the average number of events per interval

e is the number 2.71828

k takes values 0, 1, 2, …

k! = k × (k − 1) × (k − 2) × … × 2 × 1 is the factorial of k.

Thus, our graph would always start from 0, since no event can happen a negative amount of time.

If you find any discrepancy in the post, please let us know in the comment section, Your suggestions are highly welcomed.

Sampling is a process of drawing a predetermined number of observations from a larger population. It is very difficult to make predictions on the population i.e. when our data is very huge so we must take samples and make a prediction on sample data which represents our population.

A sample refers to a smaller, manageable version of a larger group. It is a subset containing the characteristics of a larger population. The good maximum sample size is usually around 10% of the population. eg) You want to know the literacy rate of India so it is very difficult to collect the data from each and every person from the country, so we will collect the samples randomly. It is one of the important tasks to determine a correct sample from the population.

Entire Population

In this case, we must ensure that data is highly random and not taken on the basis of anyone ground like a particular state or gender-wise to avoid any bias towards one category.

The sampling distributions are of two types:

Probability Distribution:

In this distribution, with randomization, every element gets an equal chance to be picked up.

Non-Probability Distribution:

In this distribution, every element does not get an equal chance to be selected.

Type of Distributions

Probability Distribution:

Probability sampling gives you the best chance to create a sample that is truly representative of the population. Using probability sampling for finding sample sizes means that you can employ statistical techniques like confidence intervals and margins of error to validate your results. There are various types of probability distribution sampling discussed below:

a) Simple Random Sampling :

Simple Random Sampling is mainly used when we don’t have any prior knowledge about the target variable. In this type of sampling, all the elements have an equal chance of being selected.

Simple Random Sampling

An example of a simple random sample would be the names of 50 employees being chosen out of a hat from a company of 500 employees. A simple random sample is meant to be an unbiased representation of a group.

How you do simple random sampling?

Define the population.

Choose your sample size.

List the population.

Assign numbers to the units.

Find random numbers.

Select your sample.

b) Systematic Sampling:

Here the elements for the sample are chosen at regular intervals of population. First, all the elements are put together in a sequence. Here the selection of elements is systematic and not random except the first element.

It is popular with researchers because of its simplicity. Researchers select items from an ordered population using a skip or sampling interval. For example, Saurabh can give a survey to every fourth customer that comes into the movie theatre.

How you do systematic sampling?

Calculate the sampling interval (the number of households in the population divided by the number of households needed for the sample)

Select a random start between 1 and sampling intervals.

Repeatedly add sampling interval to select subsequent households.

c) Stratified Sampling:

In stratified sampling, we divide the elements of the population into strata (means small groups) based upon the similarity measure. All the elements are homogenous within one group and heterogenous from others.

How you do stratified sampling?

Divide the population into smaller subgroups, or strata, based on the members’ shared attributes and characteristics.

Step 2: Take a random sample from each stratum in a number that is proportional to the size of the stratum.

Advantages of Stratified Sampling:

A stratified sample can provide greater precision than a simple random sample of the same size.

Because it provides greater precision, a stratified sample often requires a smaller sample, which saves money.

For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.

The sample size for each strata (layer) is proportional to the size of the layer:

A sample size of the strata = size of the entire sample/populationsize * layer size.

d) Cluster Sampling:

In one stage, the entire cluster is selected randomly for sampling. Here our entire population is divided into different clusters and then clusters are randomly selected.

In the second stage, here we first randomly select the clusters, combine those clusters and then randomly select samples from them.

Cluster Sampling

How you do cluster sampling?

Estimate a population parameter.

Compute sample variance within each cluster (for two-stage cluster sampling).

Compute standard error.

Specify a confidence level.

Find the critical value (often z-score or a t-score).

Compute margin of error.

NOTE: Cluster sampling is less expensive and quicker.

e) Multi-Stage Sampling:

Here, we can see the example where States are divided into districts further divided into villages and then households. In multi-stage sampling, the clusters are divided into groups and the groups are divided into subgroups until they cannot be further divided.

Multi-Stage Sampling

How you do multi-stage sampling?

Choose a sampling frame, considering the population of interest.

Select a sampling frame of relevant separate sub-groups.

Repeat the second step if necessary.

Using some variation of probability sampling, choose the members of the sample group from the sub-groups.

Advantages: cost and speed. convenience (only need a list of clusters and individuals in selected clusters) usually more accurately than clusters for the same total size.

2) Non-Probability Distribution types:

Non–probability sampling is a sampling technique where the odds of any member being selected for a sample cannot be calculated. Non–probability sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection.

Types of Non-Probability Sampling:

Type of Non-Probability Sampling

a) Convenience Sampling:

Convenience sampling which is also known as availability sampling is a specific type of non-probability sampling method. The sample is taken from a group of people easy to contact or to reach. For example, standing at a mall or a grocery store and asking people to answer questions would be an example of a convenience sample.

Convenience Sampling

The relative cost and time required to carry out a convenience sample are small in comparison to probability sampling techniques. This enables you to achieve the sample size you want in a relatively fast and inexpensive way limitations include data bias and generating inaccurate parameters. Perhaps the biggest problem with convenience sampling is dependence. Dependent means that the sample items are all connected to each other in some way.

b) judgment Sampling:

Judgment sampling is a common non-probability method. It is also called a purposive method. The researcher selects the sample based on the judgment. This is usually an extension of convenience sampling.

Judgment Sampling

Judgment sampling may be used for a variety of reasons. In general, the goal of judgment sampling is to deliberately select units (e.g., individual people, events, objects) that are best suited to enable researchers to address their research questions. This is often done when the population of interest is very small, or desired characteristics of units are very rare, making probabilities sampling infeasible.

c) Quota Sampling:

A sampling method of gathering representative data from a group. As opposed to random sampling, quota sampling requires that representative individuals are chosen out of a specific subgroup. For example, a researcher might ask for a sample of 50 females or 50 individuals between the ages of 32-43.

Quota Sampling

Quota sampling is used when the company is short of time or the budget of the person who is researching on the topic is limited. Quota sampling can also be used at times when detailed accuracy is not important. To create a quota sample, knowledge about the population and the objective should be well understood.

d) Snowball Sampling:

As described in Leo Goodman’s (2011) comment, snowball sampling was developed by Coleman (1958-1959) and Goodman (1961) as a means for studying the structure of social networks.

Snowball sampling (or chain sampling, chain-referral, sampling referral sampling) is a non-probability sampling technique where existing study subjects recruited future subjects from among their acquaintances. Snowball sampling analysis is conducted once the respondents submit their feedback and opinions. Wsed where potential participants are hard to find.

Snowball Sampling

Advantage of Snowball Sampling:

The chain referral process allows the researcher to reach populations that are difficult to sample when using other sampling methods. The process is cheap, simple and cost-efficient. This sampling technique needs little planning and fewer workforce compared to other sampling techniques.

Disadvantages of Snowball Sampling:

The researcher has little control over the sampling method.

The representativeness of the sample is not guaranteed.

Sampling bias is also a fear of researchers when using this sampling technique.