Whenever we are doing some data analysis, probability is one of the most important tools.
We can take the example of an organization. Let suppose 100k employees are working in the organization and we have to order the T-shirts for all the employees. But the problem is we don’t know the T-shirt size for every employee, and it’s very difficult to collect the t-shirt size from all the employees. Let the t-shirt sizes are S, M, X, XL.
Now we have the following questions:
- How many S size t-shirts should we order?
- How many M size t-shirts should we order?
One of the simplest solutions is to go to each employee and ask for their t-shirt size. But this process will take a very long time, so in such scenarios, distributions come very handily.
From domain knowledge, we know that t-shirt size depends on the height of the person. So at the entry gate, we can randomly collect the hight of 500 employees. From this sample, we can calculate the mean and standard deviation of these 500 employees.
Height ~ N ( Mean, Variance)
We can extend this distribution of 500 employees to the 100k employees and calculate the number of people with particular height and their t-shirt size(by mapping height to the t-shirt size). Here we are taking lots of assumptions, but in general, even after taking these assumptions results are mostly correct. As the sample size increases, the result’s accuracy also increases.
If you find any mistakes in this article, do provide your valuable suggestions in the comment section. We will be happy to correct them.