Probability Mass Function (PMF) in an Intuitive way

Probability is a topic that we all have learned in high-school maths. Since you are reading this blog, you must have got some rust on those concepts. Here, we will try to remove that rust and understand (or relearn) those concepts from the machine learning point of view. There is just one prerequisite for this blog, that is to know about the probability. If you know how we get a probability of getting head (i.e. 1/2) from tossing a coin, then you are ready to learn Probability Mass Function!


by the end of this blog, you will know,

  1. What is probability? and probability distribution?
  2. What is probability mass function?
  3. What is the expectation of probability mass function (PMF)? How to calculate it?
  4. What is the variance of probability mass function? How to calculate it?
  5. Why do we learn probability distributions for machine learning?

Feel free to skip,

What is probability?

Take an example, tossing a coin! What happens when you flip a coin? It will either give you a head/tail (you are not a trick master to make it stand on the edge). If you toss a coin. Now, how can you predict beforehand that we will get a head? How can we know the future of an event? Probability can help us find that

by definition, 

Probability is a measure of the likelihood of an event to occur.

Now, if you toss a coin and you want to know the probability of getting heads for the next toss, how would you calculate it?

You would calculate the no of times the coin has been tossed. You would count how many times you got heads and how many times you got the tails. And then, you would argue that for the next toss, the coin will have heads/tails based on which frequency you had the most.

So if you have a coin and you have tossed it 100 times. And suppose, 64 of times you got heads and 36 of times you got tails, now what do you think will happen if you toss it 101st time?

You know the probability of getting heads now, that too, 64%. That doesn’t mean coin will give you 64% head, right? It’s just a measure, the probability! If you had it 100%, that means you are sure that coin will give you heads at the next toss. We can toss that coin billions of times and get the same results. In the end, you would have (almost) 64% of all outcomes as heads. (It doesn’t really work (always) for smaller numbers can you think why?)

Code for plotting Probability Mass Function

import matplotlib.pyplot as plt
import seaborn as sns
import random


def flip(p):
		# We use random variable to get any random value. 
		# You can use more specific function.
    return 'H' if random.random() < p else 'T'


def pmf(P, N, M):
		# This function represents a probability mass function 
		# N is the no of coins a set, for example, here, we have just 1 coin
		# M is the no of times we are going to toss set of coins
    l = []
    for m in range(M):
        flips = [flip(P) for i in range(N)]
				# We are calculating total no of heads found in the sample
        l.append(flips.count('H'))
    return l

#################################################################
# Play with these numbers and observe the probability distributions
#################################################################

P = .64     # Probability of getting heads (bias)
N = 1       # No of times the coins
M = 100     # No of times the coin has been tossed

plt.ylabel('PMF')
# set stat = 'count' to get the actual counts
# Set bins = no of outcomes (optional)
sns.histplot(pmf(P, N, M), stat = 'probability', bins = 2)

# Here we get a bar graph denoting... 
# no of tails (first bar) and no of heads (second bar) 
Probability Distribution for single coin

That’s good and fine, but how would you know if we toss such 10 coins simultaneously and we want to know how many heads would we get from them? Simple probability becomes a little more interesting. For that, we need something more.

What is a probability distribution?

A probability distribution is a mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. In simple terms, this function will help us find the probability of all the events.

So, back to our previous question. How are we gonna find the no of heads from tossing 10 coins?

It’s the same as before. Instead of tossing one coin at a time, we will flip 10 coins at a time, and this we will do multiple times (let’s say 1000 times).

In the end, for predicting what would we get from tossing 10 coins, we will look at those 100 results that we have just found.

Distribution for multiple outcomes

But here’s the catch, for our 1 coin, it was easy to calculate no of heads/ tails from 100 tosses. There were just two outcomes for that event, heads and tails. We got 64 heads out of 100 tosses. So the probability is 0.64. But now we have 10 different outcomes. We can get from 0 heads to all 10 heads out of 10 coin tosses. So here we will sum up the number of cases for each event. For example, suppose we get 20 samples (1 sample = 10 coins tossed) with 4 heads and 6 tails.

We can show this with a code,


P = .64    # Keeping the same biased coin
N = 10     # No of coins in one sample is 10
M = 1000   # Total no of samples (no of times we tossed sample of 10)

plt.ylabel('PMF')
# Set bins = no of outcomes, here 10 (optional)
sns.histplot(pmf(P, N, M), stat='probability', bins=10)

# Here the we get a bar graph denoting...
# probability of getting no certain of heads in a sample
Probability distribution for 10 coins

Probability Mass Function

A probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value.

The probability distribution for the 10 tosses per sample example is actually called as probability mass function. It is a special kind of probability distribution in which the variable is discrete.

There’s one more type of variable, called a continuous random variable. We will see that in detail later.

But for now, we need to know more about the PMF. There are a few properties we need to understand before going further. Why should you learn these properties? We will answer that question once we know these concepts.

What is the expected value of Probability Mass Function?

The expected value (or the expectation) of a random variable is the theoretical mean of the random variable.

To understand the expected value, we must learn about the mean itself. What is the mean of a set of numbers? We want a number that can describe all the numbers in that set. One way to find that number is to take the mean (average).

Let’s say there’s a set {1,1,1,1,2,2,3,3,3,3} and you want to know the mean of numbers. One way to do that is by adding up all the numbers. 1+1+1+1+2+2+3+3+3+3 and dividing it by total no. So, our mean would be 20/10 = 2.

We can arrange this differently. 1 repeats 4 times, 2 repeats 2 times and 3 repeats 4 times. To put it in a probabilistic perspective, we can say, 1 repeats 4/10 times, 2 repeats 2/10 times and 3 repeats 4/10 times.

	
mean  = (1 + 1 + 1 + 1 + 2 + 2 + 3 + 3 + 3 + 3) / 10
			= (1 * 4 + 2 * 2 + 3 * 4) / 10
			= 1 * (4/10) + 2 * (2/10) + 3 * (4/10)
			= 2

i. e., to calculate the mean, we multiplied the number with its probability and then added the sum. (hold that thought)

Now, let’s come back to our coin-tossing example. Tossing set of 10 coins 100 thousand times.


import collections
import pandas as pd

P = .64
N = 10
M = 100000

l = pmf(P, N, M)
counter = collections.Counter(l)

df = pd.DataFrame(list(dict(counter).items()), columns = ['Heads', 'Count']).sort_values('Heads').reset_index(drop = True)
df['Probability'] = df['Count']/M

print(df)
distribution for 10 coin flips

What is the mean of heads in a for all the trials? It is simple to calculate. We just have to sum the number of heads and then divide it by the number of trials!

If we sum all of the number of heads (count) in the above data. We get 639901. If we divide with our no of events, then we get a mean of 6.39901. And that is what our expectation is!

Let’s put it more formally,


    \[E(X) = (\sum Count(x) * x)/ M \\ = \sum Count(x) /M * x\]

Finally,


    \[E(X) = \sum x.P(x)\]

It’s better to calculate the expectation from the above equation as we just have to put probability values there (and as we are calculating probability values in PMF).

In the end,

The expected value is a mean

The expected value is the weighted average of the possible values in the random variable (with weights given by their respective theoretical probabilities).

As you can see, expected value/ expectation is another term for a calculating mean of the distribution. Just like a mean, which describes a set of numbers, an expectation gives us an idea of the distribution. Because the expectation is mean!

Now, let’s understand one more concept that gives us a different perspective about the distribution.

What is the variance of Probability Mass Function?

A measure of spread for distribution of a random variable that determines the degree to which the values of the random variable differ from the expected value.

Variance is the measure of the spread of our data. In the above data itself, we have 2 samples where the heads count is 0 and 1151 samples with all 10/10 heads. To know how wide (spread) is our data, we can calculate the distance between the extremes. In this case, we would calculate the distance between the heads in the 10/10 heads samples and 0/10 heads samples.

(10 x 1151) - (0 x 2) = 11510

But this doesn’t really give us an idea about the data. What is in between these two values? We don’t know.

Calculate spread of the data

Another way to calculate the spread is to take distance from the mean, that we have just calculated, our expectation!. But can see the problem with that? The distance from the values that are less than the mean will cancel out the values that are more than the mean (negative cancels positive). And if the mean is at the middle and the data is ideally distributed, we can get this value as zero. This is a problem because it won’t give us any information about the data.

Squared distance or absolute distance?

So, to do that, we can actually take the square of distances, add them up and take the mean. This way, we won’t have negative values that can cancel out. But then why not take absolute distances and add them up? In theory, we could do that, but using squared distance gives a better idea about the distance itself. To argue more, read this.

To make it clearer, we take the mean (average) of squared distances from the mean (expectation of the data).

Remember, we are taking the mean of the sum of the squared distances, that is, the Expectation of squared distances.


    \[Var(X) = E(x-\mu)^2\]

If we expand it,

\mu = E(x)


    \[Var(X) = E(x^2 - 2\mu x+ \mu^2)\newline = E(x^2) - 2\mu E(x) + \mu^2\]

Finally,


    \[Var(X) = E(x^2) - [E(x)]^2\]

Here’s the formula for the expectation and the variance for a discrete random variable.

Why do we need expectation and variance?

In simple words, we need expectation and variance to describe the distribution. What is our distribution? It is the data itself! We can know a lot about our data. How is the spread of the data? How diverse? What is mean? Are the data points widely spread from the mean of they are very close to each other? We can answer most of these questions by simply applying the above concepts.

There’s one more question. We need to answer before we close this discussion.

What is the relation between probability distribution and machine learning?

We try to predict the probability distribution of the data using any machine learning algorithm.

What is machine learning? We provide our model with a set of data and then make it understand the data. Now, What is probability? We ‘predict’ (scientifically) the future based on past events. So let’s say, if we build a basic model which predicts the toss of the coin for our above dataset, we are essentially telling our model to learn the distribution. This might seem a little tricky first but think about it, if our model predicts values like expectation and variance for 10 set coins example with 1000 events dataset, we can easily predict the distribution for 100000 samples or even more. Because our distribution will not change (it will become more and more stable as you increase the events). Obviously, in the real world, we won’t have such clean distributions. Hence our models are complex. The concept you are looking is called statistical inference.

To read about probability distribution of continuous variables, read this blog!

In the end, it is crucial to understand these distributions. There’s a special name for the above distribution, Binomial Distributionbut that was not the point of this discussion. What we wanted to know about it was just probability mass function!

Hope you liked it.

Some interesting blogs,

  1. Precision, Recall and F1 Score
  2. ROC curve and AUR from Scratch

Is there any ambiguity? Do you have some suggestions? Hit me anytime on [email protected]!

Leave a Reply

Your email address will not be published. Required fields are marked *

*