Know Normal Distribution and Z-score as a Data Scientist

In this blog we will learn about normal distributions. You might have heard about the normal distribution before (also known as Gaussian distribution). It is a basic concept for any mathematics student. I find it little tricky to understand as an engineering student. Here, we will understand these concepts and have some intuition behind them. Hopefully this blog will help you understand those concepts. Rather than giving you answers, it will motivate you learn more statistics (if I’m successful).


by the end of this blog you will know,

  1. Basic understanding of Discrete and Continuous Variables
  2. What is a Probability distribution of continuous variables?
  3. What are Expectation and Variance? How are they defined for a continuous variable?
  4. What is standard deviation itself?
  5. Equation of normal distribution and its basic understanding
  6. 65-95-99 rule
  7. What is a z score?
  8. Why do we need to standardise data?
  9. How can we remove outliers from the data?
  10. How can we compare one data with another?
  11. What’s the relation between a normal distribution and machine learning?

Introduction

Have you even seen that show? The Brain games? (I used to love it as a teenager) That show had a mind-blowing trick in mathematics. They played a game where they gathered a bunch of contestants and asked them to guess the no of marbles (actually candies) in a jar in front them.

As you might expect, different people had different guesses. Some say 100, some say 10,000, some say other random numbers. But the guesses were (fairly) sparsed. Even though there were many participants, nobody got it right! Then Jason Silva (the host of show) brings one kid with calculator. The kid literally just adds all the guesses and takes the mean (average). Surprisingly, the average of these numbers is very close to the actual candies in the jar.

So how did that happened? How can mean be so close to the actual number?

The basic answer is normal distribution!

Note: this blog is continuation of probability mass function. If you are a beginner, I would suggest you to read this blog first.

PMF

Probability distribution of continuous variable

The Normal distribution or the gaussian distribution is a probability distribution of a continuous variable.

Consider an example of plotting the height of students in a school. As we know, height is a continuous variable (we measure heights). If there are 100 students and we can plot a probability distribution of it.


import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

'''
mu = mean of the data
variance = variance of the data 
no_people = no of people for the data (sample size) 
sigma = standard devaition of the data
'''

########################################
# Feel free to play with numbers
########################################
mu = 175
variance = 10
no_people = 100

sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, no_people)

plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()

# reference https://stackoverflow.com/questions/10138085/how-to-plot-normal-distribution
normal distribution
height distribution

Just like we saw before, height distribution has some sparsity to it. Few students will have under 165cm, whereas other few will have more than 185cm. On an average the students will have around 175cm height (general mean height).

Normal distribution is everywhere!

There are many things you can apply this to. e.g. Scientist have yet to discover the exact width of an atom (or any subatomic particle for that matter). They can approximate the width of an atom. (practically impossible to measure the width of an atom by modern physics/computation power).

The second example would be to approximate time itself. In our GPS satellites, it is crucial to have accurate time signature. Otherwise the entire GPS system would mess up, slight time change and we would be expecting errors of hundreds of kilometres between places.

Thanks to theory of relativity, we now know the orbits in which these satellites revolve have faster time than the time on the earth. (Because of the earth’s mass, the time flows slower on earth as compared to space. Anyways, enough physics). So we need to recalibrate these clocks in these “super-accurate” satellites. We need to approximate time itself!

You can see literally billions of examples of normal distribution in real life. Why? To be honest, I don’t know! Why there are so many things that follow this distribution… tell me when you find the answer. But I can assure you one thing, once you understand the concept of normal distributions, it’s hard to not know it.

Variance

Variance of any distribution is a measure of spread of the data. I will assume that you already know the basic concept of variance. If not, then check out this link and come back here. For now let’s jump into the formula for the variance.

Expectation of the normal distribution

The expected value (or expectation) of a random variable is a theoretical mean of the random variable.

Here, the expectation is the mean of the normal distribution. Recall that for the discrete random variable we defined the expectation to be,


    \[E(x) = \sum x.P(x)\]

Now we are dealing with the continuous variables. What’s the sum for the continuous values? Integral! 

So, expectation for the continuous variable becomes,


    \[E(x) = \int_{-\infty} ^{\infty} x. P(x) dx\]

Here we have infinity limits on either ends of the integral. This is because normal distribution is spread across the entire number line. While PDF (probability distribution for discrete variable) is a one form of normal distribution, we are dealing with normal distribution. So, to generalise things, we need to consider everything that is real.

We know already how to find variance from the expectation.


    \[Var(X) = E(X^2) + [E(X)]^2\]

Standard Deviation

Once we understand the concept of variance, we can now see for it’s root standard deviation! Standard deviation is denoted by σ. It is a value which is a square root of the variance σ2.

(ha!) But what is this term? And what it actually signifies? Honestly, I don’t know. The standard deviation denotes the spread of the data. But why do we need a term like a standard deviation? Probably the answer lies in the deeper mathematics. Normal distribution is has no bounds. It is spread across the number line, so instead of having these ginormous numbers like infinity we can have numbers like standard deviation (?).

So, keeping the standard deviation concept aside for a moment and just understanding that the SD is used to measure spread of the data, we can now see another striking thing about normal distribution.

68-95-99 rule

normal distribution
distribution (Wikipedia)

To understand how standard deviation works, let’s look at this diagram. For any normal distribution, 1 standard deviation from the mean contains 68.2% of total samples! And 2 standard deviations have 95.4% of total samples, 3 will have 99.6% of the data and so on. Not to forget that the normal distribution is applied almost everywhere in the universe.

To understand how amazing it is, we can assume certain quantity to be normally distributed. Without even knowing the value of each sample, we can find the standard deviation of a value within the distribution. This is like knowing the universe without even seeing it (Try to think it from the machine learning perspective). We will look into this concept more when we look for z score.

The Normal Equation

How did we got the exact percentages for those counts? The answer lies in the Normal distribution equation.

The Normal equation is given as,

normal equation
Normal equation

To understand what it is, you can watch a beautiful explanation here.

We can find the area under the curve. Something called as cumulative distribution. Which will have the formula,


    \[F_X(x) = P(X \leq x)\]

To grab this concept you can check out here.

Standard Normal Curve

Before we go any further, we need to define some things. We need to standardise our idea. 

Standard normal curve is the base curve for any normal distribution. For example, we can convert any normal curve into standard normal curve.

Few properties of standard normal curve,

  1. The standard deviation of the curve should be 1.
  2. The mean should be zero.
  3. The kurtosis should be 3.

We will see these cases one by one. But for now, we can convert our height distribution into normal distribution by converting it into a standard normal distribution, shifting it to 0 mean and scaling it to SD 1. (By shifting we mean, subtracting mean).

'''
mean of the standard normal curve is zero.
variance is 1. 
kurtosis is 3.
'''
mu = 0
variance = 1
no_samples = 100
kurtosis = 3

sigma = math.sqrt(variance)
x = np.linspace(mu - kurtosis*sigma, mu + kurtosis*sigma, no_samples)

plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()

In simple words, kurtosis gives us the value to show how pointy of the distribution.

Here’s a little example,

pointy normal distribution
kurtosis

We will see how and why important it is to convert the distributions into standard distributions but before that, let’s understand z score.

z score

The most used aspect of Normal curve in machine learning ( for feature selection/engineering) is z score. The z score is the quantity which tells how far is our sample (or point) from the mean. The rudimentary thing that we need to understand obviously is standard deviation. Z score tells us how many standard deviations far our point is (from the mean, obviously).

To formulate z score is very easy once we know what it tells.


    \[z = (x -\mu)/\sigma\]

Here, x represent a value we are testing (it can be anything). The \mu is the mean of the distribution. And the \sigma is the standard deviation. As you can see, the numerator just tells us the distance between the point and mean, whereas dividing it with standard deviation gives us how many standard deviations the point is away from the mean.

Then you might wonder why we need yet another score other for finding the distance from the mean? Answer is simple! The z- score is based on the standard distribution. Hence whenever we find the z score for any quantity, we can compare it with other z score of something totally different z score.

Why we need z score in machine learning?

For various machine learning algorithms, (almost all other than tree-based algorithms) we need to standardise the data. What do we mean by standardisation? We want to convert the data as close as possible to normal standard distribution. The standard distribution helps our machine learning model to converge better (this leads to the concept of gradient descent and optimisation). For now, we can agree that there are various applications other than machine learning models where standardisation of the data is necessary.

The other advantage (\use) of finding a z score is that we can compare two totally different values with each other.

In summary,

Z score is used for,

  1. Standardising data
  2. Finding outliers
  3. Comparing values between two distributions

Comparing values between two distributions using z score

For example, we want to find out how is person A’s height with respect to person B’s. A has height 175cm and person B has 180cm. But the person B lives in a western country, where generally people are taller than the eastern countries from where person A belongs. Now, what can we tell about the A/B’s height relative to their population? We can find out the distribution of US population (a western country) and India’s population distribution (eastern country). Then we can find the mean and where the A and B lie on their respective distributions. We can calculate the z score for these values and can compare them.

From the mere non-statistical perspective, we can say 175<180. But by comparing z scores, we might find they are similar.

Standardisation of the data using z score

Let’s take another example to show the power of z score. (Little treat for physics enthusiast in you). Suppose you are a NASA scientist. You are building a machine learning model that detects if a star will collapse into a black hole at the end of life cycle. (if you don’t know, the stars, among other things, collapse into themselves with the force of gravity by the end of their lifetime. And this can be a problem for us. Surprise surprise!). So, let’s look at the closest star to our sun,

details for normal distribution data
Proxima century data (Wikipedia)

Proxima century is the closest star to our sun. Given above are our parameters. We can see that the scales of the data points are vastly different (as are the metrics). The temperature is in the thousands, whereas Luminosity is way below 0, close in the scale of 10^{-5}. Handling such variety and scale of the data is hard for any machine learning algorithm (Why? read about gradient descent here). We can use z score to scale our data. In fact, standard scaler actually uses this exact technique to scale the data.

from sklearn.preprocessing import StandardScaler
'''
StandardScaler uses z score to standardise the data.
df = pandas dataframe 
col = [column name] (continuous variable) (although it works for any numeric values)
'''
scaler = StandardScaler().fit(df[col].values)
scaled = scaler.transform(df[col].values)

df[col] = scale

Here’s an example. To see how we can standardise the data. Notice the change in mean and SD.

from sklearn.preprocessing import StandardScaler
col = ['f_26']

print('Mean of the column before: ', df[col].mean())
print('SD of the column before: ', df[col].std())

scaler = StandardScaler().fit(df[col])
scaled = scaler.transform(df[col].values)
df[col] = scaled

print('Mean of the column after: ', df[col].mean())
print('SD of the column after: ', df[col].std())
Mean of the column before:  0.357591
SD of the column before: 	  2.47602

Mean of the column after: 	7.342275e-18
SD of the column after: 	  1.000001

Using standardisation, we can also handle skewness of the data.

Removing Outliers using z score

Another application of z score is removing outliers from the data. As we already know, z score is nothing but the SD value for the given data point. If the data is following normal distribution it will follow 68.2-95.4-99.6 rule. So removing total outliers is easy when we know the z score for the values. The z score (or standard deviation) can help us remove the outliers by removing those points with high z scores (larger than 2, let’s say). In fact, this exact technique get used in most of the data science projects during data cleaning. 


Q1 = df['AVG'].quantile(0.25)
Q3 = df['AVG'].quantile(0.75)
IQR = Q3 - Q1    

#IQR is interquartile range. 

filter = (df['AVG'] >= Q1 - 1.5 * IQR) & (df['AVG'] <= Q3 + 1.5 *IQR)
df.loc[filter]

# reference: https://datascience.stackexchange.com/questions/54808/how-to-remove-outliers-using-box-plot

To take this example little ahead, we can plot box plot. Which is nothing but the distribution in a bar form. A little introductory explanation of it is below.

Conclusion

Understanding normal distribution is like opening pandora’s box! It makes you think how the world works, literally. If you want to go further from here, you should look at the concept of central limit theorem that brings this concept to a whole another level. On the other hand, if you want understand practical aspect of it, you should look for concept of p values and t values (most important for feature selection).

You must have understood the normal distribution and z score concept. If this blog is successful, you must have got fascinated by the world of statistics! And if you are beginner-intermediate machine-learning/mathematics-statistics practitioner, you would love to learn more stats too. I am in that same phase, so why don’t you join me? Hit me at [email protected] or on linkedin any time. Or subscribe to the newsletter!

Leave a Reply

Your email address will not be published. Required fields are marked *

*