While studying machine learning, I encountered the term "hacker statistics" in different sources, such as from "Statistical thinking in Python" classes of DataCamp, and a presentation at PyCon by Jake Vanderplas. It was really a mind opener for me. I had very few statistics background from my university degree, so the results I find in online when searching about some statistical terms are not that intuitive and takes me a while to understand.

This post is just to solidify my understanding of the concept and to share what I learned, hoping that this helps some people having trouble with understanding intuition for some statistical concepts as well.

The coin toss

Let's try tossing a coin using a python program. We can use the the random class of the numpy to generate our random numbers. Setting 1 as heads and 0 for tails for our coin toss.

In [1]:
import numpy as np
np.random.seed(123)
np.random.randint(2,size=10) #let us toss our 2-sided (duh) coin 10 times
Out[1]:
array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0])

Cool, we have 3 heads and 7 tails.

In the real world, we may encounter a situation where we might want ask the question, is this coin we are tossing a fair coin or not? Or is this dice fair or not? We can do simulation to answer this problem.

The die toss

In particular, the question may be: What is the probability that the fair dice will show the number "1" 6 times in 20 rolls?

In [2]:
def roll_dice(n):
    "Roll a dice n times, and return the dice side counts in a dictionary"
    die_sides = np.arange(1,7)
    a=np.random.randint(6,size=n)+1
    counts_dict = {die_sides[i]:sum(a==die_sides[i]) for i in range(len(die_sides))}
    return counts_dict

#lets roll a dice 20 times
counts=roll_dice(20)

for key,val in counts.items():
    print('Die side {} occurences: {}'.format(key,val))
Die side 1 occurences: 5
Die side 2 occurences: 6
Die side 3 occurences: 2
Die side 4 occurences: 3
Die side 5 occurences: 3
Die side 6 occurences: 1

Great, we are successful with simulating 20 dice rolls. To know the probability of rolling the side "1" six times, I can repeat this exercise again for say 10,000 times, and check the the instances when die side "1" shows up 6 times or more.

In [3]:
is_greater_than_6 = 0
for i in range(10000):
    counts=roll_dice(20)
    is_greater_than_6 += counts[1]>=6

probability = is_greater_than_6/10000
print("The probability of showing \"1\" on the dice is {:.2%}".format(probability))
The probability of showing "1" on the dice is 9.95%

Ok, so we accept the null hypothesis, i.e. we don't have enough evidence to say that this is an unfair dice.

It turns out that this whole dice problem is actually a story that can be told using the binomial distribution, which says that "The number of successes ($r$) in $n$ Bernoulli trials with probability ($p$) of success, is binomially distributed". Which just says that we can show the probability distribution of dice rolling a particular side, e.g. "1", with a $\frac{1}{6}$ chance or probability, $p$.

This might still be vague, so lets draw a plot to make it clear.

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()

#we use the np.random.binomial to sample from binomial distribution
np.random.binomial(20,1./6,size=5) #toss a 6-sided dice 20 times and count the 
                                   #times it shows a particular side, e.g. "1", 
                                   #and repeat this exercise 5 times.
Out[4]:
array([6, 5, 1, 3, 1])

Notice that this is actually has the same function as the roll_dice function defined above, just without with the numpy optimizations of course. We can sample with a bigger size and plot a histogram to show how the binomial distribution looks like.

In [5]:
%matplotlib inline

binomial_samples = np.random.binomial(20,1./6,size=10000)

fig, ax=plt.subplots()

bins = np.arange(1, max(binomial_samples)) - 0.5           # set up the bin edges
ax.hist(binomial_samples,bins=bins,density=True,alpha=0.5) # generate histogram
ax.margins(0.02)                                           # set margins
ax.set_xlabel('number of times the "1" face shows up')     # label axes
ax.xaxis.grid()
ax.axvline(x=5.5,linewidth=3)
ax.annotate('{:.2%}'.format(probability), 
             xytext=(7, 0.11), 
             xy=(5.5, 0.11),
             va='center',
             color='k',
             size=11,
             arrowprops={'arrowstyle': '<|-', 
                         'lw': 2, 
                         'color': 'k'});

Indeed, we see here that the area to the right of 6 inclusive, is roughly 10% of the area of the whole distribution.

We saw that we can do simulation to get a good estimate of the binomial distribution. Though the numpy method give us faster results, the intiuition behind this statistical concept can be explained in doing a simple code. These can then give us the 'aha' moment when we look back to the formal and classical definition of a binomial distribution, and the likes of it.


Comments

comments powered by Disqus