While studying machine learning, I encountered the term "hacker statistics" in different sources, such as from "Statistical thinking in Python" classes of DataCamp, and a presentation at PyCon by Jake Vanderplas. It was really a mind opener for me. I had very few statistics background from my university degree, so the results I find in online when searching about some statistical terms are not that intuitive and takes me a while to understand.
This post is just to solidify my understanding of the concept and to share what I learned, hoping that this helps some people having trouble with understanding intuition for some statistical concepts as well.
The coin toss¶
Let's try tossing a coin using a python program. We can use the the random
class of the numpy
to generate our random numbers. Setting 1 as heads and 0 for tails for our coin toss.
import numpy as np
np.random.seed(123)
np.random.randint(2,size=10) #let us toss our 2-sided (duh) coin 10 times
Cool, we have 3 heads and 7 tails.
In the real world, we may encounter a situation where we might want ask the question, is this coin we are tossing a fair coin or not? Or is this dice fair or not? We can do simulation to answer this problem.
The die toss¶
In particular, the question may be: What is the probability that the fair dice will show the number "1" 6 times in 20 rolls?
def roll_dice(n):
"Roll a dice n times, and return the dice side counts in a dictionary"
die_sides = np.arange(1,7)
a=np.random.randint(6,size=n)+1
counts_dict = {die_sides[i]:sum(a==die_sides[i]) for i in range(len(die_sides))}
return counts_dict
#lets roll a dice 20 times
counts=roll_dice(20)
for key,val in counts.items():
print('Die side {} occurences: {}'.format(key,val))
Great, we are successful with simulating 20 dice rolls. To know the probability of rolling the side "1" six times, I can repeat this exercise again for say 10,000 times, and check the the instances when die side "1" shows up 6 times or more.
is_greater_than_6 = 0
for i in range(10000):
counts=roll_dice(20)
is_greater_than_6 += counts[1]>=6
probability = is_greater_than_6/10000
print("The probability of showing \"1\" on the dice is {:.2%}".format(probability))
Ok, so we accept the null hypothesis, i.e. we don't have enough evidence to say that this is an unfair dice.
It turns out that this whole dice problem is actually a story that can be told using the binomial distribution, which says that "The number of successes ($r$) in $n$ Bernoulli trials with probability ($p$) of success, is binomially distributed". Which just says that we can show the probability distribution of dice rolling a particular side, e.g. "1", with a $\frac{1}{6}$ chance or probability, $p$.
This might still be vague, so lets draw a plot to make it clear.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()
#we use the np.random.binomial to sample from binomial distribution
np.random.binomial(20,1./6,size=5) #toss a 6-sided dice 20 times and count the
#times it shows a particular side, e.g. "1",
#and repeat this exercise 5 times.
Notice that this is actually has the same function as the roll_dice
function defined above, just without with the numpy optimizations of course. We can sample with a bigger size and plot a histogram to show how the binomial distribution looks like.
%matplotlib inline
binomial_samples = np.random.binomial(20,1./6,size=10000)
fig, ax=plt.subplots()
bins = np.arange(1, max(binomial_samples)) - 0.5 # set up the bin edges
ax.hist(binomial_samples,bins=bins,density=True,alpha=0.5) # generate histogram
ax.margins(0.02) # set margins
ax.set_xlabel('number of times the "1" face shows up') # label axes
ax.xaxis.grid()
ax.axvline(x=5.5,linewidth=3)
ax.annotate('{:.2%}'.format(probability),
xytext=(7, 0.11),
xy=(5.5, 0.11),
va='center',
color='k',
size=11,
arrowprops={'arrowstyle': '<|-',
'lw': 2,
'color': 'k'});
Indeed, we see here that the area to the right of 6 inclusive, is roughly 10% of the area of the whole distribution.
We saw that we can do simulation to get a good estimate of the binomial distribution. Though the numpy method give us faster results, the intiuition behind this statistical concept can be explained in doing a simple code. These can then give us the 'aha' moment when we look back to the formal and classical definition of a binomial distribution, and the likes of it.
Comments
comments powered by Disqus