I am fascinated with all things statistics which is the base of all the machine learning algorithms. I strongly beleive that a strong foundation in statistics will make us as not just as better data scientists, but also will make us more intutive in real life. I often say that the machine learning is just a glorified form of statistics.
In this notebook I am explaining a concept in statistics called sampling and sampling distribution, which is very essential also to machine learning. As ML models are build on some form of sample data, and not on all the data that is available in the system.
Statistics is science of inference. So to infer a value, you can infer it from the entire population or from a fraction of the population, which is called the sample.
The reason for sampling are:
As we might already know, sampling will not be helpful in cases like electing a representative of a country (sample of voters vote instead of entire population of the country), collecting census data of a country
Sample statistics as estimators of population parameters:
We are trying to make an estimation based of a large population based (like all of the students in the university, all of the workers in the industry, or in this case all of the wine quality) on the a sample.
So in this case, let us assume that we are quality control of the wine manufacturing plant and we have to make inference about the whole of the population based on the data that we have in hand. Obviously, we cannot do this on all of the wine bottle, because wine are aged and we will not be able to pop open all the bottles. So we select some sample of the wine and say, if those are okay, then everything else should be okay as well.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
data.head()
So here is the 10 samples that the manager took:
np.random.seed(11)
sample_7_and_above = data[data['quality'].isin([7, 8])].sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
sample_7_and_above
What is the mean and standard deviation of the samples?
x_bar = np.mean(sample_7_and_above["alcohol"])
print("The mean of the sample is: ", str(x_bar))
s = np.std(sample_7_and_above["alcohol"])
print("The standard deviation of the sample is: ", str(s))
The samples are going to have it's own distribution
sns.distplot(sample_7_and_above["alcohol"], hist=False)
title = "X_bar_1 = " + str(x_bar) + ", s1 = "+ str(s)
plt.title(title)
The point estimates are never perfect, they always have an error component. This is commenly referred to as "margin of error".
seed = np.arange(0, 9)
x_bar = []
std_dev = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data[data['quality'].isin([7, 8])].sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
std_dev.append(np.std(sample_7_and_above["alcohol"]))
samples = pd.DataFrame(columns = ["Sample Means (X_bar)", "Sample Standard Deviation (s)"], data= list(zip(x_bar, std_dev)))
samples
sns.distplot(samples["Sample Means (X_bar)"])
plt.title("Distribution of the sample means")
The expected value of the of the XĚ„ = population mean.
Remember now we are given the population standard deviation as 1.55
$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{10}} = 0.49 $$If we had taken 20 samples insted of 10, what will the value of our 𝜎𝑥 be?
$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{20}} = 0.34 $$It does down by 0.15. What if we took 100 samples?
$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{100}} = 0.155 $$So we saw above that the sample means of many samples looked like familiar to the normal distribution
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 10))
ax = fig.add_subplot(3, 2, 1)
sns.distplot(data["alcohol"])
plt.title("Original distribution of Alcohol level in the entire dataset")
seed = np.arange(0, 2)
x_bar = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
ax = fig.add_subplot(3, 2, 2)
sns.distplot(x_bar)
plt.title("Sample means of 2 samples of 10 each")
seed = np.arange(0, 7)
x_bar = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
ax = fig.add_subplot(3, 2, 3)
sns.distplot(x_bar)
plt.title("Sample means of 7 samples of 10 each")
seed = np.arange(0, 20)
x_bar = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
ax = fig.add_subplot(3, 2, 4)
sns.distplot(x_bar)
plt.title("Sample means of 20 samples of 10 each")
seed = np.arange(0, 100)
x_bar = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
ax = fig.add_subplot(3, 2, 5)
sns.distplot(x_bar)
plt.title("100 samples of 10 each")
seed = np.arange(0, 500)
x_bar = []
for s in seed:
np.random.seed(s)
sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
x_bar.append(np.mean(sample_7_and_above["alcohol"]))
ax = fig.add_subplot(3, 2, 6)
sns.distplot(x_bar)
plt.title("500 samples of 10 each")
fig.tight_layout()
plt.show()
Sample proportion is simply the number of observations in the sample which meets our criteria. It is represented by p̂
$$ \bar p = \frac {x} {n} $$where x is the number of sample that we are interested in, and n is the total number of sample
np.random.seed(11)
sample_7_and_above = data[data['quality'].isin([7, 8])].sample(13)[['alcohol']].reset_index().drop(columns = ['index'])
sample_7_and_above