Introduction:

I am fascinated with all things statistics which is the base of all the machine learning algorithms. I strongly beleive that a strong foundation in statistics will make us as not just as better data scientists, but also will make us more intutive in real life. I often say that the machine learning is just a glorified form of statistics.

image.png

In this notebook I am explaining a concept in statistics called sampling and sampling distribution, which is very essential also to machine learning. As ML models are build on some form of sample data, and not on all the data that is available in the system.

Motivation:

Sampling

Statistics is science of inference. So to infer a value, you can infer it from the entire population or from a fraction of the population, which is called the sample.

The reason for sampling are:

  • Few cases it might be impossible to collect data on the entire population
  • Saves time and money

As we might already know, sampling will not be helpful in cases like electing a representative of a country (sample of voters vote instead of entire population of the country), collecting census data of a country

Sample statistics as estimators of population parameters:

  • Whenever referring to the population parameter, we refer it in Greek letters.
  • When referring to the sample estimates, we refer it in English alphabets.

image.png

Point estimate

We are trying to make an estimation based of a large population based (like all of the students in the university, all of the workers in the industry, or in this case all of the wine quality) on the a sample.

So in this case, let us assume that we are quality control of the wine manufacturing plant and we have to make inference about the whole of the population based on the data that we have in hand. Obviously, we cannot do this on all of the wine bottle, because wine are aged and we will not be able to pop open all the bottles. So we select some sample of the wine and say, if those are okay, then everything else should be okay as well.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
data = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
data.head()
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Question 1:

A wine can be labeled as quality 7 and above only if the alcohol level is greater than 11.3. If the level of alcohol is less, then the will won't taste sour enuogh, if the alcohol level is less than 11.3 then the wine might be bitter.

The price difference between quality 7 & 8 wine and quality 5 & 6 wine will be as much as 125%. So the company needs to be very particular to maintain this standard.

The quality control person is only allowed to open 10 bottles of wine in quality 7 & 8 to make sure the levels are sufficient.

Points to note:

  • There is no way that the company will be able to open all the bottles of wine
  • The company must take samples and then make inferences about the entire batch (population)

So here is the 10 samples that the manager took:

In [3]:
np.random.seed(11)
sample_7_and_above = data[data['quality'].isin([7, 8])].sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
sample_7_and_above
Out[3]:
alcohol
0 12.0
1 12.5
2 11.7
3 14.0
4 12.7
5 12.1
6 11.0
7 10.0
8 10.8
9 9.7

What is the mean and standard deviation of the samples?

In [4]:
x_bar = np.mean(sample_7_and_above["alcohol"])
print("The mean of the sample is: ", str(x_bar))
s = np.std(sample_7_and_above["alcohol"])
print("The standard deviation of the sample is: ", str(s))
The mean of the sample is:  11.65
The standard deviation of the sample is:  1.2387493693237548

The samples are going to have it's own distribution

In [5]:
sns.distplot(sample_7_and_above["alcohol"], hist=False)
title = "X_bar_1 = " + str(x_bar) + ", s1 = "+ str(s)
plt.title(title)
Out[5]:
Text(0.5, 1.0, 'X_bar_1 = 11.65, s1 = 1.2387493693237548')

Question 2:

Now remember, our goal was to get a mean of 11.3, but we got 11.65 for our samples. Since this is a sample and we do not expect it to be exactly 11.3, is 11.65 close enough to our acceptable goal?

Does this batch of sample accurately reflect the alcohol content of the entire population?

How can we determine that?

The point estimates are never perfect, they always have an error component. This is commenly referred to as "margin of error".

Sampling distribution

Take multiple samples from our population.

Let us take 9 samples each of size 10

In [6]:
seed = np.arange(0, 9)

x_bar = []
std_dev = []

for s in seed:
    np.random.seed(s)
    sample_7_and_above = data[data['quality'].isin([7, 8])].sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))
    std_dev.append(np.std(sample_7_and_above["alcohol"]))
    
samples = pd.DataFrame(columns = ["Sample Means (X_bar)", "Sample Standard Deviation (s)"], data= list(zip(x_bar, std_dev)))
samples
Out[6]:
Sample Means (X_bar) Sample Standard Deviation (s)
0 11.02 0.669029
1 11.14 0.971802
2 11.21 1.013361
3 11.63 0.888876
4 11.45 0.908020
5 11.24 0.935094
6 11.32 1.231909
7 11.39 0.773886
8 11.30 0.507937

Now the question is, what is the distribution of the sample means? One way to think about it is considering the distribution of the sample means. We call this as sampling distribution.

In [7]:
sns.distplot(samples["Sample Means (X_bar)"])
plt.title("Distribution of the sample means")
Out[7]:
Text(0.5, 1.0, 'Distribution of the sample means')

Expected value (mean) of the X̄:

$$ E(\bar X _{n}) = \mu $$

The expected value of the of the XĚ„ = population mean.

Points to consider from this:

  • The expected value of XĚ„ is the best estimate for ÎĽ. But for this we have to take all the possible samples from the population, which is not our goal.
  • The best we are going to do is to come up with a range (or interval) for the value of ÎĽ
  • But our interval estimate will be affected by sample size and degree of "confidence".

Question 3:

What is the error in the mean, given that the population standard deviation 𝜎 = 1.55?

Standard error of the mean:

$$ \sigma _ \bar x = \dfrac{\sigma}{\sqrt{n}} $$

Remember now we are given the population standard deviation as 1.55

$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{10}} = 0.49 $$

Points to note:

  • Note that as the number of samples increases, denominator of the fraction given above reduces and the 𝜎𝑥 also reduces.

Influence of sample size.

If we had taken 20 samples insted of 10, what will the value of our 𝜎𝑥 be?

$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{20}} = 0.34 $$

It does down by 0.15. What if we took 100 samples?

$$ \sigma _ \bar x = \dfrac{1.55}{\sqrt{100}} = 0.155 $$

Another way to think about that is as the number of samples increases, we are getting a value of X̄ closer to the actual population mean μ.

Important implication of that: The standard error of the mean will remain the SAME for ANY sample of size 10.

In our case, for any batch of 10 samples, the standard error of the mean will always remain 0.49 (assuming population standard deviation remains the same).

Central limit theorem:

So we saw above that the sample means of many samples looked like familiar to the normal distribution

If we take many samples from population of any shape, then if we do a distribution of sample means, it will be normal!

We can see from the below graphs that though the initial distribution is right skewed, as we take more and more samples, we tend to get pretty close to the normal distribution!

In [8]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 10))

ax = fig.add_subplot(3, 2, 1)
sns.distplot(data["alcohol"])
plt.title("Original distribution of Alcohol level in the entire dataset")

seed = np.arange(0, 2)
x_bar = []
for s in seed:
    np.random.seed(s)
    sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))

ax = fig.add_subplot(3, 2, 2)
sns.distplot(x_bar)
plt.title("Sample means of 2 samples of 10 each")

seed = np.arange(0, 7)
x_bar = []
for s in seed:
    np.random.seed(s)
    sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))
    
ax = fig.add_subplot(3, 2, 3)
sns.distplot(x_bar)
plt.title("Sample means of 7 samples of 10 each")

seed = np.arange(0, 20)
x_bar = []
for s in seed:
    np.random.seed(s)
    sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))

ax = fig.add_subplot(3, 2, 4)
sns.distplot(x_bar)
plt.title("Sample means of 20 samples of 10 each")

seed = np.arange(0, 100)
x_bar = []
for s in seed:
    np.random.seed(s)
    sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))

ax = fig.add_subplot(3, 2, 5)
sns.distplot(x_bar)
plt.title("100 samples of 10 each")

seed = np.arange(0, 500)
x_bar = []
for s in seed:
    np.random.seed(s)
    sample_7_and_above = data.sample(10)[['alcohol']].reset_index().drop(columns = ['index'])
    x_bar.append(np.mean(sample_7_and_above["alcohol"]))
    
ax = fig.add_subplot(3, 2, 6)
sns.distplot(x_bar)
plt.title("500 samples of 10 each")

fig.tight_layout()
plt.show()

Question 4:

Given that the quality checking person is drawing a sample of 10 wines of quality 7 & 8, he assumes that the sample batch meets the alcohol level of 11.3, but he know that it is acceptable untill the batch alcohol mean level drops to 11.15. What is the the probability that the batch will have the sample mean of less than 11.15?

Assumption: Population mean μ = 11.3 and population standard deviation 𝜎 = 1.55

Here our random variable X̄ is normal, as per the central limit theorem, and has a mean of μ.

Also the standard deviation of the variable X̄ is equal to sigma / sqrt(n)

$$ P(\bar X < 11.15) = P (Z < \frac{11.15 - \mu}{\sigma / \sqrt(n)}) $$$$ = P (Z < \frac{11.15 - 11.3}{1.55 / \sqrt 10})$$$$ = P (Z < -0.30603) = 0.3594 $$

So if the population mean is indeed 11.3 and population standard deviation is 1.55, then there is a 35% probability that the sample taken has alcohol level less than 11.15.

Sample Proportions

Sample proportion is simply the number of observations in the sample which meets our criteria. It is represented by p̂

$$ \bar p = \frac {x} {n} $$

where x is the number of sample that we are interested in, and n is the total number of sample

Question 5:

Taking a random sample of 13 observations, what is the sample proportion of having alcohol level of greater than 11.3?

In [9]:
np.random.seed(11)
sample_7_and_above = data[data['quality'].isin([7, 8])].sample(13)[['alcohol']].reset_index().drop(columns = ['index'])
sample_7_and_above
Out[9]:
alcohol
0 12.0
1 12.5
2 11.7
3 14.0
4 12.7
5 12.1
6 11.0
7 10.0
8 10.8
9 9.7
10 12.9
11 12.5
12 12.9

In the above output, we have 9 out of 13 observations have an alcohol level of greater than 11.3. Hence the sample proportion is:

$$ \bar p = \frac {x} {n} = \frac {9} {13} = 0.69 $$

Please upvote if you found this notebook useful and interesting!