Statistics Exercises Part 4

This is the fourth of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.

See the document titled Statistics Exercises Part 1 for more introductory information.

In previous exercise pages you learned a little about how to use the hist method to compute and display a histogram for a dataset. We will revisit the hist method in this exercise page and you will learn how to take advantage of several keyword parameters that are available to customize the behavior of the method.

After that, we will look at some examples that demonstrate the central limit theorem in action.

Utility functions

Normal probabilty density function

Let's begin by defining a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.

However, this function will be more general than the similar function presented in the earlier exercise pages. The function that was presented in those pages assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. The following function makes no such assumption and results in the more general probability density function.

As is often the case, we will begin by importing some libraries that will be needed here and at other locations within this turorial.

In [1]:
#Import some libraries that will be needed for this function and for other examples later in the tutorial
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
import random
import numpy as np

The code for the normalProbabilityDensity follows.

In [2]:
'''
Computes and return values for the normal probabilty density function

required input: x-axis value
optional input: mu
optional input: sigma

returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
    
    eVal = 2.718281828459045
    exp = -((x-mu)**2)/(2*(sigma**2))
    numerator = pow(eVal,exp)
    denominator = sigma*(math.sqrt(2*math.pi))
    return numerator/denominator

Cumulative distribution function

Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.

A call to this function is equivalent to finding a value in a Z-table.

The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.

See Standard Normal Probabilities for tables showing the values that should be returned by this function.

In [3]:
'''
Computes and returns values for the cumulative distribution function of the standard normal 
probability density function. Stated differently, computes and returns area under the standard 
normal curve from minus infinity to an input x value.

This is a series implementation. A digital numeric integration version is provided later in 
this document for comparison.

input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value

Examples

x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437

'''
def cumulativeDistribution(x):
    # constants
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429
    p  =  0.3275911

    # Save the sign of x
    sign = 1
    if x < 0:
        sign = -1
    x = abs(x)/math.sqrt(2.0)

    # A&S formula 7.1.26
    t = 1.0/(1.0 + p*x)
    y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)

    return 0.5*(1.0 + sign*y)

Test the new function

The following code tests to confirm proper operation of the normalProbabilityDensity function.

In [4]:
#Plot the normal probability density and the normal probability 
# for the specified values of mu and sigma.
xmu = -2
print("xmu =",xmu)
xsigma = 0.59
print("xsigma =",xsigma)

fig,ax = plt.subplots(1,1)

x = np.arange(-4,4.15,0.15)
y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
ax.plot(x,y,label='normal probability density',marker='.')

y = [cumulativeDistribution((val-xmu)/xsigma) for val in x]
ax.plot(x,y,label='normal probability',marker = '.')
ax.grid(True)
ax.legend()
plt.show()
xmu = -2
xsigma = 0.59

A random dataset generator

Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.

In [5]:
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
    '''
    Create a new dataset of dataLength values consisting of the average of numberSamples 
    samples taken from a population of uniformly distributed values between lowLim 
    and highLim generated with a seed of seed.

    Input keyword parameters and their default values:

    seed = 1 seed used to generate the uniformly distributed values
    dataLength = 10000 number of samples in the returned list of values
    numberSamples = 50 number of samples taken from the uniformly distributed population
                       and averaged into the output
    lowLim = 0 lower limit value of the uniformly distributed population
    highLim = 100 high limit value of the uniformly distributed population

    returns: a list containing the dataset

    '''
    data = []
    random.seed(seed)

    for cnt in range(dataLength):
        theSum = 0
        for cnt1 in range(numberSamples):
            theSum += random.uniform(lowLim,highLim)
        data.append(theSum/numberSamples)
        
    return data

The matplotlib.axes.Axes.hist method

You will probably be using the matplotlib.axes.Axes.hist method from time to time. Therefore, it will be good to learn more about that method.

Let's begin by creating some normally distributed data to experiment with.

In [6]:
# Create a new dataset of 10000 values consisting of the average of 50 samples taken from a population of uniformly
# distributed values between 0 and 100.

data = normalRandomGenerator(dataLength=10000,numberSamples=50)
print("data length =",len(data))
print("mean =",mean(data))
print("standard deviation =",stdev(data))
data length = 10000
mean = 50.00627810855217
standard deviation = 4.106854658585184

A basic histogram

Now let's compute and display a histogram of the data using the most basic call to the hist method.

In [7]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data)
plt.show()

We have a histogram all right, but it is pretty crude. There are only ten bins. The hist method lets us specify the number of bins and if we don't do that, the default is to compute the histogram with ten bins.

Increase bins to 100

Let's tell the hist method to compute and display 100 bins for the same data through the use of the keyword parameter named bins.

In [8]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
plt.show()

That's more like it.

The vertical axis scale

Can we ascribe any meaning to the numbers on the vertical axis? Let's see what happens if we double the number of data values without materially changing the shape of the distribution. We can do that by concatenating the list containing the data values with an exact copy of itself.

In [9]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100)
plt.show()

What we see is that the values on the vertical axis depend on the length of the dataset in addition to the shape of the distribution. That probably isn't what we want to see. Most of the time we are probably interested in seeing the shape of the distribution independent of the length of the dataset.

The normed keyword parameter

The hist method has a boolean keyword parameter (normed) that is designed to take care of that issue for us. If this value is set to True, the results are normalized such that the total area under the curve is equal to 1 (more on this later). Let's try it with the original dataset.

In [10]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100,normed=True)
plt.show()

Now let's try it again with the double length dataset and see what happens.

In [11]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100,normed=True)
plt.grid(True)
plt.show()

Good! The shape of the distribution as well as the scale on the vertical axis is now independent of the length of the dataset provided that changing the length doesn't change the distribution.

The horizontal axis scale

Now let's discuss the meaning of the horizontal scale. As you may recall, the dataset contains 10000 values consisting of the average of 50 samples taken from a population of uniformly distributed values between 0 and 100. Although a lot of averaging was done, it is theoretically possible that the dataset could still contain values near 0 or near 100. However, the hist method didn't find any values below about 35 or any values above about 65 for this dataset. (See the tails on the histogram shown above.)

By default, the method ignores everything outside those limits and dedicates the specified number of bins to the values between those limits producing the horizontal scale that you see above. However, we can instruct the method to compute the histogram over any range that we choose through use of the keyword parameter named range. This is illustrated in the following example, which shows the histogram for the same data computed and plotted over the full range from 0 to 100 uniformly spaced bins. Note that the peak still shows the same value on the vertical axis of approximately 0.1.

In [12]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)
ax.hist((data + data),bins=100,normed=True,range=(0,100))
plt.grid(True)
plt.show()

The area under the curve

Now we are prepared to discuss the meaning of the vertical scale that results when the normed parameter is aet to True. Setting that parameter to true causes the output to be normalized such that the area under the entire curve from 0 through 100 in the above histogram is equal to 1.

We could write code to compute that area if we had access to the values that are plotted in the histogram. As it turns out, we do have access to those values. The hist method returns a tuple containing three things:

  1. A list of the values (n) that were plotted.
  2. A list of the values (bins) on the horizontal axis for which the values were computed and plotted.
  3. Something called a patch having to do with the appearance of the plot.

(At this point, we aren't interested in the patch but if you are interested, you can read about it here.)

We are interested in the first item in the list because we can compute the area under the curve in this case by simply adding the values in the list. This is illustrated by the following code which shows the area under the curve to be 1.0.

In [13]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)

#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True,range=(0,100))

#Compute the area under the curve by computing the sum of the values for each of the bins.
areaUnderTheCurve = 0
for val in n:
    areaUnderTheCurve += val
print("areaUnderTheCurve =",areaUnderTheCurve)

plt.show()
areaUnderTheCurve = 1.0

The situation is a little more difficult to explain when we allow the method to compute and display only a portion of the full histogram. I'm not going to explain the reason why but I will tell you that the sum of the values must be scaled by the ratio of the number of bins displayed to the number of bins specified to cause the result to equal 1.0. This is illustrated in the following code that shows the area under the curve for the full histogram to be 1.0.

In [14]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)

#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True)

areaUnderTheCurve  = 0
binsSpecified = 100

#Get the number of bins displayed
binsDisplayed = bins[len(bins)-1] - bins[0]
print("binsDisplayed =",binsDisplayed)

for val in n:
    areaUnderTheCurve  += val
areaUnderTheCurveForTheFullHistogram = areaUnderTheCurve * binsDisplayed/binsSpecified
print("areaUnderTheCurveForTheFullHistogram  =",areaUnderTheCurveForTheFullHistogram)

plt.show()
binsDisplayed = 33.0210885575
areaUnderTheCurveForTheFullHistogram  = 1.0

Plot a normal probability density curve on the histogram

Earlier we developed a function named normalProbabilityDensity that can be called to return the values for a normal probability density curve for a given value of mu and a given value of sigma.

The next example creates such a curve for the same mu and sigma as the data and then superimposes that curve on the histogram of the data. As you can see, the histogram of the data is a reasonably good fit to the normal probability density curve.

In [15]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)

fig,ax = plt.subplots(1,1)

#Get and print mean and standard deviation of the dataset
xmu = mean(data)
xsigma = stdev(data)
print("mu =",xmu)
print("sigma =",xsigma)

#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True)

#Compute the values for a normal probability density curve 
# for the same mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.

x = np.arange(bins[0],bins[len(bins)-1],1.0)
y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
ax.plot(x,y,label='normal probability density',marker='o')

#Add some cosmetics and show the plot
ax.legend()
ax.grid(True)
plt.show()
mu = 50.00627810855217
sigma = 4.106854658585184

Cumulative probability distribution

The hist method can also be called to compute and plot a cumulative probability distribution by setting the keyword paramter named cumulative to True.

The following example plots the cumulative probability distribution of the data and superimposes a cumulative normal probability distribution curve on the data. As you can see, the fit between the two is quite good.

In [16]:
data = normalRandomGenerator(dataLength=10000,numberSamples=50)
fig,ax = plt.subplots(1,1)

#Get and print mean and standard deviation of the dataset
xmu = mean(data)
xsigma = stdev(data)
print("mu =",xmu)
print("sigma =",xsigma)

#Plot the histogram for the data and unpack the tuple that is returned.
n,bins,patches = ax.hist(data,bins=100,normed=True,cumulative=True)

#Compute the values for a normal probability density curve for 
# the same mu and sigma across the same range of values.

x = np.arange(bins[0],bins[len(bins)-1],1.0)
y = [cumulativeDistribution((val-xmu)/xsigma) for val in x]
ax.plot(x,y,label='normal probability',marker = '.')

#Add some cosmetics and show the plot
ax.legend()
ax.grid(True)
plt.show()
mu = 50.00627810855217
sigma = 4.106854658585184

The central limit theorem in action

Now let's look at some examples that show the central limit theorem in action.

Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next several sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.

The plotIt function

We will begin by creating a convenience function named plotIt that can produce the desired plot for a given dataset as an input parameter. Then we will apply that function to a series of datasets created by averaging progressively increasing numbers of samples taken from the same population with a uniform distribution.

In [17]:
def plotIt(data):

    fig,ax = plt.subplots(1,1)

    #Get and print mean and standard deviation of the dataset
    xmu = mean(data)
    xsigma = stdev(data)
    print("mu =",xmu)
    print("sigma =",xsigma)

    #Plot the histogram for the data and unpack the tuple that is returned.
    n,bins,patches = ax.hist(data,bins=100,normed=True)

    #Compute the values for a normal probability density curve 
    # for the same mu and sigma across the same range of values and
    # superimpose the normal probability density curve on the histogram.
    
    x = np.arange(bins[0],bins[len(bins)-1],0.25)
    y = [normalProbabilityDensity(val,mu=xmu,sigma=xsigma) for val in x]
    ax.plot(x,y,label='normal probability density')
    
    #Add some cosmetics and show the plot
    ax.legend(loc='lower left')
    ax.grid(True)
    plt.show()

Uniformly distributed data

We will begin by applying the plotIt function to a dataset of 10000 values taken from a population of uniformly distributed values between 0 and 100. As you can see, the uniform distribution of the data is not a good fit for the normal probability density curve.

Also make note of the sigma for this population as we will be discussing it further later on.

In [18]:
# Create a new dataset of 10000 values taken from a population of uniformly distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=1)
    
plotIt(data)
mu = 49.801785158479724
sigma = 29.049719939753082

The average of two samples

Now let's apply the plotIt function to a new dataset of 10000 values containing the average of 2 samples taken from the same population of uniformly distributed values between 0 and 100.

As you can see, the sigma decreased by approximately the square root of two as it should. As you can also see, the distribution of the new dataset has taken on a decidely normal characteristic after averaging only two samples.

In [19]:
# Create a new dataset of 10000 values consisting of the average of 2 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=2)
    
plotIt(data)
mu = 49.984351971052014
sigma = 20.553291369934513

The average of four samples

The next eight examples show the results of progressively doubling the number of samples averaged into the dataset until finally we end up with a dataset of 10000 values containing the average of 512 samples taken from the same population of uniformly distributed values between 0 and 100.

As you scroll down through these examples, note that sigma decreases by approximately the square root of two with each doubling of the number of samples in the dataset. This is evidenced by the numeric output and also by the fact that the hist method elects to display a narrower and narrower portion of the full histogram with each doubling.

In [20]:
# Create a new dataset of 10000 values consisting of the average of 4 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=4)
    
plotIt(data)
mu = 49.96087303836365
sigma = 14.514947640161623

The average of eight samples

In [21]:
# Create a new dataset of 10000 values consisting of the average of 8 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=8)
    
plotIt(data)
mu = 50.06010563283642
sigma = 10.235855014631102

The average of sixteen samples

In [22]:
# Create a new dataset of 10000 values consisting of the average of 16 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=16)
    
plotIt(data)
mu = 50.04858389060665
sigma = 7.259046779352402

The average of 32 samples

In [23]:
# Create a new dataset of 10000 values consisting of the average of 32 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=32)
    
plotIt(data)
mu = 50.01598009926033
sigma = 5.149117700629042

The average of 64 samples

In [24]:
# Create a new dataset of 10000 values consisting of the average of 64 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=64)
    
plotIt(data)
mu = 50.02487106929253
sigma = 3.597019328061025

The average of 128 samples

In [25]:
# Create a new dataset of 10000 values consisting of the average of 128 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=128)
    
plotIt(data)
mu = 50.00256987947287
sigma = 2.5439179950418715

The average of 256 samples

In [26]:
# Create a new dataset of 10000 values consisting of the average of 256 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=256)
    
plotIt(data)
mu = 50.00397299111193
sigma = 1.7904633206250304

The average of 512 samples

In [27]:
# Create a new dataset of 10000 values consisting of the average of 512 samples taken from a population of uniformly
# distributed values between 0 and 100.
data = normalRandomGenerator(dataLength=10000,numberSamples=512)
    
plotIt(data)
mu = 49.98389212033013
sigma = 1.264458435829136

Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX

File: StatisticsPart04.ipynb

Revised: 04/21/18

Copyright 2018 Richard G. Baldwin

-end-