This is the third of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.
See the document titled Statistics Exercises Part 1 for more introductory information.
Let's begin by importing the libraries required by the code later in the tutorial.
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
from statistics import pstdev
from statistics import mode
from statistics import StatisticsError
import random
from random import randint
from collections import Counter
import numpy as np
We will begin by defining and testing a utility function from which you can obtain values for the standard normal probability density function. See Normal Distribution Density for a table of the expected values See Engineering Statistics Handbook for a definition of the equation for the standard normal probability density function.
'''
Computes and return values for the standard normal probabilty density function
input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value
Examples
x = -2, output = 0.05399096651318806
x = -1, output = 0.24197072451914337
x = 0, output = 0.3989422804014327
x = 1, output = 0.24197072451914337
x = 2, output = 0.05399096651318806
'''
def standardNormalDistribution(x):
eVal = 2.718281828459045
exp = (-x**2)/2
numerator = pow(eVal,exp)
denominator = math.sqrt(2*math.pi)
return numerator/denominator
Now let's define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.
This function is more general than the similar function defined above. That function assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. This function makes no such assumption and results in the more general probability density function.
'''
Computes and return values for the normal probabilty density function
required input: x-axis value
optional input: mu
optional input: sigma
returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
eVal = 2.718281828459045
exp = -((x-mu)**2)/(2*(sigma**2))
numerator = pow(eVal,exp)
denominator = sigma*(math.sqrt(2*math.pi))
return numerator/denominator
Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.
A call to this function is equivalent to finding a value in a Z-table.
The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.
See Standard Normal Probabilities for tables showing the values that should be returned by this function.
'''
Computes and returns values for the cumulative distribution function of the standard normal
probability density function. Stated differently, computes and returns area under the standard
normal curve from minus infinity to an input x value.
This is a series implementation. A digital numeric integration version is provided later in
this document for comparison.
input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value
Examples
x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437
'''
def cumulativeDistribution(x):
# constants
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
p = 0.3275911
# Save the sign of x
sign = 1
if x < 0:
sign = -1
x = abs(x)/math.sqrt(2.0)
# A&S formula 7.1.26
t = 1.0/(1.0 + p*x)
y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)
return 0.5*(1.0 + sign*y)
Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
'''
Create a new dataset of dataLength values consisting of the average of numberSamples
samples taken from a population of uniformly distributed values between lowLim
and highLim generated with a seed of seed.
Input keyword parameters and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly distributed population
and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed population
highLim = 100 high limit value of the uniformly distributed population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
'''
Plots a relative frequency histogram for an incoming dataset on a figure having one row
and one column.
Also creates and plots a normal prpobability density curve on the histogram based on the
mean and standard deviation of the dataset.
'''
def plotSimpleHistogram(data):
dataBar = mean(data)
dataStd = stdev(data)
#Create a figure with one row and one column
fig,axes = plt.subplots(1,1)
#Plot and label histogram
dataN,dataBins,dataPat = axes.hist(data,bins=136,normed=True,range=(min(data),max(data)))
axes.set_title('data')
axes.set_xlabel('x')
axes.set_ylabel('Relative Freq')
#Compute the values for a normal probability density curve for the
# data mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
y = [normalProbabilityDensity(val,mu=dataBar,sigma=dataStd) for val in x]
axes.plot(x,y,label='normal probability density')
axes.grid(True)
plt.tight_layout()
plt.show()
Test the function named plotSimpleHistogram.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=100,lowLim=0,highLim=1000)
plotSimpleHistogram(data)
We can modify the mean, median, mode, and standard deviation of a dataset in a controlled manner through simple addition, subtraction, and multiplication. The next few examples will demonstrate this on a step-by-step basis.
(Note that these code samples all relate back to the original dataset.)
We will begin by creating a new dataset of 10,000 values containing the average of 512 samples taken from a population of uniformly distributed values between 0 and 1000. A histogram for the dataset is shown below along with printed values for the mean and the standard deviation.
Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next four sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.
# Create a new dataset of 10000 values consisting of the average of 512 samples taken from a population of uniformly
# distributed values between 0 and 1000.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=512,lowLim=0,highLim=1000)
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
All that is necessary to change the mean of a dataset is to add a constant to or subtract a constant from every value in the dataset.
The following code shifts the mean for our dataset to 0 by subtracting the mean value from every value in the dataset. A histogram for the modified dataset is shown below with a new mean of 0 and no change in the standard deviation.
#Get the mean for the original dataset and subtract it from every value in the dataset.
mu = mean(data)
for cnt in range(len(data)):
data[cnt] -= mu
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
We can modify the standard deviation for a dataset by shifting the mean to zero and then multiplying every value in the dataset by a constant. After that, you can restore the original mean by adding the original mean value to every value in the dataset.
The following code will reduce the standard deviation by a factor of 2 by multiplying every value by 0.5. (In this case, the mean had already been shifted to zero.) You can see a histogram of the modified dataset below with the new standard deviation and no change in the current value of the mean.
#Multiply each value by 0.5 to reduce sigma by a factor of two.
for cnt in range(len(data)):
data[cnt] *= 0.5
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
At this point, the mean for the dataset is still zero. We can change it to any value we choose by adding a constant to every value in the dataset. We will change the mean to -250 by subtracting 250 from every value in the dataset.
#Shift mu to -4250.
for cnt in range(len(data)):
data[cnt] -= 250
print("mean =",mean(data))
print("standard deviation =",stdev(data))
plotSimpleHistogram(data)
By performing these simple steps, we began with a reasonably normal dataset having a mean of 499.83 and a standard deviation of 12.64 and transformed it into a reasonably normal dataset having a mean of -250.0 and a standard deviation of 6.32.
Previous examples have mentioned a "normal" distribution. This section shows plots of a theoretical continuous distribution commonly referred to as a "normal" distribution or a "gaussian" distribution.
For purposes of review, the following code calls one of the utility methods defined earlier to plot a standard normal distribution from -3 standard deviations to +3 standard deviations. Students are not expected to understand the annotation code in this example at this point in the course, but will learn about annotating plots later in the course.
Note that the standard normal distribution has a mean value of zero and a standard deviation has a value of 1.0 as indicated in the following plot.
x = np.arange(-3,3,0.1)
y = [standardNormalDistribution(val) for val in x]
#Use matplotlib to plot the data.
plt.plot(x,y)
plt.grid(True)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.title('Standard Normal Distribution')
#Now add some annotation to the plot. Make things interesting by
# rendering the annotation in LaTeX.
plt.annotate('$+1 \sigma$',
xy=[1,0.23],xycoords='data',
xytext=(+10, +30), textcoords='offset pixels', fontsize=12,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate('$-2 \sigma$',
xy=[-2,0.05],xycoords='data',
xytext=(-50, +15), textcoords='offset pixels', fontsize=12,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate('$\mu$',
xy=[0,0.4],xycoords='data',
xytext=(-20, -30), textcoords='offset pixels', fontsize=12,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.show()
We can standardize a score on the x-axis of a normal distribution as follows:
Z = (x - mu)/sigma
Create a dataset of 1000 normally distributed values in the range from 0 to 1000 and plot a histogram for the data.
Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be less than 450. Express the probability as a decimal value, not as a percentage.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)
#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)
#Use the cumulative distribution to find the proportion of the area under the standard
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is < 450 =",probability)
plotSimpleHistogram(data)
Confirm the probability that x is less than 450 by iterating through the actual data and counting the number of values that are less than 450.
The results match very well with the theoretical value given above. This indicates that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
if val < 450:
lt += 1
else:
gte += 1
probability = lt/(lt + gte)
print("probability that x is < 450 =",probability)
plotSimpleHistogram(data)
Create a dataset of 1000 normally distributed values in the range from 0 to 1000 and plot a histogram for the data.
Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be greater than 450. Express the probability as a decimal value, not as a percentage.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)
#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)
#Use the cumulative distribution to find the proportion of the area under the standard
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is > 450 =",1 - probability)
plotSimpleHistogram(data)
Confirm the probability that x is greater than 450 by iterating through the actual data and counting the number of values that are greater than 450.
The results match very well with the theoretical value given above. This indicates that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
if val < 450:
lt += 1
else:
gte += 1
probability = gte/(lt + gte)
print("probability that x is < 450 =",probability)
plotSimpleHistogram(data)
Create a dataset of 1000 uniformly distributed values in the range from 0 to 1000 and plot a histogram for the data.
Compute the Z-score for the value 450. Use the Z-score to find and print the probability that a value in the dataset will be less than 450. Express the probability as a decimal value, not as a percentage.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=1,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = 450
print("x =",x)
#Compute the Z-score
z = (x - xBar)/stDev
print("z =",z)
#Use the cumulative distribution to find the proportion of the area under the standard
#normal distribution curve that is to the left of the value 450.
probability = cumulativeDistribution(z)
print("probability that x is < 450 =",probability)
plotSimpleHistogram(data)
For the same dataset of 1000 uniformly distributed values in the range from 0 to 1000, confirm the probability that x is less than 450 by iterating through the actual data and counting the number of values that are less than 450.
The probability results match surprisingly well with the theoretical value given above considering that the dataset does not have a normal distribution.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=1,lowLim=0,highLim=1000)
lt = 0
gte = 0
for val in data:
if val < 450:
lt += 1
else:
gte += 1
probability = lt/(lt + gte)
print("probability that x is < 450 =",probability)
plotSimpleHistogram(data)
Find and print the probability that a value in the dataset will be within one sigma of the mean. Express the probability as a decimal value, not as a percentage.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
xHigh = xBar + stDev
print("xHigh =",xHigh)
zHigh = (xHigh - xBar)/stDev
print("zHigh =",zHigh)
xLow = xBar - stDev
print("xLow =",xLow)
zLow = (xLow - xBar)/stDev
print("zLow =",zLow)
probability = cumulativeDistribution(zHigh) - cumulativeDistribution(zLow)
print("probability that x is within one sigma of the mean =",probability)
plotSimpleHistogram(data)
Confirm the probability that a value in the dataset will be within one sigma of the mean by iterating through the data and counting the number of values that are within plus or minus one sigma. The results match very well with those given above and indicate that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
inside = 0
outside = 0
for val in data:
if (val > (xBar - stDev)) and (val < (xBar + stDev)):
inside += 1
else:
outside += 1
probability = inside/(inside + outside)
print("probability that x is within one sigma of the mean =",probability)
plotSimpleHistogram(data)
Find and print the probability that a value in the dataset will be greater one sigma above the mean. Express the probability as a decimal value, not as a percentage.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
x = xBar + stDev
print("x =",x)
z = (x - xBar)/stDev
print("z =",z)
probability = 1 - cumulativeDistribution(z)
print("probability that x is greater than one sigma above the mean =",probability)
plotSimpleHistogram(data)
Confirm the probability that a value in the dataset will be greater than one sigma above the mean by iterating through the data and counting the number of values that are greater than one sigma above the mean. The results match those given above very well and indicate that the normal probabilty density function can be used to predict the frequency of occurrence of values in this dataset.
data = []
random.seed(1)
data = normalRandomGenerator(dataLength=10000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
gt = 0
lte = 0
for val in data:
if val > (xBar + stDev):
gt += 1
else:
lte += 1
probability = gt/(gt + lte)
print("probability that x is greater than one sigma above the mean =",probability)
plotSimpleHistogram(data)
When you roll a tetrahedral die, the outcome can be 1, 2, 3, or 4.
When you roll two tetrahedral dice and add the individual outcomes, the sum can be one of 16 values ranging from 2 through 8.
What is the probability that the sum will be 6 or greater when you roll two dice?
The following code approaches a solution to this question in three different ways, all of which produce approximately the same result.
print("First approach")
oneDie = [1,2,3,4]
print("Get and print all possible sums of rolling two dice")
allPosSums = []
for valA in oneDie:
for valB in oneDie:
allPosSums.append(valA + valB)
allPosSums.sort()
print(allPosSums)
print("Compute the probability by counting values")
cnt = 0
for val in allPosSums:
if val >= 6:
cnt += 1
print("Probability by counting values =",cnt/len(allPosSums))
print()
print("Second approach")
print("Estimate the probability using the Z-score and an assumption of a normal distribution")
print("on the population of all sums")
print("Find the probability that the sum will be greater than 5.5")
meanOfSums = mean(allPosSums)
popStd = pstdev(allPosSums)
sum = 5.5
z = (sum - meanOfSums)/popStd
probability = 1 - cumulativeDistribution(z)
print("probability using Z-score =",probability)
plotSimpleHistogram(allPosSums)
print()
print("Third approach")
print("Estimate the probability using the Z-score and 10000 samples taken from")
print("uniformly distributed random integer data in the range of 1 through 4.")
dataA = []
random.seed(1)
for cnt in range(1,10000):
dataA.append(randint(1,4))
dataB = []
random.seed(2)
for cnt in range(1,10000):
dataB.append(randint(1,4))
#Add the random values in pairs.
data = []
for cnt in range(len(dataA)):
data.append(dataA[cnt] + dataB[cnt])
xBar = mean(data)
stDev = stdev(data)
print("Find the probability that the sum will be greater than 5.5")
x = 5.5
z = (x - xBar)/stDev
probability = 1 - cumulativeDistribution(z)
print("probability using random data =",probability)
plotSimpleHistogram(data)
Create a population dataset consisting of 49,000 uniformly distributed values ranging from 0 to 1000. Compute the mean, the median, and the population standard deviation of the data and plot a histogram of the data using matplotlib.
pop = normalRandomGenerator(dataLength=49000,numberSamples=1,lowLim=0,highLim=1000)
print("data length =",len(pop))
print("mean =",mean(pop))
print("median =",median(pop))
print("pstdev =", pstdev(pop))
plotSimpleHistogram(pop)
Create the same dataset as before consisting of 49,000 uniformly distributed values ranging from 0 to 1000. Break the data into 49 samples of 1000 values each and average the samples to produce a single sample dataset of 1000 values. Compute the mean, the median, and the standard deviation of the data and plot a histogram of the data. Note that the histogram tends to be normal and the standard deviation is reduced by approximately 7, which is the square root of the number of samples (49). Also note that the mean and the median remain approximately the same as before. Thus the central limit theorem is confirmed.
This is the basis of the function named normalRandomGenerator that we have been using since early in the course to produce relatively normal datasets that were guaranteed to fall within a specified range of values.
samp = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
print("data length =",len(samp))
print("mean =",mean(samp))
print("median =",median(samp))
print("stdev =", stdev(samp))
plotSimpleHistogram(samp)
Note: In order to copy this code and run it under the Python interpreter, you will need to copy and paste the next two sections of code into the same Python source code file. You will also need to dismiss each plot from the screen to cause the next plot to appear.
Problem: We have a population containing 49,000 values that is known to have a population sigma of 289.69. We need to extract a sample from that population containing at least 1000 values that will have a sigma in the range of 47 to 49. Write a program to accomplish that and display the mean, median, sigma, and a histogram of the original population and the resulting sample.
First gather statistics and display a histogram of the raw data.
#Create the population
pop = normalRandomGenerator(dataLength=49000,numberSamples=1,lowLim=0,highLim=1000)
#Get and display statistics
print("population data length =",len(pop))
print("mean =",mean(pop))
print("median =",median(pop))
print("pstdev =", pstdev(pop))
print()
print("Display the histogram of the population")
plotSimpleHistogram(pop)
Now create the sample and display its statistics.
#Compute the number of values that must be averaged to produce the sample dataset
#with the required statistics.
targetSigma = 48
print("targetSigma =",targetSigma)
numberVal = (round(pstdev(pop)/targetSigma))**2
print("numberVal =",numberVal)
sampleLength = len(pop)//numberVal
sampAvg = []
print("Extract and average the required number of values from the population")
sampAvg = normalRandomGenerator(dataLength=sampleLength,numberSamples=numberVal,lowLim=0,highLim=1000)
#Display the statistics of the new dataset
print("data length =",len(sampAvg))
print("mean =",mean(sampAvg))
print("median =",median(sampAvg))
print("stdev =", stdev(sampAvg))
#Display the histagram
plotSimpleHistogram(sampAvg)
--To be continued in the tutorial titled Statistics Exercises Part 4--
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: StatisticsPart03.ipynb
Revised: 04/21/18
Copyright 2018 Richard G. Baldwin
-end-