This is the second of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.
See the document titled Statistics Exercises Part 1 for more introductory information.
Let's begin by importing the libraries required by the code later in the tutorial.
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
from statistics import mode
from statistics import StatisticsError
import random
from collections import Counter
import numpy as np
This sections defines several utility functions used by code later in the tutorial.
We will begin by defining and testing a utility function from which you can obtain values for the standard normal probability density function. See Normal Distribution Density for a table of the expected values See Engineering Statistics Handbook for a definition of the equation for the standard normal probability density function.
'''
Computes and return values for the standard normal probabilty density function
input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value
Examples
x = -2, output = 0.05399096651318806
x = -1, output = 0.24197072451914337
x = 0, output = 0.3989422804014327
x = 1, output = 0.24197072451914337
x = 2, output = 0.05399096651318806
'''
def standardNormalDistribution(x):
eVal = 2.718281828459045
exp = (-x**2)/2
numerator = pow(eVal,exp)
denominator = math.sqrt(2*math.pi)
return numerator/denominator
Get and plot the standard normal probabilty density function from -3 sigma to +3 sigma
#Create an array of uniformly spaced values
x = np.arange(-3,3,0.1)
#Create a list containing a distribution value for each value in x
# using list comprehension.
y = [standardNormalDistribution(val) for val in x]
#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()
Now let's define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.
This function is more general than the similar function defined above. That function assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. This function makes no such assumption and results in the more general probability density function.
'''
Computes and return values for the normal probabilty density function
required input: x-axis value
optional input: mu
optional input: sigma
returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
eVal = 2.718281828459045
exp = -((x-mu)**2)/(2*(sigma**2))
numerator = pow(eVal,exp)
denominator = sigma*(math.sqrt(2*math.pi))
return numerator/denominator
Get and plot the normal probabilty density function from -3 sigma to +3 sigma for mu = 1 and sigma = 0.5.
x = np.arange(-2,4,0.1)
y = [normalProbabilityDensity(val,mu=1,sigma=0.5) for val in x]
#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()
Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.
A call to this function is equivalent to finding a value in a Z-table.
The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.
See Standard Normal Probabilities for tables showing the values that should be returned by this function.
'''
Computes and returns values for the cumulative distribution function of the standard normal
probability density function. Stated differently, computes and returns area under the standard
normal curve from minus infinity to an input x value.
This is a series implementation. A digital numeric integration version is provided later in
this document for comparison.
input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value
Examples
x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437
'''
def cumulativeDistribution(x):
# constants
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
p = 0.3275911
# Save the sign of x
sign = 1
if x < 0:
sign = -1
x = abs(x)/math.sqrt(2.0)
# A&S formula 7.1.26
t = 1.0/(1.0 + p*x)
y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)
return 0.5*(1.0 + sign*y)
Get and plot the cumulative distribution from -3 sigma to +3 sigma.
x = np.arange(-3,3,0.1)
y = [cumulativeDistribution(val) for val in x]
#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()
Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
'''
Create a new dataset of dataLength values consisting of the average of numberSamples
samples taken from a population of uniformly distributed values between lowLim
and highLim generated with a seed of seed.
Input keyword parameters and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly distributed population
and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed population
highLim = 100 high limit value of the uniformly distributed population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
The following function will compute a least mean square best fit line for a scatter plot.
def lmsBestFitLine(xDataIn,yDataIn):
xBar = sum(xDataIn)/len(xDataIn)
yBar = sum(yDataIn)/len(yDataIn)
numerator = 0
for cnt in range(len(xDataIn)):
numerator += (xDataIn[cnt] - xBar) * (yDataIn[cnt] - yBar)
denominator = 0
for cnt in range(len(xDataIn)):
denominator += (xDataIn[cnt] -xBar)**2
m = numerator/denominator
b = yBar - m*xBar
return (m,b)#slope,y-intercept
'''
Plots a histogram and a "box and whisker" plot for
an incoming dataset on a 1x2 row of an incoming figure
specified by axes. The row index on which to plot the
data is specified by axesRow. Also creates and plots
a normal prpobability density curve on the histogram
based on the mean and standard deviation of the
dataset. Set multiDim to true if axes in a
multi-dimensional array.
'''
def plotHistAndBox(data,axes,axesRow=0,multiDim=False):
dataBar = mean(data)
dataStd = stdev(data)
if multiDim == True:
#Plot and label histogram
dataN,dataBins,dataPat = axes[axesRow,0].hist(
data,bins=136,normed=True,range=(min(data),max(data)))
axes[axesRow,0].set_title('data')
axes[axesRow,0].set_xlabel('x')
axes[axesRow,0].set_ylabel('Relative Freq')
#Plot a boxplot
axes[axesRow,1].boxplot(data,vert=False)
axes[axesRow,1].set_title('Box and Whisker Plot')
axes[axesRow,1].set_xlabel('x')
else:
#Plot and label histogram
dataN,dataBins,dataPat = axes[0].hist(
data,bins=136,normed=True,range=(min(data),max(data)))
axes[0].set_title('data')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Freq')
#Plot a boxplot
axes[1].boxplot(data,vert=False)
axes[1].set_title('Box and Whisker Plot')
axes[1].set_xlabel('x')
#Compute the values for a normal probability density curve for the
# data mu and sigma across the same range of values.
x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
y = [normalProbabilityDensity(
val,mu=dataBar,sigma=dataStd) for val in x]
#Superimpose the normal probability density curve on the histogram.
if multiDim == True:
axes[axesRow,0].plot(x,y,label='normal probability density')
else:
axes[0].plot(x,y,label='normal probability density')
Test the function named plotHistAndBox.
#Create a normal dataset with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,
lowLim=10,highLim=70,seed=1) + [68.5] + [100] + [120]
#Create a figure with one row and two columns
fig,axes = plt.subplots(1,2)
#Call the plotHistAndBox function to plot the dataset
plotHistAndBox(g01,axes)
axes[0].grid(True)
axes[1].grid(True)
plt.tight_layout()
plt.show()
That's the end of the section containing utility functions. Let's begin this exercise page with a general question and answer session regarding the interpretation of histograms. Note that you are not expected to understand the code that generated the histograms at this point, but you should fully understand that code by the end of the course.
Assume that the two histograms that you see below represent the annual salaries earned by graduates with a History major on the left and graduates with a Data Science Programmer major on the right. (Note that these histograms are based on totally fictious data generated solely to illustrate the use of histograms in data analysis.)
The orange curves that are superimposed on the blue histogram data are normal distribution curves.
The program also equates program code with looking up stardardized z values in a Z-table.
Questions (See the answers below the histograms.)
history = normalRandomGenerator(dataLength=4362,numberSamples=3,lowLim=10,highLim=70,seed=1)
dSci = normalRandomGenerator(dataLength=5123,numberSamples=10,lowLim=30,highLim=90,seed=2)
print("Number of History values =",len(history))
print("Number of Data Science values",len(dSci))
histBar = mean(history)
print("histBar =",histBar)
dSciBar = mean(dSci)
print("dSciBar =",dSciBar)
histStd = stdev(history)
print("histStd =", histStd)
dSciStd = stdev(dSci)
print("dSciStd =", dSciStd)
#Standardize and lookup
print()
x = (((histBar + histStd)-histBar)/histStd)
y = (((histBar - histStd)-histBar)/histStd)
histMidRange = cumulativeDistribution(x)-cumulativeDistribution(y)
print('Approximately',int(100*histMidRange), 'percent of History majors earn between',\
round(histBar - histStd),'and',round(histBar + histStd),'thousand \nper year',\
'with an average of',round(histBar),'thousand per year.')
print()
x = (((dSciBar + dSciStd)-dSciBar)/dSciStd)
y = (((dSciBar - dSciStd)-dSciBar)/dSciStd)
dSciMidRange = cumulativeDistribution(x)-cumulativeDistribution(y)
print()
print("Compare the following with looking up values in a Z-table")
print("zHigh =",x)
print("cumulativeDistribution(zHi) =",cumulativeDistribution(x))
print("zLow =",y)
print("cumulativeDistribution(zLo) =",cumulativeDistribution(y))
print()
print('Approximately',int(100*histMidRange), 'percent of Data Science majors earn between',\
round(dSciBar - dSciStd),'and',round(dSciBar + dSciStd),'thousand \nper year',\
'with an average of',round(dSciBar),'thousand per year.')
print()
x = (round(histBar)-histBar)/histStd
ltHistMean = cumulativeDistribution(x)
print("Approximately", round(ltHistMean * len(history)),\
"history majors,",round(100*ltHistMean) ,"percent, make less than",round(histBar),"thousand per year.")
print()
x = (round(histBar)-dSciBar)/dSciStd
gtHistMean = 1 - cumulativeDistribution(x)
#gtHistMean = 1 - normalProbability(round(histBar),mu=dSciBar,sigma=dSciStd)
print("Approximately", round(gtHistMean * len(dSci)),\
"Data Science majors,",round(100*gtHistMean) ,"percent, make more than",round(histBar),"thousand per year.")
fig,axes = plt.subplots(1,2,sharey=True)
histN,histBins,histPat = axes[0].hist(history,bins=50,normed=True,range=(min(history),max(dSci)))
axes[0].set_title('History')
axes[0].set_xlabel('Salary in thousands per year')
axes[0].set_ylabel('Relative Frequency')
dSciN,dSciBins,dSciPat = axes[1].hist(dSci,bins=50,normed=True,range=(min(history),max(dSci)))
axes[1].set_title('Data Science Programmer')
axes[1].set_xlabel('Salary in thousands per year')
axes[0].grid(True)
axes[1].grid(True)
#Compute the values for a normal probability density curve for the
# history mu and sigma across the same range of values.
x = np.arange(histBins[0],histBins[len(histBins)-1],0.1)
y = [normalProbabilityDensity(val,mu=histBar,sigma=histStd) for val in x]
#Superimpose the normal probability density curve on the histogram.
axes[0].plot(x,y,label='normal probability density')
#Compute the values for a normal probability density curve for the
# dSci mu and sigma across the same range of values.
x = np.arange(dSciBins[0],dSciBins[len(dSciBins)-1],0.1)
y = [normalProbabilityDensity(val,mu=dSciBar,sigma=dSciStd) for val in x]
#Superimpose the normal probability density curve on the histogram.
axes[1].plot(x,y,label='normal probability density')
fig.suptitle('Salary for history and data science majors', fontsize=15)
plt.show()
Answers to questions:
Assume that salary surveys are conducted for two equal size groups of history majors. The first group is a typical group of history majors. However, the second group includes one individual who lands a football contract with the NFL and earns a huge salary his first year out of college and into the indefinite future. This large salary is what is commonly known as an outlier.
Statistics are computed and histograms are plotted for both groups. Here are some observations.
The presence of outliers in datasets can be very detrimental to statistical analysis. Unless outliers are representative of the typical state of the data being analyzed, they should probably be removed before undertaking a statistical analysis of the data.
(Note that these analyses are based on totally fictious data generated solely to illustrate the impact of an outlier in statistical data analysis.)
#Create a normal dataset
g01 = normalRandomGenerator(dataLength=201,numberSamples=3,lowLim=10,highLim=70,seed=1)
#Create a normal dataset with an outlier
g02 = normalRandomGenerator(dataLength=200,numberSamples=3,lowLim=10,highLim=70,seed=2)+ [1000]
#Compute and print statistics for both datasets
print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g02Bar = mean(g02)
print("g02Bar =",g02Bar)
g01Std = stdev(g01)
print("g01Std =", g01Std)
g02Std = stdev(g02)
print("g02Std =", g02Std)
g01Median = median(g01)
print("g01Median =",g01Median)
g02Median = median(g02)
print("g02Median =",g02Median)
#Plot and label histograms for both datasets for same range with a shared vertical axis.
fig,axes = plt.subplots(1,2,sharey=True)
g01N,g01Bins,g01Pat = axes[0].hist(g01,bins=50,normed=True,range=(min(g01),max(g01)))
axes[0].set_title('g01')
axes[0].set_xlabel('Salary in thousands per year')
axes[0].set_ylabel('Relative Frequency')
g02N,g02Bins,g02Pat = axes[1].hist(g02,bins=50,normed=True,range=(min(g01),max(g01)))
axes[1].set_title('g02')
axes[1].set_xlabel('Salary in thousands per year')
axes[0].grid(True)
axes[1].grid(True)
#Compute the values for a normal probability density curve for the
# g01 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
#Superimpose the normal probability density curve on the g01ogram.
axes[0].plot(x,y,label='normal probability density')
#Compute the values for a normal probability density curve for the
# g02 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g02Bar,sigma=g02Std) for val in x]
#Superimpose the normal probability density curve on the histogram.
axes[1].plot(x,y,label='normal probability density')
fig.suptitle('Before and after inclusion of an outlier', fontsize=12)
plt.show()
What is the relationship between the mean and the median for a skewed distribution? The following code indicates experimentally that:
#Create a positively skewed dataset.
g01 = [10,10,10,10,10,11,11,11,11,11,11,11,12,12,12,12,12,12,13,13,\
13,13,13,14,14,14,14,15,15,15,15,16,16,16,17,17,18,18,19]
#Create a negatively skewed dataset
g02 = [18,18,18,18,18,17,17,17,17,17,17,17,16,16,16,16,16,16,15,15,\
15,15,15,14,14,14,14,13,13,13,13,12,12,12,11,11,10,10,9]
#Compute and print statistics for both datasets
print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))
print()
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g01Median = median(g01)
print("g01Median =",g01Median)
print()
g02Bar = mean(g02)
print("g02Bar =",g02Bar)
g02Median = median(g02)
print("g02Median =",g02Median)
#Plot and label histograms for both datasets for same range with a shared vertical axis.
fig,axes = plt.subplots(1,2,sharey=True)
axes[0].hist(g01,bins=50,range=(min(g01),max(g01)))
axes[0].set_title('Positively Skewed')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Frequency')
axes[1].hist(g02,bins=50,range=(min(g02),max(g02)))
axes[1].set_title('Negatively Skewed')
axes[1].set_xlabel('x')
axes[0].grid(True)
axes[1].grid(True)
fig.suptitle('Positively and negatively skewed distributions', fontsize=12)
plt.show()
Theoretically, the mean, the median, and the mode are all equal for a dataset with a normal distribution. While it is probably not possible to create a dataset with a perfectly normal distribution, we can come close.
The following code creates a dataset that is very close to normally distributed and shows that the mean and the median are very close to being equal. By observation of the histogram, we can see that the mode is also very close to being equal to the mean and the median.
#Create a nearly normal dataset
g01 = normalRandomGenerator(dataLength=100000,numberSamples=20,lowLim=-100,highLim=100)
#Compute and print statistics for the dataset
print("Number of g01 values =",len(g01))
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g01Median = median(g01)
print("g01Median =",g01Median)
g01Std = stdev(g01)
print("g01Std =", g01Std)
#Plot and label a histogram for the dataset
fig,axes = plt.subplots(1,1)
g01N,g01Bins,g01Pat = axes.hist(g01,bins=round(max(g01) - min(g01)),\
normed=True,range=(min(g01),max(g01)))
axes.set_title('g01')
axes.set_xlabel('x')
axes.set_ylabel('Relative Frequency')
axes.grid(True)
#Compute the values for a normal probability density curve for the
# g01 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
#Superimpose the normal probability density curve on the histogram.
axes.plot(x,y,label='normal probability density')
fig.suptitle('An approximately normal frequency distribution', fontsize=12)
plt.show()
One of the ways that statisticians deal with outliers is to cut the tails off of the distribution. Apparently it is common practice to discard the lowest 25% of values and the highest 25% of values keeping only the Interquartile Range (IQR). The IQR contains two quarters or 50% of the data values.
The following code shows one way to do that and then plots histograms of the original dataset and the clipped dataset. Various statistics for both datasets are also computed and printed.
The area of the histogram for the clipped dataset is approximately half that of the area of the histogram for the original dataset. This is to be expected since half of the original data values at the low end and the high end were discarded.
The distribution of the clipped dataset does not appear to be normal. Two normal distribution curves are drawn on the histogram for the clipped dataset for illustration. The green curve shows the normal distribution curve for the original dataset. The orange curve shows the normal distribution curve computed using the mean and standard distribution for the clipped dataset.
#Create a normal dataset with an outlier
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [80]
#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()
#Create a dataset by discarding the top 25% and the bottom 25% of the values in the original dataset.
#Use the median to subdivide the data into halves. Use Python list comprehension
g01Median = median(g01)
topHalf = [x for x in g01 if x >= g01Median]
bottomHalf = [x for x in g01 if x < g01Median]
#Get the median for each half
topHalfMedian = median(topHalf)
bottomHalfMedian = median(bottomHalf)
#Use the medians to divide the halves into halves, or into quartiles
firstQuartile = [x for x in bottomHalf if x < bottomHalfMedian]
secondQuartile = [x for x in bottomHalf if x >= bottomHalfMedian]
thirdQuartile = [x for x in topHalf if x <= topHalfMedian]
fourthQuartile = [x for x in topHalf if x > topHalfMedian]
#Concatenate the two center quartiles to become the new dataset
g02 = secondQuartile + thirdQuartile
#Compute and print statistics for both datasets
print()
print("sum of original dataset",sum(g01))
print("sum of clipped dataset",sum(g02))
print("sum of two center quartiles",sum(secondQuartile)+sum(thirdQuartile))
print("sum of firstQuartile",sum(firstQuartile))
print("sum of secondQuartile",sum(secondQuartile))
print("sum of thirdQuartile",sum(thirdQuartile))
print("sum of fourthQuartile",sum(fourthQuartile))
print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))
print("g01 range =",max(g01) - min(g01))
print("g02 range =",max(g02) - min(g02))
print()
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g02Bar = mean(g02)
print("g02Bar =",g02Bar)
g01Std = stdev(g01)
print("g01Std =", g01Std)
g02Std = stdev(g02)
print("g02Std =", g02Std)
g01Median = median(g01)
print("g01Median =",g01Median)
g02Median = median(g02)
print("g02Median =",g02Median)
#Plot and label histograms for both datasets
fig,axes = plt.subplots(1,2)
g01N,g01Bins,g01Pat = axes[0].hist(g01,bins=136,normed=True,range=(min(g01),max(g01)))
axes[0].set_title('g01')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Frequency')
g02N,g02Bins,g02Pat = axes[1].hist(g02,bins=136,normed=True,range=(min(g01),max(g01)))
axes[1].set_title('g02')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Relative Frequency')
axes[0].grid(True)
axes[1].grid(True)
#Compute the values for a normal probability density curve for the
# g01 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
axes[0].plot(x,y,label='normal probability density')
#Superimpose the normal probability density curve for g01 on the histogram for g02.
axes[1].plot(x,y,label='normal probability density')
#Compute the values for a normal probability density curve for the
# g02 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
y = [normalProbabilityDensity(val,mu=g02Bar,sigma=g02Std) for val in x]
plt.plot(x,y)
plt.tight_layout()
plt.show()
To understand the definition of an outlier, you must first understand quartiles. You can divide the dataset into quartiles with the following procedure:
Any value x is an outlier if either of the following is true:
The following code creates a relatively normal dataset with two outliers. One outlier on the high end was put there on purpose. The other outlier on the low end resulted from the averaging procedure used to create the dataset.
Then the program computes and prints several values using the above procedure inclusing the IRQ and the lower and upper outlier limits. The program also displays the mean and the standard deviation for the dataset.
Then the program plots a histogram for the dataset and plots a normal distribution curve on the histogram.
If you understand how the x-axis scale is determined for the histogram, you can tell by viewing the histogram that there is an outlier with a value of approximately 120. However, the outlier at the low end isn't obvious because it is barely below the lower outlier limit. You will see later how I knew that it was there.
#Create a normal dataset with an outlier
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [120]
#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()
#Use the median to subdivide the data into halves. Use Python list comprehension
Q2 = median(g01)
topHalf = [x for x in g01 if x >= Q2]
bottomHalf = [x for x in g01 if x < Q2]
#Get the median for each half
Q3 = median(topHalf)
Q1 = median(bottomHalf)
IQR = Q3 - Q1
lowerOutlierLimit = Q1 - 1.5*IQR
upperOutlierLimit = Q3 + 1.5*IQR
print("Q1 =",Q1)
print("Q2 =",Q2)
print("Q3 =",Q3)
print("IQR =",IQR)
print("lowerOutlierLimit =",lowerOutlierLimit)
print("upperOutlierLimit",upperOutlierLimit)
print()
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g01Std = stdev(g01)
print("g01Std =", g01Std)
#Plot and label histogram
fig,axes = plt.subplots(1,1)
g01N,g01Bins,g01Pat = axes.hist(g01,bins=136,normed=True,range=(min(g01),max(g01)))
axes.set_title('g01')
axes.set_xlabel('x')
axes.set_ylabel('Relative Frequency')
axes.grid(True)
#Compute the values for a normal probability density curve for the
# g01 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
axes.plot(x,y,label='normal probability density')
plt.tight_layout()
plt.show()
In an earlier exercise, I told you that the dataset being analyzed has an outlier very close to the lower outlier limit that was not obvious by viewing the histogram. I promised to tell you later how I knew it was there.
You learned how to compute the lower and upper outlier limits in that exercise.
The following code creates a relatively normal dataset with three outliers. Two outliesr on the high end were put there for purposes of illustration. The third outlier on the low end resulted from the averaging procedure used to create the dataset.
Then the program computes and prints several values inclusing the IRQ and the lower and upper outlier limits. The program also displays the mean and the standard deviation for the dataset.
Then the program plots a histogram for the dataset and plots a normal distribution curve on the histogram.
Finally, and this is new to this exercise, the program plots a "box and whisker" plot for the data. Box and whisker plots are ideal for comparing distributions because the center, the spread, and the overall range are immediately apparent, as are the outliers.
Think of the horizontal center line as repesenting the full range of the data with low values at the left and high values at the right. (The plot can also be oriented vertically with high values at the top.) The three small circles at the right represent the three outliers that I added to the normal dataset for illustration purposes. The small circle on the left represents a low-end outlier that was a normal consequence of the manner in which the dataset was created. (This is how I knew that there is an outlier at the low end.)
The red vertical line represents the median of the dataset or Q2. The left end of the box represents Q1 and the right end of the box represents Q3. The width of the box represents the IRQ.
The short vertical lines to the left and right of the box appear to represent the lower and upper outlier limits respecively although this is not obvious in the MatPlotLib documentation. This can, however, be demonstrated experimentally. I added an outlier with a value of 68.5 and the short vertical line on the right is almost centered on the circle that represents that outlier. The computed value of the upper outlier limit is 68.43, only slightly less than the value of the outlier that I added to the dataset. Thus the position of the short vertical line is a close match to the value of the upper outlier limit.
#Create a normal dataset with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [68.5] + [100] + [120]
#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()
#Use the median to subdivide the data into halves. Use Python list comprehension
Q2 = median(g01)
topHalf = [x for x in g01 if x >= Q2]
bottomHalf = [x for x in g01 if x < Q2]
#Get the median for each half
Q3 = median(topHalf)
Q1 = median(bottomHalf)
IQR = Q3 - Q1
lowerOutlierLimit = Q1 - 1.5*IQR
upperOutlierLimit = Q3 + 1.5*IQR
print("Q1 =",Q1)
print("Q2 =",Q2)
print("Q3 =",Q3)
print("IQR =",IQR)
print("lowerOutlierLimit =",lowerOutlierLimit)
print("upperOutlierLimit",upperOutlierLimit)
print("min =",min(g01))
print("max =",max(g01))
print()
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g01Std = stdev(g01)
print("g01Std =", g01Std)
#Create a figure with one row and two columns
fig,axes = plt.subplots(1,2)
#Call the plotHistAndBox function to plot the dataset
plotHistAndBox(g01,axes)
axes[0].grid(True)
axes[1].grid(True)
plt.tight_layout()
plt.show()
This example is similar to the previous one except that it creates and plots histograms and box plots for three datasets having different spreads and outliers at the same values. This illustrates the usefulness of the box and whisker plot for comparing distributions where the center, the spread, and the overall range are immediately apparent, as are the outliers.
The program plots a normal distribution curve on the histogram for each dataset showing the extent to which the dataset approximates a normal distribution.
#Create a three datasets with different spreads and with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=1,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]
g02 = normalRandomGenerator(dataLength=9999,numberSamples=2,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]
g03 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]
#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2)
#Call the plotHistAndBox function to process the first dataset
plotHistAndBox(g01,axes,axesRow=0,multiDim=True)
#Process the second dataset
plotHistAndBox(g02,axes,axesRow=1,multiDim=True)
#Process the third dataset
plotHistAndBox(g03,axes,axesRow=2,multiDim=True)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()
Experimentally find and print the probability that a value in a data set with a near-normal distribution will be within one sigma on each side of the mean. Express the probability as a percentage with no decimal.
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
lowLimit = xBar - stDev
print("lowLimit =",lowLimit)
highLimit = xBar + stDev
print("highLimit =",highLimit)
lowZ = (lowLimit - xBar)/stDev
print("lowZ =",lowZ)
highZ = (highLimit - xBar)/stDev
print("highZ =",highZ)
probability = cumulativeDistribution(highZ) - cumulativeDistribution(lowZ)
print("probability that x is within one sigma =",int(probability*100//1),"percent")
Experimentally find and print the probability that a value in a data set with a near-normal distribution will be within two sigma on each side of the mean. Express the probability as a percentage with no decimal.
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
lowLimit = xBar - 2*stDev
print("lowLimit =",lowLimit)
highLimit = xBar + 2*stDev
print("highLimit =",highLimit)
lowZ = (lowLimit - xBar)/stDev
print("lowZ =",lowZ)
highZ = (highLimit - xBar)/stDev
print("highZ =",highZ)
probability = cumulativeDistribution(highZ) - cumulativeDistribution(lowZ)
print("probability that x is within two sigma =",int(probability*100//1),"percent")
--To be continued in the tutorial titled Statistics Exercises Part 3--
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: StatisticsPart02.ipynb
Revised: 04/21/18
Copyright 2018 Richard G. Baldwin
-end-