Statistics Exercises Part 2

This is the second of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.

See the document titled Statistics Exercises Part 1 for more introductory information.

Import the required libraries

Let's begin by importing the libraries required by the code later in the tutorial.

In [1]:
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
from statistics import mode
from statistics import StatisticsError
import random
from collections import Counter
import numpy as np

Utility functions

This sections defines several utility functions used by code later in the tutorial.

Standard normal probabilty density function

We will begin by defining and testing a utility function from which you can obtain values for the standard normal probability density function. See Normal Distribution Density for a table of the expected values See Engineering Statistics Handbook for a definition of the equation for the standard normal probability density function.

In [2]:
'''
Computes and return values for the standard normal probabilty density function

input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value

Examples

x = -2, output = 0.05399096651318806
x = -1, output = 0.24197072451914337
x = 0, output = 0.3989422804014327
x = 1, output = 0.24197072451914337
x = 2, output = 0.05399096651318806
'''
def standardNormalDistribution(x):

    eVal = 2.718281828459045
    exp = (-x**2)/2
    numerator = pow(eVal,exp)
    denominator = math.sqrt(2*math.pi)
    return numerator/denominator

Plot it

Get and plot the standard normal probabilty density function from -3 sigma to +3 sigma

In [3]:
#Create an array of uniformly spaced values
x = np.arange(-3,3,0.1)
#Create a list containing a distribution value for each value in x
# using list comprehension.
y = [standardNormalDistribution(val) for val in x]

#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()

Normal probabilty density function

Now let's define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.

This function is more general than the similar function defined above. That function assumed that mu was equal to zero and sigma was equal to 1. Those assumptions result in a function commonly called the standard normal distribution. This function makes no such assumption and results in the more general probability density function.

In [4]:
'''
Computes and return values for the normal probabilty density function

required input: x-axis value
optional input: mu
optional input: sigma

returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
    
    eVal = 2.718281828459045
    exp = -((x-mu)**2)/(2*(sigma**2))
    numerator = pow(eVal,exp)
    denominator = sigma*(math.sqrt(2*math.pi))
    return numerator/denominator

Plot it

Get and plot the normal probabilty density function from -3 sigma to +3 sigma for mu = 1 and sigma = 0.5.

In [5]:
x = np.arange(-2,4,0.1)
y = [normalProbabilityDensity(val,mu=1,sigma=0.5) for val in x]

#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()

Cumulative distribution function

Now define a utility function that will return the value of the cumulative normal distribution function. This function takes x as a parameter and returns the area under the standard normal distribution curve from minus infinity up to the value of x.

A call to this function is equivalent to finding a value in a Z-table.

The method for computing the cumulative probability in the following code came from Stand-alone Python implementation of phi(x) by John D. Cook.

See Standard Normal Probabilities for tables showing the values that should be returned by this function.

In [6]:
'''
Computes and returns values for the cumulative distribution function of the standard normal 
probability density function. Stated differently, computes and returns area under the standard 
normal curve from minus infinity to an input x value.

This is a series implementation. A digital numeric integration version is provided later in 
this document for comparison.

input: x-axis value as a decimal fraction of the standard deviation
returns: The computed value of the function for the given x-axis value

Examples

x = -2, output = 0.022750062887256395
x = -1, output = 0.15865526383236372
x = 0, output = 0.5000000005
x = 1, output = 0.8413447361676363
x = 2, output = 0.9772499371127437

'''
def cumulativeDistribution(x):
    # constants
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429
    p  =  0.3275911

    # Save the sign of x
    sign = 1
    if x < 0:
        sign = -1
    x = abs(x)/math.sqrt(2.0)

    # A&S formula 7.1.26
    t = 1.0/(1.0 + p*x)
    y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*math.exp(-x*x)

    return 0.5*(1.0 + sign*y)

Plot it

Get and plot the cumulative distribution from -3 sigma to +3 sigma.

In [7]:
x = np.arange(-3,3,0.1)
y = [cumulativeDistribution(val) for val in x]

#Use matplotlib to plot the data.
plt.plot(x,y)
plt.ylabel('Probability Density')
plt.xlabel('x')
plt.show()

A random dataset generator

Now define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.

In [8]:
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
    '''
    Create a new dataset of dataLength values consisting of the average of numberSamples 
    samples taken from a population of uniformly distributed values between lowLim 
    and highLim generated with a seed of seed.

    Input keyword parameters and their default values:

    seed = 1 seed used to generate the uniformly distributed values
    dataLength = 10000 number of samples in the returned list of values
    numberSamples = 50 number of samples taken from the uniformly distributed population
                       and averaged into the output
    lowLim = 0 lower limit value of the uniformly distributed population
    highLim = 100 high limit value of the uniformly distributed population

    returns: a list containing the dataset

    '''
    data = []
    random.seed(seed)

    for cnt in range(dataLength):
        theSum = 0
        for cnt1 in range(numberSamples):
            theSum += random.uniform(lowLim,highLim)
        data.append(theSum/numberSamples)
        
    return data

Least mean square best fit line

The following function will compute a least mean square best fit line for a scatter plot.

In [9]:
def lmsBestFitLine(xDataIn,yDataIn):
    xBar = sum(xDataIn)/len(xDataIn)
    yBar = sum(yDataIn)/len(yDataIn)

    numerator = 0
    for cnt in range(len(xDataIn)):
        numerator += (xDataIn[cnt] - xBar) * (yDataIn[cnt] - yBar)

    denominator = 0
    for cnt in range(len(xDataIn)):
        denominator += (xDataIn[cnt] -xBar)**2

    m = numerator/denominator

    b = yBar - m*xBar
    
    return (m,b)#slope,y-intercept

plotHistAndBox function

In [10]:
'''
Plots a histogram and a "box and whisker" plot for 
an incoming dataset on a 1x2 row of an incoming figure
specified by axes. The row index on which to plot the 
data is specified by axesRow. Also creates and plots
a normal prpobability density curve on the histogram 
based on the mean and standard deviation of the 
dataset. Set multiDim to true if axes in a 
multi-dimensional array.
'''

def plotHistAndBox(data,axes,axesRow=0,multiDim=False):

    dataBar = mean(data)
    dataStd = stdev(data)

    if multiDim == True:
        #Plot and label histogram
        dataN,dataBins,dataPat = axes[axesRow,0].hist(
            data,bins=136,normed=True,range=(min(data),max(data)))
        axes[axesRow,0].set_title('data')
        axes[axesRow,0].set_xlabel('x')
        axes[axesRow,0].set_ylabel('Relative Freq')

        #Plot a boxplot
        axes[axesRow,1].boxplot(data,vert=False)
        axes[axesRow,1].set_title('Box and Whisker Plot')
        axes[axesRow,1].set_xlabel('x')
        
    else:
        #Plot and label histogram
        dataN,dataBins,dataPat = axes[0].hist(
            data,bins=136,normed=True,range=(min(data),max(data)))
        axes[0].set_title('data')
        axes[0].set_xlabel('x')
        axes[0].set_ylabel('Relative Freq')

        #Plot a boxplot
        axes[1].boxplot(data,vert=False)
        axes[1].set_title('Box and Whisker Plot')
        axes[1].set_xlabel('x')

    #Compute the values for a normal probability density curve for the 
    # data mu and sigma across the same range of values.
    x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
    y = [normalProbabilityDensity(
        val,mu=dataBar,sigma=dataStd) for val in x]

    #Superimpose the normal probability density curve on the histogram.
    if multiDim == True:
        axes[axesRow,0].plot(x,y,label='normal probability density')
    else:
        axes[0].plot(x,y,label='normal probability density')

Plot it

Test the function named plotHistAndBox.

In [11]:
#Create a normal dataset with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,
    lowLim=10,highLim=70,seed=1) + [68.5] + [100] + [120]

#Create a figure with one row and two columns
fig,axes = plt.subplots(1,2)

#Call the plotHistAndBox function to plot the dataset
plotHistAndBox(g01,axes)

axes[0].grid(True)
axes[1].grid(True)
plt.tight_layout()
plt.show()

General histogram Q & A

That's the end of the section containing utility functions. Let's begin this exercise page with a general question and answer session regarding the interpretation of histograms. Note that you are not expected to understand the code that generated the histograms at this point, but you should fully understand that code by the end of the course.

Assume that the two histograms that you see below represent the annual salaries earned by graduates with a History major on the left and graduates with a Data Science Programmer major on the right. (Note that these histograms are based on totally fictious data generated solely to illustrate the use of histograms in data analysis.)

The orange curves that are superimposed on the blue histogram data are normal distribution curves.

The program also equates program code with looking up stardardized z values in a Z-table.

Questions (See the answers below the histograms.)

  1. Approximately what income do most History majors and Data Science Programming majors make?
  2. Is it reasonable to draw conclusions about this data by assuming that it is normally distributed?
  3. What pecentage of History majors earn between 30 and 50 thousand per year?
  4. What percentage of Data Science majors earn between 55 and 65 thousand per year?
  5. Of the population on which the histogram is based, approximately how many History majors make less than 40,000 per year.
  6. Of the population on which the histogram is based, approximately how many Data Science majors make more than 40,000 per year.
In [12]:
history = normalRandomGenerator(dataLength=4362,numberSamples=3,lowLim=10,highLim=70,seed=1)
dSci = normalRandomGenerator(dataLength=5123,numberSamples=10,lowLim=30,highLim=90,seed=2)

print("Number of History values =",len(history))
print("Number of Data Science values",len(dSci))
histBar = mean(history)
print("histBar =",histBar)
dSciBar = mean(dSci)
print("dSciBar =",dSciBar)
histStd = stdev(history)
print("histStd =", histStd)
dSciStd = stdev(dSci)
print("dSciStd =", dSciStd)


#Standardize and lookup
print()
x = (((histBar + histStd)-histBar)/histStd)
y = (((histBar - histStd)-histBar)/histStd)
histMidRange = cumulativeDistribution(x)-cumulativeDistribution(y)

print('Approximately',int(100*histMidRange), 'percent of History majors earn between',\
      round(histBar - histStd),'and',round(histBar + histStd),'thousand \nper year',\
     'with an average of',round(histBar),'thousand per year.')

print()
x = (((dSciBar + dSciStd)-dSciBar)/dSciStd)
y = (((dSciBar - dSciStd)-dSciBar)/dSciStd) 
dSciMidRange = cumulativeDistribution(x)-cumulativeDistribution(y)

print()
print("Compare the following with looking up values in a Z-table")
print("zHigh =",x)
print("cumulativeDistribution(zHi) =",cumulativeDistribution(x))
print("zLow =",y)
print("cumulativeDistribution(zLo) =",cumulativeDistribution(y))
print()

print('Approximately',int(100*histMidRange), 'percent of Data Science majors earn between',\
      round(dSciBar - dSciStd),'and',round(dSciBar + dSciStd),'thousand \nper year',\
     'with an average of',round(dSciBar),'thousand per year.')

print()
x = (round(histBar)-histBar)/histStd
ltHistMean = cumulativeDistribution(x)
print("Approximately", round(ltHistMean * len(history)),\
      "history majors,",round(100*ltHistMean) ,"percent, make less than",round(histBar),"thousand per year.")

print()
x = (round(histBar)-dSciBar)/dSciStd
gtHistMean = 1 - cumulativeDistribution(x)
#gtHistMean = 1 - normalProbability(round(histBar),mu=dSciBar,sigma=dSciStd)
print("Approximately", round(gtHistMean * len(dSci)),\
      "Data Science majors,",round(100*gtHistMean) ,"percent, make more than",round(histBar),"thousand per year.")

fig,axes = plt.subplots(1,2,sharey=True)
histN,histBins,histPat = axes[0].hist(history,bins=50,normed=True,range=(min(history),max(dSci)))
axes[0].set_title('History')
axes[0].set_xlabel('Salary in thousands per year')
axes[0].set_ylabel('Relative Frequency')


dSciN,dSciBins,dSciPat = axes[1].hist(dSci,bins=50,normed=True,range=(min(history),max(dSci)))
axes[1].set_title('Data Science Programmer')
axes[1].set_xlabel('Salary in thousands per year')
axes[0].grid(True)
axes[1].grid(True)


#Compute the values for a normal probability density curve for the 
# history mu and sigma across the same range of values.
x = np.arange(histBins[0],histBins[len(histBins)-1],0.1)
y = [normalProbabilityDensity(val,mu=histBar,sigma=histStd) for val in x]

#Superimpose the normal probability density curve on the histogram.
axes[0].plot(x,y,label='normal probability density')


#Compute the values for a normal probability density curve for the 
# dSci mu and sigma across the same range of values.
x = np.arange(dSciBins[0],dSciBins[len(dSciBins)-1],0.1)
y = [normalProbabilityDensity(val,mu=dSciBar,sigma=dSciStd) for val in x]

#Superimpose the normal probability density curve on the histogram.
axes[1].plot(x,y,label='normal probability density')

fig.suptitle('Salary for history and data science majors', fontsize=15)
plt.show()
Number of History values = 4362
Number of Data Science values 5123
histBar = 39.90199979054073
dSciBar = 59.99548093874471
histStd = 10.064284837901639
dSciStd = 5.479641868832745

Approximately 68 percent of History majors earn between 30 and 50 thousand 
per year with an average of 40 thousand per year.


Compare the following with looking up values in a Z-table
zHigh = 1.0000000000000009
cumulativeDistribution(zHi) = 0.8413447361676365
zLow = -0.9999999999999997
cumulativeDistribution(zLo) = 0.1586552638323639

Approximately 68 percent of Data Science majors earn between 55 and 65 thousand 
per year with an average of 60 thousand per year.

Approximately 2198 history majors, 50 percent, make less than 40 thousand per year.

Approximately 5122 Data Science majors, 100 percent, make more than 40 thousand per year.

Answers to questions:

  1. Approximately what income do most History majors and Data Science Programming majors make? (See the histBar and the dSciBar in the output for quick answers. We will see better answers later based on the standard deviation.)
  2. Is it reasonable to draw conclusions about this data by assuming that it is normally distributed? Yes. The orange normal curves that are superimposed on the histograms indicate that the distribution of the data is very close to normal.
  3. What pecentage of History majors earn between 30 and 50 thousand per year? See the answer in the printed output. This is also a better answer to the first question.
  4. What percentage of Data Science majors earn between 55 and 65 thousand per year? See the answer in the printed output. This is also a better answer to the first question.
  5. Of the population on which the histogram is based, approximately how many History majors make less than 40,000 per year. See the printed output.
  6. Of the population on which the histogram is based, approximately how many Data Science majors make more than 40,000 per year. See the printed output.

Detrimental impact of an outlier

Assume that salary surveys are conducted for two equal size groups of history majors. The first group is a typical group of history majors. However, the second group includes one individual who lands a football contract with the NFL and earns a huge salary his first year out of college and into the indefinite future. This large salary is what is commonly known as an outlier.

Statistics are computed and histograms are plotted for both groups. Here are some observations.

  1. The outlier caused the mean value of the second group to increase by about 12 percent.
  2. The outlier cause the standard deviation of the second group to increase by a factor of 6.8.
  3. Replacing a value in a dataset with an outlier doesn't change the median unless the median value itself is replaced. The very small difference in the median between the two groups in this case is due simply to the fact that the two groups were different samples taken from the same population. The median can be a better representation of the salary distribution than the mean in the case of an outlier or in the case of highly skewed distributions. (See Estimates of the Center)
  4. A normal distribution curve created with the new mean and standard deviation values is a very poor match for the 99.5 percent of the data that does not include the outlier. Therefore, using the mean and standard deviation values to draw conclusions based on the assumption of a normal distribution probably wouldn't yield valid conclusions.

The presence of outliers in datasets can be very detrimental to statistical analysis. Unless outliers are representative of the typical state of the data being analyzed, they should probably be removed before undertaking a statistical analysis of the data.

(Note that these analyses are based on totally fictious data generated solely to illustrate the impact of an outlier in statistical data analysis.)

In [13]:
#Create a normal dataset
g01 = normalRandomGenerator(dataLength=201,numberSamples=3,lowLim=10,highLim=70,seed=1)

#Create a normal dataset with an outlier
g02 = normalRandomGenerator(dataLength=200,numberSamples=3,lowLim=10,highLim=70,seed=2)+ [1000]

#Compute and print statistics for both datasets
print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))

g01Bar = mean(g01)
print("g01Bar =",g01Bar)

g02Bar = mean(g02)
print("g02Bar =",g02Bar)

g01Std = stdev(g01)
print("g01Std =", g01Std)

g02Std = stdev(g02)
print("g02Std =", g02Std)

g01Median = median(g01)
print("g01Median =",g01Median)

g02Median = median(g02)
print("g02Median =",g02Median)

#Plot and label histograms for both datasets for same range with a shared vertical axis.
fig,axes = plt.subplots(1,2,sharey=True)

g01N,g01Bins,g01Pat = axes[0].hist(g01,bins=50,normed=True,range=(min(g01),max(g01)))
axes[0].set_title('g01')
axes[0].set_xlabel('Salary in thousands per year')
axes[0].set_ylabel('Relative Frequency')

g02N,g02Bins,g02Pat = axes[1].hist(g02,bins=50,normed=True,range=(min(g01),max(g01)))
axes[1].set_title('g02')
axes[1].set_xlabel('Salary in thousands per year')

axes[0].grid(True)
axes[1].grid(True)


#Compute the values for a normal probability density curve for the 
# g01 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]

#Superimpose the normal probability density curve on the g01ogram.
axes[0].plot(x,y,label='normal probability density')

#Compute the values for a normal probability density curve for the 
# g02 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g02Bar,sigma=g02Std) for val in x]

#Superimpose the normal probability density curve on the histogram.
axes[1].plot(x,y,label='normal probability density')

fig.suptitle('Before and after inclusion of an outlier', fontsize=12)
plt.show()
Number of g01 values = 201
Number of g02 values = 201
g01Bar = 40.33801352003058
g02Bar = 44.948923653598015
g01Std = 10.330463182121866
g02Std = 68.52185410619977
g01Median = 40.3739234746022
g02Median = 41.339359378750395

Order of mean and median for skewed distributions

What is the relationship between the mean and the median for a skewed distribution? The following code indicates experimentally that:

  • The mean is greater than the median for a positively skewed distribution.
  • The mean and the median are greater than the mode for a positively skewed distribution.
  • The median is greater than the mean for a negatively skewed distribution.
  • The mean and the median are less than the mode for a negatively skewed distribution.
In [14]:
#Create a positively skewed dataset.
g01 = [10,10,10,10,10,11,11,11,11,11,11,11,12,12,12,12,12,12,13,13,\
       13,13,13,14,14,14,14,15,15,15,15,16,16,16,17,17,18,18,19]

#Create a negatively skewed dataset
g02 = [18,18,18,18,18,17,17,17,17,17,17,17,16,16,16,16,16,16,15,15,\
       15,15,15,14,14,14,14,13,13,13,13,12,12,12,11,11,10,10,9]

#Compute and print statistics for both datasets
print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))

print()
g01Bar = mean(g01)
print("g01Bar =",g01Bar)
g01Median = median(g01)
print("g01Median =",g01Median)

print()
g02Bar = mean(g02)
print("g02Bar =",g02Bar)
g02Median = median(g02)
print("g02Median =",g02Median)

#Plot and label histograms for both datasets for same range with a shared vertical axis.
fig,axes = plt.subplots(1,2,sharey=True)

axes[0].hist(g01,bins=50,range=(min(g01),max(g01)))
axes[0].set_title('Positively Skewed')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Frequency')

axes[1].hist(g02,bins=50,range=(min(g02),max(g02)))
axes[1].set_title('Negatively Skewed')
axes[1].set_xlabel('x')

axes[0].grid(True)
axes[1].grid(True)

fig.suptitle('Positively and negatively skewed distributions', fontsize=12)
plt.show()
Number of g01 values = 39
Number of g02 values = 39

g01Bar = 13.256410256410257
g01Median = 13

g02Bar = 14.743589743589743
g02Median = 15

Relationship of mean, median, and mode for a normal distribution

Theoretically, the mean, the median, and the mode are all equal for a dataset with a normal distribution. While it is probably not possible to create a dataset with a perfectly normal distribution, we can come close.

The following code creates a dataset that is very close to normally distributed and shows that the mean and the median are very close to being equal. By observation of the histogram, we can see that the mode is also very close to being equal to the mean and the median.

In [15]:
#Create a nearly normal dataset
g01 = normalRandomGenerator(dataLength=100000,numberSamples=20,lowLim=-100,highLim=100)

#Compute and print statistics for the dataset
print("Number of g01 values =",len(g01))

g01Bar = mean(g01)
print("g01Bar =",g01Bar)

g01Median = median(g01)
print("g01Median =",g01Median)

g01Std = stdev(g01)
print("g01Std =", g01Std)

#Plot and label a histogram for the dataset
fig,axes = plt.subplots(1,1)

g01N,g01Bins,g01Pat = axes.hist(g01,bins=round(max(g01) - min(g01)),\
                                normed=True,range=(min(g01),max(g01)))
axes.set_title('g01')
axes.set_xlabel('x')
axes.set_ylabel('Relative Frequency')

axes.grid(True)

#Compute the values for a normal probability density curve for the 
# g01 mu and sigma across the same range of values.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]

#Superimpose the normal probability density curve on the histogram.
axes.plot(x,y,label='normal probability density')

fig.suptitle('An approximately normal frequency distribution', fontsize=12)
plt.show()
Number of g01 values = 100000
g01Bar = -0.00880052373591947
g01Median = -0.026064678991148327
g01Std = 12.923823719960344

Cutting off the tails

One of the ways that statisticians deal with outliers is to cut the tails off of the distribution. Apparently it is common practice to discard the lowest 25% of values and the highest 25% of values keeping only the Interquartile Range (IQR). The IQR contains two quarters or 50% of the data values.

The following code shows one way to do that and then plots histograms of the original dataset and the clipped dataset. Various statistics for both datasets are also computed and printed.

The area of the histogram for the clipped dataset is approximately half that of the area of the histogram for the original dataset. This is to be expected since half of the original data values at the low end and the high end were discarded.

The distribution of the clipped dataset does not appear to be normal. Two normal distribution curves are drawn on the histogram for the clipped dataset for illustration. The green curve shows the normal distribution curve for the original dataset. The orange curve shows the normal distribution curve computed using the mean and standard distribution for the clipped dataset.

In [16]:
#Create a normal dataset with an outlier
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [80]

#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()

#Create a dataset by discarding the top 25% and the bottom 25% of the values in the original dataset.

#Use the median to subdivide the data into halves. Use Python list comprehension
g01Median = median(g01)
topHalf = [x for x in g01 if x >= g01Median]
bottomHalf = [x for x in g01 if x < g01Median]

#Get the median for each half
topHalfMedian = median(topHalf)
bottomHalfMedian = median(bottomHalf)

#Use the medians to divide the halves into halves, or into quartiles
firstQuartile = [x for x in bottomHalf if x < bottomHalfMedian]
secondQuartile = [x for x in bottomHalf if x >= bottomHalfMedian]
thirdQuartile = [x for x in topHalf if x <= topHalfMedian]
fourthQuartile = [x for x in topHalf if x > topHalfMedian]

#Concatenate the two center quartiles to become the new dataset
g02 = secondQuartile + thirdQuartile


#Compute and print statistics for both datasets
print()
print("sum of original dataset",sum(g01))
print("sum of clipped dataset",sum(g02))
print("sum of two center quartiles",sum(secondQuartile)+sum(thirdQuartile))

print("sum of firstQuartile",sum(firstQuartile))
print("sum of secondQuartile",sum(secondQuartile))
print("sum of thirdQuartile",sum(thirdQuartile))
print("sum of fourthQuartile",sum(fourthQuartile))

print("Number of g01 values =",len(g01))
print("Number of g02 values =",len(g02))

print("g01 range =",max(g01) - min(g01))
print("g02 range =",max(g02) - min(g02))
print()

g01Bar = mean(g01)
print("g01Bar =",g01Bar)

g02Bar = mean(g02)
print("g02Bar =",g02Bar)

g01Std = stdev(g01)
print("g01Std =", g01Std)

g02Std = stdev(g02)
print("g02Std =", g02Std)

g01Median = median(g01)
print("g01Median =",g01Median)

g02Median = median(g02)
print("g02Median =",g02Median)

#Plot and label histograms for both datasets
fig,axes = plt.subplots(1,2)

g01N,g01Bins,g01Pat = axes[0].hist(g01,bins=136,normed=True,range=(min(g01),max(g01)))
axes[0].set_title('g01')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Frequency')

g02N,g02Bins,g02Pat = axes[1].hist(g02,bins=136,normed=True,range=(min(g01),max(g01)))
axes[1].set_title('g02')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Relative Frequency')

axes[0].grid(True)
axes[1].grid(True)

#Compute the values for a normal probability density curve for the 
# g01 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
axes[0].plot(x,y,label='normal probability density')

#Superimpose the normal probability density curve for g01 on the histogram for g02.
axes[1].plot(x,y,label='normal probability density')

#Compute the values for a normal probability density curve for the 
# g02 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
y = [normalProbabilityDensity(val,mu=g02Bar,sigma=g02Std) for val in x]
plt.plot(x,y)

plt.tight_layout()
plt.show()
sum of original dataset 400254.2526316932
sum of clipped dataset 200247.7614701245
sum of two center quartiles 200247.76147012465
sum of firstQuartile 67992.5827531311
sum of secondQuartile 91499.77733209971
sum of thirdQuartile 108747.98413802496
sum of fourthQuartile 132013.9084084383
Number of g01 values = 10000
Number of g02 values = 5000
g01 range = 68.47115817252995
g02 range = 14.149366557701335

g01Bar = 40.02542526316937
g02Bar = 40.04955229402483
g01Std = 9.996694607203745
g02Std = 4.001927446346313
g01Median = 40.01678513788222
g02Median = 40.01678513788222

Outliers and quartiles

To understand the definition of an outlier, you must first understand quartiles. You can divide the dataset into quartiles with the following procedure:

  1. Sort the dataset into ascending order.
  2. Find the median of the dataset.
  3. Divide the dataset into two halves at the median.
  4. Find the median of each half.
  5. Going from smaller to larger, label the three median values as Q1, Q2, and Q3
  6. Compute the Interquartile Range (IQR) as Q3 - Q1.

Any value x is an outlier if either of the following is true:

  • x < Q1 - 1.5*IQR
  • x > Q3 + 1.5*IQR

The following code creates a relatively normal dataset with two outliers. One outlier on the high end was put there on purpose. The other outlier on the low end resulted from the averaging procedure used to create the dataset.

Then the program computes and prints several values using the above procedure inclusing the IRQ and the lower and upper outlier limits. The program also displays the mean and the standard deviation for the dataset.

Then the program plots a histogram for the dataset and plots a normal distribution curve on the histogram.

If you understand how the x-axis scale is determined for the histogram, you can tell by viewing the histogram that there is an outlier with a value of approximately 120. However, the outlier at the low end isn't obvious because it is barely below the lower outlier limit. You will see later how I knew that it was there.

In [17]:
#Create a normal dataset with an outlier
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [120]

#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()

#Use the median to subdivide the data into halves. Use Python list comprehension
Q2 = median(g01)
topHalf = [x for x in g01 if x >= Q2]
bottomHalf = [x for x in g01 if x < Q2]

#Get the median for each half
Q3 = median(topHalf)
Q1 = median(bottomHalf)
IQR = Q3 - Q1
lowerOutlierLimit = Q1 - 1.5*IQR
upperOutlierLimit = Q3 + 1.5*IQR

print("Q1 =",Q1)
print("Q2 =",Q2)
print("Q3 =",Q3)
print("IQR =",IQR)
print("lowerOutlierLimit =",lowerOutlierLimit)
print("upperOutlierLimit",upperOutlierLimit)

print()

g01Bar = mean(g01)
print("g01Bar =",g01Bar)

g01Std = stdev(g01)
print("g01Std =", g01Std)

#Plot and label histogram
fig,axes = plt.subplots(1,1)

g01N,g01Bins,g01Pat = axes.hist(g01,bins=136,normed=True,range=(min(g01),max(g01)))
axes.set_title('g01')
axes.set_xlabel('x')
axes.set_ylabel('Relative Frequency')

axes.grid(True)

#Compute the values for a normal probability density curve for the 
# g01 mu and sigma across the same range of values and
# superimpose the normal probability density curve on the histogram.
x = np.arange(g01Bins[0],g01Bins[len(g01Bins)-1],0.1)
y = [normalProbabilityDensity(val,mu=g01Bar,sigma=g01Std) for val in x]
axes.plot(x,y,label='normal probability density')

plt.tight_layout()
plt.show()
Q1 = 32.98235916117212
Q2 = 40.01678513788222
Q3 = 47.13997603963223
IQR = 14.15761687846011
lowerOutlierLimit = 11.745933843481954
upperOutlierLimit 68.3764013573224

g01Bar = 40.029425263169365
g01Std = 10.020665229936043

Box and whisker plots

Box and whisker Example 1

In an earlier exercise, I told you that the dataset being analyzed has an outlier very close to the lower outlier limit that was not obvious by viewing the histogram. I promised to tell you later how I knew it was there.

You learned how to compute the lower and upper outlier limits in that exercise.

The following code creates a relatively normal dataset with three outliers. Two outliesr on the high end were put there for purposes of illustration. The third outlier on the low end resulted from the averaging procedure used to create the dataset.

Then the program computes and prints several values inclusing the IRQ and the lower and upper outlier limits. The program also displays the mean and the standard deviation for the dataset.

Then the program plots a histogram for the dataset and plots a normal distribution curve on the histogram.

Finally, and this is new to this exercise, the program plots a "box and whisker" plot for the data. Box and whisker plots are ideal for comparing distributions because the center, the spread, and the overall range are immediately apparent, as are the outliers.

Think of the horizontal center line as repesenting the full range of the data with low values at the left and high values at the right. (The plot can also be oriented vertically with high values at the top.) The three small circles at the right represent the three outliers that I added to the normal dataset for illustration purposes. The small circle on the left represents a low-end outlier that was a normal consequence of the manner in which the dataset was created. (This is how I knew that there is an outlier at the low end.)

The red vertical line represents the median of the dataset or Q2. The left end of the box represents Q1 and the right end of the box represents Q3. The width of the box represents the IRQ.

The short vertical lines to the left and right of the box appear to represent the lower and upper outlier limits respecively although this is not obvious in the MatPlotLib documentation. This can, however, be demonstrated experimentally. I added an outlier with a value of 68.5 and the short vertical line on the right is almost centered on the circle that represents that outlier. The computed value of the upper outlier limit is 68.43, only slightly less than the value of the outlier that I added to the dataset. Thus the position of the short vertical line is a close match to the value of the upper outlier limit.

In [18]:
#Create a normal dataset with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [68.5] + [100] + [120]

#Sort the dataset to make it possible to divide the data values into quartiles.
g01.sort()

#Use the median to subdivide the data into halves. Use Python list comprehension
Q2 = median(g01)
topHalf = [x for x in g01 if x >= Q2]
bottomHalf = [x for x in g01 if x < Q2]

#Get the median for each half
Q3 = median(topHalf)
Q1 = median(bottomHalf)
IQR = Q3 - Q1
lowerOutlierLimit = Q1 - 1.5*IQR
upperOutlierLimit = Q3 + 1.5*IQR

print("Q1 =",Q1)
print("Q2 =",Q2)
print("Q3 =",Q3)
print("IQR =",IQR)
print("lowerOutlierLimit =",lowerOutlierLimit)
print("upperOutlierLimit",upperOutlierLimit)
print("min =",min(g01))
print("max =",max(g01))

print()

g01Bar = mean(g01)
print("g01Bar =",g01Bar)

g01Std = stdev(g01)
print("g01Std =", g01Std)

#Create a figure with one row and two columns
fig,axes = plt.subplots(1,2)

#Call the plotHistAndBox function to plot the dataset
plotHistAndBox(g01,axes)

axes[0].grid(True)
axes[1].grid(True)
plt.tight_layout()
plt.show()
Q1 = 32.98790609425962
Q2 = 40.019366586363226
Q3 = 47.164764357178534
IQR = 14.176858262918913
lowerOutlierLimit = 11.722618699881252
upperOutlierLimit 68.4300517515569
min = 11.52884182747004
max = 120

g01Bar = 40.03826760964744
g01Std = 10.041625017758173

Box and whisker Example 2

This example is similar to the previous one except that it creates and plots histograms and box plots for three datasets having different spreads and outliers at the same values. This illustrates the usefulness of the box and whisker plot for comparing distributions where the center, the spread, and the overall range are immediately apparent, as are the outliers.

The program plots a normal distribution curve on the histogram for each dataset showing the extent to which the dataset approximates a normal distribution.

In [19]:
#Create a three datasets with different spreads and with outliers.
g01 = normalRandomGenerator(dataLength=9999,numberSamples=1,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]
g02 = normalRandomGenerator(dataLength=9999,numberSamples=2,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]
g03 = normalRandomGenerator(dataLength=9999,numberSamples=3,lowLim=10,highLim=70,seed=1) + [68.5] + [80] + [90]

#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2)

#Call the plotHistAndBox function to process the first dataset
plotHistAndBox(g01,axes,axesRow=0,multiDim=True)

#Process the second dataset
plotHistAndBox(g02,axes,axesRow=1,multiDim=True)
               
#Process the third dataset
plotHistAndBox(g03,axes,axesRow=2,multiDim=True)


axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()

Probability of a value within one sigma of the mean

Experimentally find and print the probability that a value in a data set with a near-normal distribution will be within one sigma on each side of the mean. Express the probability as a percentage with no decimal.

In [20]:
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
lowLimit = xBar - stDev
print("lowLimit =",lowLimit)
highLimit = xBar + stDev
print("highLimit =",highLimit)
lowZ = (lowLimit - xBar)/stDev
print("lowZ =",lowZ)
highZ = (highLimit - xBar)/stDev
print("highZ =",highZ)
probability = cumulativeDistribution(highZ) - cumulativeDistribution(lowZ)
print("probability that x is within one sigma =",int(probability*100//1),"percent")
xBar = 500.1666185602021
stdev = 41.56617034378628
lowLimit = 458.6004482164158
highLimit = 541.7327889039884
lowZ = -0.9999999999999997
highZ = 1.000000000000001
probability that x is within one sigma = 68 percent

Probability of a value within two sigma of the mean

Experimentally find and print the probability that a value in a data set with a near-normal distribution will be within two sigma on each side of the mean. Express the probability as a percentage with no decimal.

In [21]:
data = normalRandomGenerator(dataLength=1000,numberSamples=49,lowLim=0,highLim=1000)
xBar = mean(data)
print("xBar =",xBar)
stDev = stdev(data)
print("stdev =", stDev)
lowLimit = xBar - 2*stDev
print("lowLimit =",lowLimit)
highLimit = xBar + 2*stDev
print("highLimit =",highLimit)
lowZ = (lowLimit - xBar)/stDev
print("lowZ =",lowZ)
highZ = (highLimit - xBar)/stDev
print("highZ =",highZ)
probability = cumulativeDistribution(highZ) - cumulativeDistribution(lowZ)
print("probability that x is within two sigma =",int(probability*100//1),"percent")
xBar = 500.1666185602021
stdev = 41.56617034378628
lowLimit = 417.03427787262956
highLimit = 583.2989592477746
lowZ = -1.9999999999999993
highZ = 1.9999999999999993
probability that x is within two sigma = 95 percent

--To be continued in the tutorial titled Statistics Exercises Part 3--

Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX

File: StatisticsPart02.ipynb

Revised: 04/21/18

Copyright 2018 Richard G. Baldwin

-end-