This is the fifth of several exercise pages on visualization using Matplotlib that were prepared for use in the course ITSE 1302 at Austin Community College.
The remainder of this exercise page will concentrate on box and whisker plots. Before continuing, it is recommended that you read Box Plot: Display of Distribution to confirm that you have an overall understanding of box and whisker plots.
Pay particular attention to the statement that reads "This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median)." on the webpage referenced above.
It is also recommended that you read Notched Box Plots to confirm that you have an understanding of that aspect of the topic as well.
Pay particular attention to the statement that reads "The Notch - displays the a confidence interval around the median which is normally based on the median +/- 1.57 x IQR/sqrt of n." on the webpage referenced above.
import numpy as np
import matplotlib.pyplot as plt
import random
from statistics import mean
from statistics import median
from statistics import stdev
import math
Define a function from which you can obtain values for the normal probability density curve for any input values of mu and sigma. See Normal Distribution Density for a table of the expected values. See Engineering Statistics Handbook for a definition of the equation for the normal probability density function.
'''
Computes and return values for the normal probabilty density function
required input: x-axis value
optional input: mu
optional input: sigma
returns: The computed value of the function for the given x, mu, and sigma. If mu and sigma are not provided as
input arguments, the method reverts to returning values for the standard normal distribution curve with mu = 0
and sigma = 1.0
'''
def normalProbabilityDensity(x,mu=0,sigma=1.0):
eVal = 2.718281828459045
exp = -((x-mu)**2)/(2*(sigma**2))
numerator = pow(eVal,exp)
denominator = sigma*(math.sqrt(2*math.pi))
return numerator/denominator
Define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword argument value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,dataLength=10000,numberSamples=50,lowLim=0,highLim=100):
'''
Create a new dataset of dataLength values consisting of the average of numberSamples
samples taken from a population of uniformly distributed values between lowLim
and highLim generated with a seed of seed.
Input keyword argument and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly distributed population
and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed population
highLim = 100 high limit value of the uniformly distributed population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
This method is more powerful than a function with the same name that was defined on an earlier exercise page. This version incorporates some but not all of the keyword arguments of the Axes.boxplot method that support different variations of the plot.
Define a function that Plots a histogram and a "box and whisker" plot for an incoming dataset on a 1x2 row of an incoming figure specified by axes. The row index on which to plot the data is specified by axesRow. Also creates and plots a normal prpobability density curve on the histogram based on the mean and standard deviation of the dataset. Set multiDim to True if axes is in a multi-dimensional array.
Set the supported keyword arguments described for the Axes.boxplot method to create different variations of the plot.
'''
Plots a histogram and a "box and whisker" plot for
an incoming dataset on a 1x2 row of an incoming figure
specified by axes. The row index on which to plot the
data is specified by axesRow. Also creates and plots
a normal prpobability density curve on the histogram
based on the mean and standard deviation of the
dataset. Set multiDim to true if axes is in a
multi-dimensional array.
'''
def plotHistAndBox(data,axes,axesRow=0,multiDim=False,
showmeans=None,
meanline=None,
showbox=None,
showcaps=None,
notch=None,
bootstrap=None,
showfliers=None,
sym=None,
boxprops=None,
flierprops=None,
medianprops=None,
meanprops=None,
capprops=None,
whiskerprops=None
):
dataBar = mean(data)
dataStd = stdev(data)
if multiDim == True:
#Plot and label histogram
dataN,dataBins,dataPat = axes[axesRow,0].hist(
data,bins=136,normed=True)
axes[axesRow,0].set_title('data')
axes[axesRow,0].set_xlabel('x')
axes[axesRow,0].set_ylabel('Relative Freq')
#Plot a boxplot
axes[axesRow,1].boxplot(data,vert=False,widths=0.75,
showmeans=showmeans,
meanline=meanline,
showbox=showbox,
showcaps=showcaps,
notch=notch,
bootstrap=bootstrap,
showfliers=showfliers,
sym=sym,
boxprops=boxprops,
flierprops=flierprops,
medianprops=medianprops,
meanprops=meanprops,
capprops=capprops,
whiskerprops=whiskerprops
)
axes[axesRow,1].set_title('Box and Whisker Plot')
axes[axesRow,1].set_xlabel('x')
else:
#Plot and label histogram
dataN,dataBins,dataPat = axes[0].hist(
data,bins=136,normed=True,range=(min(data),max(data)))
axes[0].set_title('data')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Relative Freq')
#Plot a boxplot
axes[1].boxplot(data,vert=False,widths=0.75,
showmeans=showmeans,
meanline=meanline,
showbox=showbox,
showcaps=showcaps,
notch=notch,
bootstrap=bootstrap,
showfliers=showfliers,
sym=sym,
boxprops=boxprops,
flierprops=flierprops,
medianprops=medianprops,
meanprops=meanprops,
capprops=capprops,
whiskerprops=whiskerprops
)
axes[1].set_title('Box and Whisker Plot')
axes[1].set_xlabel('x')
#Compute the values for a normal probability density curve for the
# data mu and sigma across the same range of values.
x = np.arange(dataBins[0],dataBins[len(dataBins)-1],0.1)
y = [normalProbabilityDensity(
val,mu=dataBar,sigma=dataStd) for val in x]
#Superimpose the normal probability density curve on the histogram.
if multiDim == True:
axes[axesRow,0].plot(x,y,label='normal probability density')
else:
axes[0].plot(x,y,label='normal probability density')
The Axes.boxplot method supports a large number of keyword arguments that can be used to produce different variations of the plot. We will begin by creating box plots for three different distributions of data using the default arguments for the method.
#Create three datasets with different spreads and with outliers.
g01 = normalRandomGenerator(dataLength=99,numberSamples=1,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g02 = normalRandomGenerator(dataLength=9999,numberSamples=2,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g03 = normalRandomGenerator(dataLength=9999,numberSamples=4,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2,figsize=(4,4))
#Call the plotHistAndBox function to process the first dataset
plotHistAndBox(g01,axes,axesRow=0,multiDim=True)
#Process the second dataset
plotHistAndBox(g02,axes,axesRow=1,multiDim=True)
#Process the third dataset
plotHistAndBox(g03,axes,axesRow=2,multiDim=True)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()
The orange vertical lines in each of the three box plots shown above represents the median for the distribution. The right and left edges of the box represent the first and third quartiles. The width of the box represents the IQR or interquartile range. The vertical lines at the ends of the whiskers represent the limits beyond which values are considered to be outliers or fliers in matplotlib terminology.
The top plot shown above was created for data with a uniform distribution. The standard deviation for this data was so large that none of the values were considered to be outliers.
The middle plot was created for data with a more normal distribution and approximately half the variance or a reduction in the standard deviation of approximately 1.414. In this case, the two circles on the right represent outliers.
The bottom plot was created for data with another reduction in standard deviation of approximately 1.414. Eight to ten values were considered to be outliers in this case. Because the circles overlap, it is not possible to count the exact number of outliers.
The Axes.boxplot method refers to outliers as fliers.
You can change the symbol used to show outliers with the sym keyword. This is illustrated by the following code.
#Create three datasets with different spreads and with outliers.
g01 = normalRandomGenerator(dataLength=99,numberSamples=1,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g02 = normalRandomGenerator(dataLength=9999,numberSamples=2,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
g03 = normalRandomGenerator(dataLength=9999,numberSamples=4,
lowLim=10,highLim=70,seed=2) + [68.5] + [80] + [90]
#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2,figsize=(4,4))
#Call the plotHistAndBox function to process the first dataset
plotHistAndBox(g01,axes,axesRow=0,multiDim=True)
#Process the second dataset
plotHistAndBox(g02,axes,axesRow=1,multiDim=True,sym='+')
#Process the third dataset
plotHistAndBox(g03,axes,axesRow=2,multiDim=True,sym='D')
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()
This is the same data as in the previous example. There are no outliers in the top box plot shown above. The symbol for outliers in the middle box plot is set to '+'. The symbol for outliers in the bottom box plot is set to 'D' for diamond.
The following code illustrates the effect of the showmeans, meanline, and showbox arguments.
#Create a skewed dataset with outliers.
g01 = normalRandomGenerator(dataLength=10000,
numberSamples=3,lowLim=5,highLim=50,seed=1)
g02 = normalRandomGenerator(dataLength=7000,
numberSamples=2,lowLim=20,highLim=80,seed=2)
g03 = normalRandomGenerator(dataLength=4000,
numberSamples=1,lowLim=30,highLim=90,seed=3)
g04 = g01 + g02 + g03 +[100] + [110] + [120]
#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2,figsize=(4,4))
plotHistAndBox(g04,axes,axesRow=0,multiDim=True,
showmeans=True)
plotHistAndBox(g04,axes,axesRow=1,multiDim=True,
showmeans=True,meanline=True)
plotHistAndBox(g04,axes,axesRow=2,multiDim=True,
showmeans=True,meanline=True,showbox=False)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()
The top plot shown above sets the showmeans argument to True and shows the mean value as a green triangle.
The middle plot sets the showmeans and meanline arguments to True. This causes the program to display the mean value as a green dashed line.
The bottom plot sets the showbox argument to False without changing the other two arguments. This causes the program to display the mean value as a green dashed line and hides the surrounding box.
The following code illustrates the effect of the showcaps, showfliers, and notch arguments.
#Create a skewed dataset with outliers.
g01 = normalRandomGenerator(dataLength=10000,
numberSamples=3,lowLim=5,highLim=50,seed=1)
g02 = normalRandomGenerator(dataLength=7000,
numberSamples=2,lowLim=20,highLim=80,seed=2)
g03 = normalRandomGenerator(dataLength=4000,
numberSamples=1,lowLim=30,highLim=90,seed=3)
g04 = g01 + g02 + g03 +[100] + [110] + [120]
#Create a figure with three rows and two columns
fig,axes = plt.subplots(3,2,figsize=(4,4))
plotHistAndBox(g04,axes,axesRow=0,multiDim=True,
showcaps=False)
plotHistAndBox(g04,axes,axesRow=1,multiDim=True,
showfliers=False)
#Create a dataset that will illustrate a notch.
g05=[99.2,99.0,100.0,111.6,122.2,117.6,121.1,136.0,
154.2,153.6,158.5,140.6,136.2,168.0,154.3,149.0]
#Process the dataset for the notch.
plotHistAndBox(g05,axes,axesRow=2,multiDim=True,notch=True)
axes[0,0].grid(True)
axes[0,1].grid(True)
axes[1,0].grid(True)
axes[1,1].grid(True)
axes[2,0].grid(True)
axes[2,1].grid(True)
plt.tight_layout()
plt.show()
The top plot sets the value of showcaps to False causing the caps (the short vertical lines) on the ends of whiskers to disappear.
The middle plot sets the value of showfliers to False causing the outliers (fliers) to disappear. This in turn causes the horizontal scale to be modified such that the whiskers occupy almost the entire width of the plot.
The bottom plot creates a dataset that illustrates a notch and sets the value of notch to True producing a box with a notch.
In addition to the essential properties of the box and whisker plot discussed above, it is also possible to set a number of other properties that are more in the nature of cosmetics. These properties are controlled by the following keyword arguments:
Each of these arguments is a dictionary that contains specifications for one or more properties of the portion of the plot to which the argument applies.
#Create a skewed dataset with outliers.
g01 = normalRandomGenerator(dataLength=10000,
numberSamples=3,lowLim=5,highLim=50,seed=1)
g02 = normalRandomGenerator(dataLength=7000,
numberSamples=2,lowLim=20,highLim=80,seed=2)
g03 = normalRandomGenerator(dataLength=4000,
numberSamples=1,lowLim=30,highLim=90,seed=3)
g04 = g01 + g02 + g03 +[120] + [160]
#Create a figure with one row and two columns.
fig,axes = plt.subplots(1,2,figsize=(6,3))
#Create dictionaries containing specifications for
# various cosmetic properties.
myBoxProps = dict(linestyle=':',linewidth=2,color='red')
myFlierProps = dict(marker='D',markerfacecolor='green',
markeredgecolor='red',
markeredgewidth=2,
markersize=12,linestyle='none')
myMedianProps = dict(linestyle='--',linewidth=2,color='blue')
myMeanProps = dict(linestyle='-.',linewidth=2,color='purple')
myCapProps = dict(linestyle=':',linewidth=3,color='green')
myWhiskerProps = dict(linestyle=':',linewidth=3,color='green')
#Apply the cosmetic properties.
plotHistAndBox(g04,axes,
boxprops = myBoxProps,
flierprops=myFlierProps,
medianprops=myMedianProps,
showmeans=True,
meanline=True,
meanprops=myMeanProps,
capprops=myCapProps,
whiskerprops=myWhiskerProps
)
axes[0].grid(True)
axes[1].grid(True)
plt.tight_layout()
plt.show()
The box properties are set to produce a red dotted line with a width of 2'
The outlier or flier properties are set to produce rather large green diamond markers with a red border.
The median properties are set to produce a blue dashed line with a width of 2.
The mean properties are set to produce a purple dash-dot line with a width of 2.
The whisker properties and the cap properties are set to produce a green dotted line with a width of 3.
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: VisualizationPart05.ipynb
Revised: 04/22/18
Copyright 2018 Richard G. Baldwin
-end-