This is the first of several exercise pages on statistical programming prepared for use in the course ITSE 1302 at Austin Community College.
These exercise pages provide a variety of statistical programming examples and were prepared using Jupyter Notebook. They will be delivered to students as HTML files. Students in this course are not expected to know how to create Jupyter Notebooks, but will learn how to use that tool later.
In order to make it possible to copy the code for the examples into a Python script and run them under the stand-alone Python interpreter, most of the code in these exercise pages was written in such a way that it can stand alone and run independently of any code that was written earlier in the notebook.
Many of the exercises use the matplotlib library for generating graphs and histograms. Most of the code was written to run properly without access to advanced libraries such as numpy, scipy, and pandas. (Students will learn about those libraries later.) However, there are a few situations where resources from the numpy library were used to simplify the code. Those resources will be fully explained the first time they are used.
These exercise pages are not intended to be used as stand-alone learning resources. Instead, they are provided as supplementary material for students using various online resources, such as the Udacity course titled Intro to Descriptive Statistics as their primary statistics learning resource.
To extract a section of code from the Jupyter notebook format and run it under the Python interpreter,
Note, you must have installed a Python version 3+ distribution such as Anaconda or WinPython that inludes matplotlib, numpy, scipy, etc.
We will begin by importing libraries that may be required by code later in the notebook.
import math
from math import pow
import matplotlib.pyplot as plt
from statistics import mean
from statistics import median
from statistics import stdev
from statistics import pstdev
from statistics import mode
from statistics import StatisticsError
import random
from collections import Counter
import numpy as np
Define a utility function that returns a dataset of more or less normally distributed values. The returned dataset is uniformly distributed for an input keyword paramater value of numberSamples=1. The resulting dataset tends more towards normal as the value of numberSamples is increased.
def normalRandomGenerator(seed=1,
dataLength=10000,
numberSamples=50,
lowLim=0,
highLim=100):
'''
Create a new dataset of dataLength values consisting of the
average of numberSamples samples taken from a population
of uniformly distributed values between lowLim and highLim
generated with a seed of seed.
Input keyword parameters and their default values:
seed = 1 seed used to generate the uniformly distributed values
dataLength = 10000 number of samples in the returned list of values
numberSamples = 50 number of samples taken from the uniformly
distributed population and averaged into the output
lowLim = 0 lower limit value of the uniformly distributed
population
highLim = 100 high limit value of the uniformly distributed
population
returns: a list containing the dataset
'''
data = []
random.seed(seed)
for cnt in range(dataLength):
theSum = 0
for cnt1 in range(numberSamples):
theSum += random.uniform(lowLim,highLim)
data.append(theSum/numberSamples)
return data
Test the normalRandomGenerator function by creating and plotting the histogram for a dataset of random data produced by that function using the Axes.hist method from the matplotlib library.
As will be the case throughout these notebooks, the output is shown immediately below the code that created the output.
data = normalRandomGenerator(numberSamples=512,lowLim=100,highLim=200)
print("mean =",mean(data))
print("standard deviation =",stdev(data))
#Create a Figure object containing an Axes object into which we can
#plot a histogram.
fig,ax = plt.subplots(1,1)
print("type of fig =",fig)
print("type of ax =",ax)
#Plot and display the histogram with 100 bins.
ax.hist(data,bins=100)
plt.show()
def lmsBestFitLine(xDataIn,yDataIn):
xBar = sum(xDataIn)/len(xDataIn)
yBar = sum(yDataIn)/len(yDataIn)
numerator = 0
for cnt in range(len(xDataIn)):
numerator += (xDataIn[cnt] - xBar) * (yDataIn[cnt] - yBar)
denominator = 0
for cnt in range(len(xDataIn)):
denominator += (xDataIn[cnt] -xBar)**2
m = numerator/denominator
b = yBar - m*xBar
return (m,b)#slope,y-intercept
Test the function named lmsBestFitLine by drawing a best fit line on a scatter plot of negatively correlated data.
A correlation is a single numerical value that describes a relationship between two things, or variables.
In a positive correlation, both variables move in the same direction. In other words, as one variable increases, so does the other.
When two variables have a negative correlation, they have an inverse relationship.
Note that this code calls the numpy.arange method from the numpy library to create a numpy.ndarray object containing 20 uniformly spaced values ranging from 0 to 9.5
fig,ax1 = plt.subplots(1,1)
#Create some data with a negative correlation
random.seed(1)
x = np.arange(0,10,0.5)
y = []
print("type of x =",type(x))
print("type of y =",type(y))
for val in x:
y.append(20 - val + random.gauss(0,2))
#Plot the data in a scatter plot
ax1.scatter(x,y,c='red',s=50,edgecolor='blue')
#Compute the best fit line
m,b =lmsBestFitLine(x,y)
#Draw a best fit line on the scatter plot
x1 = min(x)
y1 = m*x1 + b
x2 = max(x)
y2 = m*x2 + b
ax1.plot([x1,x2], [y1,y2])
plt.tight_layout()
plt.grid(True)
plt.show()
The basic staistical measurements are:
It should come as no surprise that Python provides methods to compute and return all of these measurements on a dataset. The use of those methods will be illustrated in the code that follows.
A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean is usually to the left of the peak.
A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean is usually to the right of the peak.
The normal distribution is a symmetric distribution with no skew. In the ideal case, the tails are exactly the same.
See Statistics How To.
The median is the middle value in a dataset, or the average of the two middle values if the number of values is even. If the distribuion is normal, the mean and the median are the same. However, that is not necessarily true if the distribution is not normal.
The following code finds and displays the median two different ways. It also displays histograms of the datasets.
The results illustrate the difference between the median and the mean for non-normal datasets. The results also illustrate that the median may be a better indicator of the center of the distribution than the mean in the presence of significant outliers.
Note: If you copy and run this code, or any of the other code that creates two or more output plots under the Python interpreter, you will probably need to dismiss each plot from the screen before the next plot will appear.
#Create a dataset with a skewed distribution and a significant outliers.
print("First handle the case with an even number of samples")
data = [-10,-10,1,2,3,4,4,5]
print("Plot a histogram of the data with an even number of samples")
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
#Compute the median the hard way
hardMedian = 0
data.sort()
if len(data)%2 == 0:
#Even
hardMedian = (data[len(data)//2] + data[len(data)//2 - 1])/2
else:
#Odd
hardMedian = data[len(data)//2]
print("data length =",len(data))
print("median the hard way =",hardMedian)
print("median the easy way =",median(data))
print("xBar =",mean(data))
print("Dismiss the plot to continue")
plt.show()
print()
print("Create another dataset with a skewed distribution")
print("and an odd number of samples")
data = [1,2,3,4,4]
print("Plot a histogram of the data with an odd number of samples")
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
#Compute the median the hard way
hardMedian = 0
data.sort()
if len(data)%2 == 0:
#Even
hardMedian = (data[len(data)//2] + data[len(data)//2 - 1])/2
else:
#Odd
hardMedian = data[len(data)//2]
print("data length =",len(data))
print("median the hard way =",hardMedian)
print("median the easy way =",median(data))
print("xBar =",mean(data))
plt.show()
The mean is the average of the values in a sample. The mathematical equation for the mean is:
${\displaystyle {\bar {x}}={\frac {1}{n}}\left(\sum _{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}}$
In case you aren't familiar with formulas like this, the Greek letter Σ means sum or summation. Briefly, this equation means that the mean is equal to the sum of all the values in a dataset divided by the number of values in the dataset.
The following code computes the mean of a dataset using three different approaches. Obviously all three approaches produce the same result. The mean value doesn't depend on the method used to compute it.
Then the code plots a histogram for the dataset. Because the distribution of this dataset tends towards normal, the mean should be near the center of the histogram.
#Create a dataset
data = normalRandomGenerator(numberSamples=512,lowLim=100,highLim=200)
#Approach #1, use a for loop
total = 0
for val in data:
total += val
print("mean from approach #1 =",total/len(data))
#Approach #2, call the built-in sum function
#print(sum(data))
print("mean from approach #2 =",sum(data)/len(data))
#Approach #3, call the built-in mean function
print("mean from approach #3 =",mean(data))
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
plt.show()
The mathematical equation for the standard deviation of a sample is:
${\displaystyle s={\sqrt {\frac {\sum _{i=1}^{N}(x_{i}-{\overline {x}})^{2}}{N-1}}}}$
In other words, this is the equation that you should use to compute the standard deviation of a sample that will be used to estimate the standard deviation of the population from which the sample was taken.
The mathematical equation for the standard deviation of a population is:
${\displaystyle S={\sqrt {\frac {\sum _{i=1}^{N}(x_{i}-{\overline {x}})^{2}}{N}}}}$
The Python function named statistics.stdev computes and returns the standard deviation for a sample.
The Python function named statistics.pstdev computes and returns the standard deviation for a population.
The difference between these two equations is known as Bessel's correction, and is important mainly for samples with a small number of values. As the number of values increases, the two expressions tend toward the same limit. For samples with 1000 samples or more, the difference between the two will typically be in the third or fourth decimal place. Therefore, most of the example code in these tutorials calls statistics.stdev without regard for whether the dataset is a population or a sample.
Computation of the standard deviation is more involved than computation of the mean. In fact, computation of the mean is part of the computation of the standard deviation. The following code walks through the computation one step at a time and displays the final results. The computations are performed in-place requiring only one list object.
The code also displays a histogram of an intermediate version of the data after the mean has been removed.
#Create a dataset with 10000 values and compute the mean
data = normalRandomGenerator(numberSamples=512,lowLim=100,highLim=200)
xBar = mean(data)
#Get the standard deviation for a sample the easy way.
print("sample standard deviation the easy way = ",stdev(data))
#Get the standard deviation for a population the easy way.
print("population standard deviation the easy way =",pstdev(data))
#Subtract the mean from every value in the dataset.
for cnt in range(len(data)):
data[cnt] -= xBar
#Display mean of modified data. Note that the mean is now zero.
print("mean of modified data =",mean(data)//1)
#Plot a histogram of the modified data but don't show it yet
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
#Square each value of the modified data.
for cnt in range(len(data)):
data[cnt] = data[cnt]**2
#Compute the mean of the squared data
meanOfSquaredData = mean(data)
#Compute the square root
sigma = math.sqrt(meanOfSquaredData)
print("population standard deviation the hard way =",sigma)
#Now show the histogram
plt.show()
The mode is the value that occurs most often in a dataset. The following code calls the statistics.mode method to find the mode of a dataset.
The mode method will throw a StatisticsError if there is no unique mode. Therefore, the StatisticsError must be handled or the program will crash.
The following code displays proper behavior for two cases: one with a unique mode and one without a unique mode. The code also displays a histogram of both datasets.
print("Create a dataset that has a unique mode")
data = [1,2,3,4,4]
#Plot a histogram of the data with a unique mode
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
print("Attempt to get and display the mode. Will throw")
print("a StatisticsError if there is no unique mode.")
try:
print("mode =",mode(data))
except StatisticsError:
print("mode = no unique mode")
print("Dismiss the plot to continue")
plt.show()
print()
print("Create a dataset that has no unique mode")
data = [1,2,2,3,4,4]
#Plot a histogram of the data with no unique mode
fig,ax = plt.subplots(1,1)
ax.hist(data,bins=100)
print("Attempt to get and display the mode. Will throw")
print("a StatisticsError if there is no unique mode.")
try:
print("mode =",mode(data))
except StatisticsError:
print("There is no unique mode, StatisticsError is handled.")
plt.show()
Definition: In statistics, sampling error is the error caused by observing a sample instead of the whole population. The sampling error is the difference between a sample statistic used to estimate a population parameter and the actual but unknown value of the parameter. An estimate of a quantity of interest, such as an average or percentage, will generally be subject to sample-to-sample variation. These variations in the possible sample values of a statistic can theoretically be expressed as sampling errors, although in practice the exact sampling error is typically unknown. Sampling error also refers more broadly to this phenomenon of random sampling variation.
The following code creates a reproducible population of 49,000 uniformly distributed values and computes the mean (mu) of the population.
A sample consisting of 1000 values is taken from the population. The mean of the sample (xBar) is used to estimate the mu of the population. The sampling error is computed as the difference between the two.
A histogram is also displayed for the population and for the sample.
#Access the entire population
population = normalRandomGenerator(\
seed=1,dataLength=49000,numberSamples=1,lowLim=0,highLim=100)
print("Plot a histogram of the population")
print("Dismiss the plot to continue")
fig,ax = plt.subplots(1,1)
ax.hist(population,bins=100,normed=True)
plt.show()
#Note that it is possible to compute the mu for the population and the sampling error later
#only because we have access to the entire population in this case.
mu = mean(population)
print("mu =",mu)
#Get a sample from the population
sample = normalRandomGenerator(\
seed=1,dataLength=1000,numberSamples=1,lowLim=0,highLim=100)
#Plot a histogram of the sample
fig,ax = plt.subplots(1,1)
ax.hist(sample,bins=100,normed=True)
xBar = mean(sample)
print("xBar =",xBar)
print()
print("If xBar is used to estimate mu, the sampling error \n\
is the difference between the two.")
print()
print("sampling error = mu - xBar =",mu - xBar)
plt.show()
As above, the following code creates the same reproducible population of 49,000 uniformly distributed values and computes the mean (mu) of the population.
A sample consisting of 10,000 values (as opposed to 1000 values above) is taken from the population. The mean of the sample (xBar) is used to estimate the mu of the population. The sampling error is computed as the difference between the two. A factor of ten increase in the sample size resulted in a reduction of the sampling error by a factor of 6.5.
A histogram is also displayed for the population and for the sample.
#Access the entire population
population = normalRandomGenerator(\
seed=1,dataLength=49000,numberSamples=1,lowLim=0,highLim=100)
print("Plot a histogram of the population")
print("Dismiss the plot to continue")
fig,ax = plt.subplots(1,1)
ax.hist(population,bins=100,normed=True)
plt.show()
#Note that it is possible to compute the mu for the population and the sampling error later
#only because we have access to the entire population in this case.
mu = mean(population)
print("mu =",mu)
#Get a sample from the population
sample = normalRandomGenerator(\
seed=1,dataLength=10000,numberSamples=1,lowLim=0,highLim=100)
#Plot a histogram of the sample
fig,ax = plt.subplots(1,1)
ax.hist(sample,bins=100,normed=True)
xBar = mean(sample)
print("xBar =",xBar)
print()
print("If xBar is used to estimate mu, the sampling error \n\
is the difference between the two.")
print()
print("sampling error = mu - xBar =",mu - xBar)
plt.show()
The following code creates two sets of random data values, extracts the values in pairs and plots them on a scatter plot. Then it draws a best fit line through the plot.
There appears to be no obvious relationship between the two sets of data as should be the case because there is no relationship between the two sets of data.
xData = normalRandomGenerator(\
seed=1,dataLength=100,numberSamples=1,lowLim=0,highLim=100)
yData = normalRandomGenerator(\
seed=2,dataLength=100,numberSamples=1,lowLim=0,highLim=10)
fig,(ax1) = plt.subplots(1,1)
ax1.scatter(xData,yData,c='red',s=50,edgecolor='blue')
#Compute and draw the best fit line
m,b =lmsBestFitLine(xData,yData)
#Draw a best fit line
x1 = min(xData)
y1 = m*x1 + b
x2 = max(xData)
y2 = m*x2 + b
ax1.plot([x1,x2], [y1,y2])
plt.show()
The following code creates the same two sets of random data values, sorts each set, extracts the values in pairs and plots them on a scatter plot. Then it draws a best fit line through the plot.
There appears to be a causal relationship between the two sets of data but it would be incorrect to draw that conclusion. The two sets of data continue to be unrelated. The appearance of a causal relationship resulted simply from sorting the data. Correlation does not imply causation.
xData = normalRandomGenerator(\
seed=1,dataLength=100,numberSamples=1,lowLim=0,highLim=100)
yData = normalRandomGenerator(\
seed=2,dataLength=100,numberSamples=1,lowLim=0,highLim=10)
xData.sort()
yData.sort()
fig,(ax1) = plt.subplots(1,1)
ax1.scatter(xData,yData,c='red',s=50,edgecolor='blue')
#Compute and draw the best fit line
m,b =lmsBestFitLine(xData,yData)
#Draw a best fit line
x1 = min(xData)
y1 = m*x1 + b
x2 = max(xData)
y2 = m*x2 + b
ax1.plot([x1,x2], [y1,y2])
plt.show()
The following code creates the same two sets of random data values, but the two set of data are used differently than above.
As above, one set of random data is used as the x values for a scatter plot. Each y value for the scatter plot was created as one-tenth the value of the corresponding x value plus a noise value from the second set of random data values. After plotting the x and y values on the scatter plot, the code draws a best fit line through the plot.
There appears to be a causal relationship between the two sets of data. In this case, it would be correct to draw that conclusion. The two sets of data are definitely related, having been created from the same set of random data.
The bottom line is that an apparent correlation between two datasets on a scatter plot may or may not indicate causation. You need to know a lot about the origins of the data before drawing conclusions on causation. Otherwise, lurking variables may cause you to draw invalid conclusions. Correlation does not imply causation.
xData = normalRandomGenerator(\
seed=1,dataLength=100,numberSamples=1,lowLim=0,highLim=100)
noise = normalRandomGenerator(\
seed=2,dataLength=100,numberSamples=1,lowLim=0,highLim=10)
yData = []
for cnt in range(len(xData)):
yData.append(xData[cnt]/10 + noise[cnt])
fig,(ax1) = plt.subplots(1,1)
ax1.scatter(xData,yData,c='red',s=50,edgecolor='blue')
#Compute and draw the best fit line
m,b =lmsBestFitLine(xData,yData)
#Draw a best fit line
x1 = min(xData)
y1 = m*x1 + b
x2 = max(xData)
y2 = m*x2 + b
ax1.plot([x1,x2], [y1,y2])
plt.show()
Hopefully by now you are well into your recommended Udacity course titled Intro to Descriptive Statistics. If so, you will recall that Udacity provides age data for a sample of 50 students from the entire population of Udacity students.
One of the videos shows the ages of the students grouped as follows:
Age: Frequency
The following code displays a bar graph showing Frequency versus Age for the four groups listed above.
fig,ax = plt.subplots(1,1)
ax.bar([20,40,60,80],[19,21,5,5],width=19,tick_label=[20,40,60,80])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Students enrolled in the course can also download the raw age data sample of 50 values. The following code shows a non-normalized histogram of the data with four bins. Note the similarity of this histogram to the bar graph shown above.
sample =[15,19,18,14,13,27,16,65,15,31,22,15,24,22,51,24,20,45,22,33,24,27,18,66,15,18,\
39,10,30,13,19,28,53,28,65,30,20,21,20,18,20,23,18,41,52,75,19,63,14,18]
fig,ax = plt.subplots(1,1)
ax.hist(sample,bins=4,range=(0,80))
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
The following code shows a normalized histogram for the sample with 20 bins showing more detail than the histogram above. The distribution is positively skewed, meaning that it is skewed toward lower ages. The code also displays the median, mean, standard deviation, and mode for the sample.
sample =[15,19,18,14,13,27,16,65,15,31,22,15,24,22,51,24,20,45,22,33,24,27,18,66,15,18,\
39,10,30,13,19,28,53,28,65,30,20,21,20,18,20,23,18,41,52,75,19,63,14,18]
print("sample median =",median(sample))
print("sample mean =",mean(sample))
print("sample standard deviation =",stdev(sample))
print("sample mode =",mode(sample))
fig,ax = plt.subplots(1,1)
ax.hist(sample,bins=20,range=(0,80),normed=True)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
One of the videos shows the Udacity students grouped by continent as follows:
Continent: Frequency
The following code displays a bar graph showing Frequency versus continent for the three groups listed above.
fig,ax = plt.subplots(1,1)
ax.bar([10,20,30],[6,15,29],width=7,tick_label=['Europe','North America','Asia'])
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.show()
--To be continued in the tutorial titled Statistics Exercises Part 2--
Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX
File: StatisticsPart01.ipynb
Revised: 04/21/18
Copyright 2018 Richard G. Baldwin
-end-