Visualization Exercises Part 4

This is the fourth of several exercise pages on visualization using Matplotlib that were prepared for use in the course ITSE 1302 at Austin Community College.

The remainder of this exercise page will deal with scatter plots.

Import required libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt

Utility methods

Least mean square best fit line

The following function will compute a least mean square best fit line for a scatter plot. Click here to view the source of the algorithm.

In [2]:
def lmsBestFitLine(xDataIn,yDataIn):
    xBar = sum(xDataIn)/len(xDataIn)
    yBar = sum(yDataIn)/len(yDataIn)

    numerator = 0
    for cnt in range(len(xDataIn)):
        numerator += (xDataIn[cnt] - xBar) * (yDataIn[cnt] - yBar)

    denominator = 0
    for cnt in range(len(xDataIn)):
        denominator += (xDataIn[cnt] -xBar)**2

    m = numerator/denominator

    b = yBar - m*xBar
    
    return (m,b)#slope,y-intercept

Basic scatter plot

Scatter plots are similar to line plots in that they use horizontal and vertical axes to plot data points. However, they have a very specific purpose. Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation .

While that is the most common use of scatter plots, the Axes.scatter method takes many arguments and can be used to create visualizations with other purposes as well. Some of those purposes will be discussed on this exercise page.

The following code illustrates a basic scatter plot for three cases:

  • Totally correlated
  • Partially correlated
  • Totally uncorrelated

The scatter plot produced by this code shows the default marker in the default size with the default color.

In [3]:
fig,(ax0,ax1,ax2) = plt.subplots(3,1)

x = np.random.uniform(0,100,150)
#====================

#Totally correlated data.
# The two datasets are identical.
yFirst = [x[cnt] for cnt in range(len(x))] 
ax0.scatter(x,yFirst)
#Compute the best fit line
m,b =lmsBestFitLine(x,yFirst)

#Draw a best fit line on the scatter plot
x1 = min(x)
y1 = m*x1 + b
x2 = max(x)
y2 = m*x2 + b
ax0.plot([x1,x2], [y1,y2])
#====================

#Partially correlated data.
# ySecond is equal to x plus random noise.
ySecond = [(x[cnt] + np.random.uniform(0,100)) 
           for cnt in range(len(x))] 
ax1.scatter(x,ySecond)
#Compute the best fit line
m,b =lmsBestFitLine(x,ySecond)

#Draw a best fit line on the scatter plot
x1 = min(x)
y1 = m*x1 + b
x2 = max(x)
y2 = m*x2 + b
ax1.plot([x1,x2], [y1,y2])
#====================

#Totally uncorrelated data.
# x and yThird are uncorrelated datasets
# of random values.
yThird = np.random.uniform(0,100,150)
ax2.scatter(x,yThird)
#Compute the best fit line
m,b =lmsBestFitLine(x,yThird)

#Draw a best fit line on the scatter plot
x1 = min(x)
y1 = m*x1 + b
x2 = max(x)
y2 = m*x2 + b
ax2.plot([x1,x2], [y1,y2])

#====================
plt.tight_layout()
plt.show()

Markers and marker size

The Axes.scatter method takes several keyword arguments. One of the keyword arguments is named marker. This argument specifies the visual symbol or marker that will be drawn at a location determined by given values for x and y. Many markers are available and you will find a list of them here.

Another keyword parameter is named s, which stands for marker size. This is an area measurement specified in points squared. In other words, an s-value of 225 represents a marker with horizontal and vertical dimensions of 25 points each, because 25 squared is 225. In case you are unfamiliar with the term points, this is a term used in typography. It is generally recognized that there are 72 points per inch.

The following code draws the first 25 markers listed here in default colors with a size of 225.

The code also draws a square marker, which seems to be one of the larger markers, next to 25-point text to give you some idea what it means to specify the size in terms of points squared. The area occupied by this marker appears to be about the same as the area occupied by the lower-case x including the space between characters.

It is worth noting that I see slightly different results between running this code inside Jupyter notebook and running it outside of the notebook under the Python interpreter. It seems that the text method behaves slightly differently in those two cases insofar as positioning the text is concerned.

In [4]:
fig,ax = plt.subplots(1,1)

markers = [ '.', ',','o','v','^','<','>','1','2',
           '3','4','8','s','p','P','*','h','H',
           '+','x','X','D','d','|','_']
print("number markers =",len(markers))

#Draw all the markers
x = np.linspace(1,10,len(markers)+1)
for cnt in range(len(markers)):
    ax.scatter(x[cnt],x[cnt],s=225,marker=markers[cnt])

#Draw the square marker next to 25-point text.
ax.scatter(1.7,8.3,s=225,marker='s')
ax.text(1.5,8,'xx=25-point text',fontsize=25)
    
plt.tight_layout()
plt.show()
number markers = 25

Color

One of the keyword arguments to the the Axes.scatter method is c, which stands for color. This argument specifies the facecolor of the marker.

Another of the keyword arguments is edgecolors, which specifies the color, if any, that is applied to the edge of the markers.

Regarding the facecolor, here is a paraphrased quotation from the documentation

The keyword argument c can be:

  • a single color format string, or
  • a sequence of color specifications of length N, or
  • a sequence of N numbers to be mapped to colors using the cmap ...

These three options will be illustrated in the sections that follow.

Single color format string

The following code illustrates the use of a single color format string to specify the color of the markers.

This code also sets the edge color on the markers to green.

In [5]:
fig,ax0 = plt.subplots(1,1)

x = np.linspace(0,10,10)
y = x
ax0.scatter(x,y,s=40**2,marker='s',c='Red',edgecolors='Green')

plt.tight_layout()
plt.show()

Sequence of color specifications

The following code illustrates the use of a sequence of color specifications to specify the color of the markers.

Note that if the number of markers added to the plot is greater than the number of colors in the sequence, the colors cycle through the sequence of colors as markers are added to the plot.

In [6]:
fig,ax0 = plt.subplots(1,1)

x = np.linspace(0,10,10)
y = x
ax0.scatter(x,y,s=40**2,marker='s',c=['Red','Green','Blue'])

plt.tight_layout()
plt.show()

Sequence of numbers using cmap

A colormap is a set of colors taken as a group. Matplotlib provides several different colormaps, most of which progress smoothly from one color to another color with other colors in betweek. However, some progress from one color to another color in steps or blocks.

If you specify colors using a sequence of numbers with a color map, the low value will be associated with one end of the colormap and the high value will be associated with the other end of the color map. Values in between will be associated linearly with the colors between the two ends of the colormap.

Each colormap has a name. The colors in the map can be reversed by appending _r to the name of the colormap.

The following code illustrates the specification of colors using a sequence of numbers and a reversed version of the colormap named plasma.

In [7]:
fig,ax0 = plt.subplots(1,1)

x = np.linspace(0,10,10)
y = x
colors = [10,9,8,7,6,5,4,3,2,1]
ax0.scatter(x,y,s=1600,marker='s',c=colors,cmap='plasma_r')

plt.tight_layout()
plt.show()

Colorbar

A colorbar is a legend of sorts that shows the color associated with color values in an image.

The code required to add a colorbar differs depending on whether you are working with pyplot objects or axes objects.

A colorbar with pytlot object

The following code shows how to add a colorbar to a pyplot object. By default, the colorbar appears on the right and is the same height as the plot to which it refers. Note that in this case, the colorbar function does not require any arguments. It sets the default to the current image.

In [8]:
x = np.linspace(0,10,10)
y = x
colors = [1,2,3,4,5,6,7,8,9,10]
plt.scatter(x,y,s=1600,marker='s',c=colors,cmap='plasma')

plt.colorbar();  # add the colorbar
plt.tight_layout()
plt.show()

A colorbar with an axes object

To add a colorbar to an axes object, you need to create another axes object and position it next to the plot to which it refers. The following code calls the add_axes method to add the new axes to the figure. The arguments specify the location and dimensions of the new axes object as left, bottom, width, and height where all quantities are in fractions of the figure width and height.

Then you need to draw the colorbar in the new axes. In this case, the colorbar method requires two arguments. The first argument specifies the image to which the colorbar applies. The image is the scatter plot in this code. The second argument is the axes object into which the colorbar will be drawn. That is the new axes object in this code.

In [9]:
fig,ax0 = plt.subplots(1,1)
fig.set_facecolor('0.75')

x = np.linspace(0,10,10)
y = x
colors = [1,2,3,4,5,6,7,8,9,10]

#Create and save a reference to the scatter plot.
im = ax0.scatter(x,y,s=1600,marker='s',c=colors,cmap='plasma')

#Add new axes to the figure.
ax1 = fig.add_axes([0.91, 0.15, 0.03, 0.70])

#Draw a colorbar in the new axes.
fig.colorbar(im, cax=ax1)

plt.show()

Size

The size argument can either be a scalar or an array_like object. This includes lists as well as numpy arrays.

Until now, all of the example code has used a scalar to specify the size. The following code specifies the code using a list.

As is the case with color, if the number of markers added to the plot is greater than the number of values in the size sequence, the sizes cycle through the sequence as markers are added to the plot.

In [10]:
fig,ax0 = plt.subplots(1,1)

x = np.linspace(0,10,10)
y = x
ax0.scatter(x,y,s=[5**2,10**2,20**2,40**2],marker='s',c='Red')

plt.tight_layout()
plt.show()

Using x, y, size, and color to convey meaning

The arguments x, y, size, and color can all be array-like objects. Therefore, if they are the same length, each of the arguments can be used to convey meaning in a scatter plot. Let's illustrate this with an example.

A simplified example

This example is simplified to make it easy to understand and interpret.

Assume that an agriculture research project divides a field into a grid of 16 plots. Each plot is assigned an x coordinate and a y coordinate. The coordinates of the plots are reflected as x and y in the following code.

Assume that during the growing season, careful records are kept of the amount of water applied to each plot. This is reflected in the water list in the following code.

At the end of the growing season, the yield of each plot is carefully measured. This is reflected in the cropYield list in the following code.

A scatter plot is created where cropYield is assigned to color and water is assigned to marker size.

The output from the code draws a square marker to represent the x-y coordinates of each plot.

The change in color propagates down and to the right from the upper-left corner, indicating maximum yield for the plot in the upper left corner, and minimum yield for the plot in the lower right corner.

The change in marker size propagates downward from the top row indicating that the plots in the top row received the most amount of water while the plots in the bottom row received the least amount of water.

Therefore, this scatter plot shows the relationships among four variables: x-coordinate, y-coordinate, crop yield, and water applied.

In [11]:
fig,ax0 = plt.subplots(1,1)
fig.set_facecolor('0.75')

x = [1,1,1,1,
     2,2,2,2,
     3,3,3,3,
     4,4,4,4]

y = [4,3,2,1,
     4,3,2,1,
     4,3,2,1,
     4,3,2,1]

cropYield = [4,3,2,1,
             3,3,2,1,
             2,2,2,1,
             1,1,1,1]

water = [2400,1800,1200,600,
         2400,1800,1200,600,
         2400,1800,1200,600,
         2400,1800,1200,600]


im = plt.scatter(x, y, c=cropYield, s=water, cmap='plasma',marker='s')

#Add new axes to the figure.
ax1 = fig.add_axes([0.95, 0.15, 0.03, 0.70])

#Draw a colorbar in the new axes.
fig.colorbar(im, cax=ax1)

#plt.colorbar();  # show color scale

plt.show()

A slightly more realistic example

The previous example was simplified to make it easy to understand and interpret. The following code illustrates a slightly more realistic example. In this example, crop yield and water applied are random with respect to the x-y coordinates of the plots.

This code will produce different results each time it is run, but color and size can still be interpreted to represent crop yield and water applied respectively.

In [12]:
fig,ax0 = plt.subplots(1,1)
fig.set_facecolor('0.75')

x = [1,1,1,1,
     2,2,2,2,
     3,3,3,3,
     4,4,4,4]

y = [4,3,2,1,
     4,3,2,1,
     4,3,2,1,
     4,3,2,1]

cropYield  = np.random.uniform(0,100,16)

water = np.random.uniform(600,2400,16)

im = plt.scatter(x, y, c=cropYield, s=water, cmap='plasma',marker='s')

#Add new axes to the figure.
ax1 = fig.add_axes([0.95, 0.15, 0.03, 0.70])

#Draw a colorbar in the new axes.
fig.colorbar(im, cax=ax1)

plt.show()
--To be continued in the tutorial titled Visualization Exercises Part 5--

Author: Prof. Richard G. Baldwin, Austin Community College, Austin, TX

File: VisualizationPart04.ipynb

Revised: 04/22/18

Copyright 2018 Richard G. Baldwin

-end-