This notebook continues the introduction to the Pandas Series object and concentrates on missing data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline
A Pandas Series object, (and as we will see later, a Pandas DataFrame object) can have data missing from one or more cells. When this happens, that cell is flagged with NaN, which technically stands for "not a number. In the case of Series and DataFrame objects, NaN means that the data is missing from the cell. We can illustrate this by creating two Pandas Series objects, one of which has missing data and the other of which doesn't have any missing data.
We will begin by creating a Series object with no missing data. We will create the object using a Python dictionary for the values and a Python list for the index as shown below. In this case, the index values match the keys in the dictionary, resulting in a Series object with no missing data.
fruits = {'apple':10,'peach':20,'cherry':30,'plum':40,'orange':50}
fruitIndex = ['apple','peach','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
Next, we will create the Series object with data missing from one of the cells and two values from the dictionary missing altogether. In this case, we will use an index whose label values don't match the keys in the dictionary. As you can see in the output shown below, the object is missing data in the cell referenced by the index label 'pear' and is missing the 'plum' and 'orange' values from the dictionary. In other words, the Series object is defined by the index and not by the original key/value pairs in the dictionary when there is a mismatch.
fruits = {'apple':10,'peach':20,'cherry':30,'plum':40,'orange':50}
fruitIndex = ['apple','pear','cherry']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
Often, the size of a Series object or a DataFrame object is so large as to prohibit printing the contents of the object for visual inspection as we did above, The methods isnull() and notnull() can be used to test for missing data in a Series object or DataFrame object as shown below.
fruits = {'apple':10,'peach':20,'cherry':30,'plum':40,'orange':50}
fruitIndex = ['apple','pear','cherry']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries.isnull())
print('-----')
print(fruitSeries.notnull())
print('-----')
print('Number of null (or empty) cells =',fruitSeries.isnull().sum())
print('Number of non-null cells =',fruitSeries.notnull().sum())
A value of None in the dictionary will also result in an empty cell indicated by NaN as shown below.
fruits = {'apple':10,'peach':20,'cherry':None,'plum':40,'orange':50}
fruitIndex = ['apple','peach','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
According to the documentation, the isnull() method returns "bool or array-like of bool". The same is true for the notnull() method as shown below.
fruits = {'apple':10,'peach':20,'cherry':None,'plum':40,'orange':50}
fruitIndex = ['apple','peach','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
obj = fruitSeries.isnull()
print('Is Null?')
print(obj)
print('-----')
obj = fruitSeries.notnull()
print('Not Null?')
print(obj)
Note that the boolean values returned by the isnull() method are the opposite of the boolean values returned by the notnull() method.
The dropna() method can be used to filter cells with missing data out of a series. The method returns a Series object containing only non-null data as shown below.
fruits = {'apple':10,'peach':20,'cherry':None,'plum':40,'orange':50}
fruitIndex = ['apple','pear','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
print('-----')
cleanSeries = fruitSeries.dropna()
print(cleanSeries)
In some, but not all cases, the fillna() method can be used to repair a Series object that has missing data.
One approach is to use the fillna() method to insert an arbitrary constant value into every cell that has a missing data value as shown below. While not ideal, this may be the best that can be done. In this example, the constant value 999 was inserted in each of the cells that had a missing data value.
fruits = {'apple':10,'peach':20,'cherry':None,'plum':40,'orange':50}
fruitIndex = ['apple','pear','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
print('-----')
repairedSeries = fruitSeries.fillna(999)
print(repairedSeries)
In some cases, a dictionary object can be used in conjunction with the fillna() method to insert the known correct values in cells with missing data as shown below.
This approach can work for those cases where the cell has missing data due simply to the omission of the data, such as in the case of the 'cherry' cell below. However, it cannot be used to repair the Series object for those cases where the missing data is the result of a mismatch between the index values and the dictionary keys, such as with the 'peach' key below.
fruits = {'apple':10,'peach':20,'cherry':None,'plum':40,'orange':50}
fruitIndex = ['apple','pear','cherry','plum','orange']
fruitSeries = pd.Series(fruits,fruitIndex)
print(fruitSeries)
print('-----')
repairedSeries = fruitSeries.fillna({'peach':20,'cherry':30,'plum':40,'orange':50})
print(repairedSeries)
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: PandasSeries02.html
Revised: 08/30/18
Copyright 2018 Richard G. Baldwin