This notebook will introduce you to the Pandas library and its data structures.
According to Introduction into Pandas:
Pandas is a Python module, which is rounding up the capabilities of Numpy, Scipy and Matplotlab. The word pandas is an acronym which is derived from "Python and data analysis" and "panel data".
That tutorial goes on further to say:
Pandas is a software library written for the Python programming language. It is used for data manipulation and analysis. It provides special data structures and operations for the manipulation of numerical tables and time series.
Pandas provides two important data structures:
According to the tutorial mentioned above:
A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data.
A series can be thought of as a one-dimensional vertical (column) array with an index similar to a Python dictionary. However, unlike the keys in a dictionary, the index values for a Series are not required to be unique.
By default, the index is a zero-based numeric index. The following code uses the Series constructor to create and display a Series object with the default numeric index.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline
print(pd.Series(['dog','cat','cow','pig']))
If the default numeric index doesn't suit your needs, you can replace it with a set of Index values of your choosing, as illustrated by the following code, that provides a second argument to the Series constructor.
print(pd.Series(['dog','cat','cow','pig'],['A','B','C','D']))
Let's create a Series object containing numeric data and plot the data in the object.
seriesData = [2,4,8,16,32,64,128,256,512,1024]
series = pd.Series(seriesData)
series.plot()
There are many ways to construct, manipulate, and use Series objects. That will be the topic of future notebooks.
According to this tutorial:
The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.
That tutorial goes on to say:
Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data type, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.
One way to think of a DataFrame is as the concatenation of two or more Series objects having a common index. Let's look at an example that illustrates this concept.
Begin by creating and displaying a Series object pointed to by the variable named series01.
index = ['A','B','C','D']
series01 = pd.Series(['dog','cat','cow','pig'],index)
series01
Now create and display a second Series object with the same index pointed to by the variable named series02.
series02 = pd.Series(['mouse','bat','squirrel','horse'],index)
series02
Now call the pandas.concat method to concatenate the two Series objects and display the resulting DataFrame object.
dFrame = pd.concat([series01,series02],axis=1)
dFrame
As you can see from the output shown above, each Series object becomes a column in the DataFrame object. The common index becomes the row index for the DataFrame object. This approach results in a default zero-based numeric index for the columns.
The following code can be used to replace the default numeric column index with an index of your choosing by setting the columns attribute to a list of column names.
dFrame.columns = ['a','b']
dFrame
Let's create a DataFrame object containing random numeric data and then plot the data in the object.
seriesA = pd.Series(np.random.randint(0,25,101),name='A')
seriesA+=25
seriesB = pd.Series(np.random.randint(0,25,101),name='B')
newFrame = pd.concat([seriesA,seriesB],axis=1)
newFrame.head(3)
newFrame.plot()
The previous discussion showing how to create a DataFrame object by concatenating Series objects was provided to help you gain insight into the structure of a DataFrame object. In reality, you will probably use the DataFrame constructor to create your DataFrame objects instead of using the process described above. The use of the constructor will be explained in a future notebook that discusses more details about DataFrame objects.
Also, as you will see later, when you import a dataset, that process will typically result in a DataFrame object that has been populated with the data in the dataset. That could very well be the primary soure of DataFrame objects for the work that you will be doing.
Regardless of how it comes into existence, a DataFrame object can be constructed, manipulated, and used in a large variety of ways. That will be the topic of future notebooks.
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: PandasIntro01.html
Revised: 09/01/18
Copyright 2018 Richard G. Baldwin