Pandas - Introduction

This notebook will introduce you to the Pandas library and its data structures.

According to Introduction into Pandas:

Pandas is a Python module, which is rounding up the capabilities of Numpy, Scipy and Matplotlab. The word pandas is an acronym which is derived from "Python and data analysis" and "panel data".

That tutorial goes on further to say:

Pandas is a software library written for the Python programming language. It is used for data manipulation and analysis. It provides special data structures and operations for the manipulation of numerical tables and time series.

Pandas provides two important data structures:

  • Series
  • DataFrame

Data structures

Series

According to the tutorial mentioned above:

A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data.

A series can be thought of as a one-dimensional vertical (column) array with an index similar to a Python dictionary. However, unlike the keys in a dictionary, the index values for a Series are not required to be unique.

By default, the index is a zero-based numeric index. The following code uses the Series constructor to create and display a Series object with the default numeric index.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Enable inline plotting
%matplotlib inline
In [2]:
print(pd.Series(['dog','cat','cow','pig']))
0    dog
1    cat
2    cow
3    pig
dtype: object

If the default numeric index doesn't suit your needs, you can replace it with a set of Index values of your choosing, as illustrated by the following code, that provides a second argument to the Series constructor.

In [3]:
print(pd.Series(['dog','cat','cow','pig'],['A','B','C','D']))
A    dog
B    cat
C    cow
D    pig
dtype: object

Let's create a Series object containing numeric data and plot the data in the object.

In [4]:
seriesData = [2,4,8,16,32,64,128,256,512,1024]
series = pd.Series(seriesData)
series.plot()
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0xcb03bf0>
In [ ]:
 

There are many ways to construct, manipulate, and use Series objects. That will be the topic of future notebooks.

DataFrame

According to this tutorial:

The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.

That tutorial goes on to say:

Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data type, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.

One way to think of a DataFrame is as the concatenation of two or more Series objects having a common index. Let's look at an example that illustrates this concept.

Begin by creating and displaying a Series object pointed to by the variable named series01.

In [5]:
index = ['A','B','C','D']
series01 = pd.Series(['dog','cat','cow','pig'],index)
series01
Out[5]:
A    dog
B    cat
C    cow
D    pig
dtype: object

Now create and display a second Series object with the same index pointed to by the variable named series02.

In [6]:
series02 = pd.Series(['mouse','bat','squirrel','horse'],index)
series02
Out[6]:
A       mouse
B         bat
C    squirrel
D       horse
dtype: object

Now call the pandas.concat method to concatenate the two Series objects and display the resulting DataFrame object.

In [7]:
dFrame = pd.concat([series01,series02],axis=1)
dFrame
Out[7]:
0 1
A dog mouse
B cat bat
C cow squirrel
D pig horse

As you can see from the output shown above, each Series object becomes a column in the DataFrame object. The common index becomes the row index for the DataFrame object. This approach results in a default zero-based numeric index for the columns.

The following code can be used to replace the default numeric column index with an index of your choosing by setting the columns attribute to a list of column names.

In [8]:
dFrame.columns = ['a','b']
dFrame
Out[8]:
a b
A dog mouse
B cat bat
C cow squirrel
D pig horse

Let's create a DataFrame object containing random numeric data and then plot the data in the object.

In [9]:
seriesA = pd.Series(np.random.randint(0,25,101),name='A')
seriesA+=25
seriesB = pd.Series(np.random.randint(0,25,101),name='B')
newFrame = pd.concat([seriesA,seriesB],axis=1)
newFrame.head(3)
Out[9]:
A B
0 37 23
1 40 4
2 45 21
In [10]:
newFrame.plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xcb7e090>

Summary

The previous discussion showing how to create a DataFrame object by concatenating Series objects was provided to help you gain insight into the structure of a DataFrame object. In reality, you will probably use the DataFrame constructor to create your DataFrame objects instead of using the process described above. The use of the constructor will be explained in a future notebook that discusses more details about DataFrame objects.

Also, as you will see later, when you import a dataset, that process will typically result in a DataFrame object that has been populated with the data in the dataset. That could very well be the primary soure of DataFrame objects for the work that you will be doing.

Regardless of how it comes into existence, a DataFrame object can be constructed, manipulated, and used in a large variety of ways. That will be the topic of future notebooks.

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasIntro01.html

Revised: 09/01/18

Copyright 2018 Richard G. Baldwin