Pandas Series part 3 - working with real data

This notebook continues the introduction to the Pandas Series object.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Enable inline plotting
%matplotlib inline

Real data

All of the previous discussions on the Pandas Series object have been based on synthetic data that was created specifically to illustrate the points under discussion. Before leaving this topic and moving on to the Pandas DataFrame object, we need to do some experiments with a Series object containing real data.

In this notebook, I will

  • Load a dataset containing price information for a particular stock from 3019 trading days into a DataFrame object.
  • Display the beginning (head) and the ending (tail) of the price data in the DataFrame.
  • Extract the 'Open' price data from the DataFrame into a Series object with the Date data as the index.
  • Create several different plots of different types using the Series object.
  • Apply some miscellaneous methods to the Series object and display the results.

You might want to look these methods up in the documentation to understand the results. Some of these methods will also apply to DataFrame objects later.

The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231 where it was listed as 'DJIA 30 Stock Time Series'.

Kaggle is an excellent source of datasets for your educational projects. You must create an account and sign in to access the datasets, but there is no charge for the account or the datasets. In addition to having access to many datasets, there are some other benefits to having a Kaggle account for members of the Data Science community.

Load and examine the dataset

Begin by loading the dataset with the file name shown in a subdirectory named 'data'. The data is loaded into a DataFrame object named TestData with the default numeric index for the rows.

In [2]:
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv')

Examine the first five records in the dataset.

In [3]:
TestData.head()
Out[3]:
Date Open High Low Close Volume Name
0 2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
1 2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2 2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
3 2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
4 2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA

By taking this first look at the data, we can see that the date information that we want to use as our row index is contained in column number 0.

With that information, we can reload the data in such a way as to cause the data in the Date column to become the row index. To save memory, we will simply overwrite the previous DataFrame object in TestData.

In [4]:
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col=0)

The index_col argument specifies that column number 0 should be used to create an index for the rows in the object.

Let's re-examine the first five rows in the DataFrame object.

In [5]:
TestData.head()
Out[5]:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA

Examine the last 5 records in the DataFrame object.

In [6]:
TestData.tail()
Out[6]:
Open High Low Close Volume Name
Date
2017-12-22 71.42 71.87 71.22 71.58 10979165 AABA
2017-12-26 70.94 71.39 69.63 69.86 8542802 AABA
2017-12-27 69.77 70.49 69.69 70.06 6345124 AABA
2017-12-28 70.12 70.32 69.51 69.82 7556877 AABA
2017-12-29 69.79 70.13 69.43 69.85 6613070 AABA

That is exactly what we wanted to see. The data in the original Date column now serves as the row index.

Contents of the DataFrame object

As you can see, the DataFrame object now contains 3019 individual records beginning in January 2003 and ending in December 2017.

The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)

The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.

The trading date for each record is reflected in the row index.

Our objective

Our objective is to extract a Series object containing the Open price data with the date for each opening price as the index. Therefore, we will need to extract the data in the column identified by Open.

Transpose the DataFrame

We will use the loc() method of the DataFrame object to extract the data in the Open column into a Series object.

There are many ways to use the loc() method. The simplest usage is to extract a single row of data. In order to use that approach, we need to transpose the DataFrame object so as to cause the columns to become rows and the rows to become columns, as shown below.

In [7]:
TestData = TestData.transpose()

View the first five rows of the transposed DataFrame to confirm that all went as planned.

In [8]:
TestData.head()
Out[8]:
Date 2006-01-03 2006-01-04 2006-01-05 2006-01-06 2006-01-09 2006-01-10 2006-01-11 2006-01-12 2006-01-13 2006-01-17 ... 2017-12-15 2017-12-18 2017-12-19 2017-12-20 2017-12-21 2017-12-22 2017-12-26 2017-12-27 2017-12-28 2017-12-29
Open 39.69 41.22 40.93 42.88 43.1 42.96 42.19 41.92 41 39.09 ... 69.35 70.54 69.7 70.6 71 71.42 70.94 69.77 70.12 69.79
High 41.22 41.9 41.73 43.57 43.66 43.34 42.31 41.99 41.08 40.39 ... 70.54 70.74 70.36 71.05 72.17 71.87 71.39 70.49 70.32 70.13
Low 38.79 40.77 40.85 42.8 42.82 42.34 41.72 40.76 39.62 38.96 ... 68.66 69.95 69.38 70.17 70.88 71.22 69.63 69.69 69.51 69.43
Close 40.91 40.97 41.53 43.21 43.42 42.98 41.87 40.89 39.9 40.11 ... 70.29 70.14 70.21 70.9 71.59 71.58 69.86 70.06 69.82 69.85
Volume 24232729 20553479 12829610 29422828 16268338 16288580 26192772 18921686 30966185 42429911 ... 19517623 6776982 14654994 12888149 9682733 10979165 8542802 6345124 7556877 6613070

5 rows × 3019 columns

Extract one row of data

Now that we have transposed the DataFrame, we can extract the Open row into a Series object. The column headers, which are now date values, will also be extracted as the index for the Open price data as shown below.

In [9]:
seriesObj = TestData.loc["Open"]

Examine the first five and the last five price elements in our new Series object.

In [10]:
seriesObj.head()
Out[10]:
Date
2006-01-03    39.69
2006-01-04    41.22
2006-01-05    40.93
2006-01-06    42.88
2006-01-09     43.1
Name: Open, dtype: object
In [11]:
seriesObj.tail()
Out[11]:
Date
2017-12-22    71.42
2017-12-26    70.94
2017-12-27    69.77
2017-12-28    70.12
2017-12-29    69.79
Name: Open, dtype: object

As you can see, the price values match up with the date values that we saw in our first look at the data. Our new Series object has the original Date data as the index. We can confirm this by extracting and printing one of the price values by referring to its index as shown below.

In [12]:
print(seriesObj['2017-12-26'])
70.94

Now for the fun part

Now that we have our Series object just the way we want it, let's experiment with it a bit.

A line plot of the price data

First we will plot the opening price values versus the date in a line plot as shown below. Note that the labels on the horizontal axis show the dates, which is why I went to so much trouble to replace the default numeric index with the more meaningful date index.

In [13]:
seriesObj.plot(kind='line',figsize=(7,3),grid=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0xa34dc30>

The line plot shows a very nice increase in stock price from 2006 to 2017. However, the period from 2008 through 2012 must have been very depressing for the current stockholders at that time.

A histogram

Next, lets create a histogram of the opening price data as shown below.

In [14]:
seriesObj.plot(kind='hist',bins=100,figsize=(7,3),grid=True)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0xa1701b0>

A kde plot

Finally, let's create a kde plot of the opening price data as shown below. I haven't discussed the kde plot before in this course. If you are interested, you can do a little research to learn just what it is. However, you might guess from its shape that it has something to do with a smoothed version of the histogram shown above.

In [15]:
seriesObj.plot(kind='kde',figsize=(7,3),grid=True)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xa2b5a10>

Miscellaneous methods

There are many methods that can be called on a Pandas Series object. You can find a list of the methods at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html. A few of those methods are shown below. Some of the methods can also be called on a DataFrame object.

In [16]:
print('The number of missing data elements in this object =',seriesObj.isnull().sum())
The number of missing data elements in this object = 0
In [17]:
print('The number of non-null data elements in this object =',seriesObj.notnull().sum())
The number of non-null data elements in this object = 3019
In [18]:
seriesObj.shape
Out[18]:
(3019,)
In [19]:
seriesObj.describe()
Out[19]:
count     3019.0
unique    1913.0
top         16.0
freq        11.0
Name: Open, dtype: float64
In [20]:
seriesObj.count()
Out[20]:
3019
In [21]:
print('The mean value =',seriesObj.mean())
The mean value = 28.4264789665
In [22]:
print('The median value =',seriesObj.median())
The median value = 27.18
In [23]:
print('The standard deviation =',seriesObj.std())
The standard deviation = 13.2572419244
In [24]:
print('The maximum value =',seriesObj.max())
The maximum value = 73.02
In [25]:
print('The minimum value =',seriesObj.min())
The minimum value = 9.1

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: zz.html

Revised: zz

Copyright 2018 Richard G. Baldwin