This notebook continues the introduction to the Pandas Series object.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline
All of the previous discussions on the Pandas Series object have been based on synthetic data that was created specifically to illustrate the points under discussion. Before leaving this topic and moving on to the Pandas DataFrame object, we need to do some experiments with a Series object containing real data.
In this notebook, I will
You might want to look these methods up in the documentation to understand the results. Some of these methods will also apply to DataFrame objects later.
The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231 where it was listed as 'DJIA 30 Stock Time Series'.
Kaggle is an excellent source of datasets for your educational projects. You must create an account and sign in to access the datasets, but there is no charge for the account or the datasets. In addition to having access to many datasets, there are some other benefits to having a Kaggle account for members of the Data Science community.
Begin by loading the dataset with the file name shown in a subdirectory named 'data'. The data is loaded into a DataFrame object named TestData with the default numeric index for the rows.
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv')
Examine the first five records in the dataset.
TestData.head()
By taking this first look at the data, we can see that the date information that we want to use as our row index is contained in column number 0.
With that information, we can reload the data in such a way as to cause the data in the Date column to become the row index. To save memory, we will simply overwrite the previous DataFrame object in TestData.
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col=0)
The index_col argument specifies that column number 0 should be used to create an index for the rows in the object.
Let's re-examine the first five rows in the DataFrame object.
TestData.head()
Examine the last 5 records in the DataFrame object.
TestData.tail()
That is exactly what we wanted to see. The data in the original Date column now serves as the row index.
As you can see, the DataFrame object now contains 3019 individual records beginning in January 2003 and ending in December 2017.
The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)
The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.
The trading date for each record is reflected in the row index.
Our objective is to extract a Series object containing the Open price data with the date for each opening price as the index. Therefore, we will need to extract the data in the column identified by Open.
We will use the loc() method of the DataFrame object to extract the data in the Open column into a Series object.
There are many ways to use the loc() method. The simplest usage is to extract a single row of data. In order to use that approach, we need to transpose the DataFrame object so as to cause the columns to become rows and the rows to become columns, as shown below.
TestData = TestData.transpose()
View the first five rows of the transposed DataFrame to confirm that all went as planned.
TestData.head()
Now that we have transposed the DataFrame, we can extract the Open row into a Series object. The column headers, which are now date values, will also be extracted as the index for the Open price data as shown below.
seriesObj = TestData.loc["Open"]
Examine the first five and the last five price elements in our new Series object.
seriesObj.head()
seriesObj.tail()
As you can see, the price values match up with the date values that we saw in our first look at the data. Our new Series object has the original Date data as the index. We can confirm this by extracting and printing one of the price values by referring to its index as shown below.
print(seriesObj['2017-12-26'])
Now that we have our Series object just the way we want it, let's experiment with it a bit.
First we will plot the opening price values versus the date in a line plot as shown below. Note that the labels on the horizontal axis show the dates, which is why I went to so much trouble to replace the default numeric index with the more meaningful date index.
seriesObj.plot(kind='line',figsize=(7,3),grid=True)
The line plot shows a very nice increase in stock price from 2006 to 2017. However, the period from 2008 through 2012 must have been very depressing for the current stockholders at that time.
Next, lets create a histogram of the opening price data as shown below.
seriesObj.plot(kind='hist',bins=100,figsize=(7,3),grid=True)
Finally, let's create a kde plot of the opening price data as shown below. I haven't discussed the kde plot before in this course. If you are interested, you can do a little research to learn just what it is. However, you might guess from its shape that it has something to do with a smoothed version of the histogram shown above.
seriesObj.plot(kind='kde',figsize=(7,3),grid=True)
There are many methods that can be called on a Pandas Series object. You can find a list of the methods at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html. A few of those methods are shown below. Some of the methods can also be called on a DataFrame object.
print('The number of missing data elements in this object =',seriesObj.isnull().sum())
print('The number of non-null data elements in this object =',seriesObj.notnull().sum())
seriesObj.shape
seriesObj.describe()
seriesObj.count()
print('The mean value =',seriesObj.mean())
print('The median value =',seriesObj.median())
print('The standard deviation =',seriesObj.std())
print('The maximum value =',seriesObj.max())
print('The minimum value =',seriesObj.min())
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: zz.html
Revised: zz
Copyright 2018 Richard G. Baldwin