Pandas DataFrame part 3 - operations

This notebook will illustrate various operations that you can perform on a DataFrame object.

Load a dataset

This dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231

It was listed as DJIA 30 Stock Time Series. The dataset contains price information for a particular stock on the NASDAQ exchange for 3019 trading days.

In [1]:
import pandas as pd
%matplotlib inline
In [2]:
TestData01 = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',
                         index_col='Date')

The CSV file specified in the above statement was stored in a subdirectory named 'data'.

Examine the beginning and the end of the dataset

Examine the first 3 records in the dataset.

In [3]:
TestData01.head(3)
Out[3]:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA

Examine the last three records in the dataset.

In [4]:
TestData01.tail(3)
Out[4]:
Open High Low Close Volume Name
Date
2017-12-27 69.77 70.49 69.69 70.06 6345124 AABA
2017-12-28 70.12 70.32 69.51 69.82 7556877 AABA
2017-12-29 69.79 70.13 69.43 69.85 6613070 AABA

Contents of the dataset

The dataset contains 3019 individual records beginning in January 2006 and ending in December 2017.

The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.

The trading date for each record is shown as the row index.

Extract and save four columns from the original dataset

In [5]:
TestData01 = TestData01.loc[:,'Open':'Close']
TestData01.head(3)
Out[5]:
Open High Low Close
Date
2006-01-03 39.69 41.22 38.79 40.91
2006-01-04 41.22 41.90 40.77 40.97
2006-01-05 40.93 41.73 40.85 41.53

Plot the dataset

In [6]:
TestData01.plot(figsize = (7,3))
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x9e82410>

Decrease all the values in the dataset

The following code will divide each of the numeric values in the dataset by a factor of two.

In [7]:
TestData02 = TestData01/2
TestData02.head(3)
Out[7]:
Open High Low Close
Date
2006-01-03 19.845 20.610 19.395 20.455
2006-01-04 20.610 20.950 20.385 20.485
2006-01-05 20.465 20.865 20.425 20.765

Plot the modified dataset

In [8]:
TestData02.plot(figsize = (7,3))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xa2f04b0>

Increase the spread of the data in the dataset

In [9]:
TestData03 = TestData02.loc[:]
TestData03['Open']+=5
TestData03['Close']+=10
TestData03['High']+=15
TestData03.head(3)
Out[9]:
Open High Low Close
Date
2006-01-03 24.845 35.610 19.395 30.455
2006-01-04 25.610 35.950 20.385 30.485
2006-01-05 25.465 35.865 20.425 30.765

Plot the modified dataset

In [10]:
TestData03.plot(figsize = (7,3))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xadb4c30>

Increase the high values only

In [11]:
TestData04 = TestData03[:]
TestData04['High']+=5
TestData04.head(3)
Out[11]:
Open High Low Close
Date
2006-01-03 24.845 40.610 19.395 30.455
2006-01-04 25.610 40.950 20.385 30.485
2006-01-05 25.465 40.865 20.425 30.765

Plot the modified dataset

In [12]:
TestData04.plot(figsize = (7,3))
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xae9f230>

Get a new subset of the data

In [13]:
TestData10 = TestData04.loc[:,'High':'Low']
TestData10.head(3)
Out[13]:
High Low
Date
2006-01-03 40.610 19.395
2006-01-04 40.950 20.385
2006-01-05 40.865 20.425

Plot the new subset

In [14]:
TestData10.plot(figsize = (7,3))
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0xaee9df0>

Add a mean column and plot the modfied dataset

In [15]:
meanSeries = TestData10.mean(axis=1)
meanSeries.name = 'Mean'
TestData11 = pd.concat([TestData10, meanSeries], axis=1)
TestData11.head(3)
Out[15]:
High Low Mean
Date
2006-01-03 40.610 19.395 30.0025
2006-01-04 40.950 20.385 30.6675
2006-01-05 40.865 20.425 30.6450
In [16]:
TestData12 = TestData11.loc['2006-01-03':'2017-12-04','High':'Mean']
TestData12.plot(figsize = (7,3))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xaf13270>

Writing CSV files

The following code writes the contents of the DataFrame named TestData12 into an output CSV file named csvOutput.csv. Then it loads the data from the file twice, once without specifying the index and again by specifying the index. Then it plots the data in the DataFrame object to confirm that it matches the plot shown above.

In [17]:
TestData12.to_csv('csvOutput.csv')
TestData13 = pd.read_csv('csvOutput.csv')
TestData13.head(3)
Out[17]:
Date High Low Mean
0 2006-01-03 40.610 19.395 30.0025
1 2006-01-04 40.950 20.385 30.6675
2 2006-01-05 40.865 20.425 30.6450
In [18]:
TestData12.to_csv('csvOutput.csv')
TestData13 = pd.read_csv('csvOutput.csv',index_col='Date')
TestData13.head(3)
Out[18]:
High Low Mean
Date
2006-01-03 40.610 19.395 30.0025
2006-01-04 40.950 20.385 30.6675
2006-01-05 40.865 20.425 30.6450
In [19]:
TestData13.plot(figsize = (7,3))
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0xb0117f0>

Plot a histogram of the data

The following code plots the contents of the DataFrame object in the form of a histogram, showing the three overlapping distributions.

In [20]:
TestData13.plot(kind ='hist',bins=100,figsize = (7,3))
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0xb3b5d30>

Plot a kde plot of the data

The following code plots the contents of the DataFrame object in kde format, showing the three overlapping distributions.

In [21]:
TestData13.plot(kind ='kde',figsize = (7,3))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0xb0f4eb0>

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasDataFrame02.html

Revised: 09/02/18

Copyright 2018 Richard G. Baldwin