Pandas DataFrame part 1 - loading a dataset

According to this tutorial:

The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.

The tutorial goes on to say:

Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data type, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.

You learned a little about the DataFrame object in an earlier notebook. This notebook will expand on that knowledge by showing you how to load a large CSV dataset into a DataFrame object and perform some operations on the object.

Finally, this notebook will plot the 3019 data values contained in one column of the dataset as a line plot.

Load a dataset

The following code will load a dataset file into a DataFrame object.

The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231

It was listed as DJIA 30 Stock Time Series. The dataset contains price information for a particular stock on the NASDAQ exchange for 3019 trading days.

In [1]:
import pandas as pd
%matplotlib inline
In [2]:
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv')

The CSV file specified in the above statement was stored in a subdirectory named 'data'.

Examine the beginning and the end of the dataset

Examine the first three records in the dataset.

In [3]:
TestData.head(3)
Out[3]:
Date Open High Low Close Volume Name
0 2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
1 2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2 2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA

Examine the last three records in the dataset.

In [4]:
TestData.tail(3)
Out[4]:
Date Open High Low Close Volume Name
3016 2017-12-27 69.77 70.49 69.69 70.06 6345124 AABA
3017 2017-12-28 70.12 70.32 69.51 69.82 7556877 AABA
3018 2017-12-29 69.79 70.13 69.43 69.85 6613070 AABA

Contents of the dataset

As you can see, the dataset contains 3019 individual records beginning in January 2006 and ending in December 2017. That represents approximately five trading days per week, 52 weeks per year, for twelve years.

The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)

The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.

The trading date for each record is shown in the Date column.

Set the row index

The above read operation produced a DataFrame object with a default zero-based numeric row index. We would prefer to have the row index be the date instead of the default numeric index. We will read the dataset file again and specify that the Date column should become the row index.

In [5]:
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col=0)
TestData.head(3)
Out[5]:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
In [6]:
TestData.tail(3)
Out[6]:
Open High Low Close Volume Name
Date
2017-12-27 69.77 70.49 69.69 70.06 6345124 AABA
2017-12-28 70.12 70.32 69.51 69.82 7556877 AABA
2017-12-29 69.79 70.13 69.43 69.85 6613070 AABA

Get descriptive information

Now we will call some methods on the DataFrame object to obtain descriptive information about the dataset.

In [7]:
TestData.shape
Out[7]:
(3019, 6)

The shape() method tells us that the DataFrame object contains 3019 rows and six columns. Note that the row and column indices are not included in the row and column counts.

In [8]:
TestData.describe()
Out[8]:
Open High Low Close Volume
count 3019.000000 3019.000000 3019.000000 3019.000000 3.019000e+03
mean 28.426479 28.766532 28.066558 28.412726 2.158391e+07
std 13.257242 13.356692 13.157326 13.258163 1.926231e+07
min 9.100000 9.480000 8.940000 8.950000 1.939061e+06
25% 16.175000 16.385000 15.970000 16.130000 1.248025e+07
50% 27.180000 27.490000 26.820000 27.100000 1.732130e+07
75% 36.655000 37.035000 36.305000 36.635000 2.512757e+07
max 73.020000 73.250000 72.460000 72.930000 4.382317e+08

The describe() method provides basis statistical information about the values in each of the data columns.

In [9]:
TestData.isnull().sum()
Out[9]:
Open      0
High      0
Low       0
Close     0
Volume    0
Name      0
dtype: int64

The combination of the isnull() and sum() methods tell us that the dataset is very clean with no missing data in any of the data columns.

Plot the data from the open column

A quick way to get a feel for the data is to plot the numeric values in one or more columns. In this case, we will use slicing syntax to extract the data from the Open column and then plot it.

The following code extracts the data from the Open column and returns it as a Series object. The general syntax rule for extracting a subset of data from a DataFrame object is:

TestData.loc[startrow:endrow, startcolumn:endcolumn]

Either of the following statements will work.

In [10]:
#openSeries = TestData.loc['2006-01-03':,'Open':'Open']
openSeries = TestData.loc['0':,'Open':'Open']

Now examine the beginning and ending portions of the new Series object to confirm that it contains the correct data.

In [11]:
openSeries.head(2)
Out[11]:
Open
Date
2006-01-03 39.69
2006-01-04 41.22
In [12]:
openSeries.tail(2)
Out[12]:
Open
Date
2017-12-28 70.12
2017-12-29 69.79

Now call the plot() method on the Series object to create a line plot of the data from the Open column.

In [13]:
openSeries.plot(kind='line',figsize=(7,3),grid=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x9e17fd0>

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasDataFrame01.html

Revised: 08/30/18

Copyright 2018 Richard G. Baldwin