According to this tutorial:
The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.
The tutorial goes on to say:
Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data type, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.
You learned a little about the DataFrame object in an earlier notebook. This notebook will expand on that knowledge by showing you how to load a large CSV dataset into a DataFrame object and perform some operations on the object.
Finally, this notebook will plot the 3019 data values contained in one column of the dataset as a line plot.
The following code will load a dataset file into a DataFrame object.
The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231
It was listed as DJIA 30 Stock Time Series. The dataset contains price information for a particular stock on the NASDAQ exchange for 3019 trading days.
import pandas as pd
%matplotlib inline
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv')
The CSV file specified in the above statement was stored in a subdirectory named 'data'.
Examine the first three records in the dataset.
TestData.head(3)
Examine the last three records in the dataset.
TestData.tail(3)
As you can see, the dataset contains 3019 individual records beginning in January 2006 and ending in December 2017. That represents approximately five trading days per week, 52 weeks per year, for twelve years.
The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)
The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.
The trading date for each record is shown in the Date column.
The above read operation produced a DataFrame object with a default zero-based numeric row index. We would prefer to have the row index be the date instead of the default numeric index. We will read the dataset file again and specify that the Date column should become the row index.
TestData = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col=0)
TestData.head(3)
TestData.tail(3)
Now we will call some methods on the DataFrame object to obtain descriptive information about the dataset.
TestData.shape
The shape() method tells us that the DataFrame object contains 3019 rows and six columns. Note that the row and column indices are not included in the row and column counts.
TestData.describe()
The describe() method provides basis statistical information about the values in each of the data columns.
TestData.isnull().sum()
The combination of the isnull() and sum() methods tell us that the dataset is very clean with no missing data in any of the data columns.
A quick way to get a feel for the data is to plot the numeric values in one or more columns. In this case, we will use slicing syntax to extract the data from the Open column and then plot it.
The following code extracts the data from the Open column and returns it as a Series object. The general syntax rule for extracting a subset of data from a DataFrame object is:
TestData.loc[startrow:endrow, startcolumn:endcolumn]
Either of the following statements will work.
#openSeries = TestData.loc['2006-01-03':,'Open':'Open']
openSeries = TestData.loc['0':,'Open':'Open']
Now examine the beginning and ending portions of the new Series object to confirm that it contains the correct data.
openSeries.head(2)
openSeries.tail(2)
Now call the plot() method on the Series object to create a line plot of the data from the Open column.
openSeries.plot(kind='line',figsize=(7,3),grid=True)
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: PandasDataFrame01.html
Revised: 08/30/18
Copyright 2018 Richard G. Baldwin