You have learned quite a lot about the DataFrame object in earlier notebooks. This notebook will expand on that knowledge by illustrating various ways to access the data in a DataFrame object.
Before getting into the details regarding access, I will load a large dataset for use in illustrating the access procedures later.
The following code will load a dataset file into a DataFrame object and cause the 'Date' column to become the row index.
The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231
It was listed as DJIA 30 Stock Time Series. The dataset contains price information for a particular stock on the NASDAQ exchange for 3019 trading days.
import pandas as pd
%matplotlib inline
TestData01 = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col='Date')
The CSV file specified in the above statement was stored in a subdirectory named 'data'.
Examine the first 15 records in the dataset.
TestData01.head(15)
Examine the last three records in the dataset.
TestData01.tail(3)
The dataset contains 3019 individual records beginning in January 2006 and ending in December 2017. That represents approximately five trading days per week, 52 weeks per year, for twelve years.
The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)
The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.
The trading date for each record is shown as the row index.
TestData01.isnull().sum()
The combination of the isnull() and sum() methods tell us that the dataset is very clean with no missing data in any of the data columns.
There are at least three ways to access the data in a DataFrame object:
The loc method can be used to access a group of rows and columns by label(s) or a boolean array. The method is primarily label based, but may also be used with a boolean array.
The iloc method can be used to access a group of rows and columns by integer numeric indices or a boolean array. The method is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
This approach is similar to the approach that would be used for accessing the data in a Python dictionary using keys. In this case, labels are used instead of keys.
The loc method will accept any of the following input arguments:
Note that for the third case, 'a':'f', contrary to usual python slices, both the start and the stop are included
This notebook will deal with the first four possibilities in the above list.
The allowable inputs for the iloc method are similar to those for the loc method, taking into account the difference between integers and labels. The iloc method will accept any of the following input arguments:
The .iloc method will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
This notebook will deal with the first four possibilities in the above list.
The general syntax for accessing a subset of a DataFrame object is:
dataFrameObject.loc[startrow:endrow, startcolumn:endcolumn]
TestData02 = TestData01.loc['2006-01-06':'2006-01-13','Open':'Volume']
TestData02.head(7)
The following code will access and print a single column of data.
print(TestData02.loc[:,'Open':'Open'])
The following code will access and print a single row of data.
print(TestData02.loc['2006-01-10':'2006-01-10',:])
The following code uses the slice syntax to access a single data value internal to the DataFrame object.
print(TestData02.loc['2006-01-10':'2006-01-10','High':'High'])
The iloc method may be more practical than the loc method, particularly for large datasets which are difficult to view in their entirety. It may be more practical to specify integer row and column index values than to specify labels.
The following code will extract a rectangular subset (6x5) of rows and columns from the original DataFrame object and save it in a new DataFrame object. This is the same rectangular subset as the one extracted using labels. Recall the previous note having to do with the start and stop values for integers and labels.
TestData02 = TestData01.iloc[3:9,0:5]
TestData02.head(7)
The following code will access and print a single column of data.
print(TestData02.iloc[:,0:1])
The following code will access and print a single row of data.
print(TestData02.iloc[2:3,:])
The follow code uses the slice syntax to access a single data value internal to the DataFrame object.
print(TestData02.iloc[2:3,1:2])
print(TestData02.loc['2006-01-10'])
Unfortunately, the approach used in the code above cannot be used directly to access a single column. An attempt to do so raises the follow error:
the label [Open] is not in the [index]By default, the loc method seems to expect the single label to be a row label and refuses to accept a column label. There is probably a straightforward solution for this issue, but I have yet to find it after expending a great deal of time searching the web.
There is an obvious workaround, which I will illustrate below. The workaround is to transpose the DataFrame object before calling the loc method with the single label.
While this workaround code is easy enough to write, that hardly seems worthwhile since it is so easy to access a single column using the slicing syntax shown above. Also, the transpose method may be expensive in terms of computer time used, especially for very large datasets.
print(TestData02.transpose().loc['Open'])
The following code will access and print a single row of data using the single integer approach.
print(TestData02.iloc[2])
The same problem and the same workaround mentioned earlier applies here also. The workaround is to transpose the DataFrame object before calling the iloc method with the single integer.
print(TestData02.transpose().iloc[0])
print(TestData02.loc[['2006-01-09','2006-01-11','2006-01-13']])
The following code will access and print three columns of data using a list of labels. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.
print( (TestData02.transpose().loc[['Open','Low','Volume']]).transpose() )
The following code will access and print three rows of data using a list of integers.
print(TestData02.iloc[[0,2,4]])
The following code will access and print three columns of data using a list of integers. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the iloc method and then transpose the result to get back into the original format.
print( (TestData02.transpose().iloc[[0,2,4]]).transpose() )
The following code will access and print four rows of data using a list of booleans.
mask =[True,False,True,False,True,True]
print(TestData02.loc[mask])
The following code will access and print three columns of data using a list of booleans. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.
mask =[True,False,True,False,True]
print( (TestData02.transpose().loc[mask]).transpose() )
The following code will access and print four rows of data using a list of booleans.
mask =[True,False,True,False,True,True]
print(TestData02.iloc[mask])
The following code will access and print three columns of data using a list of booleans. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.
mask =[True,False,True,False,True]
print( (TestData02.transpose().iloc[mask]).transpose() )
Without further comment, the following code can be used to access a single column or a single row of data from the DataFrame object using the dictionary-style approach.
print(TestData02['Open'])
print( TestData02.transpose()['2006-01-10'] )
The following code will extract a subset of data and plot the data in the subset.
TestData02 = TestData01.loc['2010-10-14':'2010-11-15','Open':'Close']
TestData02.plot(xticks=range(len(TestData02.index)),use_index=True,
rot=90)
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: PandasDataFrame02.html
Revised: 09/02/18
Copyright 2018 Richard G. Baldwin