Pandas DataFrame part 2 - data access

You have learned quite a lot about the DataFrame object in earlier notebooks. This notebook will expand on that knowledge by illustrating various ways to access the data in a DataFrame object.

Load a dataset

Before getting into the details regarding access, I will load a large dataset for use in illustrating the access procedures later.

The following code will load a dataset file into a DataFrame object and cause the 'Date' column to become the row index.

The dataset was downloaded from Kaggle at https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231

It was listed as DJIA 30 Stock Time Series. The dataset contains price information for a particular stock on the NASDAQ exchange for 3019 trading days.

In [1]:
import pandas as pd
%matplotlib inline
In [2]:
TestData01 = pd.read_csv('data/AABA_2006-01-01_to_2018-01-01.csv',index_col='Date')

The CSV file specified in the above statement was stored in a subdirectory named 'data'.

Examine the beginning and the end of the dataset

Examine the first 15 records in the dataset.

In [3]:
TestData01.head(15)
Out[3]:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA
2006-01-10 42.96 43.34 42.34 42.98 16288580 AABA
2006-01-11 42.19 42.31 41.72 41.87 26192772 AABA
2006-01-12 41.92 41.99 40.76 40.89 18921686 AABA
2006-01-13 41.00 41.08 39.62 39.90 30966185 AABA
2006-01-17 39.09 40.39 38.96 40.11 42429911 AABA
2006-01-18 35.01 36.16 34.74 35.18 118670393 AABA
2006-01-19 35.82 35.84 34.24 34.33 60918436 AABA
2006-01-20 34.44 34.66 33.21 33.74 57681688 AABA
2006-01-23 34.22 34.40 33.98 34.17 30894265 AABA
2006-01-24 34.55 35.20 34.51 34.87 31686324 AABA

Examine the last three records in the dataset.

In [4]:
TestData01.tail(3)
Out[4]:
Open High Low Close Volume Name
Date
2017-12-27 69.77 70.49 69.69 70.06 6345124 AABA
2017-12-28 70.12 70.32 69.51 69.82 7556877 AABA
2017-12-29 69.79 70.13 69.43 69.85 6613070 AABA

Contents of the dataset

The dataset contains 3019 individual records beginning in January 2006 and ending in December 2017. That represents approximately five trading days per week, 52 weeks per year, for twelve years.

The name of the stock is shown in the Name column as AABA. (A Google search identifies this as the stock for Altaba Inc. on the NASDAQ exchange.)

The stock prices for Open, High, Low, and Close for each day are shown in the columns with the corresponding names.

The trading date for each record is shown as the row index.

Check for missing data

In [5]:
TestData01.isnull().sum()
Out[5]:
Open      0
High      0
Low       0
Close     0
Volume    0
Name      0
dtype: int64

The combination of the isnull() and sum() methods tell us that the dataset is very clean with no missing data in any of the data columns.

Three ways to access the data

There are at least three ways to access the data in a DataFrame object:

  • By calling the DataFrame.loc method on the DataFrame object.
  • By calling the DataFrame.iloc method on the DataFrame object.
  • By using dictionary-style indexing.

The loc method

The loc method can be used to access a group of rows and columns by label(s) or a boolean array. The method is primarily label based, but may also be used with a boolean array.

The iloc method

The iloc method can be used to access a group of rows and columns by integer numeric indices or a boolean array. The method is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Dictionary-style indexing

This approach is similar to the approach that would be used for accessing the data in a Python dictionary using keys. In this case, labels are used instead of keys.

Allowable inputs for the loc method

The loc method will accept any of the following input arguments:

  • A single label, e.g. 5, or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
  • A list or array of labels, e.g. ['a', 'b', 'c'].
  • A slice object with labels, e.g. 'a':'f'. (See note below.)
  • A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

Note that for the third case, 'a':'f', contrary to usual python slices, both the start and the stop are included

This notebook will deal with the first four possibilities in the above list.

Allowable inputs for the iloc method

The allowable inputs for the iloc method are similar to those for the loc method, taking into account the difference between integers and labels. The iloc method will accept any of the following input arguments:

  • An integer, e.g. 5.
  • A list or array of integers, e.g. [4, 3, 0].
  • A slice object with ints, e.g. 1:7.
  • A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

The .iloc method will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

This notebook will deal with the first four possibilities in the above list.

Using slice notation

The general syntax for accessing a subset of a DataFrame object is:

dataFrameObject.loc[startrow:endrow, startcolumn:endcolumn]

Slice notation with the loc method

Access a rectangular subset

The following code will extract a rectangular subset (6x5) of rows and columns from the original DataFrame object and save it in a new DataFrame object.

In [6]:
TestData02 = TestData01.loc['2006-01-06':'2006-01-13','Open':'Volume']
TestData02.head(7)
Out[6]:
Open High Low Close Volume
Date
2006-01-06 42.88 43.57 42.80 43.21 29422828
2006-01-09 43.10 43.66 42.82 43.42 16268338
2006-01-10 42.96 43.34 42.34 42.98 16288580
2006-01-11 42.19 42.31 41.72 41.87 26192772
2006-01-12 41.92 41.99 40.76 40.89 18921686
2006-01-13 41.00 41.08 39.62 39.90 30966185

Access a single column

The following code will access and print a single column of data.

In [7]:
print(TestData02.loc[:,'Open':'Open'])
             Open
Date             
2006-01-06  42.88
2006-01-09  43.10
2006-01-10  42.96
2006-01-11  42.19
2006-01-12  41.92
2006-01-13  41.00

Access a single row

The following code will access and print a single row of data.

In [8]:
print(TestData02.loc['2006-01-10':'2006-01-10',:])
             Open   High    Low  Close    Volume
Date                                            
2006-01-10  42.96  43.34  42.34  42.98  16288580

Access a single value

The following code uses the slice syntax to access a single data value internal to the DataFrame object.

In [9]:
print(TestData02.loc['2006-01-10':'2006-01-10','High':'High'])
             High
Date             
2006-01-10  43.34

Slice notation with the iloc method

Access a rectangular subset

The iloc method may be more practical than the loc method, particularly for large datasets which are difficult to view in their entirety. It may be more practical to specify integer row and column index values than to specify labels.

The following code will extract a rectangular subset (6x5) of rows and columns from the original DataFrame object and save it in a new DataFrame object. This is the same rectangular subset as the one extracted using labels. Recall the previous note having to do with the start and stop values for integers and labels.

In [10]:
TestData02 = TestData01.iloc[3:9,0:5]
TestData02.head(7)
Out[10]:
Open High Low Close Volume
Date
2006-01-06 42.88 43.57 42.80 43.21 29422828
2006-01-09 43.10 43.66 42.82 43.42 16268338
2006-01-10 42.96 43.34 42.34 42.98 16288580
2006-01-11 42.19 42.31 41.72 41.87 26192772
2006-01-12 41.92 41.99 40.76 40.89 18921686
2006-01-13 41.00 41.08 39.62 39.90 30966185

Access a single column

The following code will access and print a single column of data.

In [11]:
print(TestData02.iloc[:,0:1])
             Open
Date             
2006-01-06  42.88
2006-01-09  43.10
2006-01-10  42.96
2006-01-11  42.19
2006-01-12  41.92
2006-01-13  41.00

Access a single row

The following code will access and print a single row of data.

In [12]:
print(TestData02.iloc[2:3,:])
             Open   High    Low  Close    Volume
Date                                            
2006-01-10  42.96  43.34  42.34  42.98  16288580

Access a single value

The follow code uses the slice syntax to access a single data value internal to the DataFrame object.

In [13]:
print(TestData02.iloc[2:3,1:2])
             High
Date             
2006-01-10  43.34

Using the single label or single integer approach

Access a single row using the loc method

The following code will access and print a single row of data using the single label approach.

In [14]:
print(TestData02.loc['2006-01-10'])
Open            42.96
High            43.34
Low             42.34
Close           42.98
Volume    16288580.00
Name: 2006-01-10, dtype: float64

Access a single column using the loc method

Unfortunately, the approach used in the code above cannot be used directly to access a single column. An attempt to do so raises the follow error:

the label [Open] is not in the [index]
By default, the loc method seems to expect the single label to be a row label and refuses to accept a column label. There is probably a straightforward solution for this issue, but I have yet to find it after expending a great deal of time searching the web.

There is an obvious workaround, which I will illustrate below. The workaround is to transpose the DataFrame object before calling the loc method with the single label.

While this workaround code is easy enough to write, that hardly seems worthwhile since it is so easy to access a single column using the slicing syntax shown above. Also, the transpose method may be expensive in terms of computer time used, especially for very large datasets.

In [15]:
print(TestData02.transpose().loc['Open'])
Date
2006-01-06    42.88
2006-01-09    43.10
2006-01-10    42.96
2006-01-11    42.19
2006-01-12    41.92
2006-01-13    41.00
Name: Open, dtype: float64

Access a single row using the iloc method

The following code will access and print a single row of data using the single integer approach.

In [16]:
print(TestData02.iloc[2])
Open            42.96
High            43.34
Low             42.34
Close           42.98
Volume    16288580.00
Name: 2006-01-10, dtype: float64

Access a single column using the iloc method

The same problem and the same workaround mentioned earlier applies here also. The workaround is to transpose the DataFrame object before calling the iloc method with the single integer.

In [17]:
print(TestData02.transpose().iloc[0])
Date
2006-01-06    42.88
2006-01-09    43.10
2006-01-10    42.96
2006-01-11    42.19
2006-01-12    41.92
2006-01-13    41.00
Name: Open, dtype: float64

Using the list of labels or integers approach

Access several rows using a list of labels and the loc method

The following code will access and print three rows of data using a list of labels.

In [18]:
print(TestData02.loc[['2006-01-09','2006-01-11','2006-01-13']])
             Open   High    Low  Close    Volume
Date                                            
2006-01-09  43.10  43.66  42.82  43.42  16268338
2006-01-11  42.19  42.31  41.72  41.87  26192772
2006-01-13  41.00  41.08  39.62  39.90  30966185

Access several columns using a list of labels and the loc method

The following code will access and print three columns of data using a list of labels. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.

In [19]:
print(  (TestData02.transpose().loc[['Open','Low','Volume']]).transpose() )
             Open    Low      Volume
Date                                
2006-01-06  42.88  42.80  29422828.0
2006-01-09  43.10  42.82  16268338.0
2006-01-10  42.96  42.34  16288580.0
2006-01-11  42.19  41.72  26192772.0
2006-01-12  41.92  40.76  18921686.0
2006-01-13  41.00  39.62  30966185.0

Access several rows using a list of labels and the iloc method

The following code will access and print three rows of data using a list of integers.

In [20]:
print(TestData02.iloc[[0,2,4]])
             Open   High    Low  Close    Volume
Date                                            
2006-01-06  42.88  43.57  42.80  43.21  29422828
2006-01-10  42.96  43.34  42.34  42.98  16288580
2006-01-12  41.92  41.99  40.76  40.89  18921686

Access several columns using a list of integers and the iloc method

The following code will access and print three columns of data using a list of integers. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the iloc method and then transpose the result to get back into the original format.

In [21]:
print(  (TestData02.transpose().iloc[[0,2,4]]).transpose() )
             Open    Low      Volume
Date                                
2006-01-06  42.88  42.80  29422828.0
2006-01-09  43.10  42.82  16268338.0
2006-01-10  42.96  42.34  16288580.0
2006-01-11  42.19  41.72  26192772.0
2006-01-12  41.92  40.76  18921686.0
2006-01-13  41.00  39.62  30966185.0

Access several rows using a list of booleans and the loc method

The following code will access and print four rows of data using a list of booleans.

In [22]:
mask =[True,False,True,False,True,True]
print(TestData02.loc[mask])
             Open   High    Low  Close    Volume
Date                                            
2006-01-06  42.88  43.57  42.80  43.21  29422828
2006-01-10  42.96  43.34  42.34  42.98  16288580
2006-01-12  41.92  41.99  40.76  40.89  18921686
2006-01-13  41.00  41.08  39.62  39.90  30966185

Access several columns using a list of booleans and the loc method

The following code will access and print three columns of data using a list of booleans. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.

In [23]:
mask =[True,False,True,False,True]
print(  (TestData02.transpose().loc[mask]).transpose() )
             Open    Low      Volume
Date                                
2006-01-06  42.88  42.80  29422828.0
2006-01-09  43.10  42.82  16268338.0
2006-01-10  42.96  42.34  16288580.0
2006-01-11  42.19  41.72  26192772.0
2006-01-12  41.92  40.76  18921686.0
2006-01-13  41.00  39.62  30966185.0

Access several rows using a list of booleans and the iloc method

The following code will access and print four rows of data using a list of booleans.

In [24]:
mask =[True,False,True,False,True,True]
print(TestData02.iloc[mask])
             Open   High    Low  Close    Volume
Date                                            
2006-01-06  42.88  43.57  42.80  43.21  29422828
2006-01-10  42.96  43.34  42.34  42.98  16288580
2006-01-12  41.92  41.99  40.76  40.89  18921686
2006-01-13  41.00  41.08  39.62  39.90  30966185

Access several columns using a list of booleans and the iloc method

The following code will access and print three columns of data using a list of booleans. However, the same issue discussed earlier exists here also. The workaround here is to transpose the DataFrame object before calling the loc method and then transpose the result to get back into the original format.

In [25]:
mask =[True,False,True,False,True]
print(  (TestData02.transpose().iloc[mask]).transpose() )
             Open    Low      Volume
Date                                
2006-01-06  42.88  42.80  29422828.0
2006-01-09  43.10  42.82  16268338.0
2006-01-10  42.96  42.34  16288580.0
2006-01-11  42.19  41.72  26192772.0
2006-01-12  41.92  40.76  18921686.0
2006-01-13  41.00  39.62  30966185.0

Using the dictionary-style approach

Without further comment, the following code can be used to access a single column or a single row of data from the DataFrame object using the dictionary-style approach.

In [26]:
print(TestData02['Open'])
Date
2006-01-06    42.88
2006-01-09    43.10
2006-01-10    42.96
2006-01-11    42.19
2006-01-12    41.92
2006-01-13    41.00
Name: Open, dtype: float64
In [27]:
print(  TestData02.transpose()['2006-01-10'] )
Open            42.96
High            43.34
Low             42.34
Close           42.98
Volume    16288580.00
Name: 2006-01-10, dtype: float64

Plot the data in a subset

The following code will extract a subset of data and plot the data in the subset.

In [28]:
TestData02 = TestData01.loc['2010-10-14':'2010-11-15','Open':'Close']
TestData02.plot(xticks=range(len(TestData02.index)),use_index=True, 
                rot=90)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x9d99930>

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasDataFrame02.html

Revised: 09/02/18

Copyright 2018 Richard G. Baldwin