This notebook will introduce you to some of the ways to read a dataset into a Pandas DataFrame
Pandas provides support for reading datasets in a variety of formats, including:
This notebook will concentrate on reading delimiter separated files. Once you understand how to do that, you should be able to use the documentation to determine how to read the other formats on your own.
A delimiter separated file is simply a text file in a format where a unique character is used to separate one data value from the next. The most common types are probably comma-separated files and tab-separated files.
To avoid confusion, it is important to note that insofar as Pandas is concerned, the file extension doesn't matter. A text file containing tab-separated values could have an extension of .pdq, .txt, .csv, or no extension at all and it would still be treated the same by the pandas.read_csv() method. It is the internal structure and not the file extension that is important in the treadment of the file by the pandas.read_csv() method.
Let's begin with a very simple tab-separated text file with an arbitrary extension of .pdq. The raw text file looks like this where all of the values are separated by tab characters:
"Year" "Avg" "Min" "Months" 2016 69,342.20 42,654.11 12 2015 72,133.63 63,128.19 6 2014 55,469.15 32,336.12 9 2013 61,661.20 41,664.55 10 2012 76,359.17 67,333.25 8 2011 52,852.31 41,621.39 6
(Note that a real-world dataset file will often contain dozens of columns and hundreds, or perhaps thousands of rows.)
This dataset file could not be handled as a comma-separated file because the values contain commas. Therefore, it is necessary to use some other unique character to separate the values. The tab character is simply one of the unique characters that could be used.
The Pandas default is to consider the separator to be a comma. If it is anything else, it must be specified in one of the arguments to the pandas.read_csv() method. Many different arguments, most with default values, can be specified when the pandas.read_csv() method is called. As you will see later, in this case, the separator is specified using the '\t' escape character.
Insert the required import statements and then call the pandas.read_csv() method to read the specified data file. This data file is located in a sub-directory named 'data'. The incoming data values will be stored in a DataFrame object pointed to by the variable named TestDataA.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline
TestDataA = pd.read_csv('data/TabDelimited01.pdq',sep='\t')
Now call the head() method on the DataFrame object to display the first eight lines of data (of which there are only six in this case).
TestDataA.head(8)
By default, the pandas.read_csv() method assumes a grid structure consisting of rows and columns. The contents of the cells in the first row are assumed to be column neaders, which also serve as indices for the columns. (I will show you how to override this assumption later if you choose to do so.) A zero-based numeric row index is created on the left side as shown above.
Note that this format is the opposite of the format of the DataFrame object that was constructed by concatenating Series objects in an earlier notebook. In that case, the default column index was a zero-based numeric index and the row index was defined by the common index of the Series objects being concatenated.
Sometimes, we would prefer to use one of the columns from the dataset file as the row index in place of the default numeric index. We can accomplish that by specifying a column, either by numeric index or by header label, as an argument to the pandas.read_csv() method to serve as the row index as shown below.
TestDataA = pd.read_csv('data/TabDelimited01.pdq',sep='\t',index_col='Year',thousands=',')
TestDataA.head(8)
As you can see, adding the argument shown below caused the default numeric row index to be replaced by the values in the 'Year' column.
index_col='Year'
(Also note that the commas in the numeric values have also disappeared in the display shown above. I will have more to say about this later.)
The values in the row and column indices are not required to be unique.
There are several different ways to access the data in a DataFrame object, and this will be the topic of a future notebook. For now, in order to emphasize the importance of the row and column indices, the following three examples use the indices to access and display data from the DataFrame.
The following syntax uses the DataFrame.loc method to access the single row identified by the index value 2014 and return it as a Series object.
TestDataA.loc[2014]
The following syntax uses the DataFrame.loc method to access the single column identified as 'Min'.
TestDataA.loc[2016:,'Min']
The following syntax uses the DataFrame.loc method to access the single value at the intersection of the 2014 row and the 'Min' column.
TestDataA.loc[2014,'Min']
By default, the values internal to the original DataFrame object shown above would not be treated as numeric data for arithmetic and other purposes, (such as plotting), because of the inclusion of the comma in the original data format.
The argument shown below was added to the argument list in the call to the pd.read_csv() method shown above to rectify that situation and to cause those values to be treated as numeric data as originally intended. Note that this also caused the commas to disappear from the display of the numeric values above, and also caused the original value of 32336.12 to be displayed as 32336.119999999999.
thousands=','
Sometimes we would like to transpose the DataFrame object, causing the rows to become columns and causing the columns to become rows. The DataFrame.transpose method is used in the following code to accomplish that. The code also displays the transposed DataFrame object.
TestDataB = TestDataA.transpose()
TestDataB.head()
I mentioned earier that by default the DataFrame.read_csv method uses the values in the first row of the CSV file for column headers. However, you don't have to accept that if it doesn't work for you. You can provide different values for column headers if you choose to do so. That is accomplished by the following code.
This code replaces 'Avg' and 'Min' with 'Average' and 'Minimum' when the data is read from the original CSV file. Rather than try to explain the details, which are somewhat complicated, I will simply refer you to the header and names arguments in the documentation for the DataFrame.read_csv method.
TestDataC = pd.read_csv('data/TabDelimited01.pdq',
sep='\t',
index_col='Year',
thousands=',',
header=0,
names=['Year','Average','Minimum','Months']
)
TestDataC.head(8)
As icing on the cake, before we leave this notebook, the following code uses the DataFrame.plot() method to plot the values in the 'Avg' row of the transposed DataFrame object named TestDataB.
seriesObjB = TestDataB.loc['Avg']
seriesObjB.head(8)
seriesObjB.plot(kind='line',figsize=(10,3),grid=True,legend=True)
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: DF-ReadData01.html
Revised: 08/30/18
Copyright 2018 Richard G. Baldwin