Pandas DataFrame part 4 -creating DataFrame objects

There are at least four different ways to create a DataFrame object. That will be the focus of this notebook.

Four ways to create a DataFrame object

You can create a DataFrame object in any of the following four ways, and possibly other ways as well:

  • By calling the pandas.read_csv method (and other similar methods) to load a dataset file into a DataFrame object.
  • By extracting a subset of the data in an existing DataFrame object into a new DataFrame object.
  • By concatenating Series objects.
  • By calling the DataFrame constructor and passing an appropriate set of arguments to the constructor.

Calling the read_csv method

You have already seen numerous examples of this in previous notebooks so it shouldn't be necessary to explain it further. It is worth mentioning, however, that there are other similar methods such as read_excel and read_json that also create DataFrame objects.

Extracting a subset of data

You have already seen numerous examples of this approach in previous notebooks, so it also shouldn't be necessary to explain this further.

Concatenating Series objects

You saw an example of this in an earlier notebook. Ditto on a further explanation.

Calling the DataFrame constructor

You have also seen a few examples of this approach in previous notebooks, but those examples were far from exhaustive from a constructor argument.

A good good source of information on this topic can be found under Object Creation in 10 Minutes to pandas.

Constructor arguments

The positional constructor arguments are:

  • data : numpy ndarray (structured or homogeneous), dict, or DataFrame - Dict can contain Series, arrays, constants, or list-like objects
  • index : Index or array-like - Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
  • columns : Index or array-like - Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
  • dtype : dtype, default None - Data type to force. Only a single dtype is allowed. If None, infer
  • copy : boolean, default False - Copy data from inputs. Only affects DataFrame / 2d ndarray input

The default values for the first four arguments are None. The default value for the last argument is False.

As you can see, many combinations are possible. This notebook will illustrate two approaches that call the DataFrame constructor to create a new DataFrame object.

First approach

This approach constructs a DataFrame object by passing an ndarray of random numbers for the data, and passing lists of strings for the index and columns arguments. It accepts the default values for the arguments named dtype and copy.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dta = np.random.randn(6,4)
idx=['Tom','Dick','Harry','Joe','Bill','Albert']
col=['First','Second','Third','Fourth']
dataFrame01 = pd.DataFrame(data=dta,
                           index=idx,
                           columns= col)
dataFrame01.head(7)
Out[1]:
First Second Third Fourth
Tom 0.886820 -1.848080 1.085854 0.268027
Dick 2.374316 0.289112 -0.540536 0.108514
Harry -2.357538 -0.049196 -1.074118 -0.539682
Joe -0.259197 0.474051 -0.825370 0.280995
Bill -2.639130 0.695665 -1.954999 -1.864397
Albert -1.093242 -0.397612 0.811151 -0.031438

Second approach

This approach uses a dictionary object to specify the data and the columns. Then it uses a list to specify the row index. It accepts the default values for the arguments named dtype and copy.

Note that this approach sorts the columns on the basis of column names, which may or may not be what you want.

In [2]:
dataFrame02 = pd.DataFrame({ 
    'Fruits' : ['Peach','Pair','Apple','Orange'],
    'Boats' : [1,2,3,4],
    'Cars' : [5,6,7,8]},
    index=['Tom','Dick','Harry','Bill'])

dataFrame02.head(5)
Out[2]:
Boats Cars Fruits
Tom 1 5 Peach
Dick 2 6 Pair
Harry 3 7 Apple
Bill 4 8 Orange

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasDataFrame04.html

Revised: 09/02/18/18

Copyright 2018 Richard G. Baldwin