Pandas Series part 1 - creation and use

According to Introduction into Pandas,

A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data.

This notebook provides an introduction to Pandas Series along with a simple example showing how a Series can be used to create a bar chart.

Import required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline

Practical description

A Pandas series is something like a cross between a one-dimensional Numpy array and a Python dictionary. (A Numpy structured array also falls in that mix somewhere.)

A Numpy array

The elements in a one-dimensional Numpy array are accessed by a unique numeric index with the index ranging from 0 to one less than the number of elements in the array as shown below.

In [2]:
anArray = np.array([10,20,30])
print(anArray)
print("-----")
print(anArray[0])
print(anArray[1])
print(anArray[2])
[10 20 30]
-----
10
20
30

The default index for a Pandas Series is also a numeric index ranging from 0 to one less than the number of elements in the series. The elements in the Series can also be accessed by referring to the numeric indices.

A python dictionary

A Python dictionary is a mapping of unique keys to values. The keys can be any immutable type with strings and numbers probably being the most common types of keys.

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique.

The values that are stored in a dictionary are accessed by referring to the keys as shown below.

In [3]:
aDict = {'Tom':32,'Dick':65,'Harry':21}
print(aDict)
print("-----")
print(aDict['Tom'])
print(aDict['Dick'])
print(aDict['Harry'])
{'Tom': 32, 'Dick': 65, 'Harry': 21}
-----
32
65
21

The indices in a Pandas Series are similar to the keys in a Python dictionary, except that they need not be unique. The data values in a Pandas Series can also be accessed by referring to the indices.

Creating a Series object

We can create a Pandas Series object with default numeric indices by calling the panda.Series constructor and passing a list of data values as an argument as shown below.

(Note that the behavior of the print statement when applied to a Series object is to print the Series object as a two-column table with the index values in the left column and the data values in the right column.)

In [4]:
aSeries = pd.Series(['Tom','Dick','Harry'])
print(aSeries)
print("-----")
print(aSeries[0])
print(aSeries[1])
print(aSeries[2])
0      Tom
1     Dick
2    Harry
dtype: object
-----
Tom
Dick
Harry

A Pandas Series object consists of an index array and a data array. In the above example, we didn't specify the index so the index array was populated with the default integer values from 0 to the number of data elements minus one.

Display index and data independently

The print statement that was executed above displayed the index and the data in the form of a table with the index in the left column and the data values in the right column.

We can also display the index and data values independently as shown below.

In [5]:
print(aSeries.index)
print(aSeries.values)
RangeIndex(start=0, stop=3, step=1)
['Tom' 'Dick' 'Harry']

Specify both index and data values

The following code shows how to create a Series object with specified values for both the index and the data. In this case, we will use names for the index and numeric values for the data values. Once again, the index is displayed in the left column and the data values are displayed in the right column.

In [6]:
ageIndex = ['Tom','Dick','Harry']
ageValues = [39,42,65]
ageSeries = pd.Series(ageValues,ageIndex)
print(ageSeries)
Tom      39
Dick     42
Harry    65
dtype: int64

Also, once again, we can access and print the index values and the data values independently of one another.

In [7]:
print(ageSeries.index)
print(ageSeries.values)
Index(['Tom', 'Dick', 'Harry'], dtype='object')
[39 42 65]

Positional versus named arguments

The data and index arguments can be treated as positional arguments and passed in the order shown above. They can also be treated as named arguments and entered in a different order as shown below. (Note that it is not necessary for the index values to be unique.)

In [8]:
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [39,42,65,15]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
Tom      39
Dick     42
Harry    65
Tom      15
dtype: int64

Creating a series from a dictionary

We can create a Series from a dictionary, in which case we get a Series with the indices sorted as shown below

In [9]:
aDict = {'Tom':32,'Dick':65,'Harry':21}
print(aDict)
print("-----")
ageSeries = pd.Series(aDict)
print(ageSeries)
{'Tom': 32, 'Dick': 65, 'Harry': 21}
-----
Dick     65
Harry    21
Tom      32
dtype: int64

Adding series with same index

As is the case with Numpy arrays, Pandas Series objects can be added together.

If the two objects have the same index as shown below, the resulting data values are the sums of the individual data values in each of the series.

(Note that the sum of two strings is a new string that is the concatenation of the two original strings.)

In [10]:
ageIndex = ['Tom','Dick','Harry']

ageValuesA = [1,2,'Sue']
ageSeriesA = pd.Series(ageValuesA,ageIndex)
print(ageSeriesA)
print("-----")

ageValuesB = [3,4,'Bill']
ageSeriesB = pd.Series(index=ageIndex,data=ageValuesB,)
print(ageSeriesB)
print("-----")

ageIndexC = ageSeriesA + ageSeriesB
print(ageIndexC)
Tom        1
Dick       2
Harry    Sue
dtype: object
-----
Tom         3
Dick        4
Harry    Bill
dtype: object
-----
Tom            4
Dick           6
Harry    SueBill
dtype: object

Adding series with different indices

Pandas Series need not have the same indices in order to add them. However, if a given index value doesn't appear in both indices, the resulting sum of the data values is NaN, which means "not a number". This is illustrated by the following code where the indices 'Joe" and 'Dick' are not common between the two indices.

(I will have a lot more to say about NaN in future notebooks.)

In [11]:
ageIndexA = ['Tom','Dick','Harry']

ageValuesA = [1,2,'Sue']
ageSeriesA = pd.Series(ageValuesA,ageIndexA)
print(ageSeriesA)
print("-----")

ageIndexB = ['Tom','Joe','Harry']
ageValuesB = [3,2,'Bill']
ageSeriesB = pd.Series(index=ageIndexB,data=ageValuesB,)
print(ageSeriesB)
print("-----")

ageIndexC = ageSeriesA + ageSeriesB
print(ageIndexC)
Tom        1
Dick       2
Harry    Sue
dtype: object
-----
Tom         3
Joe         2
Harry    Bill
dtype: object
-----
Dick         NaN
Harry    SueBill
Joe          NaN
Tom            4
dtype: object

Access data values by index using loc and iloc

As with Numpy arrays and Python dictionaries, we can access individual data values in a Pandas Series by referring to the index corresponding to a particular data element.

Access a single value

The following code uses the pandas.Series.loc attribute to access a single value by an index label. Then it uses the pandas.Series.iloc attribute to access the same value using the corresponding numeric index

In [12]:
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [39,42,65,15]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
print('Access with loc:', ageSeries.loc['Dick'])
print('-----')
print('Access with iloc:',ageSeries.iloc[1])
Tom      39
Dick     42
Harry    65
Tom      15
dtype: int64
-----
Access with loc: 42
-----
Access with iloc: 42

Access multiple values using a list

We can also access multiple data values in a Series by providing a list of index values as shown below.

When using loc, this example accesses the data value associated with the index 'Dick' and also access both of the data values associated with the duplicated index 'Tom' even though 'Tom' is included only once in the access list.

However, when using iloc, all three numeric indices must be included in the access list to get the same results.

In [13]:
print('Access with loc:')
print(ageSeries.loc[['Dick','Tom']])
print('-----')
print('Access with iloc:')
print(ageSeries.iloc[[1,0,3]])
Access with loc:
Dick    42
Tom     39
Tom     15
dtype: int64
-----
Access with iloc:
Dick    42
Tom     39
Tom     15
dtype: int64

Access multiple values using slice-like indexing

We can also use slice-like indexing with both loc and iloc. Note that this example adds two more elements to the Series object than was the case in the examples above. Also note that when using loc to slice, contrary to usual Python slices, both the start and the stop are included.

In [14]:
ageIndex = ['Tom','Dick','Harry','Tom','Sue','Bill']
ageValues = [39,42,65,15,23,42]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
print('Access with loc:')
print(ageSeries.loc['Dick':'Sue'])
Tom      39
Dick     42
Harry    65
Tom      15
Sue      23
Bill     42
dtype: int64
-----
Access with loc:
Dick     42
Harry    65
Tom      15
Sue      23
dtype: int64

The following example uses standard slicing syntax along with iloc to access the same elements using the numeric index.

In [15]:
print('Access with iloc:')
print(ageSeries.iloc[1:5])
Access with iloc:
Dick     42
Harry    65
Tom      15
Sue      23
dtype: int64

Access multiple values using a list of boolean values

A list of boolean values can also be used to select the elements in the Series on the basis of position provided that the number of boolean values in the list matches the number of elements in the Series. This works with both loc and iloc.

In [16]:
print('Access with loc:')
print(ageSeries.loc[[False,True,True,False,True,False]])
print('-----')
print('Access with iloc:')
print(ageSeries.iloc[[False,True,False,False,True,False]])
Access with loc:
Dick     42
Harry    65
Sue      23
dtype: int64
-----
Access with iloc:
Dick    42
Sue     23
dtype: int64

Scalar operations

As with Numpy arrays, we can perform scalar arithmetic operations on Pandas Series as shown below.

In [17]:
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [10,20,30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries + 2
print(ageSeries)
print('-----')
ageSeries = ageSeries * 2
print(ageSeries)
Tom      10
Dick     20
Harry    30
Tom      40
dtype: int64
-----
Tom      12
Dick     22
Harry    32
Tom      42
dtype: int64
-----
Tom      24
Dick     44
Harry    64
Tom      84
dtype: int64

Mathematical operations

We can also apply mathematical functions to Pandas Series as shown below. This code passes the Series object to the numpy.sqrt function causing each of the data values in the object to be replaced by the square root of the element.

In [18]:
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [10,20,30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = np.sqrt(ageSeries)
print(ageSeries)
Tom      10
Dick     20
Harry    30
Tom      40
dtype: int64
-----
Tom      3.162278
Dick     4.472136
Harry    5.477226
Tom      6.324555
dtype: float64

Calling methods on a Series object

Once you have a Pandas Series object, there are many methods that you can call on the object such as abs, add, and apply. For example, two such methods are shown below.

The abs method returns a new Series object containing the absolute values of each of the original data values.

The apply method provides an alternative way to call a specified function on the object.

In [19]:
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [-10,20,-30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries.abs()
print(ageSeries)
print('-----')
print(ageSeries.apply(np.sqrt))
Tom     -10
Dick     20
Harry   -30
Tom      40
dtype: int64
-----
Tom      10
Dick     20
Harry    30
Tom      40
dtype: int64
-----
Tom      3.162278
Dick     4.472136
Harry    5.477226
Tom      6.324555
dtype: float64

The apply method can also be used with lambda functions as shown below.

In [20]:
ageIndex = ['Tom','Dick','Harry','Tom','Sue','Mary','Ted','Bill']
ageValues = [0,1,2,3,4,5,6,7]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries.apply(lambda x: x if x > 3 else x*2)
print(ageSeries)
Tom      0
Dick     1
Harry    2
Tom      3
Sue      4
Mary     5
Ted      6
Bill     7
dtype: int64
-----
Tom      0
Dick     2
Harry    4
Tom      6
Sue      4
Mary     5
Ted      6
Bill     7
dtype: int64

A bar chart

The following program shows how a Pandas series can be used to create a bar chart. This example displays the individual scores of five ball players for two games, along with the sum of the scores of the individual players for the two games both in printed text and also in a bar chart.

In [21]:
playerNames = ['Alex','Bob','Don','Frank','Harry'] #player names

game01Scores = [78,62,72,60,75] #player scores for game01
game01Series = pd.Series(game01Scores,playerNames)
print('Game 1')
print(game01Series)
print('-----')

game02Scores = [34,70,28,54,36] #player scores for game01
game02Series = pd.Series(game02Scores,playerNames)
print('Game 2')
print(game02Series)
print('-----')

sum = game01Series + game02Series
print('Sum')
print(sum)

# Display the information in a bar chart
xPosition = np.arange(5)
width = 0.25

fig,ax = plt.subplots(1,1)
ax.bar(xPosition,game01Series.values,width,label="Game 1")
ax.bar(xPosition+width,game02Series.values,width,label="Game 2"
       ,tick_label=game02Series.index)
ax.bar(xPosition+width*2,sum.values,width,label="Sum")
ax.set_xlabel('Player Name')
ax.set_ylabel('Scores')
ax.set_title('Player Scores')

ax.grid()
plt.legend(loc='lower left')
plt.show()
Game 1
Alex     78
Bob      62
Don      72
Frank    60
Harry    75
dtype: int64
-----
Game 2
Alex     34
Bob      70
Don      28
Frank    54
Harry    36
dtype: int64
-----
Sum
Alex     112
Bob      132
Don      100
Frank    114
Harry    111
dtype: int64

Housekeeping material

Author: Prof. Richard G. Baldwin

Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.

File: PandasSeries01.html

Revised: 08/30/18

Copyright 2018 Richard G. Baldwin