According to Introduction into Pandas,
A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data.
This notebook provides an introduction to Pandas Series along with a simple example showing how a Series can be used to create a bar chart.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Enable inline plotting
%matplotlib inline
A Pandas series is something like a cross between a one-dimensional Numpy array and a Python dictionary. (A Numpy structured array also falls in that mix somewhere.)
The elements in a one-dimensional Numpy array are accessed by a unique numeric index with the index ranging from 0 to one less than the number of elements in the array as shown below.
anArray = np.array([10,20,30])
print(anArray)
print("-----")
print(anArray[0])
print(anArray[1])
print(anArray[2])
The default index for a Pandas Series is also a numeric index ranging from 0 to one less than the number of elements in the series. The elements in the Series can also be accessed by referring to the numeric indices.
A Python dictionary is a mapping of unique keys to values. The keys can be any immutable type with strings and numbers probably being the most common types of keys.
It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique.
The values that are stored in a dictionary are accessed by referring to the keys as shown below.
aDict = {'Tom':32,'Dick':65,'Harry':21}
print(aDict)
print("-----")
print(aDict['Tom'])
print(aDict['Dick'])
print(aDict['Harry'])
The indices in a Pandas Series are similar to the keys in a Python dictionary, except that they need not be unique. The data values in a Pandas Series can also be accessed by referring to the indices.
We can create a Pandas Series object with default numeric indices by calling the panda.Series constructor and passing a list of data values as an argument as shown below.
(Note that the behavior of the print statement when applied to a Series object is to print the Series object as a two-column table with the index values in the left column and the data values in the right column.)
aSeries = pd.Series(['Tom','Dick','Harry'])
print(aSeries)
print("-----")
print(aSeries[0])
print(aSeries[1])
print(aSeries[2])
A Pandas Series object consists of an index array and a data array. In the above example, we didn't specify the index so the index array was populated with the default integer values from 0 to the number of data elements minus one.
The print statement that was executed above displayed the index and the data in the form of a table with the index in the left column and the data values in the right column.
We can also display the index and data values independently as shown below.
print(aSeries.index)
print(aSeries.values)
The following code shows how to create a Series object with specified values for both the index and the data. In this case, we will use names for the index and numeric values for the data values. Once again, the index is displayed in the left column and the data values are displayed in the right column.
ageIndex = ['Tom','Dick','Harry']
ageValues = [39,42,65]
ageSeries = pd.Series(ageValues,ageIndex)
print(ageSeries)
Also, once again, we can access and print the index values and the data values independently of one another.
print(ageSeries.index)
print(ageSeries.values)
The data and index arguments can be treated as positional arguments and passed in the order shown above. They can also be treated as named arguments and entered in a different order as shown below. (Note that it is not necessary for the index values to be unique.)
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [39,42,65,15]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
We can create a Series from a dictionary, in which case we get a Series with the indices sorted as shown below
aDict = {'Tom':32,'Dick':65,'Harry':21}
print(aDict)
print("-----")
ageSeries = pd.Series(aDict)
print(ageSeries)
As is the case with Numpy arrays, Pandas Series objects can be added together.
If the two objects have the same index as shown below, the resulting data values are the sums of the individual data values in each of the series.
(Note that the sum of two strings is a new string that is the concatenation of the two original strings.)
ageIndex = ['Tom','Dick','Harry']
ageValuesA = [1,2,'Sue']
ageSeriesA = pd.Series(ageValuesA,ageIndex)
print(ageSeriesA)
print("-----")
ageValuesB = [3,4,'Bill']
ageSeriesB = pd.Series(index=ageIndex,data=ageValuesB,)
print(ageSeriesB)
print("-----")
ageIndexC = ageSeriesA + ageSeriesB
print(ageIndexC)
Pandas Series need not have the same indices in order to add them. However, if a given index value doesn't appear in both indices, the resulting sum of the data values is NaN, which means "not a number". This is illustrated by the following code where the indices 'Joe" and 'Dick' are not common between the two indices.
(I will have a lot more to say about NaN in future notebooks.)
ageIndexA = ['Tom','Dick','Harry']
ageValuesA = [1,2,'Sue']
ageSeriesA = pd.Series(ageValuesA,ageIndexA)
print(ageSeriesA)
print("-----")
ageIndexB = ['Tom','Joe','Harry']
ageValuesB = [3,2,'Bill']
ageSeriesB = pd.Series(index=ageIndexB,data=ageValuesB,)
print(ageSeriesB)
print("-----")
ageIndexC = ageSeriesA + ageSeriesB
print(ageIndexC)
As with Numpy arrays and Python dictionaries, we can access individual data values in a Pandas Series by referring to the index corresponding to a particular data element.
The following code uses the pandas.Series.loc attribute to access a single value by an index label. Then it uses the pandas.Series.iloc attribute to access the same value using the corresponding numeric index
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [39,42,65,15]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
print('Access with loc:', ageSeries.loc['Dick'])
print('-----')
print('Access with iloc:',ageSeries.iloc[1])
We can also access multiple data values in a Series by providing a list of index values as shown below.
When using loc, this example accesses the data value associated with the index 'Dick' and also access both of the data values associated with the duplicated index 'Tom' even though 'Tom' is included only once in the access list.
However, when using iloc, all three numeric indices must be included in the access list to get the same results.
print('Access with loc:')
print(ageSeries.loc[['Dick','Tom']])
print('-----')
print('Access with iloc:')
print(ageSeries.iloc[[1,0,3]])
We can also use slice-like indexing with both loc and iloc. Note that this example adds two more elements to the Series object than was the case in the examples above. Also note that when using loc to slice, contrary to usual Python slices, both the start and the stop are included.
ageIndex = ['Tom','Dick','Harry','Tom','Sue','Bill']
ageValues = [39,42,65,15,23,42]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
print('Access with loc:')
print(ageSeries.loc['Dick':'Sue'])
The following example uses standard slicing syntax along with iloc to access the same elements using the numeric index.
print('Access with iloc:')
print(ageSeries.iloc[1:5])
A list of boolean values can also be used to select the elements in the Series on the basis of position provided that the number of boolean values in the list matches the number of elements in the Series. This works with both loc and iloc.
print('Access with loc:')
print(ageSeries.loc[[False,True,True,False,True,False]])
print('-----')
print('Access with iloc:')
print(ageSeries.iloc[[False,True,False,False,True,False]])
As with Numpy arrays, we can perform scalar arithmetic operations on Pandas Series as shown below.
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [10,20,30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries + 2
print(ageSeries)
print('-----')
ageSeries = ageSeries * 2
print(ageSeries)
We can also apply mathematical functions to Pandas Series as shown below. This code passes the Series object to the numpy.sqrt function causing each of the data values in the object to be replaced by the square root of the element.
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [10,20,30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = np.sqrt(ageSeries)
print(ageSeries)
Once you have a Pandas Series object, there are many methods that you can call on the object such as abs, add, and apply. For example, two such methods are shown below.
The abs method returns a new Series object containing the absolute values of each of the original data values.
The apply method provides an alternative way to call a specified function on the object.
ageIndex = ['Tom','Dick','Harry','Tom']
ageValues = [-10,20,-30,40]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries.abs()
print(ageSeries)
print('-----')
print(ageSeries.apply(np.sqrt))
The apply method can also be used with lambda functions as shown below.
ageIndex = ['Tom','Dick','Harry','Tom','Sue','Mary','Ted','Bill']
ageValues = [0,1,2,3,4,5,6,7]
ageSeries = pd.Series(index=ageIndex,data=ageValues,)
print(ageSeries)
print('-----')
ageSeries = ageSeries.apply(lambda x: x if x > 3 else x*2)
print(ageSeries)
The following program shows how a Pandas series can be used to create a bar chart. This example displays the individual scores of five ball players for two games, along with the sum of the scores of the individual players for the two games both in printed text and also in a bar chart.
playerNames = ['Alex','Bob','Don','Frank','Harry'] #player names
game01Scores = [78,62,72,60,75] #player scores for game01
game01Series = pd.Series(game01Scores,playerNames)
print('Game 1')
print(game01Series)
print('-----')
game02Scores = [34,70,28,54,36] #player scores for game01
game02Series = pd.Series(game02Scores,playerNames)
print('Game 2')
print(game02Series)
print('-----')
sum = game01Series + game02Series
print('Sum')
print(sum)
# Display the information in a bar chart
xPosition = np.arange(5)
width = 0.25
fig,ax = plt.subplots(1,1)
ax.bar(xPosition,game01Series.values,width,label="Game 1")
ax.bar(xPosition+width,game02Series.values,width,label="Game 2"
,tick_label=game02Series.index)
ax.bar(xPosition+width*2,sum.values,width,label="Sum")
ax.set_xlabel('Player Name')
ax.set_ylabel('Scores')
ax.set_title('Player Scores')
ax.grid()
plt.legend(loc='lower left')
plt.show()
Author: Prof. Richard G. Baldwin
Affiliation: Professor of Computer Information Technology at Austin Community College in Austin, TX.
File: PandasSeries01.html
Revised: 08/30/18
Copyright 2018 Richard G. Baldwin