Pandas: Apply functions and GroupBy
April 19, 2019
Introduction
So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.
In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like apply()
and applymap()
. These methods allow us to apply a function over an entire DataFrame.
Let’s get started!
Set up
As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.
import pandas as pd
import numpy as np
The student
dataset
Let us load the dataset for this tutorial using . We will use the read_csv()
function for this. The dataset has 8 columns, but we will only keep 5 of them for this tutorial.
load_commonlounge_dataset('student_v3')
student = pd.read_csv("/tmp/student_v3.csv")
student = student[['Admn Yr', 'Board', 'Physics', 'Chemistry', 'Maths']]
Let us look at the first five rows of the data using the head()
method.
print(student.head())
Brief description of the data-set:
This dataset contains information about students from an Engineering college. Here’s a brief description of the columns in the dataset:
Admn Yr
- The year in which the student was admitted into the college (numerical)Board
- Board under which the student studied in High School (categorical)Physics
- Marks secured in Physics in the final High School exam (numerical)Chemistry
- Marks secured in Chemistry in the final High School exam (numerical)Maths
- Marks secured in Maths in the final High School exam (numerical)
Numbers only dataset
For some parts of this tutorial, we will also need a DataFrame which only has numerical values. So, let’s also create a modified DataFrame with the numerical features from student
. We will call this student_num
:
# extract the numerical features
student_num = student.select_dtypes(include='number')
# display
print(student_num.head())
Note: In this tutorial, you can assume that we reload the dataset at the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.
The apply()
method
The apply()
method is used to apply a function to every row or column of the DataFrame.
The syntax for apply()
method is as follows:
DataFrame.apply(func, axis=0)
where,
axis
— allows us to decide whether to apply the function over rows or columns. Here,0
means column-by-column, and1
means row-by-row.func
— is the function which we want to apply. It must accept one argument, which will be a Series (either a column or a row of the DataFrame, depending on the value ofaxis
).
When we use the apply()
method, it calls the func
function once for each row / column, and passes the Series object to func
as an argument.
Let’s see some examples.
apply()
over columns
In this first example, we will use apply()
to calculate the difference between the mean and the median of every column. For this example, we will be using the student_num
DataFrame.
Let’s first define the function which will be called for each column.
def diff_func(arg):
diff = np.mean(arg) - np.median(arg)
return diff
Here, arg
will be the Pandas Series object for a column in the DataFrame. Using the NumPy function mean()
and median()
, we will calculate the mean and median of the column and then return the difference.
Now, let’s use apply()
to apply this function over all columns in the student_num
DataFrame:
result = student_num.apply(diff_func, axis=0)
print(result)
As you can see, the result is a Series with the appropriate values.
Note: We do not put parentheses after the function name
diff_func
, since we do not want the function to execute immediately. We want to pass the function as a parameter, to be used by theapply()
method.
Anonymous functions — lambda
Before we move on to the next topic, let’s learn a little about a concept in Python called lambda
or anonymous functions.
It allows us to define and use a function directly in one expression instead of defining the function separately using def
first.
In general, the syntax for lambda
functions is as follows:
lambda arguments: expression using arguments
This returns a function object. In particular, note that there’s no function name, and that we can omit the return
keyword. We’re only allowed to have one expression inside a lambda
function.
For example, here’s a lambda
function to find the cube of a number:
f = lambda x: x**3
print(f(5))
So, we can rewrite the previous code as follows:
result = student_num.apply(lambda arg: np.mean(arg) - np.median(arg), axis=0)
print(result)
This syntax is convenient when the function we are passing to apply()
is really short.
Pre-defined functions
Obviously, we can also directly use existing functions with apply()
.
For example, to calculate the mean of the values in each column, we can simply pass np.mean
in the apply()
method:
print(student_num.apply(np.mean, axis=0))
apply()
over rows
Now, let us apply a function over rows using axis=1
.
We will be calculating the average marks from Physics
, Chemistry
and Maths
columns for every row.
Let’s define our function:
def avg_func(arg):
x = (arg['Physics'] + arg['Chemistry'] + arg['Maths']) / 3
return x
Here the Series that will be passed to arg
, will be the rows of the DataFrame. The column labels will be the index of this Series.
Now, we can apply()
the function over all the rows.
avg = student.apply(avg_func, axis=1)
print(avg.head())
We can also store the results back in our DataFrame. Let’s try it:
student['Average'] = student.apply(avg_func, axis=1)
print(student.head())
Awesome!
apply() vs vectorized operations
Now, you may be wondering why would we use the apply()
method when we could instead do these things using vectorized operations. For example, we could have done the last example without apply()
as well,
student['Average'] = (student['Physics'] + student['Chemistry'] + student['Maths']) / 3
The advantage of apply()
functions are that they are much more flexible, since we can write any arbitrarily complicated code inside func
. We will some examples of more complicated functions being passed to apply()
in the next couple of sections.
The main disadvantage of apply()
functions is that they are not as fast as vectorized operations which take advantage of the fact that Pandas DataFrame and Series are built on arrays. So the vectorized code above for calculating the average would be faster than doing the same thing using apply()
.
Hence, when Pandas or NumPy already provides vectorized operations to do what we want to do, we should use those operations. But if those functions are not available, or the code is less complicated using apply()
, then we should use apply()
.
Let’s do some slightly more complicated things using apply()
.
apply()
function if-else example
Our student
DataFrame contains the Maths
, Physics
and Chemistry
grades for some students. However, they gave different examinations, and for students whose school Board
was 'HSC'
, the subject exams had a maximum possible score of 200. Whereas for all other students, the maximum possible score is 100.
So, it would be nice to divide the Maths
(and Physics
and Chemistry
) marks by 2, but only if Board
is 'HSC'
.
Let’s define our function for dividing the Maths
marks by 2 if Board
is 'HSC'
, and then use apply()
the function:
# print first few rows before applying function
print(student.head(10))
print('')
# define func
def normalize_math(x):
if x['Board'] == 'HSC':
return x['Maths'] / 2
else:
return x['Maths']
# do apply and store results in student
student['Maths_normalized'] = student.apply(normalize_math, axis=1)
# print first few rows after applying function
print(student.head(10))
print('')
All the Maths
marks are based out of 100 now! Similarly, we can normalize the marks for Physics
and Chemistry
.
The applymap()
method
The applymap()
method is used to apply a function on every single element of the DataFrame.
The syntax is very simple and similar to the apply()
function:
DataFrame.applymap(func)
where func
is the function we want to apply.
This returns a DataFrame with transformed elements.
Let us take an example where we divide all the elements by 100 and then square them.
First, we will define a function foo()
. Here the arguments to the function are the individual values in the DataFrame.
def foo(arg):
return np.square(arg/100)
Next, we use applymap()
to us execute the function:
# apply the new function
temp = student_num.applymap(foo)
# display top 5 results
print(temp.head())
Now, let us try and do the same thing with the lambda
function. Instead of passing the function as argument, we will directly write the lambda
expression.
# apply function with lambda
temp = student_num.applymap(lambda x: np.square(x/100))
# display top 5 results
print(temp.head())
The groupby()
method
Earlier, we used the apply()
method to find the mean of each column in the student
DataFrame.
But if you notice the labels like Board
or Admn Yr
, it suggests that students may belong to different groups or clusters, according to values in these columns. Therefore, it is possible that aggregate statistics like mean or median are different for each such group of students.
We will use the groupby()
method to break up the dataset into different groups, and then calculate aggregate statistics using the Pandas DataFrame method mean()
.
We will use the following syntax:
DataFrame.groupby(["column"]).mean()
Here the groupby()
method will group the data according to the values from the column that is passed as argument. Then mean()
will calculate the mean for each group separately.
Note: By default, DataFrame methods like
mean()
selectively operate on the columns with numericdtype
.
A good way to understand groupby()
to think of it as a three step process of Split-Apply-Combine:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Let’s see an example of group by.
First we will look at the mean of all the columns as a whole. This will allow us to understand the difference in the aggregate statistics better.
print(student.mean())
Now, let us use groupby()
on the student
DataFrame and find the average values of different groups of students from different Board
s.
# calculate the group mean
group_mean = student.groupby(["Board"]).mean()
# display result
print(group_mean)
We can see the different groups of students based on Board
have significantly different means for Physics
, Chemistry
and Maths
columns.
groupby()
can be a powerful tool when applied appropriately! Similarly, we can also calculate other statistics like median, standard deviation, etc with groupby()
.
Summary
- We can apply a function to every row / column of a DataFrame using
apply()
method. - We can apply both pre-defined Pandas DataFrame statistical/mathematical functions or create our own functions.
- The anonymous function
lambda
can be used to apply functions inline, without defining it separately. - Although
apply()
is slower than performing vectorized operations, it is more flexible. - The
applymap()
method is used to apply a function on every single element of the DataFrame - The
groupby()
method is used to break up the dataset into different groups, after which we can apply functions on each group separately.
Reference
apply()
DataFrame.apply(func, axis=0)
lambda
functions
lambda arguments: expression using arguments
applymap()
DataFrame.applymap(func)
groupby()
DataFrame.groupby(["Column label"]).function()