Pandas: Apply functions and GroupBy

April 19, 2019

Introduction

So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.

In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like apply() and applymap(). These methods allow us to apply a function over an entire DataFrame.

Let’s get started!

Set up

As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.

import pandas as pd
import numpy as np

The `student` dataset

Let us load the dataset for this tutorial using . We will use the read_csv() function for this. The dataset has 8 columns, but we will only keep 5 of them for this tutorial.

load_commonlounge_dataset('student_v3')
student = pd.read_csv("/tmp/student_v3.csv")
student = student[['Admn Yr', 'Board', 'Physics', 'Chemistry', 'Maths']]

Let us look at the first five rows of the data using the head() method.

print(student.head())

Brief description of the data-set:

This dataset contains information about students from an Engineering college. Here’s a brief description of the columns in the dataset:

Admn Yr - The year in which the student was admitted into the college (numerical)
Board - Board under which the student studied in High School (categorical)
Physics - Marks secured in Physics in the final High School exam (numerical)
Chemistry - Marks secured in Chemistry in the final High School exam (numerical)
Maths - Marks secured in Maths in the final High School exam (numerical)

Numbers only dataset

For some parts of this tutorial, we will also need a DataFrame which only has numerical values. So, let’s also create a modified DataFrame with the numerical features from student. We will call this student_num:

# extract the numerical features 
student_num = student.select_dtypes(include='number')
# display 
print(student_num.head())

Note: In this tutorial, you can assume that we reload the dataset at the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.

The `apply()` method

The apply() method is used to apply a function to every row or column of the DataFrame.

The syntax for apply() method is as follows:

DataFrame.apply(func, axis=0)

where,

axis — allows us to decide whether to apply the function over rows or columns. Here, 0 means column-by-column, and 1 means row-by-row.
func — is the function which we want to apply. It must accept one argument, which will be a Series (either a column or a row of the DataFrame, depending on the value of axis).

When we use the apply() method, it calls the func function once for each row / column, and passes the Series object to func as an argument.

Let’s see some examples.

`apply()` over columns

In this first example, we will use apply() to calculate the difference between the mean and the median of every column. For this example, we will be using the student_num DataFrame.

Let’s first define the function which will be called for each column.

def diff_func(arg):
    diff = np.mean(arg) - np.median(arg)
    return diff

Here, arg will be the Pandas Series object for a column in the DataFrame. Using the NumPy function mean() and median(), we will calculate the mean and median of the column and then return the difference.

Now, let’s use apply() to apply this function over all columns in the student_num DataFrame:

result = student_num.apply(diff_func, axis=0)
print(result)

As you can see, the result is a Series with the appropriate values.

Note: We do not put parentheses after the function name diff_func, since we do not want the function to execute immediately. We want to pass the function as a parameter, to be used by the apply() method.

Anonymous functions — `lambda`

Before we move on to the next topic, let’s learn a little about a concept in Python called lambda or anonymous functions.

It allows us to define and use a function directly in one expression instead of defining the function separately using def first.

In general, the syntax for lambda functions is as follows:

lambda arguments: expression using arguments

This returns a function object. In particular, note that there’s no function name, and that we can omit the return keyword. We’re only allowed to have one expression inside a lambda function.

For example, here’s a lambda function to find the cube of a number:

f = lambda x: x**3
print(f(5))

So, we can rewrite the previous code as follows:

result = student_num.apply(lambda arg: np.mean(arg) - np.median(arg), axis=0)
print(result)

This syntax is convenient when the function we are passing to apply() is really short.

Pre-defined functions

Obviously, we can also directly use existing functions with apply().

For example, to calculate the mean of the values in each column, we can simply pass np.mean in the apply() method:

print(student_num.apply(np.mean, axis=0))

`apply()` over rows

Now, let us apply a function over rows using axis=1.

We will be calculating the average marks from Physics, Chemistry and Maths columns for every row.

Let’s define our function:

def avg_func(arg):
    x = (arg['Physics'] + arg['Chemistry'] + arg['Maths']) / 3
    return x

Here the Series that will be passed to arg, will be the rows of the DataFrame. The column labels will be the index of this Series.

Now, we can apply() the function over all the rows.

avg = student.apply(avg_func, axis=1)
print(avg.head())

We can also store the results back in our DataFrame. Let’s try it:

student['Average'] = student.apply(avg_func, axis=1)
print(student.head())

Awesome!

apply() vs vectorized operations

Now, you may be wondering why would we use the apply() method when we could instead do these things using vectorized operations. For example, we could have done the last example without apply() as well,

student['Average'] = (student['Physics'] + student['Chemistry'] + student['Maths']) / 3

The advantage of apply() functions are that they are much more flexible, since we can write any arbitrarily complicated code inside func. We will some examples of more complicated functions being passed to apply() in the next couple of sections.

The main disadvantage of apply() functions is that they are not as fast as vectorized operations which take advantage of the fact that Pandas DataFrame and Series are built on arrays. So the vectorized code above for calculating the average would be faster than doing the same thing using apply().

Hence, when Pandas or NumPy already provides vectorized operations to do what we want to do, we should use those operations. But if those functions are not available, or the code is less complicated using apply(), then we should use apply().

Let’s do some slightly more complicated things using apply().

`apply()` function if-else example

Our student DataFrame contains the Maths, Physics and Chemistry grades for some students. However, they gave different examinations, and for students whose school Board was 'HSC', the subject exams had a maximum possible score of 200. Whereas for all other students, the maximum possible score is 100.

So, it would be nice to divide the Maths (and Physics and Chemistry) marks by 2, but only if Board is 'HSC'.

Let’s define our function for dividing the Maths marks by 2 if Board is 'HSC', and then use apply() the function:

# print first few rows before applying function
print(student.head(10))
print('')
# define func 
def normalize_math(x):
    if x['Board'] == 'HSC':
        return x['Maths'] / 2
    else:
        return x['Maths']
# do apply and store results in student
student['Maths_normalized'] = student.apply(normalize_math, axis=1)
# print first few rows after applying function
print(student.head(10))
print('')

All the Maths marks are based out of 100 now! Similarly, we can normalize the marks for Physics and Chemistry.

The `applymap()` method

The applymap() method is used to apply a function on every single element of the DataFrame.

The syntax is very simple and similar to the apply() function:

DataFrame.applymap(func)

where func is the function we want to apply.

This returns a DataFrame with transformed elements.

Let us take an example where we divide all the elements by 100 and then square them.

First, we will define a function foo(). Here the arguments to the function are the individual values in the DataFrame.

def foo(arg):
    return np.square(arg/100)

Next, we use applymap() to us execute the function:

# apply the new function 
temp = student_num.applymap(foo)
# display top 5 results
print(temp.head())

Now, let us try and do the same thing with the lambda function. Instead of passing the function as argument, we will directly write the lambda expression.

# apply function with lambda
temp = student_num.applymap(lambda x: np.square(x/100))
# display top 5 results
print(temp.head())

The `groupby()` method

Earlier, we used the apply() method to find the mean of each column in the student DataFrame.

But if you notice the labels like Board or Admn Yr, it suggests that students may belong to different groups or clusters, according to values in these columns. Therefore, it is possible that aggregate statistics like mean or median are different for each such group of students.

We will use the groupby() method to break up the dataset into different groups, and then calculate aggregate statistics using the Pandas DataFrame method mean().

We will use the following syntax:

DataFrame.groupby(["column"]).mean()

Here the groupby() method will group the data according to the values from the column that is passed as argument. Then mean() will calculate the mean for each group separately.

Note: By default, DataFrame methods like mean() selectively operate on the columns with numeric dtype.

A good way to understand groupby() to think of it as a three step process of Split-Apply-Combine:

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

Let’s see an example of group by.

First we will look at the mean of all the columns as a whole. This will allow us to understand the difference in the aggregate statistics better.

print(student.mean())

Now, let us use groupby() on the student DataFrame and find the average values of different groups of students from different Boards.

# calculate the group mean
group_mean = student.groupby(["Board"]).mean()
# display result
print(group_mean)

We can see the different groups of students based on Board have significantly different means for Physics, Chemistry and Maths columns.

groupby() can be a powerful tool when applied appropriately! Similarly, we can also calculate other statistics like median, standard deviation, etc with groupby().

Summary

We can apply a function to every row / column of a DataFrame using apply() method.
We can apply both pre-defined Pandas DataFrame statistical/mathematical functions or create our own functions.
The anonymous function lambda can be used to apply functions inline, without defining it separately.
Although apply() is slower than performing vectorized operations, it is more flexible.
The applymap() method is used to apply a function on every single element of the DataFrame
The groupby() method is used to break up the dataset into different groups, after which we can apply functions on each group separately.

Reference

apply()

DataFrame.apply(func, axis=0)

lambda functions

lambda arguments: expression using arguments

applymap()

DataFrame.applymap(func)

groupby()

DataFrame.groupby(["Column label"]).function()

Introduction

Set up

The student dataset

Brief description of the data-set:

Numbers only dataset

The apply() method

apply() over columns

Anonymous functions — lambda