Pandas: Criteria Based Selection
April 18, 2019
Introduction
You have already learnt about basics of the Pandas DataFrame, and how to select various subsets of the DataFrame using indexing and slicing.
In this tutorial we will learn how to select a subset of DataFrame which matches some criteria. For example, we might want to select the rows of a DataFrame based on the values in each row (say age >= 30). Or, we might even want to select a subset of the DataFrame based on the datatypes.
Let’s get started!
Set-up
As in the previous tutorial, let’s start by importing the libraries and loading the dataset.
import pandas as pd
import numpy as np
# load the dataset
load_commonlounge_dataset('adult_income_data_small')
adult_income = pd.read_csv("/tmp/adult_income_data_small.csv")
# just to check data loaded
print(adult_income.head())
Selection using Boolean Array (Boolean Indexing)
Sometimes we want to select a subset of DataFrame based on the values it contains.
For example, from the adult_income
DataFrame, we might want to select all the adults with more than 12 years of education (edu_yrs >= 12
).
Pandas allows us to create a subset based on a Boolean Array. This Array needs to be of the same length as number of rows in the DataFrame. The rows corresponding to True
are kept in the subset and the rows corresponding to False
are excluded. This is known as Boolean Indexing.
Let’s write the code for select all the adults with more than 12 years of education (edu_yrs >= 12
) from our adult_income
DataFrame.
We will first create the Boolean array:
b_arr = adult_income["edu_yrs"] >= 12
Then, we will use this array to select rows from adult_income
DataFrame using []
operator:
temp_df = adult_income[b_arr]
Let us create the Boolean Array and take glimpse at the first 10 elements. The array created is actually a Pandas Series with elements of datatype bool
— so all the values are True
or False
.
# create boolean array
b_arr = adult_income["edu_yrs"] >= 12
# print the first 10 values
print(b_arr.head(10))
We can see that among the top 10 rows, only the row with index 4 is True
and everything else is False
.
Hence, when perform the boolean indexing, that row is the only one which will be selectively kept from the first 10 rows. Let us look at the data subset selected.
# selection of subset
temp_df = adult_income[b_arr]
# display first few elements
print(temp_df.head())
We can do this directly within the []
operator too:
temp_df = adult_income[adult_income["edu_yrs"] >= 12]
We can also use loc
to perform Boolean Indexing, which works exactly the same way. Here’s an example:
temp_df = adult_income.loc[adult_income["edu_yrs"]>=12, :]
print(temp_df.head())
Although we will be using []
operator throughout this tutorial, you should try and re-write the code with loc
and see that they give the same results!
Note: In Pandas,
iloc
does not support Indexing using a Boolean Series.
Criterias using logical operations — and, or, not
In Python, we can combine boolean values using logical operators and
, or
and not
.
Pandas supports these operations for Series, so we can write more complicated expressions and then use them for Boolean Indexing. In Pandas, &
is used for the and operation, |
is used for the or operation, and ~
is used for the not operation.
Let’s see a few examples.
Suppose we wanted the rows for which edu_yrs
is greater than or equal to 12 and sex
is Female
.
We can create a Boolean array using the &
operator as follows:
b_arr = (adult_income["edu_yrs"] >= 12) & (adult_income["sex"] == 'Female')
Let us use this to create a subset and see the first few elements!
# create boolean array
b_arr = (adult_income["edu_yrs"] >= 12) & (adult_income["sex"] == 'Female')
# selection of subset
temp_df = adult_income[b_arr]
# display first few elements
print(temp_df.head())
Now, let’s select rows for which either capital_gain is more than 1000 or income is >50K.
This time, we will use the |
operator to create the Boolean array:
# create boolean array
b_arr = (adult_income["capital_gain"] > 1000) | (adult_income["income"] == '>50K')
# selection of subset
temp_df = adult_income[b_arr]
# display first 10 elements
print(temp_df.head(10))
We can use the ~
operator select all the rows which do not match some criteria!
For example, suppose we need to select all the rows where the education
is not “Doctorate”. The Boolean expression corresponding to this criteria is
~(adult_income["education"] == 'Doctorate')
Let’s see the full code using this criteria:
# create boolean array
b_arr = ~(adult_income["education"] == 'Doctorate')
# selection of subset
temp_df = adult_income[b_arr]
# display first 10 elements
print(temp_df.head(10))
You can do some boolean indexing yourself to find interesting types of people in the above data. Are there high school graduates (HS-grad
) with income >50K
? Are there people who work less than 30 hours a week with income >50K
?
Selection based on datatypes
Sometimes we might want to keep only particular types of data in our DataFrame subset. The data type of a column in a DataFrame is referred to as dtype.
We will use the method select_dtypes()
to carry out data-type based selection. The syntax is as follows:
DataFrame.select_dtypes(include="", exclude="")
Here, the parameter include
is used to decide which dtypes to include, and exclude
is used to decide which dtypes to exclude.
Each parameter can be either a dtype, or a list of dtypes.
Usually, we specify the each dtype as a string. Also, most of the time, one parameter is enough to work with.
Some common valid strings for dtypes are:
'int'
: for integers'float'
: for floating point numbers'bool'
: Boolean data type'number'
: all numeric types'categorical'
: for pandas dtype categorical'object'
: all object data types, generally strings are the only object datatypes'datetime'
and'timedelta'
: these are datatypes related to time and date. We will learn about these in a later tutorial.
Let us use this to select all the numeric datatypes from the adult_income
dataset:
# select numeric dtypes
df_temp = adult_income.select_dtypes(include='number')
# display first few elements
print(df_temp.head())
Summary
- With Boolean Indexing, we can select DataFrame rows satisfying some criteria
- We can define criterias / Boolean arrays using comparison operations (using
==
,>
, etc) and logical operations (using&
,|
and~
) - We can also selectively keep columns with particular datatypes using
select_dtypes()
Reference
Boolean indexing:
subset_dataframe = DataFrame[boolean_array]
subset_dataframe = DataFrame[(boolean_array_1) & (boolean_array_2)] # and
subset_dataframe = DataFrame[(boolean_array_1) | (boolean_array_2)] # or
subset_dataframe = DataFrame[~(boolean_array)] # not
Selecting columns with specific datatypes:
DataFrame.select_dtypes(include="", exclude="")
# common arguments: 'int', 'float', 'bool', 'number', 'categorical', 'object, 'datetime'