Pandas Introduction: DataFrames
April 12, 2019
Introduction
Pandas is an open-source Python library which provides a 2-dimensional data structure called a DataFrame.
A DataFrame is the Python equivalent of a spreadsheet or SQL table, and is the most commonly used Pandas object. It provides a lot of useful functionality to work with tabular data, which is something we do regularly in data science and machine learning.
In this tutorial, we will be learning about the pandas DataFrame and explore some of its useful methods.
Importing
We will need the Pandas and NumPy libraries for this tutorial, so let us load them at the beginning.
import pandas as pd
import numpy as np
Note:
import pandas as pd
is the standard convention for importing pandas. It is followed universinally within the machine learning / data science communities.
The next step is to load our dataset.
The adult_income
dataset
Before we start loading the dataset using Pandas, we need to download the data on CommonLounge’s system. For this, we will use the load_commonlounge_dataset()
function:
load_commonlounge_dataset('adult_income_data_small')
The above function downloads a file to the location /tmp/adult_income_data_small.csv
. Here are the first few lines from the file:
education,edu_yrs,sex,capital_gain,hours_per_week,income
Some-college,10,Male,0,40,<=50K
Some-college,10,Male,0,45,<=50K
HS-grad,9,Male,0,50,>50K
HS-grad,9,Female,0,50,<=50K
Doctorate,16,Male,0,40,>50K
The above format is called comma-separated values, or CSV. It’s a very common way of storing tabular data in plain-text files. Often, the first line is a header line containing the column names. The rest of the lines have one observation per line.
Now that the dataset is downloaded, let’s use the Pandas read_csv()
function to read the dataset.
adult_income = pd.read_csv("/tmp/adult_income_data_small.csv")
DataFrames
The read_csv()
function returns a DataFrame. A DataFrame is a 2-dimensional data structure — analogous to a spreadsheet or SQL table. It is generally the most commonly used Pandas object.
Once the data is loaded into a pandas DataFrame, the first step is to see how large the dataset is. For this, we can check the shape
of the DataFrame, which tells us the number of rows (observations) and columns (variables).
print('Shape of dataset:', adult_income.shape)
Our data set has 326 rows and 6 columns.
As you may already know, you can run this code right here. To run the program, just click
Try It Now!
and then theRun
button. You can also edit the code in any way you want.
To get the column names, use the columns
attribute:
print(adult_income.columns)
To get a glimpse of the data itself, we can see the first five rows using the head()
method:
print(adult_income.head())
Pandas takes care of printing the table nicely for us! Sweet!
Note: On the very left, you can see the numbers 0, 1, 2, and so on. These are the row indexes. The same way every column has a name, every row has an index. By default, the row indexes go from 0 to n-1, where n is the total number of rows in the DataFrame.
We can also pass in an argument to head()
specifying how many lines we want to see. Let’s try printing the first 8 observations:
print(adult_income.head(8))
Remember, you can run the code yourself. Click on
Try it now!
and edit the code to print the first 50 rows of the dataset. Then, press theRun
button and take a detailed look at the dataset.
Brief description of the Dataset
Before we move further, it might be helpful to know a little more about our dataset. This data has been taken from the UCI Machine Learning Repository. (The dataset has been modified for the purpose of this tutorial).
Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
The features/variables in the dataset are described below:
education
: highest educational qualification (categorical)edu_yrs
: total number of years spent in educational institutes (numerical)sex
:Male
orFemale
(categorical, binary)capital_gain
: profits made through capital markets (numerical)hours_per_week
: working hours per week (numerical)income
: the target variable - either>50k
or<=50K
(categorical, binary)
Dataset overview
Let’s see another couple of helpful DataFrame methods — info()
and describe()
.
We can use the .info()
method to look at the datatype of every variable, along with number of observations present, like so:
adult_income.info()
As you can see, the numerical variables such as edu_yrs
, capital_gain
, etc have type int64
(64-bit integers), and the strings (categorical variables) such as education
, sex
, etc have type object
.
Often, datasets have missing values, usually denoted by null
. Hence, pandas is being helpful and letting us know that for all variables in our dataset, we have 326 non-null
values, i.e. no missing values. 🙂
Finally, there’s the .describe()
method, which by default summarizes the numerical variables. For each numerical variable, it tells us the minimum value, maximum value, mean, standard deviation, etc. Even the 25th and 75th percentile values are included.
print(adult_income.describe())
The .describe()
method also accepts an argument which allows us to specify which datatype we want a summary of. For categorical variables, the datatype is 'object'
.
Hence, to get a summary, of the categorical variables, we will pass the parameter include
with argument 'object'
. When summarizing categorical variables, the .describe()
method includes information about the number of unique values for the variable, the most common value for the variable, etc.
Let’s see it in action:
print(adult_income.describe(include='object'))
Awesome! Let’s see some even more interesting things we could do with our dataset.
Sorting a DataFrame - sort_values()
Sometimes in order to do data exploration or feature engineering, we might need the DataFrame to be sorted in a particular order.
For example, suppose we want the adult-income dataset to be sorted in increasing order of edu_yrs
variable.
We can use the sort_values()
DataFrame method to sort our dataset. The syntax is as follows:
Sorted_DataFrame = DataFrame.sort_values(by, axis=0, ascending=True)
This returns a sorted copy of the DataFrame.
Parameters:
by
: the column name or list of column names to sort the DataFrame byaxis
: “0” denotes sorting by values in a column. (“1” means sorting by row)ascending
: if this is “True”, sort in ascending order, otherwise descending order
Let us sort the adult_income
data and see for ourselves!
We will sort according to the edu_yrs
column:
df_sorted = adult_income.sort_values(by="edu_yrs", axis=0, ascending=False)
Now, let’s look at the sorted dataframe:
# display the top 5 values (maximum "edu_yrs")
print(df_sorted.head())
Note: On the very left, you can see the row indexes 75, 4, 211, and so on. When creating a DataFrame from another DataFrame, Pandas operations maintain the index values. Recall that the
adult_income
DataFrame has row indexes going from 0 to 325, in that order. So now, the sorted DataFrame still has the row indexes 0 to 325, but the order of the indexes has changed because of sorting.
# display the bottom 5 values (minimum "edu_yrs")
print(df_sorted.tail())
As you can see, our data is now sorted.
However, you may notice that even though the data is sorted, multiple rows with the same value for edu_yrs
are not arranged in any particular order.
What if we wanted all the rows which have the same value of edu_yrs
to be sorted with respect to the hours_per_week
column?
We just need to pass the name of the second column we want to sort by, along with the first column as a list to the by
parameter:
df_sorted = adult_income.sort_values(by=["edu_yrs","hours_per_week"], axis=0, ascending=False)
# display the top 12 values after sorting
print(df_sorted.head(12))
We can even modify this further. It is not necessary that we will want the DataFrame to be sorted in descending order by both the columns.
We might want to get it sorted in descending order of edu_yrs
values, while sorting in ascending order of hours_per_week
values within that.
To do that, we pass a list to the ascending
parameter as well, with the order corresponding to that of the list passed to the by
parameter.
For example in our current problem we will pass the following argument ascending = [False, True]
df_sorted = adult_income.sort_values(by=["edu_yrs","hours_per_week"], axis=0, ascending=[False, True])
# display the top 12 values after sorting
print(df_sorted.head(12))
Top Values - nlargest()
and nsmallest()
Sometimes we may not be interested in sorting the DataFrame, but just need to look at the top 10 or top 20 instances according to some column/feature.
For example:
- Who are the top 10 earners of
capital_gain
? - According to
hours_per_week
, which 5 people have the lowest working hours?
Instead of sorting and then viewing the top 5 or 10 results, we can use a function to display the largest or smallest n-values according to any column. We will be using the following functions for this: nlargest()
and nsmallest()
1. Who are the top 10 earners of “capital-gain”?
Let us answer the first question with nlargest()
. The syntax is as follows:
DataFrame.nlargest(n, column_label)
n
— number of rows to returncolumn_label
— name of the column for which we want the largest values
This function returns a DataFrame with n-rows.
df_largest = adult_income.nlargest(10, "capital_gain")
# display results
print(df_largest)
2. According to “hours-per-week”, which 5 people have the lowest working hours?
We will answer the second question with nsmallest(n, column_label)
. The parameters are exactly the same:
# Who are the top 10 earners of "capital-gain".
df_smallest = adult_income.nsmallest(5, "hours_per_week")
# display results
print(df_smallest)
Summary
- The pandas DataFrame is a 2D table. It is the Python equivalent of a spreadsheet or SQL table.
read_csv()
function — reads a CSV file and returns a DataFramehead()
andtail()
methods — used to see the first and last few rowsinfo()
method — prints the datatypes of every variabledescribe()
method — summarizes each variable in the DataFrame. For numerical variables, displays the mean, median, etc. For categorical variables, displays the mode, number of unique values, etc.sort_values()
method — used to sort a DataFrameby
parameter — for specifying the column to sort byascending
parameter — for ascending or descending ordernlargest(n, column)
andnsmallest(n, column)
methods — returns the n rows with the largest / smallest values for the specified column