Pandas: Creating Series and DataFrames
April 18, 2019
Introduction
In the previous tutorial, you have used DataFrame which is a 2-dimensional data structure supported by Pandas, which looks and behaves like a table.
In this tutorial, you will learn about Series which is a 1-dimensional data structure supported by Pandas. In fact, in Pandas, each column of a DataFrame is a Series.
We will also learn in depth about creating DataFrames, adding rows to a DataFrame and converting DataFrames to other formats, such as NumPy arrays or Python lists and dictionaries.
Imports
Before we start, let’s import the Pandas and NumPy libraries, as we will need them throughout the tutorial.
import pandas as pd
import numpy as np
Series
Just like the DataFrame, Series is another useful data structure provided by the Pandas library.
A Series is a 1-dimensional array. It is capable of holding all data types like integers, strings, floating point numbers, Python objects, etc.
Unlike DataFrames which have an index (row labels) and columns (column labels), Series objects have only one set of labels.
Creating a Series
Series can be created with the following syntax:
pandas.Series(data=None, index=None)
Here’s a description of each of the parameters:
data
— refers to the data to store in the Series. It can be a list, 1D numpy array, dict or a scalar value.index
— refers to the index labels. If no index is passed, the default index is range0
ton-1
. Index values must be hashable and have the same length as the data.
Additionally, if a dict
of key-value pairs is passed as data
and no index
is passed, then the key
is used as the index
.
Let us look at some examples of Series construction.
# From a list, without passing any index
s1 = pd.Series([1, 'tom', 32, 'qualified'])
print(s1)
# From a list, with an index
s2 = pd.Series([1, 'tom', 32, 'qualified'], index=['number', 'name', 'age', 'status'])
print(s2)
# From a list of integer values, with an index
s3 = pd.Series([1, 345, 14, 24, 12], index=['first', 'second', 'third', 'fourth', 'fifth'])
print(s3)
# From a dict of key-value pairs
s4 = pd.Series({'number':1, 'name':'tom', 'age':32, 'status':'qualified'})
print(s4)
You will notice that the data type ( dtype
) of the Series is inferred from the elements passed to the Series. It is automatically chosen such that all elements in the Series are of same dtype
(or a sub-type of dtype
).
For example, if all the values are integers, the dtype
will be int64
. If the values are a mix of integers and floats, the dtype
of the Series will be float64
. Similarly, if any of the values is a string, the Series’ dtype
will be object
.
Creating a DataFrame
Now we will look at the different ways for creating DataFrames. We use the following syntax to create a DataFrame.
pandas.DataFrame(data=None, index=None, columns=None)
Here’s a description of each of the parameters:
data
— a 2D array, or a dict (of 1D array, Series, list or dicts).index
— row index values for the DataFrame that will be created. If not specified, row index values default to the range_0_
to_n-1_
(where n is the number of rows).columns
— column labels of the DataFrame that will be created. If not specified, column labels default to the range_0_
to_c-1_
(where c is the number of columns).
There are many other ways to pass information about the row index values and the column labels as well, which we will see soon in this section.
Creating DataFrame from ndarray
In this section, we will see how to create a DataFrame using a numpy ndarray
as data
, and a list
of column labels.
We will:
- Create a random
ndarray
with 7 rows and 5 columns:np.random.rand(7, 5)
. - Create a list of 5 column labels:
['col1','col2','col3','col4','col5']
- Create the DataFrame
Let us put these steps together and create a DataFrame:
# Create a random 7 x 5 numpy ndarray
np.random.seed(42) # set a seed so that the same random numbers are generated each time
np_array = 10 * np.random.rand(7, 5)
# Create a list of 5 column labels
cols = ['col1', 'col2', 'col3', 'col4', 'col5']
# Create the DataFrame
ndf = pd.DataFrame(data=np_array, columns=cols)
# Display dataframe
print(ndf)
Creating DataFrame from dict
Next, let’s see how to create a DataFrame using a dict
of Series.
First, we will create a dict
of key-value pairs, where the values are pandas Series. This will be the data
parameter.
# make three Series'
s1 = pd.Series([10, 20, 30, 40, 50])
s2 = pd.Series(['a', 'b', 'c', 'd', 'e'])
s3 = pd.Series(['one', 'two', 'three', 'four', 'five'])
# create a dict
data_dict = {'col1': s1, 'col2': s2, 'col3': s3}
If we don’t pass a list of names to column
explicitly, the column labels in the constructed DataFrame will be the ordered list of dict keys.
# create dataframe
df = pd.DataFrame(data=data_dict)
# display dataframe
print(df)
If we pass a list to column
parameter, then only the dictionary keys which match the list of column labels are selectively kept in the DataFrame.
So, first we must create a list of columns which matches some of the dict keys.
cols = ['col1', 'col2']
Then, we can create the DataFrame with these columns:
# data_dict same as defined earlier
# create a list of columns_labes
cols = ['col1', 'col2']
# create DataFrame
df = pd.DataFrame(data=data_dict, columns=cols)
# display DataFrame
print(df)
DataFrame from multiple lists as rows
Finally, let’s see an example where we use a list of lists to create a DataFrame. This is similar to creating a DataFrame using a 2D Numpy array.
Let us start with the following lists of rows:
a1 = ['one', 1, 'up', 'top', 'beauty']
a2 = ['zero', 0, 'down', 'bottom', 'charm']
Then, we combine both the above lists, to get a list of lists.
l = [a1, a2]
Let’s see the full code:
# create multiple lists (one per row)
a1 = ['one', 1, 'up', 'top', 'beauty']
a2 = ['zero', 0, 'down', 'bottom', 'charm']
# combine the data into a single list
l = [a1, a2]
# create a list of column names
col = ['col1', 'col2', 'col3', 'col4', 'col5']
# create the DataFrame
df2 = pd.DataFrame(data=l, columns=col)
# display DataFrame
print(df2)
Awesome! We saw how to create DataFrames using various different methods. Depending on how your data is stored initially, or where you are getting your data from, some of these methods would be more convenient than others for creating the DataFrame.
Creating or adding rows
In this section we will learn how to add rows to an existing DataFrame.
To do this, we will be using the .append()
DataFrame method. This method allows us to add a single row or multiple rows to the DataFrame.
Let us see the syntax:
DataFrame.append(other, ignore_index=False)
where,
other
- can be a Series or Dictionary. Incase you want to pass multiple rows, you also pass a list of Series, Dictionaries or even a DataFrame.ignore_index
- if True, it ignores the index of object passed inother
and ressigns the row(s) with new index instead. If False (the default value), it preserves the index fromother
.
There are a few other parameters which give us more options, but we won’t be going over them in this tutorial.
append()
returns a new DataFrame with the added row(s). It does not modify the original DataFrame.
Note: If the columns of DataFrame and
other
object doesn’t match, the additional elements from non existing columns will be filled withNaN
. We will avoid doing this for now and assume that theother
object we pass has exactly the same columns as the DataFrame.
Let us see a few examples.
We will use the last DataFrame that we created in the previous section. Let’s take a look at it:
# display DataFrame
print(df2)
Dict as new row
Let us use a dict with key-value pairs as a new row.
The keys denote the column names and values denote the values of corresponding columns in the new row.
We will first declare the dictionary and then pass the dictionary as other
.
# declare a key-value pair dict type to match the dimensions of row
row = {'col1':'two', 'col2':2, 'col3':'blue', 'col4':'green', 'col5':'red'}
# pass it to the append() method
new_df = df2.append(row, ignore_index=True)
# display the new DataFrame
print(new_df)
Series as new row
Next we append a Series as a new row, where the index of the Series is same as column names of df2
.
We will first declare the Series with the required index and then pass it to the append method.
# create a series with column labels of "df2" as index
row = pd.Series(['three',3,'black','white','grey'],
index=df2.columns)
# pass it to the append() method
new_df = df2.append(row, ignore_index=True)
# display the new DataFrame
print(new_df)
Adding multiple rows
We can also append multiple rows to the DataFrame by passing a list of Series or Python dictionaries.
Let us look at an example using Series. We will pass a list with multiple Series to the append method.
# create two series with column labels of "df2" as index
row1 = pd.Series(['four',4,'left','right','center'],
index=df2.columns)
row2 = pd.Series(['five',5,'Winterfell','Eyrie','Sunspear'],
index=df2.columns)
# pass it to the append() method
new_df = df2.append([row1, row2], ignore_index=True)
# display the new DataFrame
print(new_df)
Great! The new DataFrame now has two more rows.
Converting a DataFrame to other formats
Earlier in the tutorial, you have seen how to create DataFrames using NumPy arrays, lists, dictionaries, etc.
Often, it is also useful to convert DataFrame back to one of these forms. In this section, we will see how to convert DataFrames to NumPy arrays, Python dictionaries, etc.
Converting to a ndarray
(NumPy array)
First, let’s see how to convert a DataFrame to an NumPy ndarray
. There are two methods to do this.
- using the Pandas DataFrame
.values
attribute - using the Pandas DataFrame
.to_numpy()
method
values
attribute
The values
attribute extracts all the values from a Pandas DataFrame in the form of a NumPy ndarray
. Its syntax is as follows:
DataFrame.values
This returns a NumPy ndarray
. The datatype (dtype
) of the ndarray
will be chosen such that it can preserve and accommodate all the values from the DataFrame.
For example, if the DataFrame contains integers and floats, the dtype
of the ndarray
will be float
. But if the DataFrame has numeric as well as non-numeric values, then the dtype
will be object
.
Let us see an example. We will reuse the ndf
DataFrame we created at the beginning of the tutorial.
# print DataFrame
print('The DataFrame')
print(ndf)
print("")
# use the values attribute to return an ndarray
print('Using values attribute')
print(ndf.values)
print("")
to_numpy()
method
We can also use the to_numpy()
method to convert a DataFrame to a NumPy ndarray
. The syntax is as follows:
DataFrame.to_numpy(dtype="")
In this case too, the dtype
of the ndarray
is chosen such that it can preserve and accommodate all the values from the DataFrame. But with this method we can also explicitly specify the dtype
by passing the dtype
parameter.
Let’s see an example. First, we will just use the method as is, so the result is same as before. Then, we will specify the dtype=‘int’.
# use the to_numpy() method to convert to ndarray
print('Using to_numpy() method')
print(ndf.to_numpy())
print("")
# use to_numpy() method with explicit dtype
print('Using to_numpy() method with dtype="int"')
print(ndf.to_numpy(dtype='int'))
print("")
Note:
to_numpy()
method is a recent addition in Pandas version 0.24.0. If you are following this tutorial on your own computer, make sure your Pandas is updated to this version before using the method. You can check the Pandas version by typingpandas.__version__
.
Converting to a dict
or a list
We can also convert a DataFrame to a Python dict
. We will use the to_dict()
method to do this. The syntax is as follows:
DataFrame.to_dict(orient='dict')
The orient
parameter can take a number of arguments, but we will concentrate on four of them: "dict"
, "list"
, "series"
and "records"
.
- For the first three arguments (
"dict"
,"list"
and"series"
), the method returns adict
of key-value pairs, where the keys are column labels of the DataFrame. The data-structure of the values are as specified by theorient
parameter (dict
,list
orSeries
). "dict"
- this is the default argument. The values of the dict returned are dict themselves with row index as key and elements of the column as values."list"
- the values of the dict are lists of corresponding column elements."series"
- the values of the dict are Series of column elements, with the row index as index label of the series. The dtype of the series are inferred from the data.- For the last argument —
"records"
— the method returns alist
with onedict
corresponding to each row in the DataFrame.
Let’s convert the DataFrame new_df
from the “Adding multiple rows” section into these various formats. We’ll use Python’s pretty print library — pprint
— to print the results in a nicely formatted way so that it is easier to look at.
import pprint
# print the actual dataframe
print('The dataframe')
print(new_df)
print('')
print('to_dict() with orient="dict"')
pprint.pprint(new_df.to_dict(orient='dict'))
print('')
print('to_dict() with orient="list"')
pprint.pprint(new_df.to_dict(orient='list'))
print('')
print('to_dict() with orient="series"')
pprint.pprint(new_df.to_dict(orient='series'))
print('')
print('to_dict() with orient="records"')
pprint.pprint(new_df.to_dict(orient='records'))
print('')
Summary
- Series is a 1-dimensional data-structure supported by Pandas.
- Series objects have only one set of labels.
- Each column of a Pandas DataFrame is a Series.
- We can create DataFrames from ndarray, dicts, etc and also convert a DataFrame back to these formats
- We can add rows to a DataFrame using
append()
Reference
Creating a Series:
pandas.Series(data=None, index=None)
# data is usually 1D numpy array, list or dict
Creating a DataFrame:
pandas.DataFrame(data=None, index=None, columns=None)
# data is usually a 2D array, or
# a dict where each key-value pairs represent columns.
# the values can be 1D arrays, Series, lists or dicts
Append rows:
DataFrame.append(other, ignore_index=False)
# other can be 1D array, Series, list, dict, DataFrame
# or a list of 1D arrays, Series, lists, dicts, DataFrames
Converting DataFrame to other formats:
# to ndarray
DataFrame.values
DataFrame.to_numpy(dtype="")
# to Python dict with key-value pairs representing columns
DataFrame.to_dict(orient='dict') # orient can also be "list" or "series"
# to Python list with each element representing rows
DataFrame.to_dict(orient='records')