Pandas: Creating Series and DataFrames

April 18, 2019

Introduction

In the previous tutorial, you have used DataFrame which is a 2-dimensional data structure supported by Pandas, which looks and behaves like a table.

In this tutorial, you will learn about Series which is a 1-dimensional data structure supported by Pandas. In fact, in Pandas, each column of a DataFrame is a Series.

We will also learn in depth about creating DataFrames, adding rows to a DataFrame and converting DataFrames to other formats, such as NumPy arrays or Python lists and dictionaries.

Imports

Before we start, let’s import the Pandas and NumPy libraries, as we will need them throughout the tutorial.

import pandas as pd
import numpy as np

Series

Just like the DataFrame, Series is another useful data structure provided by the Pandas library.

A Series is a 1-dimensional array. It is capable of holding all data types like integers, strings, floating point numbers, Python objects, etc.

Unlike DataFrames which have an index (row labels) and columns (column labels), Series objects have only one set of labels.

Creating a Series

Series can be created with the following syntax:

pandas.Series(data=None, index=None)

Here’s a description of each of the parameters:

data — refers to the data to store in the Series. It can be a list, 1D numpy array, dict or a scalar value.
index — refers to the index labels. If no index is passed, the default index is range 0 to n-1. Index values must be hashable and have the same length as the data.

Additionally, if a dict of key-value pairs is passed as data and no index is passed, then the key is used as the index.

Let us look at some examples of Series construction.

# From a list, without passing any index
s1 = pd.Series([1, 'tom', 32, 'qualified'])
print(s1)

# From a list, with an index
s2 = pd.Series([1, 'tom', 32, 'qualified'], index=['number', 'name', 'age', 'status'])
print(s2)

# From a list of integer values, with an index
s3 = pd.Series([1, 345, 14, 24, 12], index=['first', 'second', 'third', 'fourth', 'fifth'])
print(s3)

# From a dict of key-value pairs
s4 = pd.Series({'number':1, 'name':'tom', 'age':32, 'status':'qualified'})
print(s4)

You will notice that the data type ( dtype ) of the Series is inferred from the elements passed to the Series. It is automatically chosen such that all elements in the Series are of same dtype (or a sub-type of dtype).

For example, if all the values are integers, the dtype will be int64. If the values are a mix of integers and floats, the dtype of the Series will be float64. Similarly, if any of the values is a string, the Series’ dtype will be object.

Creating a DataFrame

Now we will look at the different ways for creating DataFrames. We use the following syntax to create a DataFrame.

pandas.DataFrame(data=None, index=None, columns=None)

Here’s a description of each of the parameters:

data — a 2D array, or a dict (of 1D array, Series, list or dicts).
index — row index values for the DataFrame that will be created. If not specified, row index values default to the range _0_ to _n-1_ (where n is the number of rows).
columns — column labels of the DataFrame that will be created. If not specified, column labels default to the range _0_ to _c-1_ (where c is the number of columns).

There are many other ways to pass information about the row index values and the column labels as well, which we will see soon in this section.

Creating DataFrame from ndarray

In this section, we will see how to create a DataFrame using a numpy ndarray as data, and a list of column labels.

We will:

Create a random ndarray with 7 rows and 5 columns: np.random.rand(7, 5).
Create a list of 5 column labels: ['col1','col2','col3','col4','col5']
Create the DataFrame

Let us put these steps together and create a DataFrame:

# Create a random 7 x 5 numpy ndarray
np.random.seed(42) # set a seed so that the same random numbers are generated each time
np_array = 10 * np.random.rand(7, 5)
# Create a list of 5 column labels
cols = ['col1', 'col2', 'col3', 'col4', 'col5']
# Create the DataFrame
ndf = pd.DataFrame(data=np_array, columns=cols)
# Display dataframe
print(ndf)

Creating DataFrame from dict

Next, let’s see how to create a DataFrame using a dict of Series.

First, we will create a dict of key-value pairs, where the values are pandas Series. This will be the data parameter.

# make three Series'
s1 = pd.Series([10, 20, 30, 40, 50])
s2 = pd.Series(['a', 'b', 'c', 'd', 'e'])
s3 = pd.Series(['one', 'two', 'three', 'four', 'five'])
# create a dict 
data_dict = {'col1': s1, 'col2': s2, 'col3': s3}

If we don’t pass a list of names to column explicitly, the column labels in the constructed DataFrame will be the ordered list of dict keys.

# create dataframe
df = pd.DataFrame(data=data_dict)
# display dataframe
print(df)

If we pass a list to column parameter, then only the dictionary keys which match the list of column labels are selectively kept in the DataFrame.

So, first we must create a list of columns which matches some of the dict keys.

cols = ['col1', 'col2']

Then, we can create the DataFrame with these columns:

# data_dict same as defined earlier
# create a list of columns_labes
cols = ['col1', 'col2']
# create DataFrame 
df = pd.DataFrame(data=data_dict, columns=cols)
# display DataFrame
print(df)

DataFrame from multiple lists as rows

Finally, let’s see an example where we use a list of lists to create a DataFrame. This is similar to creating a DataFrame using a 2D Numpy array.

Let us start with the following lists of rows:

a1 = ['one', 1, 'up', 'top', 'beauty']
a2 = ['zero', 0, 'down', 'bottom', 'charm']

Then, we combine both the above lists, to get a list of lists.

l = [a1, a2]

Let’s see the full code:

# create multiple lists (one per row)
a1 = ['one', 1, 'up', 'top', 'beauty']
a2 = ['zero', 0, 'down', 'bottom', 'charm']
# combine the data into a single list
l = [a1, a2]
# create a list of column names
col = ['col1', 'col2', 'col3', 'col4', 'col5']
# create the DataFrame
df2 = pd.DataFrame(data=l, columns=col)
# display DataFrame
print(df2)

Awesome! We saw how to create DataFrames using various different methods. Depending on how your data is stored initially, or where you are getting your data from, some of these methods would be more convenient than others for creating the DataFrame.

Creating or adding rows

In this section we will learn how to add rows to an existing DataFrame.

To do this, we will be using the .append() DataFrame method. This method allows us to add a single row or multiple rows to the DataFrame.

Let us see the syntax:

DataFrame.append(other, ignore_index=False)

where,

other - can be a Series or Dictionary. Incase you want to pass multiple rows, you also pass a list of Series, Dictionaries or even a DataFrame.
ignore_index - if True, it ignores the index of object passed in other and ressigns the row(s) with new index instead. If False (the default value), it preserves the index from other.

There are a few other parameters which give us more options, but we won’t be going over them in this tutorial.

append() returns a new DataFrame with the added row(s). It does not modify the original DataFrame.

Note: If the columns of DataFrame and other object doesn’t match, the additional elements from non existing columns will be filled with NaN. We will avoid doing this for now and assume that the other object we pass has exactly the same columns as the DataFrame.

Let us see a few examples.

We will use the last DataFrame that we created in the previous section. Let’s take a look at it:

# display DataFrame
print(df2)

Dict as new row

Let us use a dict with key-value pairs as a new row.

The keys denote the column names and values denote the values of corresponding columns in the new row.

We will first declare the dictionary and then pass the dictionary as other.

# declare a key-value pair dict type to match the dimensions of row
row = {'col1':'two', 'col2':2, 'col3':'blue', 'col4':'green', 'col5':'red'}
# pass it to the append() method
new_df = df2.append(row, ignore_index=True)
# display the new DataFrame
print(new_df)

Series as new row

Next we append a Series as a new row, where the index of the Series is same as column names of df2.

We will first declare the Series with the required index and then pass it to the append method.

# create a series with column labels of "df2" as index
row = pd.Series(['three',3,'black','white','grey'],
                index=df2.columns)
# pass it to the append() method
new_df = df2.append(row, ignore_index=True)
# display the new DataFrame
print(new_df)

Adding multiple rows

We can also append multiple rows to the DataFrame by passing a list of Series or Python dictionaries.

Let us look at an example using Series. We will pass a list with multiple Series to the append method.

# create two series with column labels of "df2" as index
row1 = pd.Series(['four',4,'left','right','center'],
              index=df2.columns)
row2 = pd.Series(['five',5,'Winterfell','Eyrie','Sunspear'],
              index=df2.columns) 
# pass it to the append() method
new_df = df2.append([row1, row2], ignore_index=True)
# display the new DataFrame
print(new_df)

Great! The new DataFrame now has two more rows.

Converting a DataFrame to other formats

Earlier in the tutorial, you have seen how to create DataFrames using NumPy arrays, lists, dictionaries, etc.

Often, it is also useful to convert DataFrame back to one of these forms. In this section, we will see how to convert DataFrames to NumPy arrays, Python dictionaries, etc.

Converting to a `ndarray` (NumPy array)

First, let’s see how to convert a DataFrame to an NumPy ndarray. There are two methods to do this.

using the Pandas DataFrame .values attribute
using the Pandas DataFrame .to_numpy() method

values attribute

The values attribute extracts all the values from a Pandas DataFrame in the form of a NumPy ndarray. Its syntax is as follows:

DataFrame.values

This returns a NumPy ndarray. The datatype (dtype) of the ndarray will be chosen such that it can preserve and accommodate all the values from the DataFrame.

For example, if the DataFrame contains integers and floats, the dtype of the ndarray will be float. But if the DataFrame has numeric as well as non-numeric values, then the dtype will be object.

Let us see an example. We will reuse the ndf DataFrame we created at the beginning of the tutorial.

# print DataFrame 
print('The DataFrame')
print(ndf)
print("")
# use the values attribute to return an ndarray
print('Using values attribute')
print(ndf.values)
print("")

to_numpy() method

We can also use the to_numpy() method to convert a DataFrame to a NumPy ndarray. The syntax is as follows:

DataFrame.to_numpy(dtype="")

In this case too, the dtype of the ndarray is chosen such that it can preserve and accommodate all the values from the DataFrame. But with this method we can also explicitly specify the dtype by passing the dtype parameter.

Let’s see an example. First, we will just use the method as is, so the result is same as before. Then, we will specify the dtype=‘int’.

# use the to_numpy() method to convert to ndarray
print('Using to_numpy() method')
print(ndf.to_numpy())
print("")
# use to_numpy() method with explicit dtype
print('Using to_numpy() method with dtype="int"')
print(ndf.to_numpy(dtype='int'))
print("")

Note: to_numpy() method is a recent addition in Pandas version 0.24.0. If you are following this tutorial on your own computer, make sure your Pandas is updated to this version before using the method. You can check the Pandas version by typing pandas.__version__.

Converting to a `dict` or a `list`

We can also convert a DataFrame to a Python dict. We will use the to_dict() method to do this. The syntax is as follows:

DataFrame.to_dict(orient='dict')

The orient parameter can take a number of arguments, but we will concentrate on four of them: "dict", "list", "series" and "records".

For the first three arguments ("dict", "list" and "series" ), the method returns a dict of key-value pairs, where the keys are column labels of the DataFrame. The data-structure of the values are as specified by the orient parameter (dict, list or Series).
"dict" - this is the default argument. The values of the dict returned are dict themselves with row index as key and elements of the column as values.
"list" - the values of the dict are lists of corresponding column elements.
"series" - the values of the dict are Series of column elements, with the row index as index label of the series. The dtype of the series are inferred from the data.
For the last argument — "records" — the method returns a list with one dict corresponding to each row in the DataFrame.

Let’s convert the DataFrame new_df from the “Adding multiple rows” section into these various formats. We’ll use Python’s pretty print library — pprint — to print the results in a nicely formatted way so that it is easier to look at.

import pprint
# print the actual dataframe
print('The dataframe')
print(new_df)
print('')
print('to_dict() with orient="dict"')
pprint.pprint(new_df.to_dict(orient='dict'))
print('')
print('to_dict() with orient="list"')
pprint.pprint(new_df.to_dict(orient='list'))
print('')
print('to_dict() with orient="series"')
pprint.pprint(new_df.to_dict(orient='series'))
print('')
print('to_dict() with orient="records"')
pprint.pprint(new_df.to_dict(orient='records'))
print('')

Summary

Series is a 1-dimensional data-structure supported by Pandas.
Series objects have only one set of labels.
Each column of a Pandas DataFrame is a Series.
We can create DataFrames from ndarray, dicts, etc and also convert a DataFrame back to these formats
We can add rows to a DataFrame using append()

Reference

Creating a Series:

pandas.Series(data=None, index=None)
# data is usually 1D numpy array, list or dict

Creating a DataFrame:

pandas.DataFrame(data=None, index=None, columns=None)
# data is usually a 2D array, or 
# a dict where each key-value pairs represent columns. 
# the values can be 1D arrays, Series, lists or dicts

Append rows:

DataFrame.append(other, ignore_index=False)
# other can be 1D array, Series, list, dict, DataFrame
# or a list of 1D arrays, Series, lists, dicts, DataFrames

Converting DataFrame to other formats:

# to ndarray
DataFrame.values
DataFrame.to_numpy(dtype="")
# to Python dict with key-value pairs representing columns
DataFrame.to_dict(orient='dict') # orient can also be "list" or "series"
# to Python list with each element representing rows 
DataFrame.to_dict(orient='records')