Histograms
March 22, 2019
What is a histogram?
A histogram graphically summarizes the distribution of numerical data. Typically, it consists of vertical rectangular bars which show the frequency of data (along y-axis) in successive class intervals (x-axis) of equal size. The height of each bar of a histogram is proportional to the frequency of the data.
Let’s take a sample data of heights (in cm) of 50 students in a class.
Heights (in cm) of students in a class
For plotting histogram, the first step is to divide the entire range of values into a series of intervals or bins. Here the heights range from 150.6 cm to 189.1 cm. Let’s divide the range into 8 intervals, each of width 5. The frequency distribution table is:
Frequency distribution table for the above data
Each row of the table tells us the number of data points in that range. For example, the number of students with heights in range 180 - 185 is 4 (the students’ heights are 182.1, 180.9, 181.4 and 181.9).
Here’s the histogram for this data:
We can see that the histogram visualizes the distribution of data over a continuous interval. The height of each bar in the histogram represents the frequency at each interval / bin. For example, as we discussed earlier, there are 4 students whose heights are in the range 180 - 185.
Creating histogram with Python
Now let’s see how we would create the histogram with Python. We will be using the matplotlib
library, which provides a hist()
function for plotting histograms. The two most important parameters the function takes are
range
— The lower and upper range of the bins (defaults to minimum and maximum value in the data). If any of the data values are outside the range specified, they will be ignored.bins
— The number of bins/intervals to divide the range into
Let’s see the code in action:
import matplotlib.pyplot as plt
# height data
heights = [
158.9, 175.4, 167.2, 162.3, 152.6, 159.2, 176.9, 155.8, 163.6, 181.9,
167.4, 161.5, 180.9, 171.6, 165.6, 160.1, 172.7, 155.3, 171.2, 162.7,
175.7, 182.1, 188.4, 157.0, 189.1, 168.7, 178.9, 162.3, 186.9, 178.8,
166.0, 165.9, 169.3, 155.4, 178.9, 181.4, 162.2, 176.5, 165.9, 150.6,
172.7, 174.9, 150.9, 152.2, 172.6, 170.7, 171.5, 169.4, 155.9, 162.7,
]
# plot histogram
# 'ec' stands for edge color. we add black borders to the bars.
plt.hist(heights, bins=8, range=(150,190), ec='black')
# add title and axis labels
plt.title("Histogram – Height of Students")
plt.ylabel("Frequency")
plt.xlabel("Height Bins")
# display the plot
plt.show()
Notice how we choose range
and bins
such that all the intervals have integer values — in this case, 150 - 155, 155 - 160, and so on.
Note: We should choose the number of bins wisely for histograms. Way too few bins does not reveal any information, since all the data points will belong to the same bar. On the other hand, too many bins will make a separate bar for each data point. As a rule of thumb, for realistic data you usually want around 20 - 50 bins.
Histograms for various types of frequency distributions
The shape of the histogram reflects the underlying distribution of data. Hence, often taking a single glance at the histogram allows us to easily understand the data distribution. Below, we will discuss some distributions which are encountered commonly and their corresponding histogram shapes.
Skew
Symmetric
In a symmetric distribution, data points are symmetrically distributed around the mean. In these cases, the mean is approximately equal to the median. An example of a symmetric distribution is the height of individuals in a locality. A lot of the data observed in nature follows this distribution.
The histogram representing a symmetric distribution has the left and right hand sides roughly equally balanced around the centre. You can observe that the left and right tails in the graph below have about the same length.
Histogram for data with symmetric distribution
Right Skewed (Positively skewed)
Variables representing income or wealth of individuals etc., are right (positively) skewed. This is because the majority of individual have either low income or medium income, but a small minority of individuals are very rich or extremely rich.
The histogram of a right skewed distribution has a long tail to the right. In these cases, the mean is typically greater than the median.
Histogram for right-skewed data
Left Skewed (Negatively skewed)
Similarly, if the histogram has a long tail to the left, we can assume that the data is from a negatively skewed distribution. In a left skewed distribution, most data points will be greater than the mean.
Observe the shape of the histogram with a long left tail. In this case, the mean is typically less than the median.
Histogram for left-skewed data
Kurtosis
Apart from the left / right skew of the data distribution, another important thing to notice in histograms is the “tailedness” of a distribution, or how much of the data is close to the center vs in the tails. This property of the distribution is known as its kurtosis.
Mesokurtic (Normal tailed)
To measure how heavy or light-tailed a distribution is, we take the normal distribution (also known as Gaussian distribution) as the point of comparison. Below is a histogram plot of data points from a normal distribution.
Histogram for data from normal distribution
Leptokurtic (Heavy tailed)
A distribution whose tails are heavier than the normal distribution is called leptokurtic. Often, this kind of distribution will have heavy tails as well as a taller center, but lesser data points at a moderate distance from the center.
Histogram for data from a heavy-tailed distribution
Platykurtic (Light tailed)
A distribution whose tails are thinner than the normal distribution is called platykurtic. In some cases (as in the plot below), the tails might be completely non-existent.
Histogram for data from a distribution with non-existent tails
Histogram for multiple distributions
Histograms are also useful for comparing different subsets of the data.
Let’s say we have data containing the salary (in $k) of 1000 employees in USA and UK. We will compare their distribution by plotting two histograms simultaneously in the same graph within the same axes.
Let’s first create the data which we will be plotting:
# random seed so that the generated values are always same
np.random.seed(2)
# 1000 values for USA incomes with mean 100 and standard deviation 30
income_USA = 30 * np.random.randn(1000) + 100
# 1000 values for UK incomes with mean 85 and standard deviation 20
income_UK = 20 * np.random.randn(1000) + 85
Superimposed histogram
In a superimposed histogram, we display the histograms layered on top of each other.
Here’s the code to create a superimposed histogram:
# Superimposed histogram
plt.hist([income_USA, income_UK], label=['USA', 'UK'], histtype='stepfilled',
alpha=0.6, bins=30, ec='black')
# 'alpha' sets the 'opacity'.
# if alpha = 1.0 (default) we won't be able to see the second histogram behind.
# add title and axis labels
plt.title("Histogram – Height of Incomes")
plt.ylabel("Frequency")
plt.xlabel("Income Bins")
# add legend
plt.legend(loc='upper right')
# display plot
plt.show()
Notice some things about the code above:
- To plot multiple data together, we pass all of them to the
hist()
function. - We use the
label
argument to label the data, and thelegend()
function to display the labels. - To create a superimposed histogram, we set the parameter
histtype
’s value to be'stepfilled'
. There are other histogram types as well, which we will see in the next section. - Lastly, since the histograms are layered on top of each other, we used the
alpha
parameter to make the histograms a little transparent so that we can also see ‘behind’ it.
Here’s the resulting plot:
From the histogram, we can see that on average income levels in USA are higher than in UK (note that income is on the x-axis, so higher income means the histogram is shifted to the right).
We can also see that incomes of employees from USA have a higher spread (since the histogram is wider).
Stacked Histogram
Apart from superimposing the plots, there are other ways to plot the distributions in the same graph. One option is to stack the multiple sets of data over one another. For that, we assign the parameter histtype
the value 'barstacked'
.
# Stacked histogram
plt.hist([income_USA,income_UK], label=['USA','UK'], histtype='barstacked', bins=30, ec='black')
plt.title("Histogram – Height of Incomes")
plt.ylabel("Frequency")
plt.xlabel("Income Bins")
plt.legend(loc='upper right')
plt.show()
Stacking makes it easier to see the combined frequency, while making it harder to see the exact frequency within each category.
Side by side histogram
Finally, we can also plot the data side by side. To achieve this, we set the parameter histtype
to 'bar'
. This plot makes it easier to compare the data frequency for each class interval.
# Side by side histogram
plt.hist([income_USA,income_UK], label=['USA','UK'], histtype='bar', bins=20, ec='black')
plt.title("Histogram – Height of Incomes")
plt.ylabel("Frequency")
plt.xlabel("Income Bins")
plt.legend(loc='upper right')
plt.show()
We can visually check the frequencies of income levels in USA and UK, and determine where one dominates the other. From the above plot, we can see that at very low and very high income levels, USA has a higher frequency, whereas in the middle regions, UK has a higher frequency.
Summary
- Histograms summarize large data sets graphically
- It is easy to understand the underlying distribution by looking at the histogram — for example, we can infer whether the distribution is symmetrical or skewed.
- Multiple distributions can be easily compared by plotting a superimposed histogram.
- You saw how to plot histograms using
matplotlib
’shist()
function, and the various parameters it supports such asbins
,range
,ec
(for adding borders to the bars),label
,alpha
(for transparency) andhisttype
.
References
We covered all the important parameters that the matplotlib
histogram supports. If you would like to read more about the other parameters the matplotlib
histogram supports, you can check the documentation here.