Hive Tutorial
February 09, 2018
What is Hive?
Apache Hive is a Hadoop application that is designed to do data analytics on structured data inside the Hadoop ecosystem. Hive sits on top of HDFS and provides an easy way to query data using an SQL like language called HiveQL.
Hive was initially developed by Facebook to provide their Business Analysts a way to analyze Hadoop data without having to teach them how to write MapReduce. Facebook donated Hive to the Apache Software Foundation.
Hive has the ability to store a schema for data in a database called the Hive Metadatastore. It can then apply that schema to processed data that is stored in HDFS.
What Hive isn’t
There is a misconception that Hive is a relational database. This is NOT true. Relational databases are more than just running queries for analytics. They maintain and keep track of transactional events. Hive cannot do that — it does not do Online Transaction Processing (OLTP). There is just no way it can do that. Hive also cannot perform row-level updates.
Hive’s Architecture
Hive has two main components: the Hive Metastore and Hive.
Hive Metastore
The Hive Metastore utilizes a relational database to store schemas of data, data about tables, databases, columns, data types, and how they apply to HDFS directories. This makes the Hive Metastore very important. Without it, Hive wouldn’t know:
- the schema of the data that is being queried.
- where to find the data that it is supposed to query.
The Hive Metastore is an essential part of the Hive application.
Hive
The second main component of Hive is Hive. This is broken down even further into the HiveQL Process Engine and Execution Engine.
The HiveQL Process Engine is the component that communicates with the Hive Metastore. The Process Engine uses HiveQL, which is a SQL like language that is used to query the Metastore. HiveQL is special because it gives the end user the ability to query large sets of data without having to write a bunch of MapReduce code. The query actually gets converted into a MapReduce job to calculate the final result.
The execution engine is the bridge between the HiveQL Process Engine and MapReduce. The execution engine processes the query that was written in HiveQL and uses the MapReduce framework to produce a result. The execution engine can also be applied to Spark for more performance.
Hive’s Data Types
Hive is very similar to a relational database in several ways.
Just as in a relational database, a schema must be defined before it can be applied to data in HDFS.
Another similarity is that Hive is organized into databases, which are more like namespaces rather than a collection of related tables. These databases have tables, but they don’t have to be related at all — it is purely a way to organize data. Inside of those tables are rows and columns. The column types are very similar to those in a relational database. Hive supports a ton of data types. Let’s take a look at some of the ones that Hive supports:
- Integral type — It’s a number. There are four types of INT categories supported by Hive. TINYINT, SMALLINT, INT, BIGINT.
- String types have two categories: VARCHAR and CHAR. Strings in Hive can be specified either by single quotes or double quotes.
- Timestamp is (shockingly!) a timestamp. Naturally, you have complete flexibility over it’s formatting.
- Dates can be formatted in any way based on your specifications.
- Decimal type is analogous to a Big Decimal format in Java. The decimal type offers the precision that ints can’t.
- Hive also uses literals such as the Floating Point and the Decimal types. Floating Points are akin to floats in other programming languages, i.e. they are just numbers with decimal points after them. The decimal type is just a floating point with a higher range.
- NULL is used to express missing values.
The above are some of the more common types that are offered in Hive. Let’s also briefly go over some of the complex types.
- Arrays are collections of data that can be called upon individually or collectively.
- Maps are a collection of
<key, value>
. - Structs customize a complex data type while offering the ability to add a comment.
Hive Example
For this tutorial, we will be working with some movie data. You can download the data that we are going to be using from this website here: http://www.grouplens.org/system/files/ml-1m.zip.
If you want to read about the data and how it is setup, feel free to check out this file: http://files.grouplens.org/papers/ml-1m-README.txt. In the ml-1m.zip
file, you should see three files inside: movies.dat
, ratings.dat
, users.dat
.
Let’s get that into our Cloudera Quickstart Docker image and load them into Hive.
If you aren’t using the Digital Ocean version of the tutorial, you can volume mount the directory that you saved your data into your quickstart docker image by running this command:
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -v /place/you/saved:/inside/your/container cloudera/quickstart:latest /usr/bin/docker-quickstart bash
Otherwise, once you have the Digital Ocean instance spun up and resized, let’s run some commands to all get on the same page.
docker-machine scp -r ./Downloads/ml-1m/ root@docker-sandbox:/tmp
This command will get the data into your /tmp folder on the Digital Ocean server.
docker-machine ssh docker-sandbox
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -v /tmp:/dataset cloudera/quickstart:latest /usr/bin/docker-quickstart bash
Great, now you should all be inside your docker instance.
We need to do some data preparation before we put the data into HDFS. The following commands will get the data setup correctly by removing all ::
and replacing them with ,
inside of the files.
[root@quickstart /]# cd /dataset/ml-1m/
[root@quickstart ml-1m]# sed 's/::/,/g' movies.dat > movies.t
[root@quickstart ml-1m]# sed 's/::/,/g' ratings.dat > ratings.t
[root@quickstart ml-1m]# sed 's/::/,/g' users.dat > users.t
Let’s put the data into HDFS.
[root@quickstart ml-1m]# hdfs dfs -mkdir -p /moviedata/ratings
[root@quickstart ml-1m]# hdfs dfs -mkdir -p /moviedata/movies
[root@quickstart ml-1m]# hdfs dfs -mkdir -p /moviedata/users
[root@quickstart ml-1m]# hdfs dfs -put movies.t /moviedata/movies
[root@quickstart ml-1m]# hdfs dfs -put ratings.t /moviedata/ratings
[root@quickstart ml-1m]# hdfs dfs -put users.t /moviedata/users
We’ll start by doing some Hive commands. We are going to be running these commands via Beeline which is the command line interface for Hive. Start it up by typing the following command on the command line.
beeline -u jdbc:hive2://quickstart:10000/default -n admin -d org.apache.hive.jdbc.HiveDriver
Beeline is the command that we are running. We add the -u for the url of the hive database. We are using a jdbc driver (notice the -d at the end) so we have to do put the jdbc in front of the quickstart database address. -n is the username that the beeline will start up in.
First thing is we have to create a database to put our tables in.
create database moviedata;
use moviedata;
The use
command tells beeline which database we will be working on. Next, we are going to create the tables.
CREATE EXTERNAL TABLE ratings (userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/moviedata/ratings';
This statement looks like DDL (Data Definition Language) in SQL — that’s because it is very similar.
One difference is the word EXTERNAL
which tells Hive that this table is going to sit on top of data rather than loading data into Hive. Thus, if you drop this table in Hive, the data will not disappear.
Another difference is the second part of the query:
ROW FORMAT DELIMITED
means that each line is a row inside the text file.FIELDS TERMINATED BY ','
means that the columns are separated by a comma.STORED AS TEXTFILE
means that the data is stored in text filesLOCATION
simply means that it is located in HDFS at a the specified location.
Now, let’s create some more tables.
CREATE EXTERNAL TABLE movies ( movieid INT, title STRING, genres ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY "|" STORED AS TEXTFILE LOCATION '/moviedata/movies/';
CREATE EXTERNAL TABLE users (userid INT, gender STRING, age INT, occupation INT, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/moviedata/users/';
We’ll make sure that the data was loaded correctly, by seeing a sample of the various tables.
select * from ratings LIMIT 10;
+-----------------+------------------+-----------------+-----------------+--+
| ratings.userid | ratings.movieid | ratings.rating | ratings.tstamp |
+-----------------+------------------+-----------------+-----------------+--+
| 1 | 1193 | 5 | 978300760 |
| 1 | 661 | 3 | 978302109 |
| 1 | 914 | 3 | 978301968 |
| 1 | 3408 | 4 | 978300275 |
| 1 | 2355 | 5 | 978824291 |
| 1 | 1197 | 3 | 978302268 |
| 1 | 1287 | 5 | 978302039 |
| 1 | 2804 | 5 | 978300719 |
| 1 | 594 | 4 | 978302268 |
| 1 | 919 | 4 | 978301368 |
+-----------------+------------------+-----------------+-----------------+--+
select * from users LIMIT 10;
+---------------+---------------+------------+-------------------+----------------+--+
| users.userid | users.gender | users.age | users.occupation | users.zipcode |
+---------------+---------------+------------+-------------------+----------------+--+
| 1 | F | 1 | 10 | 48067 |
| 2 | M | 56 | 16 | 70072 |
| 3 | M | 25 | 15 | 55117 |
| 4 | M | 45 | 7 | 02460 |
| 5 | M | 25 | 20 | 55455 |
| 6 | F | 50 | 9 | 55117 |
| 7 | M | 35 | 1 | 06810 |
| 8 | M | 25 | 12 | 11413 |
| 9 | M | 25 | 17 | 61614 |
| 10 | F | 35 | 1 | 95370 |
+---------------+---------------+------------+-------------------+----------------+--+
select * from movies LIMIT 10;
+-----------------+-------------------------------------+---------------------------------------+--+
| movies.movieid | movies.title | movies.genres |
+-----------------+-------------------------------------+---------------------------------------+--+
| 1 | Toy Story (1995) | ["Animation","Children's","Comedy"] |
| 2 | Jumanji (1995) | ["Adventure","Children's","Fantasy"] |
| 3 | Grumpier Old Men (1995) | ["Comedy","Romance"] |
| 4 | Waiting to Exhale (1995) | ["Comedy","Drama"] |
| 5 | Father of the Bride Part II (1995) | ["Comedy"] |
| 6 | Heat (1995) | ["Action","Crime","Thriller"] |
| 7 | Sabrina (1995) | ["Comedy","Romance"] |
| 8 | Tom and Huck (1995) | ["Adventure","Children's"] |
| 9 | Sudden Death (1995) | ["Action"] |
| 10 | GoldenEye (1995) | ["Action","Adventure","Thriller"] |
+-----------------+-------------------------------------+---------------------------------------+--+
Congrats! You have created some tables with data inside. We suggest keeping the LIMIT so that you don’t blow up your docker image by printing out / querying too much data.
Hive Query Example
Next, let’s run some queries.
First, we will put a rating to the movie. Notice in the ratings
table, all it has is the movieid
. We want to see the movie title with it. Let’s try and write a query to do that (Hint: It involves a JOIN).
select m.title, r.rating FROM MOVIES m JOIN RATINGS r ON (m.movieid = r.movieid) LIMIT 20;
+-----------------------------------------+-----------+--+
| m.title | r.rating |
+-----------------------------------------+-----------+--+
| One Flew Over the Cuckoo's Nest (1975) | 5 |
| James and the Giant Peach (1996) | 3 |
| My Fair Lady (1964) | 3 |
| Erin Brockovich (2000) | 4 |
| Bug's Life | 5 |
| Princess Bride | 3 |
| Ben-Hur (1959) | 5 |
| Christmas Story | 5 |
| Snow White and the Seven Dwarfs (1937) | 4 |
| Wizard of Oz | 4 |
| Beauty and the Beast (1991) | 5 |
| Gigi (1958) | 4 |
| Miracle on 34th Street (1947) | 4 |
| Ferris Bueller's Day Off (1986) | 4 |
| Sound of Music | 5 |
| Airplane! (1980) | 4 |
| Tarzan (1999) | 3 |
| Bambi (1942) | 4 |
| Awakenings (1990) | 5 |
| Big (1988) | 4 |
+-----------------------------------------+-----------+--+
Let’s try and order the ratings for the movies by altering the above query.
select m.title, r.rating FROM MOVIES m JOIN RATINGS r ON (m.movieid = r.movieid) ORDER BY r.rating DESC LIMIT 20;
+-------------------------------+-----------+--+
| m.title | r.rating |
+-------------------------------+-----------+--+
| Election (1999) | 5 |
| Sixth Sense | 5 |
| Dead Man Walking (1995) | 5 |
| King and I | 5 |
| On the Beach (1959) | 5 |
| Party Girl (1995) | 5 |
| Spartacus (1960) | 5 |
| Mutiny on the Bounty (1935) | 5 |
| Usual Suspects | 5 |
| Lost Weekend | 5 |
| Big (1988) | 5 |
| Best Years of Our Lives | 5 |
| Desperado (1995) | 5 |
| Schindler's List (1993) | 5 |
| Apollo 13 (1995) | 5 |
| Gentleman's Agreement (1947) | 5 |
| Aliens (1986) | 5 |
| Cabaret (1972) | 5 |
| On the Waterfront (1954) | 5 |
| Dangerous Beauty (1998) | 5 |
+-------------------------------+-----------+--+
20 rows selected (37.343 seconds)
Let’s get the average movie rating for the movies.
select m.title, AVG(r.rating) as AVG_RATING FROM MOVIES m JOIN RATINGS r ON (m.movieid = r.movieid) GROUP BY m.title ORDER BY AVG_RATING DESC LIMIT 20;
+----------------------------------------------------------------------+--------------------+--+
| m.title | avg_rating |
+----------------------------------------------------------------------+--------------------+--+
| Gate of Heavenly Peace | 5.0 |
| Follow the Bitch (1998) | 5.0 |
| Song of Freedom (1936) | 5.0 |
| Smashing Time (1967) | 5.0 |
| One Little Indian (1973) | 5.0 |
| Baby | 5.0 |
| Lured (1947) | 5.0 |
| Schlafes Bruder (Brother of Sleep) (1995) | 5.0 |
| Bittersweet Motel (2000) | 5.0 |
| Ulysses (Ulisse) (1954) | 5.0 |
| I Am Cuba (Soy Cuba/Ya Kuba) (1964) | 4.8 |
| Lamerica (1994) | 4.75 |
| Apple | 4.666666666666667 |
| Sanjuro (1962) | 4.608695652173913 |
| Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.560509554140127 |
| Shawshank Redemption | 4.554557700942973 |
| Godfather | 4.524966261808367 |
| Close Shave | 4.52054794520548 |
| Usual Suspects | 4.517106001121705 |
| Schindler's List (1993) | 4.510416666666667 |
+----------------------------------------------------------------------+--------------------+--+
20 rows selected (54.088 seconds)
Play around with these tables and make your own queries. If you know SQL, you know HiveQL for the most part.
Conclusion
We just finished a quick dive into Hive — Hive is a tool that you can write SQL queries to query data in Hadoop.