Hadoop Tutorial
January 05, 2018
Apache Hadoop is an open source framework that is used for processing massive amounts of data in a distributed fashion across a cluster of commodity hardware. By design, Hadoop is able to scale from a single node implementation to thousands of nodes or servers. With more nodes comes more computation power and storage.
Hadoop is a simple programming framework that is written in Java. Hadoop has two main components that make up the backbone of the framework: MapReduce and the Hadoop Distributed File System. Throughout the years, companies and other open source communities continue to expand the number of applications and products that can interact with Hadoop ecosystem.
Now that you know what Hadoop is, let’s talk about what Hadoop isn’t. Hadoop is not the replacement for the relational database such as Postgres that handle millions of transactions a day. Relational Databases are organized into tables and normalized to optimize SQL queries — queries often run in milliseconds.
Hadoop really prefers to work with big data that is non-normalized. Most of the time data inside Hadoop are just saved as files rather than tables. This data is not indexed so queries involve running a MapReduce job that could take time. If you don’t have Big Data or just need a transactional system, Hadoop probably isn’t for you. Hadoop is great, but it’s not always the best choice.
History of Hadoop
In 2003, Google released a paper named “Google File System”. A year later in 2004, Google released another paper named “MapReduce: Simplified Data Processing on Large Clusters”. These two papers were the main inspirations for Hadoop applying the Google File System model into a file system that would later be called the Hadoop Distributed File System (HDFS). In the beginning, Hadoop was actually called Apache Nutch until 2006 when Doug Cutting and Mike Cafarella (co-founders) renamed the project Hadoop. Doug Cutting named the project after his son’s toy elephant.
In April of 2006, Hadoop was able to sort 1.8 TB of data in 47.9 hours across 188 nodes. Since Doug Cutting worked at Yahoo during the inception of Hadoop, Yahoo had a lot of contributions during the early years of Hadoop. In May 2006, Yahoo was able to deploy a 300 node Hadoop cluster. In October of 2006, they were able to double the amount of machines to 600 nodes. By June of 2011, Yahoo was able to have a total of 42,000 Hadoop nodes which equates to hundreds of petabytes of data.
Throughout the years, many changes have happened to Hadoop. For example, in January of 2011, MapReduce was removed as the cluster management tool of Hadoop and changed to YARN (Yet Another Resource Negotiator). A cluster management tool is a tool that manages the resources on a cluster of machines. MapReduce is still used as a data processing framework just not as the manager of resources inside of the Hadoop cluster. Since the inception of Hadoop in 2006, there have been 3 major version releases with that latest (3.0) being released in 2017.
Even though Hadoop is open source, there are some companies that have packaged the ecosystem up so that they can provide easy setup, configuration, and support for enterprises. There are two major companies that offer distributions: Cloudera and HortonWorks. We will be using a docker image throughout these wikis that is a single node implementation of the Cloudera Hadoop Distribution or CDH.
The Hadoop Ecosystem
There are so many applications that can run in the Hadoop ecosystem. Hadoop isn’t just one thing. It is a combination of applications that you fit together to create the best Big Data solution that you can. There are common applications that people use such as HDFS, Spark, Oozie, Pig, or Mahout. The truth is that it is almost impossible to know every single detail about each ecosystem application out there. However, knowing a little bit about each and knowing the most popular tools will give you enough information to create the correct Big Data application. With that said, we are going to do a quick overview of some of the most popular tools that are being used in the industry. Most of these will be explained in depth in later posts.
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that makes up the backbone of Hadoop. Since HDFS is scalable horizontally across many servers, this is an ideal file system for the Big Data. The HDFS servers also double as processing machines giving you processing next to the data.
MapReduce
MapReduce is the original algorithm and framework used to manage and process data inside of a Hadoop cluster. This uses the nodes that make up the HDFS to process large amounts of data in a Java written framework.
Spark
Spark is MapReduce’s faster younger brother. Spark processes data in-memory rather than via disk like MapReduce does. Because Spark uses memory instead of hard disks to do processing, it is 100x faster than MapReduce.
Yet Another Resource Negotiator (YARN)
Yet Another Resource Negotiator (YARN) is a cluster management tool that replaced MapReduce as the cluster management tool for Hadoop. Switching to YARN helped introduce non-MapReduce based Hadoop ecosystem tools such as Spark.
Pig
Pig is a Hadoop application that was written by Yahoo. It is a scripting language that solves problems using a data flow strategy. Pig scripts are written in Pig Latin (a simple SQL-like scripting language) and can run on top of the execution engines of MapReduce and Spark.
HBase
HBase is a NoSql database that is built on top of HDFS. Building on top of HDFS allows it to have the ability to scale linearly and have the durability that comes with HDFS. HBase adds benefit by allowing random read and writes which is not possible in HDFS.
Hive
Hive is a Hadoop ecosystem tool that offers a SQL-like interface that allows a user to put a relational view on top of data stored in HDFS. HiveQL is the the SQL-like language that the user can use to write queries on data. Even though this sounds like a relational database, Hive has some restrictions and is used primarily as a data visualization tool.
Flume
Flume is a distributed, reliable, and available service allowing the user to collect, aggregate, and move large amounts of data. With its simple and flexible architecture, Flume offers a robust and fault tolerant solution for high available ingestion system.
Oozie
Oozie is a workflow application that helps schedule and manage Hadoop jobs. It has the ability to chain multiple jobs together and kick off these jobs based off many options including input data and time. Oozie also offers decision points so that you can insert logic into your workflows.
Zeppelin
Zeppelin is a data visualization tool that works with Hadoop. It has the ability to run Spark jobs from an interactive shell and visualize data that you have mined from Hive. Zeppelin has the power to take in preconfigured configurations and also custom inputs so that you can visualize the data that you want to.
Kafka
Kafka is a popular message queueing system that is used for ingestion of data into the Hadoop environment. It is highly available and offers very quick throughput. Kafka also allows multiple consumers from a single topic (queue) which is hard to find in other queue systems.
Installing Hadoop using Docker
Introduction
This guide focuses on giving you practical, hands-on experience, so we will use Docker to install a free sandbox provided by Cloudera where you can test out the code that you write.
Docker allows developers to quickly and seamlessly package and deploy their application in a sandbox (known as containers). You may be thinking that they are similar to virtual machines, however Docker containers have a much smaller footprint and don’t need an entire virtual operating system.
Even though we are using Docker, we won’t be explaining a lot of Docker concepts. The reason we are using Docker is so that we can easily spin up and spin down a single node Hadoop cluster so that you can follow along at home. Also, once you get inside the Docker container, the commands are the same across Windows, Mac, and Linux keeping everything consistent.
If you have a machine that has 8 GBs of RAM or more, you are welcome to follow the local installation instructions below, otherwise we will provide instructions on how to run a Docker server in the cloud using Digital Ocean.
Setting Up Docker
For Mac: Head over to https://docs.docker.com/docker-for-mac/install/ and download the stable version for Mac and follow the instructions on this page. By the end of this page, you should be able to open a terminal and run docker and see some options like a regular command.
For Windows: Head over to https://docs.docker.com/docker-for-windows/install/#start-docker-for-windows and download the stable version for Windows and follow the instructions on this page. By the end of this page, you should be able to open a terminal and run docker and see some options like a regular command.
Local Installation Instructions
Note: Only use these instructions if you have enough RAM available on your computer.
If you have a computer that has 8 GB or more of RAM, you can disregard the rest of the instructions unless you want to have some fun with a cloud provider. All you have to do to get started is install docker with above instructions. Then run this command:
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 cloudera/quickstart:latest /usr/bin/docker-quickstart bash
To exit the container just type exit
or ctrl+c
.
Cloud Installation Instructions
Note: Only use these instructions if you don’t have enough RAM on your computers.
We are going to be utilizing a cloud company called Digital Ocean to spin up and down a server that runs Docker. With Docker, we will have a popular free sandbox that is provided by Cloudera. While Digital Ocean isn’t free, if you use this link, you’ll get $10 to get started. Plus you only pay for what you use so be sure to follow along in each wiki because we will always shut down the virtual server so that you’re not charged for something you’re not using.
Setting Up Digital Ocean
Head on over this link and sign up for an account. Once the account is made let’s go to the homescreen. On the menu bar, click API → Generate New Token → Enter the Name of the Token → Click “Generate Token”
This should have spit out a generated hex string which is super important when we go to create a docker machine so hold on to it. MAKE A COPY BECAUSE YOU WON’T BE ABLE TO GET THIS AGAIN. You’ll have to recreate if you lose it.
Creating a Docker Host Machine
Here comes the fun part. We are going to create a virtual server in Digital Ocean with a single command. This will spin up an environment that we can ssh to so that we can run the commands we need to setup our sandbox environment. We are almost there.
In your terminal while Docker is running, run the following command while replacing all of the xxx’s with your personal access token hex string.
docker-machine create --driver digitalocean --digitalocean-access-token xxxxx docker-sandbox
You’ll see it jump into action. After just a few minutes, you should be able to go to your Digital Ocean home page and see your first droplet running. Should look something like this.
Congratulations! You just created a virtual server on the cloud! Pretty impressive, but there is still more to do. Always make sure to return here to ensure that your droplet (server) has been destroyed properly.
Using the Docker Machine
We need a way to ssh into that box to run some commands. Try the following command:
docker-machine ssh docker-sandbox
That should have transported you to your server. You can tell by the root@docker-sandbox
that shows up in your command prompt. Now we are going to spin up our single node Hadoop cluster using this command (while still on your virtual server)
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 cloudera/quickstart:latest /usr/bin/docker-quickstart bash
This is doing a lot of things including downloading the Cloudera image from Docker hub and starting the Hadoop cluster. If you are seeing this move pretty slow, you may want to resize your server you’re running on (See below). Once it has loaded you should be able to do a
hdfs dfs -ls /
Your Hadoop installation is working. Awesome job. Let’s see how to increase the size of your machine.
Increasing the size
From the homepage, locate the droplet you want to resize and go to More -> Resize
You should come to this page:
Don’t let these numbers scare you. The top number is the cost of the server per month. Underneath the big number is the price per hour. Not so bad. But when resizing, we need to ensure that your droplet is off so toggle the On in the top right corner to off. Now you can pick the $80 server or$0.119/hour server but that’s not too bad either because if you used the link above to sign up, you should have gotten $10 free. Once you click the$80 server, scroll down and click Resize. This will take a couple of minutes, but once it is done, ensure that it is back on and you are ready to go.
Destroying Your Virtual Server
This is almost as easy as starting one. From your local machine’s command line, run the following commands
docker-machine stop docker-sandbox
docker-machine rm docker-sandbox
You’ll be prompted to if you want to remove and the answer is y or yes.
To confirm, check on your Digital Ocean homepage and by doing a
docker-machine ls
Which should return an empty list.
Conclusion
We’ve looked at the history of Hadoop and some of the most popular tools in Hadoop. We even gave you a crash course in Docker, Digital Ocean, and spun up a single node Hadoop cluster. Don’t worry if it all didn’t click or you forget exactly how to spin up a Docker machine. We will have refreshers throughout each tutorial.