阅读

Run Hadoop Cluster in Docker Update

Abstract: Last year, I developed kiwenlau/hadoop-cluster-docker project, which aims to help user quickly build Hadoop cluster on local host using Docker. The project is quite popular with 250+ stars on GitHub and 2500+ pulls on Docker Hub. In this blog, I’d like to introduce the update version.

Introduction

By packaging Hadoop into Docker image, we can easily build a Hadoop cluster within Docker containers on local host. This is very help for beginners, who want to learn:

  • How to configure Hadoop cluster correctly?
  • How to run word count application?
  • How to manage HDFS?
  • How to run test program on local host?

Following figure shows the architecture of kiwenlau/hadoop-cluster-docker project. Hadoop master and slaves run within different Docker containers, NameNode and ResourceManager run within hadoop-master container while DataNode and NodeManager run within hadoop-slave container. NameNode and DataNode are the components of Hadoop Distributed File System(HDFS), while ResourceManager and NodeManager are the components of Hadoop cluster resource management system called Yet Another Resource Manager(YARN). HDFS is in charge of storing input and output data, while YARN is in charge of managing CPU and Memory resources.

In the old version, I use serf/dnsmasq to provide DNS service for Hadoop cluster, which is not an elegant solution because it requires extra installation/configuration and it will delay the cluster startup procedure. Thanks to the enhancement of Docker network function, we don’t need to use serf/dnsmasq any more. We can create a independent network for Hadoop cluster using following command:

sudo docker network create --driver=bridge hadoop

By using “–net=hadoop” option when we start Hadoop containers, these containers will attach to the “hadoop” network and they are able to communicate with container name.

Key points of update:

3 Nodes Hadoop Cluster

1. pull docker image

sudo docker pull kiwenlau/hadoop:1.0

2. clone github repository

git clone https://github.com/kiwenlau/hadoop-cluster-docker

3. create hadoop network

sudo docker network create --driver=bridge hadoop

4. start container

cd hadoop-cluster-docker
sudo ./start-container.sh

output:

start hadoop-master container...
start hadoop-slave1 container...
start hadoop-slave2 container...
root@hadoop-master:~#
  • start 3 containers with 1 master and 2 slaves
  • you will get into the /root directory of hadoop-master container

5. start hadoop

./start-hadoop.sh

6. run wordcount

./run-wordcount.sh

output

input file1.txt:
Hello Hadoop
input file2.txt:
Hello Docker
wordcount output:
Docker 1
Hadoop 1
Hello 2

Arbitrary size Hadoop cluster

1. pull docker images and clone github repository

do 1~3 like previous section

2. rebuild docker image

sudo ./resize-cluster.sh 5
  • specify parameter > 1: 2, 3..
  • this script just rebuild hadoop image with different slaves file, which pecifies the name of all slave nodes

3. start container

cd hadoop-cluster-docker
sudo ./start-container.sh 5
  • use the same parameter as the step 2

4. run hadoop cluster

do 5~6 like previous section

References

  1. Quickly build arbitrary size Hadoop Cluster based on Docker
  2. How to Install Hadoop on Ubuntu 13.10