This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out Refcard #117, Getting Started with Apache Hadoop, for basic terminology and for an overview of the tools available in the Hadoop Project.
Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. This Refcard presents how to deploy and use development and production computational networks. HDFS, MapReduce, and Pig are the foundational tools for developing Hadoop applications.
There are two basic Hadoop distributions:
The decision of using one or the other distributions depends on the organization’s desired objective.
Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop. Their toolset complements Apache’s. Documentation about Cloudera’s CDH is available from http://docs.cloudera.com.
The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. CDH, on the other hand, is designed as a drop-in component for all major Linux distributions.
Linux is the supported platform for production systems. Windows is adequate but is not supported as a development platform.
Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration.
|Listing 1 - Hadoop SSH Prerequisits|
The public key for this example is left blank. If this were to run on a public network it could be a security hole. Distribute the public key from the master node to all other nodes for data exchange. All nodes are assumed to run in a secure network behind the firewall.
All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.eu/DeployingHadoopDZone
Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions.
CDH packages have names like CDH2, CDH3, and so on, corresponding to the CDH version. The examples here use CDH3. Use the appropriate version for your installation.
Execute these commands as root or via sudo to add the Cloudera repositories:
|Listing 2 - Ubuntu Pre-Install Setup|
Run these commands as root or through sudo to add the yum Cloudera repository:
|Listing 3 - Red Hat Pre-Install Setup|
Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. Don’t mix and match operating systems, distributions, Hadoop, or Java versions!
If you have an OS X or a Windows development workstation, consider using a Linux distribution hosted on VirtualBox for running Hadoop. It will help prevent support or compatibility headaches.
This Refcard is a reference for development and production deployment of the components shown in Figure 1. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released.
Whether the user intends to run Hadoop in non-distributed or distributed modes, it’s best to install every required component in every machine in the computational network. Any computer may assume any role thereafter.
A non-trivial, basic Hadoop installation includes at least these components:
Enterprise users often chose CDH because of:
The steps in this section must be repeated for every node in a Hadoop cluster. Downloads, installation, and configuration
could be automated with shell scripts. All these steps are performed as the service user hadoop, defined in the
http://hadoop.apache.org/common/releases.html has the latest version of the common tools. This guide used version 0.20.2.
|Listing 4 - Set the Hadoop Runtime Environment|
Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.xml, hdfs-site.xml, and the mapred-site.xml. These files configure the master, the file system, and the MapReduce framework and live in the runtime/conf directory.
|Listing 5 - Pseudo-Distributed Operation Config|
These files are documented in the Apache Hadoop Clustering reference, http://is.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.
Hadoop requires a formatted HDFS cluster to do its work:
hadoop namenode -format
The HDFS volume lives on top of the standard file system. The format command will show this upon successful completion:
/tmp/dfs/name has been successfully formatted.
Start the Hadoop processes and perform these operations to validate the installation:
|Listing 6 - Testing the Hadoop Installation|
You may ignore any warnings or errors about a missing slaves file.
|Listing 7 - Job Completion and Daemon Termination|
That’s it! Apache Hadoop is installed in your system and ready for development.
CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. Compare the instructions in Listing 8 against the previous section. CDH simplifies installation and configuration for huge time savings.
|Listing 8 - Installing CDH|
Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Install Flume or Pig with the instructions in Listing 9.
|Listing 9 - Adding Optional Components|
The CDH daemons are ready to be executed as services. There is no need to create a service account for executing them. They can be started or stopped as any other Linux service, as shown in Listing 10.
|Listing 10 - Starting the CDH Daemons|
CDH will create an HDFS partition when its daemons start. It’s another convenience it offers over regular Hadoop. Listing 11 shows how to validate the installation by:
|Listing 11 - Testing the CDH Installation|
The daemons will continue to run until the server stops. All the Hadoop services are available.
Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces. All daemons expose their data over HTTP. The users can chose to monitor a node or daemon interactively using the web UI, like in Figure 2. Developers, monitoring tools, and system administrators can use the same ports for tracking the system performance and state using web service calls.
The web interface can be used for monitoring the JobTracker, which dispatches tasks to specific nodes in a cluster, the DataNodes, or the NameNode, which manages directory namespaces and file nodes in the file system.
Use the information in Table 1 for configuring a development workstation or production server firewall.
The Hadoop daemons also expose internal data over a RESTful interface. Automated monitoring tools like Nagios, Splunk, or SOBA can use them. Listing 12 shows how to fetch a daemon’s metrics as a JSON document:
|Listing 12 - Fetching Daemon Metrics|
All the daemons expose these useful resource paths:
Each daemon type also exposes one or more resource paths specific to its operation. A comprehensive list is available from: http://is.gd/MBN4qz
The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.
Detailed guides that complement this Refcard are available from Cloudera at http://is.gd/RBWuxm and from Apache at http://is.gd/ckUpu1.
The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The basic procedure for deploying a Hadoop cluster is:
The Apache Hadoop documentation shows this as a rather involved process. The value-added in CDH is that most of that work is already in place. Role-based configuration is very easy to accomplish. The rest of this Refcard will be based on CDH.
Each server role will be determined by its configuration, since they will all have the same software installed. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations.
Check the main page to learn more about alternatives. Ubuntu: man update-alternatives Red Hat: man alternatives
The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. This customization is where all the extra work went into CDH. The CDH installation uses alternatives to set the effective CDH configuration.
Listing 13 takes a basic Hadoop configuration and sets it up for production.
|Listing 13 - Set the Production Configuration|
The server will restart all the Hadoop daemons using the new production configuration.
Pick a node from the cluster to act as the NameNode (see Figure 3). All Hadoop activity depends on having a valid R/W file system. Format the distributed file system from the NameNode, using user hdfs:
|Listing 14 - Create a New File System|
|sudo -u hdfs hadoop namenode -format|
Stop all the nodes to complete the file system, permissions, and ownership configuration. Optionally, set daemons for automatic startup using rc.d.
|Listing 15 - Stop All Daemons|
Every node in the cluster must be configured with appropriate directory ownership and permissions. Execute the commands in Listing 16 in every node:
|Listing 16 - File System Setup|
CDH daemons are defined in /etc/init.d — they can be configured to start along with the operating system or they can be started manually. Execute the command appropriate for each node type using this example:
|Listing 17 - Starting a Node Example|
Use jobtracker, datanode, tasktracker, etc. corresponding to the node you want to start or stop.
Refer to the Linux distribution’s documentation for information on how to start the /etc/init.d daemons with the chkconfig tool.
|Listing 18 - Set the MapReduce Directory Up|
|Listing 19 - Minimal HDFS Config Update|
The last step consists of configuring the MapReduce nodes to find their local working and system directories:
|Listing 20 - Minimal MapReduce Config Update|
Start the JobTracker and all other nodes. You now have a working Hadoop cluster. Use the commands in Listing 11 to validate that it’s operational.
The instructions in this Refcard result in a working development or production Hadoop cluster. Hadoop is a complex framework and requires attention to configure and maintain it. Review the Apache Hadoop and Cloudera CDH documentation. Pay particular attention to the sections on:
Happy Hadoop computing!
Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.eu/scalablesystems