Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
Apache Hadoop Deployment
A Blueprint for Reliable Distributed Computing
By Eugene Ciurana
16,745 Downloads · Refcard 133 of 199 (see them all)
The Essential Hadoop Deployment Cheat Sheet
Apache Hadoop Deployment:A Blueprint for Reliable Distributed Computing
This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out Refcard #117, Getting Started with Apache Hadoop, for basic terminology and for an overview of the tools available in the Hadoop Project.
WHICH HADOOP DISTRIBUTION?
Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. This Refcard presents how to deploy and use development and production computational networks. HDFS, MapReduce, and Pig are the foundational tools for developing Hadoop applications.
There are two basic Hadoop distributions:
- Apache Hadoop is the main open-source, bleeding-edge distribution from the Apache foundation.
- The Cloudera Distribution for Apache Hadoop (CDH) is an open-source, enterprise-class distribution for productionready environments.
The decision of using one or the other distributions depends on the organization’s desired objective.
- The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together.
- CDH removes the guesswork and offers an almost turnkey product for robustness and stability; it also offers some tools not available in the Apache distribution.
Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop. Their toolset complements Apache’s. Documentation about Cloudera’s CDH is available from http://docs.cloudera.com.
The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. CDH, on the other hand, is designed as a drop-in component for all major Linux distributions.
Linux is the supported platform for production systems. Windows is adequate but is not supported as a development platform.
- Java 1.6 from Oracle, version 1.6 update 8 or later; identify your current JAVA_HOME
- sshd and ssh for managing Hadoop daemons across multiple systems
- rsync for file and directory synchronization across the nodes in the cluster
- Create a service account for user hadoop where $HOME=/home/hadoop
Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration.
|Listing 1 - Hadoop SSH Prerequisits|
The public key for this example is left blank. If this were to run on a public network it could be a security hole. Distribute the public key from the master node to all other nodes for data exchange. All nodes are assumed to run in a secure network behind the firewall.
All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.eu/DeployingHadoopDZone
Enterprise: CDH Prerequisites
Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions.
CDH packages have names like CDH2, CDH3, and so on, corresponding to the CDH version. The examples here use CDH3. Use the appropriate version for your installation.
CDH on Ubuntu Pre-Install Setup
Execute these commands as root or via sudo to add the Cloudera repositories:
|Listing 2 - Ubuntu Pre-Install Setup|
CDH on Red Hat Pre-Install Setup
Run these commands as root or through sudo to add the yum Cloudera repository:
|Listing 3 - Red Hat Pre-Install Setup|
Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. Don’t mix and match operating systems, distributions, Hadoop, or Java versions!
Hadoop for Development
- Hadoop runs as a single Java process, in non-distributed mode, by default. This configuration is optimal for development and debugging.
- Hadoop also offers a pseudo-distributed mode, in which every Hadoop daemon runs in a separate Java process. This configuration is optimal for development and will be used for the examples in this guide.
If you have an OS X or a Windows development workstation, consider using a Linux distribution hosted on VirtualBox for running Hadoop. It will help prevent support or compatibility headaches.
Hadoop for Production
- Production environments are deployed across a group of machines that make the computational network. Hadoop must be configured to run in fully distributed, clustered mode.
APACHE HADOOP INSTALLATION
This Refcard is a reference for development and production deployment of the components shown in Figure 1. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released.
Figure 1 - Hadoop Components
Whether the user intends to run Hadoop in non-distributed or distributed modes, it’s best to install every required component in every machine in the computational network. Any computer may assume any role thereafter.
A non-trivial, basic Hadoop installation includes at least these components:
- Hadoop Common: the basic infrastructure necessary for running all components and applications
- HDFS: the Hadoop Distributed File System
- MapReduce: the framework for large data set distributed processing
- Pig: an optional, high-level language for parallel computation and data flow
Enterprise users often chose CDH because of:
- Flume: a distributed service for efficient large data transfers in real-time
- Sqoop: a tool for importing relational databases into Hadoop clusters
Apache Hadoop Development Deployment
The steps in this section must be repeated for every node in a Hadoop cluster. Downloads, installation, and configuration
could be automated with shell scripts. All these steps are performed as the service user hadoop, defined in the
http://hadoop.apache.org/common/releases.html has the latest version of the common tools. This guide used version 0.20.2.
- Download Hadoop from a mirror and unpack it in the /home/hadoop work directory.
- Set the JAVA_HOME environment variable.
- Set the run-time environment:
|Listing 4 - Set the Hadoop Runtime Environment|
Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.xml, hdfs-site.xml, and the mapred-site.xml. These files configure the master, the file system, and the MapReduce framework and live in the runtime/conf directory.
|Listing 5 - Pseudo-Distributed Operation Config|
These files are documented in the Apache Hadoop Clustering reference, http://is.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.
Test the Hadoop Installation
Hadoop requires a formatted HDFS cluster to do its work:
hadoop namenode -format
The HDFS volume lives on top of the standard file system. The format command will show this upon successful completion:
/tmp/dfs/name has been successfully formatted.
Start the Hadoop processes and perform these operations to validate the installation:
- Use the contents of runtime/conf as known input
- Use Hadoop for finding all text matches in the input
- Check the output directory to ensure it works
|Listing 6 - Testing the Hadoop Installation|
You may ignore any warnings or errors about a missing slaves file.
- View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install
|Listing 7 - Job Completion and Daemon Termination|
That’s it! Apache Hadoop is installed in your system and ready for development.
CDH Development Deployment
CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. Compare the instructions in Listing 8 against the previous section. CDH simplifies installation and configuration for huge time savings.
|Listing 8 - Installing CDH|
Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Install Flume or Pig with the instructions in Listing 9.
|Listing 9 - Adding Optional Components|
Test the CDH Installation
The CDH daemons are ready to be executed as services. There is no need to create a service account for executing them. They can be started or stopped as any other Linux service, as shown in Listing 10.
|Listing 10 - Starting the CDH Daemons|
CDH will create an HDFS partition when its daemons start. It’s another convenience it offers over regular Hadoop. Listing 11 shows how to validate the installation by:
- Listing the HDFS module
- Moving files to the HDFS volume
- Running an example job
- Validating the output
|Listing 11 - Testing the CDH Installation|
The daemons will continue to run until the server stops. All the Hadoop services are available.
Monitoring the Local Installation
Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces. All daemons expose their data over HTTP. The users can chose to monitor a node or daemon interactively using the web UI, like in Figure 2. Developers, monitoring tools, and system administrators can use the same ports for tracking the system performance and state using web service calls.
Figure 2 - NameNode status web UI
The web interface can be used for monitoring the JobTracker, which dispatches tasks to specific nodes in a cluster, the DataNodes, or the NameNode, which manages directory namespaces and file nodes in the file system.
HADOOP MONITORING PORTS
Use the information in Table 1 for configuring a development workstation or production server firewall.
Table 1 - Hadoop ports
Plugging a Monitoring Agent
The Hadoop daemons also expose internal data over a RESTful interface. Automated monitoring tools like Nagios, Splunk, or SOBA can use them. Listing 12 shows how to fetch a daemon’s metrics as a JSON document:
|Listing 12 - Fetching Daemon Metrics|
All the daemons expose these useful resource paths:
- /metrics - various data about the system state
- /stacks - stack traces for all threads
- /logs - enables fetching logs from the file system
- /logLevel - interface for setting log4j logging levels
Each daemon type also exposes one or more resource paths specific to its operation. A comprehensive list is available from: http://is.gd/MBN4qz
APACHE HADOOP PRODUCTION DEPLOYMENT
The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.
Detailed guides that complement this Refcard are available from Cloudera at http://is.gd/RBWuxm and from Apache at http://is.gd/ckUpu1.
Figure 3 - Hadoop Computational Network
The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The basic procedure for deploying a Hadoop cluster is:
- Pick a Hadoop distribution
- Prepare a basic configuration on one node
- Deploy the same pre-configured package across all machines in the cluster
- Configure each machine in the network according to its role
The Apache Hadoop documentation shows this as a rather involved process. The value-added in CDH is that most of that work is already in place. Role-based configuration is very easy to accomplish. The rest of this Refcard will be based on CDH.
Handling Multiple Configurations: Alternatives
Each server role will be determined by its configuration, since they will all have the same software installed. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations.
Check the main page to learn more about alternatives. Ubuntu: man update-alternatives Red Hat: man alternatives
The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. This customization is where all the extra work went into CDH. The CDH installation uses alternatives to set the effective CDH configuration.
Setting Up the Production Configuration
Listing 13 takes a basic Hadoop configuration and sets it up for production.
|Listing 13 - Set the Production Configuration|
The server will restart all the Hadoop daemons using the new production configuration.
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop
Pick a node from the cluster to act as the NameNode (see Figure 3). All Hadoop activity depends on having a valid R/W file system. Format the distributed file system from the NameNode, using user hdfs:
|Listing 14 - Create a New File System|
|sudo -u hdfs hadoop namenode -format|
Stop all the nodes to complete the file system, permissions, and ownership configuration. Optionally, set daemons for automatic startup using rc.d.
|Listing 15 - Stop All Daemons|
File System Setup
Every node in the cluster must be configured with appropriate directory ownership and permissions. Execute the commands in Listing 16 in every node:
|Listing 16 - File System Setup|
Starting the Cluster
- Start the NameNode to make HDFS available to all nodes
- Set the MapReduce owner and permissions in the HDFS volume
- Start the JobTracker
- Start all other nodes
CDH daemons are defined in /etc/init.d — they can be configured to start along with the operating system or they can be started manually. Execute the command appropriate for each node type using this example:
|Listing 17 - Starting a Node Example|
Use jobtracker, datanode, tasktracker, etc. corresponding to the node you want to start or stop.
Refer to the Linux distribution’s documentation for information on how to start the /etc/init.d daemons with the chkconfig tool.
|Listing 18 - Set the MapReduce Directory Up|
Update the Hadoop Configuration Files
|Listing 19 - Minimal HDFS Config Update|
The last step consists of configuring the MapReduce nodes to find their local working and system directories:
|Listing 20 - Minimal MapReduce Config Update|
Start the JobTracker and all other nodes. You now have a working Hadoop cluster. Use the commands in Listing 11 to validate that it’s operational.
The instructions in this Refcard result in a working development or production Hadoop cluster. Hadoop is a complex framework and requires attention to configure and maintain it. Review the Apache Hadoop and Cloudera CDH documentation. Pay particular attention to the sections on:
- How to write MapReduce, Pig, or Hive applications
- Multi-node cluster management with ZooKeeper
- Hadoop ETL with Sqoop and Flume
Happy Hadoop computing!
Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.eu/scalablesystems