apache

  • submit to reddit

Apache Maven 2

By Matthew McCullough

34,897 Downloads · Refcard 55 of 151 (see them all)

Download
FREE PDF


The Essential Maven 2 Cheat Sheet

Maven is a comprehensive project information tool whose most common application is building Java code. It is receiving renewed recognition in the emerging development space for its convention over configuration approach to builds. This DZone Refcard showcases how Maven offers unparalleled software lifecycle management, and gives Java developers a wide range of execution commands, tips for debugging Mavenized builds, and a clear introduction to the "Maven vocabulary". This Refcard also covers the MVN command, dependencies, plugins, profiles and more. Download it today!
HTML Preview
Apache Maven 2

Apache Maven 2

By Matthew McCullough

ABOUT APACHE MAVEN

Maven is a comprehensive project information tool, whose most common application is building Java code. Maven is often considered an alternative to Ant, but as you’ll see in this Refcard, it offers unparalleled software lifecycle management, providing a cohesive suite of verification, compilation, testing, packaging, reporting, and deployment plugins.

Maven is receiving renewed recognition in the emerging development space for its convention over configuration approach to builds. This Refcard aims to give JVM platform developers a range of basic to advanced execution commands, tips for debugging Mavenized builds, and a clear introduction to the “Maven vocabulary”.

Interoperability and Extensibility

New Maven users are pleasantly surprised to find that Maven offers easy-to-write custom build-supplementing plugins, reuses any desired aspect of Ant, and can compile native C, C++, and .NET code in addition to its strong support for Java and JVM languages and platforms, such as Scala, JRuby, Groovy and Grails.

Hot Tip

All things Maven can be found at http://maven.apache.org

THE MVN COMMAND

Maven supplies a Unix shell script and MSDOS batch file named mvn and mvn.bat respectively. This command is used to start all Maven builds. Optional parameters are supplied in a space-delimited fashion. An example of cleaning and packaging a project, then running it in a Jetty servlet container, yet skipping the unit tests, reads as follows:


mvn clean package jetty:run –Dmaven.test.skip

PROJECT OBJECT MODEL

The world of Maven revolves around metadata files named pom.xml. A file of this name exists at the root of every Maven project and defines the plugins, paths and settings that supplement the Maven defaults for your project.

Basic pom.xml Syntax

The smallest valid pom.xml, which inherits the default artifact type of “jar”, reads as follows:


<project>
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.ambientideas</groupId>
	<artifactId>barestbones</artifactId>
	<version>1.0-SNAPSHOT</version>
</project>

Super POM

The Super POM is a virtual pom.xml file that ships inside the core Maven JARs, and provides numerous default settings. All projects automatically inherit from the Super POM, much like the Object super class in Java. Its contents can be viewed in one of two ways:

View Super POM via SVN

Open the following SVN viewing URL in your web browser:


http://svn.apache.org/repos/asf/maven/components/branches/maven-2.1.x/pom.xml

View Super POM via effective-pom

Run the following command in a directory that contains the most minimal Maven project pom.xml, listed above.


mvn help:effective-pom

Multi-module Projects

Maven showcases exceptional support for componentization via its concept of multi-module builds. Place sub-projects in sub-folders beneath your top level project and reference each with a module tag. To build all sub projects, just execute your normal mvn command and goals from a prompt in the top-most directory.


<project>
  <!-- ... -->
  <packaging>pom</packaging>
  <modules>
    <module>servlets</module>
    <module>ejbs</module>
    <module>ear</module>
  </modules>
</project>

Artifact Vector

Each Maven project produces an element, such as a JAR, WAR or EAR, uniquely identified by a composite of fields known as groupId, artifactId, packaging, version and scope. This vector of fields uniquely distinguishes a Maven artifact from all others.

Many Maven reports and plugins print the details of a specific artifact in this colon separated fashion:


groupid:artifactid:packaging:version:scope

An example of this output for the core Spring JAR would be:


org.springframework:spring:jar:2.5.6:compile

EXECUTION GROUPS

Maven divides execution into four nested hierarchies. From most-encompassing to most-specific, they are: Lifecycle, Phase, Plugin, and Goal.

Lifecycles, Phases, Plugins and Goals

Maven defines the concept of language-independent project build flows that model the steps that all software goes through during a compilation and deployment process.

Lifecycles

Lifecycles represent a well-recognized flow of steps (Phases) used in software assembly.

Each step in a lifecycle flow is called a phase. Zero or more plugin goals are bound to a phase.

A plugin is a logical grouping and distribution (often a single JAR) of related goals, such as JARing.

A goal, the most granular step in Maven, is a single executable task within a plugin. For example, discrete goals in the jar plugin include packaging the jar (jar:jar), signing the jar (jar:sign), and verifying the signature (jar:sign-verify).

Executing a Phase or Goal

At the command prompt, either a phase or a plugin goal can be requested. Multiple phases or goals can be specified and are separated by spaces.


If you ask Maven to run a specific plugin goal, then only that goal is run. This example runs two plugin goals: compilation of code, then JARing the result, skipping over any intermediate steps. mvn compile:compile jar:jar

Conversely, if you ask Maven to execute a phase, all phases and bound plugin goals up to that point in the lifecycle are also executed. This example requests the deploy lifecycle phase, which will also execute the verification, compilation, testing and packaging phases.


mvn deploy

Online and Offline

During a build, Maven attempts to download any uncached referenced artifacts and proceeds to cache them in the ~/.m2/repository directory on Unix, or in the %USERPROFILE%/.m2/repository directory on Windows.

To prepare for compiling offline, you can instruct Maven to download all referenced artifacts from the Internet via the command:


mvn dependency:go-offline

If all required artifacts and plugins have been cached in your local repository, you can instruct Maven to run in offline mode with a simple flag:


mvn <phase or goal> -o

Built-in Maven Lifecycles

Maven ships with three lifecycles; clean, default, and site. Many of the phases within these three lifecycles are bound to a sensible plugin goal.

Hot Tip

The official lifecycle reference, which extensively lists all the default bindings, can be found at http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html

The clean lifecycle is simplistic in nature. It deletes all generated and compiled artifacts in the output directory.

Clean Lifecycle
Lifecycle Phase Purpose
pre-clean
clean Remove all generated and compiled artifacts in preperation for a fresh build.
post-clean
Default Lifecycle
Lifecycle Phase Purpose
validate Cross check that all elements necessary for the build are correct and present.
initialize Set up and bootstrap the build process.
generate-sources Generate dynamic source code
process-sources Filter, sed and copy source code
generate-resources Generate dynamic resources
process-resources Filter, sed and copy resources files.
compile Compile the primary or mixed language source files.
process-classes Augment compiled classes, such as for code-coverage instrumentation.
generate-test-sources Generate dynamic unit test source code.
process-test-sources Filter, sed and copy unit test source code.
generate-test-resources Generate dynamic unit test resources.
process-test-resources Filter, sed and copy unit test resources.
test-compile Compile unit test source files
test Execute unit tests
prepare-package Manipulate generated artifacts immediately prior to packaging. (Maven 2.1 and above)
package Bundle the module or application into a distributable package (commonly, JAR, WAR, or EAR).
pre-integration-test
integration-test Execute tests that require connectivity to external resources or other components
post-integration-test
verify Inspect and cross-check the distribution package (JAR, WAR, EAR) for correctness.
install Place the package in the user’s local Maven repository.
deploy Upload the package to a remote Maven repository

The site lifecycle generates a project information web site, and can deploy the artifacts to a specified web server or local path.

Site Lifecycle
Lifecycle Phase Purpose
pre-site Cross check that all elements necessary for the build are correct and present.
site Generate an HTML web site containing project information and reports.
post-site
site-deploy Upload the generated website to a web server

Default Goal

The default goal codifies the author’s intended usage of the build script. Only one goal or lifecycle can be set as the default. The most common default goal is install.


<project>
   [...]
   <build>
      lt;defaultGoal>install</defaultGoal>
   </build>
   [...]
</project>

HELP

Help for a Plugin

Lists all the possible goals for a given plugin and any associated documentation.


help:describe -Dplugin=<pluginname>

Help for POMs

To view the composite pom that’s a result of all inherited poms:


mvn help:effective-pom

Help for Profiles

To view all profiles that are active from either manual or automatic activation:


mvn help:active-profiles

DEPENDENCIES

Declaring a Dependency

To express your project’s reliance on a particular artifact, you declare a dependency in the project’s pom.xml.

Hot Tip

You can use the search engine at repository.sonatype.org to find dependencies by name and get the xml necessary to paste into your pom.xml

<project>
  <dependencies>
    <dependency>
	 <groupId>com.yourcompany</groupId>
	 <artifactId>yourlib</artifactId>
         <version>1.0</version>
	 <type>jar</type>
	 <scope>compile</scope>
    </dependency>
   </dependencies>
  <!-- ... -->
</project>

Standard Scopes

Each dependency can specify a scope, which controls its visibility and inclusion in the final packaged artifact, such as a WAR or EAR. Scoping enables you to minimize the JARs that ship with your product.

Scope Description
compile Needed for compilation, included in packages.
test Needed for unit tests, not included in packages.
provided Needed for compilation, but provided at runtime by the runtime container.
system Needed for compilation, given as absolute path on disk, and not included in packages.
import An inline inclusion of a POM-type artifact facilitating dependency-declaring POM snippets.

PLUGINS

Adding a Plugin

A plugin and its configuration are added via a small declaration, very similar to a dependency, in the <build> section of your pom.xml.


<build>
  <!-- ... -->
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-compiler-plugin</artifactId>
      <configuration>
        <maxmem>512m</maxmem>
     </configuration>
    </plugin>
  </plugins>
</build>

Common Plugins

Maven created an acronym for its plugin classes that aggregates “Plain Old Java Object” and “Maven Java Object” into the resultant word, Mojo.

There are dozens of Maven plugins, but a handful constitute some of the most valuable, yet underused features:

surefire Runs unit tests.
checkstyle Checks the code’s styling
clover Code coverage evaluation.
enforcer Verify many types of environmental conditions as prerequisites.
assembly Creates ZIPs and other distribution packages of apps and their transitive dependency JARs.

Hot Tip

The full catalog of plugins can be found at: http://maven.apache.org/plugins/index.html

VISUALIZE DEPENDENCIES

Users often mention that the most challenging task is identifying dependencies: why they are being included, where they are coming from and if there are collisions. Maven has a suite of goals to assist with this.

List a hierarchy of dependencies.


mvn dependency:tree

List dependencies in alphabetic form.


mvn dependency:resolve

List plugin dependencies in alphabetic form.


mvn dependency:resolve-plugins

Analyze dependencies and list any that are unused, or undeclared.


mvn dependency:analyze

REPOSITORIES

Repositories are the web sites that host collections of Maven plugins and dependencies.

Declaring a Repository


<repositories>
  lt;repository>
  <id>JavaDotNetRepo</id>
    <url>https://maven-repository.dev.java.net</url>
  </repository>
</repositories>

The Maven community strongly recommends using a repository manager such as Nexus to define all repositories. This results in cleaner pom.xml files and centrally cached and managed connections to external artifact sources. Nexus can be downloaded from http://nexus.sonatype.org/

Popular Repositories

Central http://repo1.maven.org/maven2/
Java.net https://maven-repository.dev.java.net/
Codehaus http://repository.codehaus.org/
JBoss http://repository.jboss.org/maven2

Hot Tip

A near complete list of repositores can be found at http://www.mvnbrowser.com/repositories.html

PROPERTY VARIABLES

A wide range of predefined or custom of property variables can be used anywhere in your pom.xml files to keep string and path repetition to a minimum.

All properties in Maven begin with ${ and end with }. To list all available properties, run the following command.


mvn help:expressions

Predefined Properties (Partial List)

${env.PATH} Any OS environment variable such as EDITOR, or GROOVY_HOME. Specifically, the PATH environment variable.
${project.groupId} Any project node from the aggregated Maven pom.xml. Specifically, the Group ID of the project
${project.artifactId} Name of the artifact.
${project.basedir} Path of the pom.xml.
${settings.localRepository} The path to the user’s local repository.
${java.home} Any Java System Property. Specifically, the Java System Property path to its home.
${java.vendor} The Java System Property declaring the JRE vendor’s name.
${my.somevar} A user-defined variable.

Project properties could previously be referenced with a ${pom.basedir} prefix or no prefix at all ${basedir}. Maven now requires that you prefix these variables with the word project ${project.basedir}.

Define a Property

You can define a new custom property in your pom.xml like so:


<project>
   [...]
   <properties>
      <my.somevar>My Value</my.somevar>
   </properties>
   [...]
</project>

DEBUGGING

Exception Full Stack Traces

If a Maven plugin is reporting an error, to see the full detail of the exception’s stack trace run Maven with the -e flag.


mvn <yourgoal> -e

Output Debugging Info

Whenever reporting a Maven bug, or troubleshooting a problem, turn on all the debugging info by running Maven like so:


mvn <yourgoal> -X

Debug Maven Core/Plugins

Core Maven operations and plugins can be stepped through with any JPDA-compatible debugger, the most common option being Eclipse. When run in debug mode, Maven will wait for you to connect your debugger to socket port 8000 before continuing with its lifecycle.


mvnDebug <yourgoal>
Preparing to Execute Maven in Debug Mode
Listening for transport dt_socket at address: 8000

Debug a Unit Test

Your suite or an individual unit test can be debugged in much the same fashion by telling the Surefire test-execution plugin to wait for you to attach a debugger to port 5005.


mvn test -Dmaven.surefire.debug
Listening for transport dt_socket at address: 5005

SOURCE CODE MANAGEMENT

Configuring SCM

Your project’s SCM connection can be quickly configured with just three XML tags, which adds significant capabilities to the scm, release, and reactor plugins.

The connection tag is your read-only view of your repository and developerConnection is the writable link. URL is your web-based view of the source.


<scm>
  <connection>scm:svn:http://myvendor.com/ourrepo/trunk</
connection>
  <developerConnection>
     scm:svn:https://myvendor.com/ourrepo/trunk
  </developerConnection>
  <url>http://myvendor.com/viewsource.pl</url>
</scm>

Hot Tip

Over 12 SCM systems are supported by Maven. The full list can be viewed at http://docs.codehaus.org/display/SCM/SCM+Matrix

Using the SCM Plugin

The core SCM plugin offers two highly useful goals.

The diff command produces a standard Unix patch file with the extension .diff of the pending (uncommitted) changes on disk that can be emailed or attached to a bug report.


mvn scm:diff

The update-subprojects goal invokes a recursive scm-provider specific update (svn update, git pull) across all the submodules of a multimodule project.


mvn scm:update-subprojects

PROFILES

Profiles are a means to conditionally turn on portions of Maven configuration, including plugins, pathing and configuration.

The most common uses of profiles are for Windows/Unix platform-specific variations and build-time customization of JAR dependencies based on the use of a specific Weblogic, Websphere or JBoss J2EE vendor.


<project>
     [...]
  <profiles>
    <profile>
      <id>YourProfile</id>
         [...settings, build, plugins etc...]
      <dependencies>
        <dependency>
          <groupId>com.yourcompany</groupId>
          <artifactId>yourlib</artifactId>
       </dependency>
      <dependencies>
   </profile>
 </profiles>
[...]
</project>

Profile Definition Locations

Profiles can be defined in pom.xml, profiles.xml (parallel to the pom.xml), ~/.m2/settings.xml, or $M2_HOME/conf/settings.xml.

Hot Tip

The full Maven Profile reference, including details about when to use each of the profile definition files, can be found at http://maven.apache.org/guides/introduction/introduction-to-profiles.html

PROFILE ACTIVATION

Profiles can be activated manually from the command line or through an activation rule (OS, file existence, Maven version, etc.). Profiles are primarily additive, so best practices suggest leaving most off by default, and activating based on specific conditions.

Manual Profile Activation


mvn <yourgoal> –P YourProfile

Automatic Profile Activation


<project>
     [...]
 <profiles>
   <profile>
     <id>YourProfile</id>
     [...settings, build, etc...]
  <activation>
    <os>
      <name>Windows XP</name>
      <family>Windows</family>
      <arch>x86</arch>
      <version>5.1.2600</version>
   </os>
    <file>
       <missing>somefolder/somefile.txt</missing>
    </file>
  </activation>
</profile>
</profiles>
[...]
</project>

CUTTING A RELEASE

Maven offers excellent automation for cutting a release of your project. In short, this is a plugin-guided ceremony for verifying that all tests pass, tagging your source code repository, and altering the POMs to reflect a product version increment.

The prepare goal runs the unit tests, continuing only if all pass, then increments the value in the pom <version> tag to a release version, tags the source repository accordingly, and increments the pom version tag back to a SNAPSHOT version.


mvn release:prepare

After a release has been successfully prepared, run the perform goal. This goal checks out the prepared release and deploys it to the POM’s specified remote Maven repository for consumption by other teams and Maven builds.


mvn release:perform

ARCHETYPES

An archetype is a powerful template that uses your corporate Java package names and project name in the instantiated project and establishes a baseline of dependencies, with a bonus of basic sample code.

You can leverage public archetypes for quickly starting a project that uses a familiar stack, such as Struts+Spring, or Tapestry+Hibernate. You can also create private archetypes within your company to offer new projects a level of consistent dependencies matching your approved corporate technology stack.

Using an Archetype

The default behavior of the generate goal is to bring up a menu of choices. You are then prompted for various replaceables such as package name and artifactId. Type this command, then answer each question at the command line prompt.


mvn archetype:generate

Creating Archetypes

An archetype can be created from an existing project, using it as the pattern by which to build the template. Run the command from the root of your existing project.


mvn archetype:create-from-project

Archetype Catalogs

The Maven Archetype plugin comes bundled with a default catalog of applications it can create, but other projects on the Internet also publish catalogs. To use an alternate catalog:


mvn archetype:generate –DarchetypeCatalog=<catalog>

A list of the most commonly used catalogs is as follows:


local
remote
http://repo.fusesource.com/maven2
http://cocoon.apache.org
http://download.java.net/maven/2


http://myfaces.apache.org
http://tapestry.formos.com/maven-repository
http://scala-tools.org
http://www.terracotta.org/download/reflector/maven2/

REPORTS

Maven has a robust offering of reporting plugins, commonly run with the site generation phase, that evaluate and aggregate information about the project, contributors, it’s source, tests, code coverage, and more.

Adding a Report Plugin


<:reporting>
 <:plugins>
    <:plugin>
      <:artifactId>maven-javadoc-plugin<:/artifactId>
    <:/plugin>
  <:/plugins>
<:/reporting>

Hot Tip

A list of commonly used reporting plugins can be reviewed here http://maven.apache.org/plugins/

About The Author

Photo of MatthewMcCullough

Matthew McCullough

Matthew McCullough is an Open Source Architect with the Denver, Colorado consulting firm Ambient Ideas, LLC which he co-founded in 1997. He’s spent the last 13 years passionately aiming for ever-greater efficiencies in software development, all while exploring how to share these practices with his clients and their team members. Matthew is a nationally touring speaker on all things open source and has provided long term mentoring and architecture services to over 40 companies ranging from startups to Fortune 500 firms. Feedback and questions are always welcomed at matthewm@ambientideas.com

Recommended Book

Maven

Several sources for Maven have appeared online for some time, but nothing served as an introduction and comprehensive reference guide to this tool -- until now. Maven: The Definitive Guide is the ideal book to help you manage development projects for software, webapplications, and enterprise applications. And it comes straight from the source.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Daily Dose: Cloudera And Dell Set To Deliver "Complete" Hadoop Solution

Cloudera and Dell have partnered to deliver the industry's first total Apache Hadoop solution, which will combine Dell servers and networking components with Cloudera's Distribution, including Apache Hadoop, management tools, training and support. Customers...

0 replies - 16144 views - 08/07/11 by Jim Moscater in Daily Dose

Daily Dose: Release of Jive 4.5.6

Version 4.5.6 Jive, the popular Java portal building platform, boasts a significant set of improvements over the previous releases. Jive users can access Jive remotely using Jive Mobile.  Support is included for Android, Backberry, and iPhone. Overall...

0 replies - 21559 views - 05/15/11 by Katie Mckinsey in Daily Dose

Daily Dose: jQuery Creator Leaves Mozilla

John Resig has announced that he will no longer be acting as a JavaScript evangelist for Mozilla. He plans on scaling back his involvement with the jQuery project, and focusing on new challenges.  His next project will be with a non-profit, online...

0 replies - 17962 views - 05/05/11 by Katie Mckinsey in Daily Dose

Understanding Lucene

Powering Better Search Results

By Erik Hatcher

11,541 Downloads · Refcard 137 of 151 (see them all)

Download
FREE PDF


The Essential Apache Lucene Cheat Sheet

Apache Lucene is a cross-platform, high-performance, full-text search engine library written in Java. Today, there are also .NET and Python ports available. When used in conjunction with Apache Solr, Lucene becomes a world-class search platform. Solr includes a number of other features like faceting and a rich function query/sort capability. This Refcard will give you a foundational knowledge of Lucenes features from the inverted index structure on up. This includes documents, indexes, fields, analysis, searching and more. There will also be plenty of usage examples to look at with Solr as the front-end.
HTML Preview
Understanding Lucene Powering Better Search Results

Understanding Lucene: Powering Better Search Results

By Erik Hatcher

WHAT IS LUCENE?

The Lucene Ecosystem

“Lucene” is a broadly used term. It’s the original Java indexing and search library created by Doug Cutting. Lucene was then chosen as a top-level Apache Software Foundation project name — http://lucene.apache.org. The name is also used for various ports of the Java library to other languages (Lucene.Net, PyLucene, etc). The following table shows the key projects at http://lucene.apache.org.

Project Description
Lucene - Java Java-based indexing and search library. Also comes with extras such as highlighting, spellchecking, etc.
Solr High-performance enterprise search server. HTTP interface. Built upon Lucene Java. Adds faceting, replication, sharding, and more.
Droids Intelligent robot crawling framework.
Open Relevance Aims to collect and distribute free materials for relevance testing and performance.
PyLucene Python port of the Lucene Java project.

There are many projects and products that use, expose, port, or in some way wrap various pieces of the Apache Lucene ecosystem.

WHICH LUCENE DISTRIBUTION?

There are many ways to obtain and leverage Lucene technology. How you choose to go about it will depend on your specific needs and integration points, your technical expertise and resources, and budget/time constraints.

When Lucene in Action was published in 2004, before the advent of many of the projects mentioned above, we just had Lucene Java and some other open-source building blocks. It served its purpose and did so extremely well. Lucene has only gotten better since then: faster, more efficient, newer features, and more. If you’ve got Java skills you can easily grab lucene.jar and go for it.

However, some better and easier ways to build Lucene-based search applications are now available. Apache Solr, specifically, is a top notch server architecture, built from the ground up with Lucene. Solr factors in Lucene best practices and simplifies many aspects of indexing content and integrating search into your application as well as addressing scalability needs that exceed the capacity of single machines.

This Refcard is about the concepts of Lucene more than the specifics of the Lucene API. We’ll be shining the light on Lucene internals and concepts with Solr. Solr provides some very direct ways to interact with Lucene.

We recommend you start with one of the following distributions:

  • LucidWorks for Solr – certified distributions of the official Apache Solr distributions, including any critical bug fixes and key performance enhancements.
  • Apache Solr – a great starting point for developers; grab a distro, write a script, integrate into UI.

Hot Tip

If you’re getting started on building a search application, your quickest, easiest bet is to use LucidWorks Enterprise. LucidWorks Enterprise is Lucene and Solr, plus more. Easy to install, easy to configure and monitor. LucidWorks Enterprise is free for development, with support subscriptions available for production deployments.

Lucid Imagination offers professional services, training, and the new LucidWorks Enterprise platform. Visit http://www.lucidimagination.com.

Definitions/Glossary

There are many common terms used when elaborating on Lucene’s design and usage.

Term Definition/context/usage
Document Returnable search result item. A document typically represents a crawled web page, a file system file, or a row from a database query.
Field Property, metadata item, or attribute of a document. Documents typically have a unique key field, often called “id”. Other common fields are “title”, “body”, “last_modified_date”, and “categories”.
Term Searchable text, extracted from each indexed field by analysis (a process of tokenization and filtering).
tf/idf Term frequency / inverse document frequency. This is a commonly used factor, computing the relationship between term frequency (how many uses of the query term exists in the entire index) to the inverse document frequency (how many documents in the entire collection that contain that query term, inverted).

Lucene Java and Core Lucene Concepts Explained

The design of Lucene is, at a high level, quite straightforward. Documents are “indexed”.

Documents are a representation of whatever types of “objects” and granularities your application needs to work with on the search/discovery side of the equation. In other words, when thinking Lucene, it is important to consider the use cases / demands of the encompassing application in order to effectively tune the indexing process with the end goal in mind.

Lucene provides APIs to open, read, write, and search an index. Documents contain “fields”. Fields are the useful individually named attributes of a document used by your search application. For example, when indexing traditional files such as Word, HTML, and PDF documents, commonly used fields are “title”, “body”, “keywords”, “author”, and “last_modified_date”.

DOCUMENTS

Documents, to Lucene, are the findable items. Here’s where domain-specific abstractions really matter. A Lucene Document can represent a file on a file system, a row in a database, a news article, a book, a poem, an historical artifact (see collections. si.edu), and so on. Documents contain “fields”. Fields represent attributes of the containing document, such as title, author, keywords, filename, file_type, lastModified, and fileSize.

Fields have a name and one or more values. A field name, to Lucene, is arbitrary, whatever you want.

When indexing documents, the developer has the choice of what fields to add to the Document instance, their names, and how they are each handled. Field values can be stored and/or indexed. A large part of the magic of Lucene is in how field values are analyzed and how a field’s terms are represented and structured.

filename.doc
“document” example

Hot Tip

There are additional bits of metadata that can be indexed along with the terms text. Terms can optionally carry along their positions (relative position of term to previous term within the field), offsets (character offsets of the term in the original field), and payloads (arbitrary bytes associated with a term which can influence matching and scoring). Additionally, fields can store term vectors (an intra-field term/frequency data structure).

The heart of Lucene’s search capabilities is in the elegance of the index structure, a form of an “inverted index”. An inverted index is a data structure mapping “terms” to the documents. Indexed fields can be “analyzed”, a process of tokenizing and filtering text into individual searchable terms. Often these terms from the analysis process are simply the individual words from the text. The analysis process of general text typically also includes normalization processes (lowercasing, stemming, other cleansing). There are many interesting and sophisticated ways indexing analysis tuning techniques can facilitate typical search application needs for sorting, faceting, spell checking, autosuggest, highlighting, and more.

Inverted Index
Inverted Index

Again we need to look back at the search application needs. Almost every search application ends up with a human user interface with the infamous and ubiquitous “search box”.

box

The trick is going from a human entered “query” to returning matching documents blazingly fast. This is where the inverted index structure comes into play. For example, a user searching for “mountain” can be readily accommodated by looking up the term in the inverted index and matching associated documents.

Not only are documents matched to a query, but they are also scored. For a given search request, a subset of the matching documents are returned to the user. We can easily provide sorting options for the results, though presenting results in “relevancy” order is more often the desired sort criteria. Relevancy refers to a numeric “score” based on the relationship between the query and the matching document. (“Show me the documents best matching my query first, please”).

The following formula (straight from Lucene’s Similarity class javadoc) illustrates the basic factors used to score a document.

box 1
Lucene practical scoring formula

Each of the factors in this equation are explained further in the following table:

Factor Explanation
score(q,d) The final computed value of numerous factors and weights, numerically representing the relationship between the query and a given document.
coord(q,d) A search-time score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query’s terms will receive a higher score than another document with fewer query terms.
queryNorm(q) A normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable.
tf(t in d) Correlates to the term’s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. Note that tf(t in q) is assumed to be 1 and, therefore, does not appear in this equation. However, if a query contains twice the same term, there will be two term-queries with that same term. Hence, the computation would still be correct (although not very efficient).
idf(t) Stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score. idf(t) appears for t in both the query and the document, hence it is squared in the equation.
t.getBoost() A search-time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost().
norm(t,d) Encapsulates a few (indexing time) boost and length factors.

Understanding how these factors work can help you control exactly how to get the most effective search results from your search application. It's worth noting that in many applications these days, there are numerous other factors involved in scoring a document. Consider boosting documents by recency (latest news articles bubble up), popularity/ratings (or even like/dislike factors), inbound link count, user search/click activity feedback, profit margin, geographic distance, editorial decisions, or many other factors. But let's not get carried away just yet, and focus on Lucene's basic tf/idf.

So now we've briefly covered the gory details of how Lucene works for matching and scoring documents during a search. There's one missing bit of magic, going from the human input of a search box and translating that into a representative data structure, the Lucene Query object. This string, Query process is called "queryparsing". Lucene itself includes a basic QueryParser that can parse sophisticated expressions including AND, OR, +/-, parenthetical grouped expressions, range, fuzzy, wildcarded, and phrase query clauses. For example, the following expression will match documents with a title field with the terms "Understanding" and Lucene collocated successively (provided positional information was enabled!) where the mimeType (MIME type is the document type) value is "application/pdf":


title:”Understanding Lucene” AND mimeType:application/PDF

For more information on Lucene QueryParser syntax, see http://lucene.apache.org/java/3_0_3/queryparsersyntax.html (or the docs for the version of Lucene you are using).

It is important to note that query parsing and allowable user syntax is often an area of customization consideration. Lucene’s API richly exposes many Query subclasses, making it very straightforward to construct sophisticated Query objects using building blocks such as TermQuery, BooleanQuery, PhraseQuery, WildcardQuery, and so on.

Shining the Light on Lucene: Solr

Apache Solr embeds Java Lucene, exposing its capabilities through an easy-to-use HTTP interface. Solr has Lucene best practices built in, and provides distributed and replicated search for large scale power.

For the examples that follow, we’ll be using Solr as the front-end to Lucene. This allows us to demonstrate the capabilities with simple HTTP commands and scripts, rather than coding in Java directly. Additionally, Solr adds easy-to-use faceting, clustering, spell checking, autosuggest, rich document indexing, and much more. We’ll introduce some of Solr’s value-added pieces along the way.

Lucene has a lot of flexibility, likely much more than you will need or use. Solr layers some general common-sense best practices on top of Lucene with a schema. A Solr schema is conceptually the same as a relational database schema. It is a way to map fields/ columns to data types, constraints, and representations. Let’s take a preview look at fields defined in the Solr schema (conf/schema. xml) for our running example:


<fields>
	<field name=”id”
		type=”string” indexed=”true” stored=”true”/>
	<field name=”title”
		type=”text_en” indexed=”true” stored=”true” />
	<field name=”mimeType”
		type=”string” indexed=”true” stored=”true” />
	<field name=”lastModified”
		type=”date” indexed=”true” stored=”true” />
</fields>

The schema constrains all fields of a particular name (there is dynamic wildcard matching capability too) to a “field type”. A field type controls how the Lucene Field instances are constructed during indexing, in a consistent manner. We saw above that Lucene fields have a number of additional attributes and controls, including whether the field value is stored, indexed, if indexed, how so, which analysis chain, and whether positions, offsets, and/or term vectors are stored.

Our Running Example, Quick Proof-of-Concepts

The (Solr) documents we index will have a unique “id” field, a “title” field, a “mimeType” field to represent the file type for filtering/faceting purposes, and a “lastModified” date field to represent a file’s last modified timestamp. Here’s an example document (in Solr XML format, suitable for direct POSTing):


<add>
  <doc>
	<field name=”id”>doc01</field>
	<field name=”title”>Our first document</field>
	<field name=”mimeType”>application/pdf</field>
	<field name=”lastModified”>NOW</field>
  </doc>
</add>

That example shows indexing the metadata regarding an actual file. Ultimately, we also want the contents of the file to be searchable. Solr natively supports extracting and indexing content from rich documents. And LucidWorks Enterprise has built-in file and web crawling and scheduling along with content extraction.

Launching Solr, using its example configuration, is as straightforward as this, from a Solr installation directory:


cd example
java –jar start.jar

And from another command-shell, documents can be easily indexed. Our example document shown previously (saved as docs.xml for us) can be indexed like this:


cd example/exampledocs
java –jar post.jar docs.xml

First of all, this isn’t going to work out of the box, as we have a custom schema and applications needs not supported by Solr’s example configuration. Get used to it, it’s the real world! The example schema is there as an example, and likely inappropriate for your application as-is. Borrow what makes sense for your own applications needs, but don’t leave cruft behind.

At this point, we have a fully functional search engine, with a single document, and will use this for all further examples. Solr will be running at http://localhost:8983/solr.

INDEXING

The process of adding documents to Lucene or Solr is called indexing. With Lucene Java, you create a new Document instance and call the addDocument method of an IndexWriter. This is straightforward and simple enough, leaving the burden on you to come up with the textual strings that'll comprise the document.

Contrast with Solr, which provides numerous ways out of the box to index. We've seen an example of Solr XML, one basic way to bring in documents. Here are detailed examples of various ways to index content into Solr. Solr’s schema centralizes the decisions made about how fields are indexed, freeing the indexer from any internal knowledge about how fields should be handled.

sunny diagram

Solr XML/JSON

Solr’s basic XML format can be a convenient way to map your applications “documents” into Solr. A simple HTTP post to /update is all it takes.

Posting XML to Solr can be done using the post.jar tool that comes with Solr’s example data, curl (see Solr’s post.sh), or any other HTTP library or tool capable of POST. In fact, most of the popular Solr client API libraries out there simply wrap an HTTP library with some convenience methods for indexing documents, packaging up documents and field values into this XML structure and POSTing it to Solr’s /update handler. Documents indexed in this fashion will be updated if they share the same unique key field value (configured in schema.xml) as existing documents.

Recently, JSON support has been added so it can be even cleaner to post documents into Solr and easier to adapt to a wider variety of clients. It looks like this:


{“add”: {
  “doc”: {
	“id”: “doc02”,
	“title”: “Solr JSON”,
	“mimeType”: “application/pdf”}
  }
}

Simply post this type of JSON to /update/json. All other Solr commands can be posted as JSON as well (delete, commit, optimize).

Comma, or Tab, Separated Values

Another extremely convenient and handy way to bring documents into Solr is through CSV (comma-separated variables; or, more generally, column-separated variables as the separator character is configurable). An example CSV file is shown here:


id,title,mimeType,lastModified
doc03,CSV ftw,application/pdf,2011-02-28T23:59:59Z

This CSV can be POSTed to the /update/csv handler, mapping rows to documents and columns to fields in a flexible, mappable manner. Using curl, this file (we named docs.csv) can be posted like this:


curl “http://localhost:8983/solr /update/csv?commit=true” --databinary
@docs.csv -H ‘Content-type:text/plain; charset=utf-8’

Note that this Content-type header is a necessary HTTP header to use for the CSV update handler.

Indexing Rich Document Types

Thus far, our indexing examples have omitted extracting and indexing file content. Numerous rich document types, such as Word, PDF, and HTML, can be processed using Solr’s built-in Apache Tika integration. To index the contents and metadata of a Word document, using the HTTP command-line tool curl, this is basically all that is needed:


curl “http://localhost:8983/solr/update/extract?literal.id=doc04” -F
“myfile=@technical_manual.doc”

To index rich documents with Lucene’s API, you would need to interface with one or more extractor libraries, such as Tika, extract the text, and map full text and document metadata as appropriate to Lucene fields. It’s much more straightforward, with no coding, to accomplish this task with Solr.

Hot Tip

Apache Tika http://tika.apache.org/ is a toolkit for detecting and extracting metadata from various types of documents. Existing open-source extractors and parsers are bundled with Tika to handle the majority of file types folks desire to search. Tika is baked into Solr, under the covers of the /update/extract capability.

DataImportHandler

And finally, Solr includes a general-purpose “data import handler” framework that has built-in capabilities for indexing relational databases (anything with a JDBC driver), arbitrary XML, and e-mail folders. The neat thing about the DataImportHandler is that it allows aggregating data from various sources into whole Solr documents.

For more information on Solr’s DataImportHandler, see http://wiki.apache.org/solr/DataImportHandler.

Deleting Documents

Documents can be deleted from a Lucene index, either by precise term matching (a unique identifier field, generally) or in bulk for all documents matching a Query.

When using Solr, deletes are accomplished by POSTing <delete><id>refcard01</id></delete> or <delete><query>mi meType:application/PDF</query></delete> XML messages to the /update handler. Or “delete”: { “id”:”ID”} or “delete”: { “query”:”mimeType:application/pdf” } messages to /update/json.

Hot Tip

Deleting by query “*:*” and committing is a handy trick for deleting all documents and starting with a fresh index; very helpful during rapid iterative development.

Committing

Lucene is designed such that documents can continuously be indexed, though the view of what is searchable is fixed to a certain snapshot of an index (for performance, caching, and versioning reasons). This architecture allows batches of documents to be indexed and only made searchable after the entire batch has been ingested. Pending changes to an index, including added and deleted documents, are made visible using a commit command. With Solr, a <commit/> message can be posted to the /update handler, “commit”: {} to /update/json, or even simpler as a bodyless /update GET (or POST) with commit=true set: http://localhost:8983/solr/update?commit=true

FIELDS

As mentioned, fields have a lot of configuration flexibility. The following table details the various decisions you must make regarding each fields configuration.

Field Attribute Effect and Uses
stored Stores the original incoming field value in the index. Stored field values are available when documents are retrieved for search results.
term positions Location information of terms within a field. Positional information is necessary for proximity-related queries, such as phrase queries.
term offsets Character begin and end offset values of a term within a fields textual value. Offsets can be handy for increasing performance of generating query term highlighted field fragments. This one typically is a trade-off between highlighting performance and index size. If offsets aren’t stored, they can be computed at highlighting time.
term vectors An “inverted index” structure within a document, containing term/frequency pairs. Term vectors can be useful for more advanced search techniques, such as “more like this” where terms and their frequencies within a single document can be leveraged for finding similar documents.

In Solr’s schema.xml, a field can be configured to have all of these bells and whistles enabled like this:


<field name=”kitchen_sink” type=”text” indexed=”true” stored=”true”
termVectors=”true” termPositions=”true” termOffsets=”true” />

Only indexed fields have “terms”. These additional term-based structures are only available on indexed fields and really only make sense when used with analyzed full-text fields.

When indexing non-textual information, such as dates or numbers, the representation and ordering of the terms in the index drastically impact the types of operations available. Especially for numeric and date types, which typically are used for range queries and sorting, Lucene (and Solr) offer special ways to handle them. When indexing dates and numerics, use the Trie*Field types in Solr, and the NumericField/NumericTokenStream API’s with Lucene. This is a crucial reminder that what you want your end application to do with the search server greatly impacts how you index your documents. Sorting and range queries, specifically, require up-front planning to index properly to support those operations.

ANALYSIS

The Lucene analysis process consists of several stages. The text is sent initially through an optional CharFilter, then through a Tokenizer, and finally through any number of TokenFilters. CharFilters are useful for mapping diacritical characters to their ASCII equivalent, or mapping Traditional to Simplified Chinese. A Tokenizer is the first step in breaking a string into “tokens” (what they are called before being written to the index as “terms”). TokenFilters can subsequently add, remove, or modify/augment tokens in a sequential pipeline fashion.

Diagram 1

Hot Tip

Solr includes a very handy analysis introspection tool. You can access it at http://localhost:8983/sorl/admin/analysis.jsp. Specify a field name or field type, enter some text, and see how it gets analyzed through each of the processing stages.

Using the Solr admin analysis introspection tool, using the field type “text_en” with the value “Understanding Lucene Refcard”, the following terms result:

Diagram 2

The analysis tool shows the term text that would be indexed ([understanding], [lucene]…), and the position and offset attributes we previously discussed. The analysis tool will handily show you the term output of each of the analysis stages, from tokenization through each of the filters.

SEARCHING

Now that we’ve got content indexed, searching it is easy! Ultimately, a Lucene Query object is handed to a Lucene IndexSearcher.search() method and results are processed. How to construct a query is the next step.

With Lucene Java, TermQuery is the most primitive Query. Then there’s BooleanQuery, PhraseQuery, and many other Query subclasses to choose from. Programmatically, the sky’s the limit in terms of query complexity. Lucene also includes a QueryParser, which parses a string into a Query object, supporting fielded, grouped, fuzzy, phrase, range, AND/OR/NOT/+/- and other sophisticated syntax.

Solr makes this all possible without coding and accepts a simple string query (q) parameter (and other parameters that can affect query parsing/generation). Solr includes a couple of general purpose query parsers, most notably a schema-aware subclass of Lucene’s QueryParser. This Lucene query parser is the default.

Hot Tip

Solr also includes a number of other specialized query parsers and the capability to mix-and-match them in rich combinations. Most notably is the “dismax” (disjunction maximum) and a new experimental “edismax” (extended dismax) query parsers that allow typical users queries to query across a number of configurable fields with individual field boosting. Dismax is the parser most often used with Solr these days.

Searching Solr is a straightforward HTTP request to / select?q=<your query>. Displaying search results in JSON (adding &wt=json) format, we get something like this:


{“responseHeader”:{
	“status”:0,
	“QTime”:2,
	“params”:{
	  “indent”:”true”, “wt”:”json”, “q”:”*:*”}},
  “response”:{“numFound”:3,”start”:0,
	“docs”:[
	  {“id”:”refcard01”,
		“timestamp”:”2011-02-17T20:44:49.064Z”,
		“title”:[ 		“Understanding Lucene”]}, {
“id”:”refcard02”, 		“timestamp”:”2011-02-17T20:48:16.862Z”,
“title”:[ 		“Refcard 2”]}, 	{ 		“id”:”doc03” ,
“mimeType”:”application/pdf”,		 “lastModified”:”2011-02-
28T23:59:59Z”, 			“timestamp”:”2011-02-17T21:42:31.423Z”,
“title”:[		 “CSV ftw”]}] }}

Note that Solr can return search results in a number of formats (XML, JSON, Ruby, PHP, Python, CSV, etc), choose the one that is most convenient for your environment.

Debugging Query Parsing

Query parsing is complex business. It can be very helpful in seeing a representation of the underlying Query object generated. By adding a debug=query parameter to the request, you can see how a query is parsed. For example, using the query “title:lucene AND timestamp:[NOW-1YEAR TO NOW]“, the debug output returns a parsedquery value of:


parsedquery:+title:lucene +timestamp:[1266446158657 TO
1297982158657]”

Note that AND translated to both clauses as mandatory (leading +) and the date range values were parsed by Solr’s useful date math feature and then converted to the Lucene “date” type index representation.

Explaining Result Scoring

Now that we have real documents indexed, we can take a look at Lucene’s scoring first-hand. Solr provides an easy way to look at Lucene’s “explain” output, which details how/why a document scored the way it did. In our Refcard lab, doing a title:lucene search matches a document and scores it like this:


0.8784157 = (MATCH) fieldWeight(title:lucene in 0), product of:
	1.0 = tf(termFreq(title:lucene)=1)
	1.4054651 = idf(docFreq=1, maxDocs=3)
	0.625 = fieldNorm(field=title, doc=0)

Add the debug=results parameter to the Solr search request to have explanation output added to the response.

BELLS AND WHISTLES

Solr includes a number of other features; some of them wrap Lucene Java add-on libraries and some of them (like faceting and rich function query/sort capability) are currently only at the Solr layer. We aren’t going into any detail of these particular features here, but now that you understand Lucene, you have the foundation to understand basically how they work from the inverted index structure on up. These features include:

  • Faceting: providing counts for various document attributes across the entire result set.
  • Highlighting: generating relevant snippets of document text, highlighting query terms. Useful in result display to show users the context in which their queries matched.
  • Spell checking: “Did you mean…?”. Looks up terms textually close to the query terms and suggests possible intended queries.
  • More-like-this: Given a particular document, or some arbitrary text, what other documents are similar?

Version Information

These Refcard demos use the current development branch of Lucene/Solr. This is likely to be what is eventually released from Apache as Lucene and Solr 4.0. LucidWorks Enterprise is also based on this same version. The concepts apply to all versions of Lucene and Solr, and the bulk of these examples should also work with earlier versions of Solr.

For Further Information

For all things Apache Lucene, start here: http://lucene.apache.org

Solr sports relatively decent developer-centric documentation: http://wiki.apache.org/solr

Lucene in Action (Manning): http://www.manning.com/lucene

To answer your Lucene questions, try LucidFind — http://search.lucidimagination.com — where the Lucene ecosystems e-mail lists, wikis, issue tracker, etc are made searchable for the entire Lucene community’s benefit.

See Apache Solr: Getting Optimal Search Results, http://refcardz.dzone.com/refcardz/solr-essentials, for more information on Apache Solr.

About The Authors

Erik Hatcher

Erik Hatcher

Erik Hatcher evangelizes and engineers at Lucid Imagination. He co-authored both Lucene in Action and Java Development with Ant. At Lucid, he has worked with many companies deploying Lucene/Solr search systems. Erik has spoken at numerous industry events including Lucene EuroCon, ApacheCon, JavaOne, OSCON, and user groups and meetups around the world.

Recommended Book

Lucene in Action

When Lucene first appeared, this superfast search engine was nothing short of amazing. Today, Lucene still delivers. Its high-performance, easy-to-use API features like numeric fields, payloads, near-realtime search, and huge increases in indexing and searching speed make it the leading search tool.

And with clear writing, reusable examples, and unmatched advice, Lucene in Action, Second Edition is still the definitive guide to effectively integrating search into your applications. This totally revised book shows you how to index your documents, including formats such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering and covers the numerous improvements to Lucene since the first edition. Source code is for Lucene 3.0.1.

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

The Top Twelve Integration Patterns for Apache Camel

By Claus Ibsen

8,194 Downloads · Refcard 47 of 151 (see them all)

Download
FREE PDF


The Essential Apache Camel Cheat Sheet

Enterprise Integration Patterns (EIP) have become the standard way to describe, document and implement complex integration problems. Apache Camel is an open-source project for implementing the EIP simply in a few lines of Java code or XML configuration. This DZone Refcard will guide you through the most common Enterprise Integration Patterns and give you examples of how to implement them either in Java code or using Spring XML. While it is targeted toward software developers and enterprise architects, anyone in the integration space can benefit from this Refcard.
HTML Preview
Enterprise Integration Patterns with Apache Camel

Enterprise Integration Patterns: with Apache Camel

By Claus Ibsen

About Enterprise Integration Patterns

Integration is a hard problem. To help deal with the complexity of integration problems the Enterprise Integration Patterns (EIP) have become the standard way to describe, document and implement complex integration problems. Hohpe & Woolf's book the Enterprise Integration Patterns has become the bible in the integration space - essential reading for any integration professional.

Apache Camel is an open source project for implementing the EIP easily in a few lines of Java code or Spring XML configuration. This reference card, the first in a two card series, guides you through the most common Enterprise Integration Patterns and gives you examples of how to implement them either in Java code or using Spring XML. This Refcard is targeted for software developers and enterprise architects, but anyone in the integration space can benefit as well.

About Apache Camel

Apache Camel is a powerful open source integration platform based on Enterprise Integration Patterns (EIP) with powerful Bean Integration. Camel lets you implementing EIP routing using Camels intuitive Domain Specific Language (DSL) based on Java (aka fluent builder) or XML. Camel uses URI for endpoint resolution so its very easy to work with any kind of transport such as HTTP, REST, JMS, web service, File, FTP, TCP, Mail, JBI, Bean (POJO) and many others. Camel also provides Data Formats for various popular formats such as: CSV, EDI, FIX, HL7, JAXB, Json, Xstream. Camel is an integration API that can be embedded in any server of choice such as: J2EE Server, ActiveMQ, Tomcat, OSGi, or as standalone. Camels Bean Integration let you define loose coupling allowing you to fully separate your business logic from the integration logic. Camel is based on a modular architecture allowing you to plugin your own component or data format, so they seamlessly blend in with existing modules. Camel provides a test kit for unit and integration testing with strong mock and assertion capabilities.

Essential Patterns

This group consists of the most essential patterns that anyone working with integration must know.

Pipes and Filters

Diagram How can we perform complex processing on a message while maintaining independence and flexibility?
Pipes and Filters
Problem A single event often triggers a sequence of processing steps
Solution Use Pipes and Filters to divide a larger processing steps (filters) that are connected by channels (pipes)
Camel Camel supports Pipes and Filters using the pipeline node.
Java DSL

from("jms:queue:order:in").pipeline("direct:transformOrder", "direct:validateOrder", "jms:queue:order:process");

Where jms represents the JMS component used for consuming JMS messages on the JMS broker. Direct is used for combining endpoints in a synchronous fashion, allow you to divide routes into sub routes and/or reuse common routes.

Tip: Pipeline is the default mode of operation when you specify multiple outputs, so it can be omitted and replaced with the more common node:


from("jms:queue:order:in").to("direct:transformOrder",
"direct:validateOrder", "jms:queue:order:process");

TIP: You can also separate each step as individual to nodes:


from("jms:queue:order:in")
	.to("direct:transformOrder")
	.to("direct:validateOrder")
	.to("jms:queue:order:process");

Spring DSL

<route>
	<from uri="jms:queue:order:in"/>
	<pipeline>
		<to uri="direct:transformOrder"/>
		<to uri="direct:validateOrder"/>
		<to uri="jms:queue:order:process"/>
	</pipeline>
</route>
<route>
	<from uri="jms:queue:order:in"/>
	<to uri="direct:transformOrder"/>
	<to uri="direct:validateOrder"/>
	<to uri="jms:queue:order:process"/>
</route>

Message Router

Diagram How can you deouple indevidual processing steps so that messages can be passed to different filters depending on a set of conditions?
Message Router
Problem Pipes and Filters route each message in the same processing steps. How can we route messages differently?
Solution Filter using predicates to choose the right output destination.
Camel Camel supports Message Router using the choice node. For more details see the Content Based router pattern.

Content-Based Router

Diagram How do we handle a situation where the implementation of a single logical function (e.g., inventory check) is spread across multiple physical systems?
Content-Based Router
Problem How do we ensure a Message is sent to the correct recipient based on information from its content?
Solution Use a Content-Based Router to route each message to the correct recipient based on the message content.
Camel Camel has extensive support for Content-Based Routing. Camel supports content based routing based on choice, filter, or any other expression.
Java DSL

Choice


from("jms:queue:order")
.choice()
.when(header("type").in("widget","wiggy"))
.to("jms:queue:order:widget")
.when(header("type").isEqualTo("gadget"))
.to("jms:queue:order:gadget")
.otherwise().to("jms:queue:order:misc")
.end();

TIP: In the route above end() can be omitted as its the last node and we do not route the message to a new destination after the choice.

TIP: You can continue routing after the choice ends.

Spring DSL

Choice


<route>
	<from uri="jms:queue:order"/>
	<choice>
		<when>
			<simple>${header.type} in 'widget,wiggy'</simple>
			<to uri="jms:queue:order:widget"/>
		</when>
		<when>
			<simple>${header.type} == 'gadget'</simple>
			<to uri="jms:queue:order:gadget"/>
		</when>
		<otherwise>
			<to uri="jms:queue:order:misc"/>
		</otherwise>
	</choice>
</route>

TIP: In Spring DSL you cannot invoke code, as opposed to the Java DSL that is 100% Java. To express the predicates for the choices we need to use a language. We will use simple language that uses a simple expression parser that supports a limited set of operators. You can use any of the more powerful languages supported in Camel such as: JavaScript, Groovy, Unified EL and many others.

TIP: You can also use a method call to invoke a method on a bean to evaluate the predicate. Lets try that:


<when>
	<method bean="myBean" method="isGadget"/>
	...
</when>

<bean id="myBean" class="com.mycomapany.MyBean"/>
	
public boolean isGadget(@Header(name = "type") String type) {
	return type.equals("Gadget");
}

Notice how we use Bean Parameter Binding to instruct Camel to invoke this method and pass in the type header as the String parameter. This allows your code to be fully decoupled from any Camel API so its easy to read, write and unit test.

Message Translator

Diagram How can systems using different data formats communicate with each other using messaging?
Message Translator
Problem Each application uses its own data format, so we need to translate the message into the data format the application supports.
Solution Use a special filter, a messae translator, between filters or applications to translate one data format into another.
Camel Camel supports the message translator using the processor, bean or transform nodes. TIP: Camel routes the message as a chain of processor nodes.
Java DSL

Processor


public class OrderTransformProcessor
		implements Processor {
	public void process(Exchange exchange)
			throws Exception {
		// do message translation here
	}
}
from("direct:transformOrder")
	.process(new OrderTransformProcessor());

Bean

Instead of the processor we can use Bean (POJO). An advantage of using a Bean over Processor is the fact that we do not have to implement or use any Camel specific interfaces or types. This allows you to fully decouple your beans from Camel.


public class OrderTransformerBean {
	public StringtransformOrder(String body) {
		// do message translation here
	}
}
Object transformer = new OrderTransformerBean();
from("direct:transformOrder").bean(transformer);

TIP: Camel can create an instance of the bean automatically; you can just refer to the class type.


from("direct:transformOrder")
	.bean(OrderTransformerBean.class);

TIP: Camel will try to figure out which method to invoke on the bean in case there are multiple methods. In case of ambiguity you can specify which methods to invoke by the method parameter:


from("direct:transformOrder")
	.bean(OrderTransformerBean.class, "transformOrder");

Transform

Transform is a particular processor allowing you to set a response to be returned to the original caller. We use transform to return a constant ACK response to the TCP listener after we have copied the message to the JMS queue. Notice we use a constant to build an "ACK" string as response.


from("mina:tcp://localhost:8888?textline=true")
	.to("jms:queue:order:in")
	.transform(constant("ACK"));

Spring DSL

Processor


<route>
	<from uri="direct:transformOrder"/>
	<process ref="transformer"/>
</route>

<bean id="transformer" class="com.mycompany.
OrderTransformProcessor"/>

In Spring DSL Camel will look up the processor or POJO/Bean in the registry based on the id of the bean.

Bean


<route>
<from uri="direct:transformOrder"/>
<bean ref="transformer"/>
</route>
<bean id="tramsformer"
class="com.mycompany.OrderTransformBean"/>

Transform


<route>
<from uri="mina:tcp://localhost:8888?textline=true"/>
<to uri="jms:queue:order:in"/>
<transform>
<constant>ACK</constant>
</transform>
</route>

Annotation DSL

You can also use the @Consume annotation for transformations. For example in the method below we consume from a JMS queue and do the transformation in regular Java code. Notice that the input and output parameters of the method is String. Camel will automatically coerce the payload to the expected type defined by the method. Since this is a JMS example the response will be sent back to the JMS reply-to destination.


@Consume(uri="jms:queue:order:transform")
public String transformOrder(String body) {
	// do message translation
}

TIP: You can use Bean Parameter Binding to help Camel coerce the Message into the method parameters. For instance you can use @Body, @Headers parameter annotations to bind parameters to the body and headers.

Message Filter

Diagram How can a component avoid receiving unwanted messages?
Message Filter
Problem How do you discard unwanted messages?
Solution Use a special kind of Message Router, a Message Filter, to eliminate undesired messages from a channel based on a set of criteria.
Camel Camel has support for Message Filter using the filter node. The filter evaluates a predicate whether its true or false; only allowing the true condition to pass the filter, where as the false condition will silently be ignored.
Java DSL We want to discard any test messages so we only route non-test messages to the order queue.

from("jms:queue:inbox")
	.filter(header("test").isNotEqualTo("true"))
	.to("jms:queue:order");

Spring DSL For the Spring DSL we use XPath to evaluate the predicate. The $test is a special shorthand in Camel to refer to the header with the given name. So even if the payload is not XML based we can still use XPath to evaluate predicates.

<route>
	<from uri="jms:queue:inbox"/>
	<filter>
		<xpath>$test = 'false'</xpath>
		<to uri="jms:queue:inbox"/>
	</filter>
</route>

Dynamic Router

Diagram
Dynamic Router
Problem How can we route messages based on a dynamic list of destinations?
Solution Use a Dynamic Router, a router that can self-configure based on special configuration messages from participating destinations.
Camel Camel has support for Dynamic Router using the Dynamic Recipient List combined with a data store holding the list of destinations.
Java DSL We use a Processor as the dynamic router to determine the destinations. We could also have used a Bean instead.

from("jms:queue:order")
	.processRef(myDynamicRouter)
	.recipientList("destinations");
	
public class MyDynamicRouter implements Processor {
	public void process(Exchange exchange) {
		// query a data store to find the best match of the
		// endpoint and return the destination(s) in the
		// header exchange.getIn()
		// .setHeader("destinations", list);
	}
}

Spring DSL

<route>
	<from uri="jms:queue:order"/>
	<process ref="myDynamicRouter"/>
	<recipientList>
		<header>destinations</destinations>
	</recipientList>
</route>

Annotation DSL

public class MyDynamicRouter {
	@Consume(uri = "jms:queue:order")
	@RecipientList
	public List<String> route(@XPath("/customer/id")
String customerId, @Header("location") String location,
Document body) {
		// query data store, find best match for the
		//endpoint and return destination (s)
	}
}

TIP: Notice how we used Bean Parameter Binding to bind the parameters to the route method based on an @XPath expression on the XML payload of the JMS message. This allows us to extract the customer id as a string parameter. @Header wil bind a JMS property with the key location. Document is the XML payload of the JMS message.

TIP: Camel uses its strong type converter feature to convert the payload to the type of the method parameter. We could use String and Camel will convert the body to a String instead. You can register your own type converters as well using the @Converter annotation at the class and method level.

Recipient List

Diagram How do we route a message to a list of statically or dynamically specified recipients?
Recipient List
Problem How can we route messages based on a static or dynamic list of destinations?
Solution Define a channel for each recipient. Then use a Recipient List to inspect an incoming message, determine the list of desired recipients and forward the message to all channels associated with the recipients in the list.
Camel Camel supports the static Recipient List using the multicast node, and the dynamic Recipient List using the recipientList node.
Java DSL

Static

In this route we route to a static list of two recipients, that will receive a copy of the same message simultaneously.


from("jms:queue:inbox")
	.multicast().to("file://backup", "seda:inbox");

Dynamic

In this route we route to a dynamic list of recipients defined in the message header [mails] containing a list of recipients as endpoint URLs. The bean processMails is used to add the header[mails] to the message.


from("seda:confirmMails").beanRef(processMails)
	.recipientList("destinations");

And in the process mails bean we use @Headers Bean Parameter Binding to provide a java.util.Map to store the recipients.


public void confirm(@Headers Map headers, @Body String body} {
	String[] recipients = ...
	headers.put(""destinations", recipients);
}

Spring DSL

Static


<route>
	<from uri="jms:queue:inbox" />
	<multicast>
		<to uri="file://backup"/>
		<to uri="seda:inbox"/>
	</multicast>
</route>

Dynamic

In this example we invoke a method call on a Bean to provide the dynamic list of recipients.


<route>
	<from uri="jms:queue:inbox" />
	<recipientList>
		<method bean="myDynamicRouter" method="route"/>
	</recipientList>
</route>

<bean id="myDynamicRouter"
	class="com.mycompany.MyDynamicRouter"/>
	
public class myDynamicRouter {
	public String[] route(String body) {
		return new String[] { "file://backup", .... }
	}
}

Annotation DSL

In the CustomerService class we annoate the whereTo method with @RecipientList, and return a single destination based on the customer id. Notice the flexibility of Camel as it can adapt accordingly to how you define what your methods are returning: a single element, a list, an iterator, etc.


public class CustomerService {
	@RecipientList
	public String whereTo(@Header("customerId") id) {
		return "jms:queue:customer:" + id;
	}
}

And then we can route to the bean and it will act as a dynamic recipient list.


from("jms:queue:inbox")
	.bean(CustomerService.class, "whereTo");

Splitter

Diagram How can we process a message if it contains multiple elements, each of which may have to be processed in a different way?
Splitter
Problem How can we split a single message into pieces to be routed individually?
Solution Use a Splitter to break out the composite message into a series of individual messages, each containing data related to one item.
Camel Camel has support for Splitter using the split node.
Java DSL

In this route we consume files from the inbox folder. Each file is then split into a new message. We use a tokenizer to split the file content line by line based on line breaks.


from("file://inbox")
	.split(body().tokenize("\n"))
	.to("seda:orderLines");

TIP: Camel also supports splitting streams using the streaming node. We can split the stream by using a comma:


.split(body().tokenize(",")).streaming().to("seda:parts");

TIP: In the routes above each individual split message will be executed in sequence. Camel also supports parallel execution using the parallelProcessing node.


.split(body().tokenize(",")).streaming()
	.parallelProcessing().to("seda:parts");

Spring DSL In this route we use XPath to split XML payloads received on the JMS order queue.

<route>
	<from uri="jms:queue:order"/>
	<split>
		<xpath>/invoice/lineItems</xpath>
		<to uri="seda:processOrderLine"/>
	</split>
</route>

And in this route we split the messages using a regular expression


<route>
	<from uri="jms:queue:order"/>
	<split>
		<tokenizer token="([A-Z|0-9]*);" regex="true"/>
		<to uri="seda:processOrderLine"/>
	</split>
</route>

TIP: Split evaluates an org.apahce.camel.Expression to provide something that is iterable to produce each individual new message. This allows you to provide any kind of expression such as a Bean invoked as a method call.


<split>
	<method bean="mySplitter" method="splitMe"/>
	<to uri="seda:processOrderLine"/>
</split>

<bean id="mySplitter" class="com.mycompany.MySplitter"/>

public List splitMe(String body) {
	// split using java code and return a List
	List parts = ...
	return parts;
}

Aggregator

Diagram How do we combine the results of individual, but related messages so that they can be processed as a whole?
Message Router
Problem How do we combine multiple messages into a single combined message?
Solution Use a stateful filter, an Aggregator, to collect and store individual messages until it receives a complete set of related messages to be published.
Camel Camel has support for the Aggregator using the aggregate node. Camel uses a stateful batch processor that is capable of aggregating related messaged into a single combined message. A correlation expression is used to determine which messages should be aggregated. An aggregation strategy is used to combine aggregated messages into the result message. Camel’s aggregator also supports a completion predicate allowing you to signal when the aggregation is complete. Camel also supports other completion signals based on timeout and/or a number of messages already aggregated.
Java DSL

Stock quote example

We want to update a website every five minutes with the latest stock quotes. The quotes are received on a JMS topic. As we can receive multiple quotes for the same stock within this time period we only want to keep the last one as its the most up to date. We can do this with the aggregator:


from("jms:topic:stock:quote")
	.aggregate().xpath("/quote/@symbol")
	.batchTimeout(5 * 60 * 1000).to("seda:quotes");

As the correlation expression we use XPath to fetch the stock symbol from the message body. As the aggregation strategy we use the default provided by Camel that picks the latest message, and thus also the most up to date. The time period is set as a timeout value in milliseconds.

Loan broker example

We aggregate responses from various banks for their quote for a given loan request. We want to pick the bank with the best quote (the cheapest loan), therefore we need to base our aggregation strategy to pick the best quote.


from("jms:topic:loan:quote")
	.aggregate().header("loanId")
	.aggregationStrategy(bestQuote)
	.completionPredicate(header(Exchange.AGGREGATED_SIZE)
	.isGreaterThan(2))
	.to("seda:bestLoanQuote");

We use a completion predicate that signals when we have received more than 2 quotes for a given loan, giving us at least 3 quotes to pick among. The following shows the code snippet for the aggregation strategy we must implement to pick the best quote:


public class BestQuoteStrategy implements AggregationStrategy {
	public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
		double oldQuote = oldExchange.getIn().getBody(Double.class);
		double newQuote = newExchange.getIn().getBody(Double.class);
		// return the "winner" that has the lowest quote
		return newQuote < oldQuote ? newExchange : oldExchange;
	}
}

Spring DSL

Loan Broker Example


<route>
	<from uri="jms:topic:loan:qoute"/>
	<aggregate strategyRef="bestQuote">
		<correlationExpression>
			<header>loanId</header>
		</correlationExpression>
		<completionPredicate>
			<simple>${header.CamelAggregatedSize} > 2</simple>
		</completionPredicate>
	</aggregate>
	<to uri="seda:bestLoanQuote"/>
</route>

<bean id="bestQuote"
	class="com.mycompany.BestQuoteStrategy"/>

TIP: We use the simple language to declare the completion predicate. Simple is a basic language that supports a primitive set of operators. ${header. CamelAggregatedSize} will fetch a header holding the number of messages aggregated.

TIP: If the completed predicate is more complex we can use a method call to invoke a Bean so we can do the evaluation in pure Java code:


<completionPredicate>
	<method bean="quoteService" method="isComplete"/>
</compledtionPrediacate>
public boolean isComplete(@Header(Exchange.AGGREGATED_SIZE)
	int count, String body) {
	return body.equals("STOP");
}

Notice how we can use Bean Binding Parameter to get hold of the aggregation size as a parameter, instead of looking it up in the message.

Resequencer

Diagram How can we get a stream of related but out-of-sequence messages back into the correct order?
Resequencer
Problem How do we ensure ordering of messages?
Solution Use a stateful filter, a Resequencer, to collect and reorder messages so that they can be published in a specified order.
Camel

Camel has support for the Resequencer using the resequence node. Camel uses a stateful batch processor that is capable of reordering related messages. Camel supports two resequencing algorithms:

-batch = collects messages into a batch, sorts the messages and publish the messages

-stream = re-orders, continuously, message streams based on detection of gaps between messages.

Batch is similar to the aggregator but with sorting. Stream is the traditional Resequencer pattern with gap detection. Stream requires usage of number (longs) as sequencer numbers, enforced by the gap detection, as it must be able to compute if gaps exist. A gap is detected if a number in a series is missing, e.g. 3, 4, 6 with number 5 missing. Camel will back off the messages until number 5 arrives.

Java DSL

Batch:

We want to process received stock quotes, once a minute, ordered by their stock symbol. We use XPath as the expression to select the stock symbol, as the value used for sorting.


from("jms:topic:stock:quote")
	.resequence().xpath("/quote/@symbol")
	.timeout(60 * 1000)
	.to("seda:quotes");

Camel will default the order to ascending. You can provide your own comparison for sorting if needed.

Stream:

Suppose we continuously poll a file directory for inventory updates, and its important they are processed in sequence by their inventory id. To do this we enable streaming and use one hour as the timeout.


from("file://inventory")
	.resequence().xpath("/inventory/@id")
	.stream().timeout(60 * 60 * 1000)
	.to("seda:inventoryUpdates");

Spring DSL

Batch:


<route>
	<from uri="jms:topic:stock:quote"/>
	<resequence>
		<xpath>/quote/@symbol</xpath>
		<batch-config batchTimeout="60000"/>
	</resequence>
	<to uri="seda:quotes"/>
</route>

Stream:


<route>
	<from uri="file://inventory"/>
	<resequence>
		<xpath>/inventory/@id
		<stream-config timeout="3600000"/>
	</resequence>
	<to uri="seda:quotes"/>
</route>

Notice that you can enable streaming by specifying <stream-config> instead of .

Dead Letter Channel

Diagram What will the messaging system do with a message it cannot deliver?
Message Router
Problem The messaging system cannot deliver a message
Solution When a message cannot be delivered it should be moved to a Dead Letter Channel
Camel

Camel has extensive support for Dead Letter Channel by its error handler and exception clauses. Error handler supports redelivery policies to decide how many times to try redelivering a message, before moving it to a Dead Letter Channel.

The default Dead Letter Channel will log the message at ERROR level and perform up to 6 redeliveries using a one second delay before each retry.

Error handler has two scopes: global and per route

TIP: See Exception Clause in the Camel documentation for selective interception of thrown exception. This allows you to route certain exceptions differently or even reset the failure by marking it as handled.

TIP: DeadLetterChannel supports processing the message before it gets redelivered using onRedelivery. This allows you to alter the message beforehand (i.e. to set any custom headers).

Java DSL

Global scope


errorHandler(deadLetterChannel("jms:queue:error")
	.maximumRedeliveries(3));
	
from(...)

Route scope
from("jms:queue:event")
	.errorHandler(deadLetterChannel()
	.maximumRedeliveries(5))
	.multicast().to("log:event", "seda:handleEvent");

In this route we override the global scope to use up to five redeliveries, where as the global only has three. You can of course also set a different error queue destination:


deadLetterChannel("log:badEvent").maximumRedeliveries(5)

Spring DSL

The error handler is configured very differently in the Java DSL vs. the Spring DSL. The Spring DSL relies more on standard Spring bean configuration whereas the Java DSL uses fluent builders.

Global scope

The Global scope error handler is configured using the errorHandlerRef attribute on the camelContext tag.


<camelContext errorHandlerRef="myDeadLetterChannel">
...
</camelContext>

Route scope

Route scoped is configured using the errorHandlerRef attribute on the route tag.


<route errorHandlerRef="myDeadLetterChannel">
...
</route>

For both the error handler itself is configured using a regular Spring bean


<bean id="myDeadLetterChannel" class="org.apache.camel.
builder.DeadLetterChannelBuilder">
	<property name="deadLetterUri" value="jms:queue:error"/>
	<property name="redeliveryPolicy"
		ref="myRedeliveryPolicy"/>
</bean>

<bean id="myRedeliverPolicy"
		class="org.apache.camel.processor.RedeliverPolicy">
	<property name="maximumRedeliveries" value="5"/>
	<property name="delay" value="5000"/>
</bean>

Wire Tap

Diagram How do you inspect messages that travel on a point-to-point channel?
Wire Tap
Problem How do you tap messages while they are routed?
Solution Insert a Wire Tap into the channel, that publishes each incoming message to the main channel as well as to a secondary channel.
Camel Camel has support for Wire Tap using the wireTap node, that supports two modes: traditional and new message. The traditional mode sends a copy of the original message, as opposed to sending a new message. All messages are sent as Event Message and runs in parallel with the original message.
Java DSL

Traditional

The route uses the traditional mode to send a copy of the original message to the seda tapped queue, while the original message is routed to its destination, the process order bean.


from("jms:queue:order")
	.wireTap("seda:tappedOrder")
	.to("bean:processOrder");

New message

In this route we tap the high priority orders and send a new message containing a body with the from part of the order. Tip: As Camel uses an Expression for evaluation you can use other functions than xpath, for instance to send a fixed String you can use constant.


from("jms:queue:order")
	.choice()
		.when("/order/priority = ‘high’")
			.wireTap("seda:from", xpath("/order/from"))
			.to("bean:processHighOrder");
		.otherwise()
			.to("bean:processOrder");

Spring DSL

Traditional


<route>
	<from uri="jms:queue:order"/>
	<wireTap uri="seda:tappedOrder"/>
	<to uri="bean:processOrder"/>
</route>

New Message


<route>
	<choice>
		<when>
			<xpath>/order/priority = 'high'</xpath>
			<wireTap uri="seda:from">
				<body><xpath>/order/from</xpath></body>
			</wireTap>
			<to uri="bean:processHighOrder"/>
		</when>
		<otherwise>
			<to uri="bean:processOrder"/>
		</otherwise>
	</choice>
</route>

Conclusion

The twelve patterns in this Refcard cover the most used patterns in the integration space, together with two of the most complex such as the Aggregator and the Dead Letter Channel. In the second part of this series we will take a further look at common patterns and transations.

Get More Information

Camel Website http://camel.apache.org The home of the Apache Camel project. Find downloads, tutorials, examples, getting started guides, issue tracker, roadmap, mailing lists, irc chat rooms, and how to get help.
FuseSource Website http://fusesource.com The home of the FuseSource company, the professional company behind Apache Camel with enterprise offerings, support, consulting and training.
About Author http://davsclaus.blogspot.com The personal blog of the author of this reference card.

About The Author

Photo of author Claus Ibsen

Claus Ibsen

Claus Ibsen is a passionate open-source enthusiast who specializes in the integration space. As an engineer in the Progress FUSE open source team he works full time on Apache Camel, FUSE Mediation Router (based on Apache Camel) and related projects. Claus is very active in the Apache Camel and FUSE communities, writing blogs, twittering, assisting on the forums irc channels and is driving the Apache Camel roadmap.

About Progress Fuse

FUSE products are standards-based, open source enterprise integration tools based on Apache SOA projects, and are productized and supported by the people who wrote the code.

Recommended Book

ASP.NET

Utilizing years of practical experience, seasoned experts Gregor Hohpe and Bobby Woolf show how asynchronous messaging has proven to be the best strategy for enterprise integration success. However, building and deploying messaging solutions presents a number of problems for developers. Enterprise Integration Patterns provides an invaluable catalog of sixty-five patterns, with real-world solutions that demonstrate the formidable of messaging and help you to design effective messaging solutions for your enterprise.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Daily Dose : Java 7 Developer Preview Builds Finally Available!

Oracle's new Java 7 JDK, or JDK7, is now available for download by developers. This version of JDK7 has been officially feature complete since the beginning of the year. Users are encouraged to report bugs between now and the end of March.  Just remember...

1 replies - 21355 views - 02/24/11 by Katie Mckinsey in Daily Dose

Apache Hadoop Deployment

A Blueprint for Reliable Distributed Computing

By Eugene Ciurana

9,712 Downloads · Refcard 133 of 151 (see them all)

Download
FREE PDF


The Essential Hadoop Deployment Cheat Sheet

Apache Hadoop Deployment is covered in this refcard. It's a basic blueprint for deploying Apache Hadoop HDFS and MapReduce using the Cloudera Distribution. It will take you from installation to deployment. It provides developers and data experts with the instructions they need for deploying Big Data applications. The process is made simpler by the Cloudera Distribution for Apache Hadoop: an open-source, enterprise-class distribution for production ready environments. To learn about basic tools and terminology of Hadoop, check out our Getting Started with Apache Hadoop Refcard.
HTML Preview
Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

By Eugene Ciurana

INTRODUCTION

This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out Refcard #117, Getting Started with Apache Hadoop, for basic terminology and for an overview of the tools available in the Hadoop Project.

WHICH HADOOP DISTRIBUTION?

Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. This Refcard presents how to deploy and use development and production computational networks. HDFS, MapReduce, and Pig are the foundational tools for developing Hadoop applications.

There are two basic Hadoop distributions:

  • Apache Hadoop is the main open-source, bleeding-edge distribution from the Apache foundation.
  • The Cloudera Distribution for Apache Hadoop (CDH) is an open-source, enterprise-class distribution for productionready environments.

The decision of using one or the other distributions depends on the organization’s desired objective.

  • The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together.
  • CDH removes the guesswork and offers an almost turnkey product for robustness and stability; it also offers some tools not available in the Apache distribution.

Hot Tip

Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop. Their toolset complements Apache’s. Documentation about Cloudera’s CDH is available from http://docs.cloudera.com.

The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. CDH, on the other hand, is designed as a drop-in component for all major Linux distributions.

Hot Tip

Linux is the supported platform for production systems. Windows is adequate but is not supported as a development platform.

Minimum Prerequisites

  • Java 1.6 from Oracle, version 1.6 update 8 or later; identify your current JAVA_HOME
  • sshd and ssh for managing Hadoop daemons across multiple systems
  • rsync for file and directory synchronization across the nodes in the cluster
  • Create a service account for user hadoop where $HOME=/home/hadoop
SSH Access

Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration.

Listing 1 - Hadoop SSH Prerequisits

keyFile=$HOME/.ssh/id_rsa.pub
pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
if ! ssh localhost -C true ; then \
  if [ ! -e “$keyFile” ]; then \
     ssh-keygen -t rsa -b 2048 -P ‘’ \
        -f “$pKeyFile”; \
 fi; \
 cat “$keyFile” >> “$authKeys”; \
 chmod 0640 “$authKeys”; \
 echo “Hadoop SSH configured”; \
else echo “Hadoop SSH OK”; fi

The public key for this example is left blank. If this were to run on a public network it could be a security hole. Distribute the public key from the master node to all other nodes for data exchange. All nodes are assumed to run in a secure network behind the firewall.

Hot Tip

All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.eu/DeployingHadoopDZone

Enterprise: CDH Prerequisites

Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions.

Hot Tip

CDH packages have names like CDH2, CDH3, and so on, corresponding to the CDH version. The examples here use CDH3. Use the appropriate version for your installation.
CDH on Ubuntu Pre-Install Setup

Execute these commands as root or via sudo to add the Cloudera repositories:

Listing 2 - Ubuntu Pre-Install Setup

DISTRO=$(lsb_release -c | cut -f 2)
REPO=/etc/apt/sources.list.d/cloudera.list
echo “deb \
http://archive.cloudera.com/debian \
	$DISTRO-cdh3 contrib” > “$REPO”
echo “deb-src \
http://archive.cloudera.com/debian \
	$DISTRO-cdh3 contrib” >> “$REPO”
apt-get update

CDH on Red Hat Pre-Install Setup

Run these commands as root or through sudo to add the yum Cloudera repository:

Listing 3 - Red Hat Pre-Install Setup

curl -sL http://is.gd/3ynKY7 | tee \
	/etc/yum.repos.d/cloudera-cdh3.repo | \
	awk ‘/^name/’
yum update yum

Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. Don’t mix and match operating systems, distributions, Hadoop, or Java versions!

Hadoop for Development

  • Hadoop runs as a single Java process, in non-distributed mode, by default. This configuration is optimal for development and debugging.
  • Hadoop also offers a pseudo-distributed mode, in which every Hadoop daemon runs in a separate Java process. This configuration is optimal for development and will be used for the examples in this guide.

Hot Tip

If you have an OS X or a Windows development workstation, consider using a Linux distribution hosted on VirtualBox for running Hadoop. It will help prevent support or compatibility headaches.

Hadoop for Production

  • Production environments are deployed across a group of machines that make the computational network. Hadoop must be configured to run in fully distributed, clustered mode.

APACHE HADOOP INSTALLATION

This Refcard is a reference for development and production deployment of the components shown in Figure 1. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released.

Figure1
Figure 1 - Hadoop Components

Hot Tip

Whether the user intends to run Hadoop in non-distributed or distributed modes, it’s best to install every required component in every machine in the computational network. Any computer may assume any role thereafter.

A non-trivial, basic Hadoop installation includes at least these components:

  • Hadoop Common: the basic infrastructure necessary for running all components and applications
  • HDFS: the Hadoop Distributed File System
  • MapReduce: the framework for large data set distributed processing
  • Pig: an optional, high-level language for parallel computation and data flow

Enterprise users often chose CDH because of:

  • Flume: a distributed service for efficient large data transfers in real-time
  • Sqoop: a tool for importing relational databases into Hadoop clusters

Apache Hadoop Development Deployment

The steps in this section must be repeated for every node in a Hadoop cluster. Downloads, installation, and configuration could be automated with shell scripts. All these steps are performed as the service user hadoop, defined in the prerequisites section.
http://hadoop.apache.org/common/releases.html has the latest version of the common tools. This guide used version 0.20.2.

  1. Download Hadoop from a mirror and unpack it in the /home/hadoop work directory.
  2. Set the JAVA_HOME environment variable.
  3. Set the run-time environment:
Listing 4 - Set the Hadoop Runtime Environment

version=0.20.2 # change if needed
identity=”hadoop-dev”
runtimeEnv=”runtime/conf/hadoop-env.sh”
ln -s hadoop-”$version” runtime
ln -s runtime/logs .
export HADOOP_HOME=”$HOME”
cp “$runtimeEnv” “$runtimeEnv”.org
echo “export \
HADOOP_SLAVES=$HADOOP_HOME/slaves” \
>> “$runtimeEnv”
mkdir “$HADOOP_HOME”/slaves
echo \
“export HADOOP_IDENT_STRING=$identity” >> \
“$runtimeEnv”
echo \
“export JAVA_HOME=$JAVA_HOME” \
>>”$runtimeEnv”
export \
PATH=$PATH:”$HADOOP_HOME”/runtime/bin
unset version; unset identity; unset runtimeEnv

Configuration

Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.xml, hdfs-site.xml, and the mapred-site.xml. These files configure the master, the file system, and the MapReduce framework and live in the runtime/conf directory.

Listing 5 - Pseudo-Distributed Operation Config

<!-- core-site.xml -->
<configuration>
 <property>
	<name>fs.default.name</name>
	<value>hdfs://localhost:9000</value>
 </property>
</configuration>

<!-- hdfs-site.xml -->
<configuration>
 <property>
	<name>dfs.replication</name>
	<value>1</value>
 </property>
</configuration>
<!-- mapred-site.xml -->
<configuration>

 <property>
	<name>mapred.job.tracker</name>
	<value>localhost:9001</value>
 </property>
</configuration>

These files are documented in the Apache Hadoop Clustering reference, http://is.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.

Test the Hadoop Installation

Hadoop requires a formatted HDFS cluster to do its work:


hadoop namenode -format

The HDFS volume lives on top of the standard file system. The format command will show this upon successful completion:


/tmp/dfs/name has been successfully formatted.

Start the Hadoop processes and perform these operations to validate the installation:

  • Use the contents of runtime/conf as known input
  • Use Hadoop for finding all text matches in the input
  • Check the output directory to ensure it works

Listing 6 - Testing the Hadoop Installation

start-all.sh ; sleep 5
hadoop fs -put runtime/conf input
hadoop jar runtime/hadoop-*-examples.jar\
grep input output ‘dfs[a-z.]+’

Hot Tip

You may ignore any warnings or errors about a missing slaves file.
  • View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install
Listing 7 - Job Completion and Daemon Termination

hadoop fs -cat output/*
stop-all.sh

That’s it! Apache Hadoop is installed in your system and ready for development.

CDH Development Deployment

CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. Compare the instructions in Listing 8 against the previous section. CDH simplifies installation and configuration for huge time savings.

Listing 8 - Installing CDH

ver=”0.20”
command=”/usr/bin/aptitude”
if [ ! -e “$command” ];
then command=”/usr/bin/yum”; fi
“$command” install\
hadoop-”$ver”-conf-pseudo
unset command ; unset ver

Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Install Flume or Pig with the instructions in Listing 9.

Listing 9 - Adding Optional Components

apt-get install hadoop-pig
apt-get install flume
apt-get install sqoop

Test the CDH Installation

The CDH daemons are ready to be executed as services. There is no need to create a service account for executing them. They can be started or stopped as any other Linux service, as shown in Listing 10.

Listing 10 - Starting the CDH Daemons

for s in /etc/init.d/hadoop* ; do \
“$s” start; done

CDH will create an HDFS partition when its daemons start. It’s another convenience it offers over regular Hadoop. Listing 11 shows how to validate the installation by:

  • Listing the HDFS module
  • Moving files to the HDFS volume
  • Running an example job
  • Validating the output
Listing 11 - Testing the CDH Installation

hadoop fs -ls /
# run a job:
pushd /usr/lib/hadoop
hadoop fs -put /etc/hadoop/conf input
hadoop fs -ls input
hadoop jar hadoop-*-examples.jar \
grep input output ‘dfs[a-z.]+’
# Validate it ran OK:
hadoop fs -cat output/*

The daemons will continue to run until the server stops. All the Hadoop services are available.

Monitoring the Local Installation

Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces. All daemons expose their data over HTTP. The users can chose to monitor a node or daemon interactively using the web UI, like in Figure 2. Developers, monitoring tools, and system administrators can use the same ports for tracking the system performance and state using web service calls.

Figure 2
Figure 2 - NameNode status web UI

The web interface can be used for monitoring the JobTracker, which dispatches tasks to specific nodes in a cluster, the DataNodes, or the NameNode, which manages directory namespaces and file nodes in the file system.

HADOOP MONITORING PORTS

Use the information in Table 1 for configuring a development workstation or production server firewall.

Port Service
50030 JobTracker
50060 TaskTrackers
50070 NameNode
50075 DataNodes
50090 Secondary NameNode
50105 Backup Node
Table 1 - Hadoop ports

Plugging a Monitoring Agent

The Hadoop daemons also expose internal data over a RESTful interface. Automated monitoring tools like Nagios, Splunk, or SOBA can use them. Listing 12 shows how to fetch a daemon’s metrics as a JSON document:

Listing 12 - Fetching Daemon Metrics
http://localhost:50070/metrics?format=json

All the daemons expose these useful resource paths:

  • /metrics - various data about the system state
  • /stacks - stack traces for all threads
  • /logs - enables fetching logs from the file system
  • /logLevel - interface for setting log4j logging levels

Each daemon type also exposes one or more resource paths specific to its operation. A comprehensive list is available from: http://is.gd/MBN4qz

APACHE HADOOP PRODUCTION DEPLOYMENT

The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.

Hot Tip

Detailed guides that complement this Refcard are available from Cloudera at http://is.gd/RBWuxm and from Apache at http://is.gd/ckUpu1.
Figure 3
Figure 3 - Hadoop Computational Network

The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The basic procedure for deploying a Hadoop cluster is:

  • Pick a Hadoop distribution
  • Prepare a basic configuration on one node
  • Deploy the same pre-configured package across all machines in the cluster
  • Configure each machine in the network according to its role

The Apache Hadoop documentation shows this as a rather involved process. The value-added in CDH is that most of that work is already in place. Role-based configuration is very easy to accomplish. The rest of this Refcard will be based on CDH.

Handling Multiple Configurations: Alternatives

Each server role will be determined by its configuration, since they will all have the same software installed. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations.

Hot Tip

Check the main page to learn more about alternatives. Ubuntu: man update-alternatives Red Hat: man alternatives

The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. This customization is where all the extra work went into CDH. The CDH installation uses alternatives to set the effective CDH configuration.

Setting Up the Production Configuration

Listing 13 takes a basic Hadoop configuration and sets it up for production.

Listing 13 - Set the Production Configuration

ver=”0.20”
prodConf=”/etc/hadoop-$ver/conf.prod”
cp -Rfv /etc/hadoop-”$ver”/conf.empty \
“$prodConf”
chown hadoop:hadoop “$prodConf”
# activate the new configuration:
alt=”/usr/sbin/update-alternatives”
if [ ! -e “$alt” ]; then alt=”/usr/sbin/alternatives”; fi
“$alt” --install /etc/hadoop-”$ver”/conf \
hadoop-”$ver”-conf “$prodConf” 50
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” restart; done

The server will restart all the Hadoop daemons using the new production configuration.

Figure 4
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop

Pick a node from the cluster to act as the NameNode (see Figure 3). All Hadoop activity depends on having a valid R/W file system. Format the distributed file system from the NameNode, using user hdfs:

Listing 14 - Create a New File System
sudo -u hdfs hadoop namenode -format

Stop all the nodes to complete the file system, permissions, and ownership configuration. Optionally, set daemons for automatic startup using rc.d.

Listing 15 - Stop All Daemons

# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ;\
# Optional command for auto-start:
update-rc.d “$h” defaults; \
done

File System Setup

Every node in the cluster must be configured with appropriate directory ownership and permissions. Execute the commands in Listing 16 in every node:

Listing 16 - File System Setup

mkdir -p /data/1/dfs/nn /data/2/dfs/nn
mkdir -p /data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
mkdir -p /data/1/mapred/local \
/data/2/mapred/local
chown -R hdfs:hadoop /data/1/dfs/nn \
/data/2/dfs/nn /data/1/dfs/dn \
/data/2/dfs/dn /data/3/dfs/dn \
/data/4/dfs/dn
chown -R mapred:hadoop \
/data/1/mapred/local \
/data/2/mapred/local
chmod -R 755 /data/1/dfs/nn \
/data/2/dfs/nn \
/data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
chmod -R 755 /data/1/mapred/local \
/data/2/mapred/local

Starting the Cluster
  • Start the NameNode to make HDFS available to all nodes
  • Set the MapReduce owner and permissions in the HDFS volume
  • Start the JobTracker
  • Start all other nodes

CDH daemons are defined in /etc/init.d — they can be configured to start along with the operating system or they can be started manually. Execute the command appropriate for each node type using this example:

Listing 17 - Starting a Node Example

# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ; done

Use jobtracker, datanode, tasktracker, etc. corresponding to the node you want to start or stop.

Hot Tip

Refer to the Linux distribution’s documentation for information on how to start the /etc/init.d daemons with the chkconfig tool.
Listing 18 - Set the MapReduce Directory Up

sudo -u hdfs hadoop fs -mkdir \
/mapred/system
sudo -u hdfs hadoop fs -chown mapred \
/mapred/system

Update the Hadoop Configuration Files
Listing 19 - Minimal HDFS Config Update

<!-- hdfs-site.xml -->
<property>
	<name>dfs.name.dir</name>
	<value>/data/1/dfs/nn,/data/2/dfs/nn
	</value>
	<final>true</final>
</property>
<property>
	<name>dfs.data.dir</name>
	<value>
	 /data/1/dfs/dn,/data/2/dfs/dn,
	 /data/3/dfs/dn,/data/4/dfs/dn
	</value>
   <final>true</final>
</property>	

The last step consists of configuring the MapReduce nodes to find their local working and system directories:

Listing 20 - Minimal MapReduce Config Update

<!-- mapred-site.xml -->
<property>
  <name>mapred.local.dir</name>
  <value>
	/data/1/mapred/local,
	/data/2/mapred/local
  </value>
  <final>true</final>
</property>
<property>
	<name>mapred.systemdir</name>
	<value>
	  /mapred/system
	</value>
	<final>true</final>
</property>

Start the JobTracker and all other nodes. You now have a working Hadoop cluster. Use the commands in Listing 11 to validate that it’s operational.

WHAT’S NEXT?

The instructions in this Refcard result in a working development or production Hadoop cluster. Hadoop is a complex framework and requires attention to configure and maintain it. Review the Apache Hadoop and Cloudera CDH documentation. Pay particular attention to the sections on:

  • How to write MapReduce, Pig, or Hive applications
  • Multi-node cluster management with ZooKeeper
  • Hadoop ETL with Sqoop and Flume

Happy Hadoop computing!

STAYING CURRENT

Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.eu/scalablesystems

About The Authors

Eugene Ciurana

Eugene Ciurana (http://eugeneciurana.eu) is the VP of Technology at Badoo.com, the largest dating site worldwide, and cofounder of SOBA Labs, the most sophisticated public and private clouds management software. Eugene is also an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability systems. He recently built scalable computational networks for leading financial, software, insurance, SaaS, government, and healthcare companies in the US, Japan, Mexico, and Europe.

Publications
  • Developing with Google App Engine, Apress
  • DZone Refcard #117: Getting Started with Apache Hadoop
  • DZone Refcard #105: NoSQL and Data Scalability
  • DZone Refcard #43: Scalability and High Availability
  • The Tesla Testament: A Thriller, CIMEntertainment

Thank You!

Thanks to all the technical reviewers, especially to Pavel Dovbush at http://dpp.su

Recommended Book

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open-source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems; programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

ServiceMix 4.2

The Apache Open Source ESB

By Jos Dirksen

8,730 Downloads · Refcard 65 of 151 (see them all)

Download
FREE PDF


The Essential ServiceMix Cheat Sheet

ServiceMix 4.2 is an enterprise-class open source ESB from Apache. In the open-source community, there are many different solutions for each problem. Although there are many open-source ESB projects, not all of them are mature enough to be use to solve enterprise mission-critical integration problems. This DZone Refcard will introduce you to whats new in version 4.2 and walk you through the OSGi based architecture, Web services, configuration and deployment options of ServiceMix 4.2 components with some key tips mixed in along the way.
HTML Preview
Getting Started with ServiceMix 4.0

Getting Started with ServiceMix 4.0

By Jos Dirksen

About Servicemix 4.0

In the open source community there are many different solutions for each problem. When you look for an open source ESB, however, you don't have that many options. Even though there are many open source ESB projects, not all of them are mature enough to be used to solve enterprise mission critical integration problems. ServiceMix is one of the open source projects that is mature enough to be used in these scenarios. ServiceMix, an Apache project, has been around for a couple of years now. It provides all the features you expect from an ESB such as routing, transformation, etc. The previous version was built based on JBI (JSR-208), but in its latest iteration, which we're discussing in this Refcard, ServiceMix has moved to an OSGi based architecture, which we'll discuss later on.

This DZone Refcard will provide an overview of the core elements of ServiceMix 4.0 and will show you how to use ServiceMix 4 by providing example configurations.

Servicemix 4.0 Architecture

Before we show how to configure ServiceMix 4.0 for use, let us first look at the architecture of ServiceMix 4.0. This figure shows the following components:

Architecture of ServiceMix

ServiceMix Kernel: In this figure you can see that the basis of ServiceMix 4 is the ServiceMix Kernel. This kernel, which is based on the Apache Felix Karaf project (an OSGi based runtime), handles the core features ServiceMix provides, such as hot-deployment, provisioning of libraries or applications, remote access using ssh, JMX management and more.

ServiceMix NMR: This component, a normalized message router, handles all the routing of messages within ServiceMix and is used by all the other components.

>ActiveMQ: ActiveMQ, another Apache project, is the message broker which is used to exchange messages between components. Besides this ActiveMQ can also be used to create a fully distributed ESB.

Web: ServiceMix 4 also provides a web component. You can use this to start ServiceMix 4 embedded in a web application. An example of this is provided in the ServiceMix distribution.

JBI compatibility layer: The previous version of ServiceMix was based on JBI 1.0. For JBI a lot of components (from ServiceMix, but also from other parties), are available. This layer provides compatibility with the JBI specification, so that all the components from the previous version of ServiceMix can run on ServiceMix 4. Be sure though to use the 2009.01 version of these components.

Camel NMR: ServiceMix 4 provides a couple of different ways you can configure routing. You can use the endpoints provided by the ServiceMix NMR, but you can also use more advanced routing engines. One of those is the Camel NMR. This component allows you to run Camel based routes on ServiceMix.

CXF NMR: Besides an NMR based on Camel, ServiceMix also provides an NMR based on CXF. You can use this NMR to expose and route to Java POJOs annotated with JAX-WS annotations.

Hot Tip

OSGi runtime
ServiceMix runs on an OSGi based kernel, but what is OSGi? In short an OSGi container provides a service based in-VM platform on which you can deploy services and components dynamically. OSGi provides strict classloasing seperation and forces you to think about the dependencies your components have. Besides that OSGi also defines a simple lifecycle model for your services and components. This results in an environment where you can easily add and remove components and services at runtime and allows the creation of modular applications. An added advantage of using an OSGi container is that you can use many components out of the box: remote administration, a web container, configuration and preferences services, etc.

Before we move on to the next part, let's have a quick look at how a message is processed by ServiceMix. The following figure shows how a message is routed by the NMR. In this case we're showing a reply / response (in-out) message pattern.

Reply/Response Pattern

In this figure you can see a number of steps being executed:

  1. The consumer creates a message exchange for a specific service and sends a request.
  2. The NMR determines the provider this exchange needs to be sent to and queus the message for delivery. The provider accepts this message and executes its business logic.
  3. After the provider has finished processing, the response message is returned to the NMR.
  4. The NMR once again queues the message for delivery. This time to the consumer. The consumer accepts the message.
  5. After the response is accepted, the consumer sends a confirmation to the NMR.
  6. The NMR routes this confirmation to the provider, who accepts it and ends this message exchange.

Now that we've seen the architecture and how a message is handled by the NMR, we'll have a look at how to configure ServiceMix 4.

Configuration of ServiceMix 4.0

ServiceMix 4 configuration is mostly done through Spring XML files supported by XML schemas for easy code completion. Let's look at two simple examples. The first one uses the File Binding component to poll a directory and the second one exposes a Web service using ServiceMix's CXF support.


<beans xmlns:file="http://servicemix.Apache.org/file/1.0"
		xmlns:dzone="http://servicemix.org/dzone/">
	<file:poller service="foo:filePoller"
		endpoint="filePoller"
		targetService="foo:fileSender"
		file="inbox" />
</beans>

In this listing you can see that we define a poller. A poller is one of the standard components that is provided by ServiceMix's file-binding-component. If we deploy this configuration to ServiceMix, ServiceMix will start polling the inbox directory for files. If it finds one, the file will be sent to the specified targetService.

Hot Tip

Service Addressing
An important concept to understand when working with ServiceMix is that of services and endpoints. When you configure services on a component you need to tell ServiceMix how to route messages to and from that service. This name is called a service endpoint. If you look back at the previous example we created a file:poller. On this file:poller we defined a service and an endpoint attribute. These two attributes together uniquely identify this file:poller. Note though that you can have multiple endpoints defined on the same service. You can also see a targetService attribute on the file:poller. Besides this attribute there is also a targetEndpoint attribute. With these two attributes you identify the service endpoint to sent the message to. The targetEndpoint isn't always needed, if only one endpoint is registered on that service.

In the following listing, we've again used a simple XML file. This time we've configured a webservice.


<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:jaxws="http://cxf.Apache.org/jaxws"
	xsi:schemaLocation="
	http://www.springframework.org/schema/beans
	http://www.springframework.org/schema/beans/spring-beans.xsd
	http://cxf.Apache.org/jaxws http://cxf.Apache.org/schemas/jaxws.
	xsd">
<import resource="classpath:META-INF/cxf/cxf.xml" /> 1
<import resource="classpath:META-INF/cxf/cxf-extension-soap.xml" />
<import resource="classpath:META-INF/cxf/cxf-extension-http.xml" />
<import resource="classpath:META-INF/cxf/osgi/cxf-extensionosgi.xml" />
	<jaxws:endpoint id="helloWorld"
		implementor="dzone.refcards.HelloWorld"
		address="/HelloWorld"/>
</beans>

In this listing we use a jaxws:endpoint to define a webservice. The implementor points to a simple POJO annotated with JAX-WS annotations. If this example is deployed to ServiceMix, ServiceMix will register a webservice based on the value in the address attribute.

Deployment of ServiceMix 4 Components

ServiceMix provides a number of different options which you can use to deploy artifacts. In this section we'll look at these options, and show you how to use these.

ServiceMix 4, deployment options

Name Description
OSGi Bundles ServiceMix 4 is built around OSGi and ServiceMix 4 also allows you to deploy your configurations as an OSGi bundle with all the advantages OSGi provides.
Spring XML files ServiceMix 4 support plain Spring XML files.
JBI artifacts You can also deploy artifacts following the JBI standard (service assemblies and service units) to ServiceMix 4.
Feature descriptors This is a Karaf specific way for installing applications. It will install the necessary OSGi bundles and will add configuration defaults. This is mostly used to install core parts of the ServiceMix distribution.

OSGi bundle deployment
The easiest way to create an OSGi based ServiceMix bundle is by using Maven 2. To create a bundle you need to take a couple of simple steps. The first one is adding the mavenbundle- plugin to your pom.xml file. This is shown in the following code fragment.


...
<dependencies>
	<dependency>
		<groupId>org.Apache.felix</groupId>
		<artifactId>org.osgi.core</name>
		<version>1.0.0</version>
	</dependency>
	...
</dependencies>
...
<build>
	<plugins>
		<plugin>
			<groupId>org.Apache.felix</groupId>
			<artifactId>maven-bundle-plugin</artifactId>
			<configuration>
				<instructions>
				<Bundle-SymbolicName>${pom.artifactId}</Bundle-SymbolicName>
				<Import-Package>*,org.Apache.camel.osgi</Import-Package>
				<Private-Package>org.Apache.servicemix.examples.camel</Private-Package>
				</instructions>
			</configuration>
		</plugin>
	</plugins>
</build>
...

The important part here is the instructions section. This determines how the plugin packages your project. For more information on these settings see the maven OSGi bundle plugin page at http://cwiki.Apache.org/FELIX/Apachefelixmaven-bundle-plugin-bnd.html.

The next step is to make sure your project is bundled as a OSGi bundle. You do this by setting the <packaging> element in your pom.xml to bundle.

Now you can use mvn install to create an OSGi bundle, which you can copy to the deploy directory of ServiceMix and your bundle will be installed. If you use Spring to configure your application, make sure the Spring configuration files are located in the META-INF/spring directory. That way the Spring application context will be automatically created based on these files.

If you don't want to do this by hand you can also use a Maven archetype. ServiceMix provides a set of archetypes you can use. A good starting point for a project is the Camel OSGi archetype which you can use by executing the following following Maven command:


mvn archetype:create -DarchetypeGroupId=org.Apache.servicemix.tooling
-DarchetypeArtifactId=servicemix-osgi-camel-archetype
-DarchetypeVersion=4.0.0.2-fuse
-DgroupId=com.yourcompany -DartifactId=camel-router
-DremoteRepositories=http://repo.fusesource.com/maven2/

There are many other archetypes available. For an overview of the available archetypes see: http://repo.fusesource.com/maven2/org/Apache/servicemix/tooling/

Spring XML Files Deployment

It's also possible to deploy Spring files without OSGi. Just drop a Spring file into the deploy directory. There are two points to take into account. First, you need to add the following to your Spring configuration file:


<bean class="org.Apache.servicemix.common.osgi.EndpointExporter" />

This will register the endpoints you've configured in your Spring file. The next element is optional but is good practice to add:


<manifest>
	Bundle-Version = 1.0.0
	Bundle-Name = Dzone :: Dzone test application
	Bundle-SymbolicName = dzone.refcards.test
	Bundle-Description = An example for servicemix refcard
	Bundle-Vendor = jos.dirksen@gmail.com
	Require-Bundle = servicemix-file, servicemix-eip
</manifest>

Using a manifest configuration element allows you to specify how your application is registered in ServiceMix.

JBI artifacts deployment

If you've already invested in JBI based applications, you can still use ServiceMix 4 to run them in. Just deploy your Service Assembly (SA) in the ServiceMix deploy directory and ServiceMix will deploy your application.

Feature descriptor based deployment

If you've got an application which contains many bundles and that requires additional configuration you can use a feature to easily manage this. A feature contains a set of bundles and configuration which can be easily installed from the ServiceMix console. The following listing shows the feature descriptor of the nmr component.


<features>
	<feature name="nmr" version="1.0.0">
	<bundle>mvn:org.Apache.servicemix.document/org.Apache.servicemix.document/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.api/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.core/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.osgi/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.spring/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.commands/1.0.0</bundle>
	<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.management/1.0.0</bundle>
	</feature>
</features>

If you want to install this feature you can just type features/install nmr from the ServiceMix console.

Routing in ServiceMix 4.0

For routing in ServiceMix you've got two options:

  • EIP: ServiceMix provides a JBI component that implements a number of Enterprise Integration Patterns.
  • Camel: You can use Camel routes in ServiceMix. Camel provides the most flexible and exhaustive routing options for ServiceMix

EIP Component Routing

This routing is provided by the EIP component. To check whether this is installed in your ServiceMix runtime you can execute features/list from the ServiceMix commandline. This will show you a list of installed features. If you see [installed] [ 2009.01] servicemix-eip the component is installed. If it shows uninstalled instead of installed, you can use the features/install servicemix-eip to install this component. You can now use this router using a simple XML file:


<eip:static-routing-slip service="test:routingSlip"	endpoint="endpoint">
	<eip:targets>
		<eip:exchange-target service="test:echo" />
		<eip:exchange-target service="test:echo" />
	</eip:targets>
</eip:static-routing-slip>

When installed this component provides the following routing options (this information is also available in the XSD of this component):

XML Element Description
async-bridge The async bridge pattern is used to bridge an In-Out exchange with two In-Only (or Robust-In-Only) exchanges. This pattern is the opposite of the pipeline.
content-basedrouter Component that can be used for content based routing of the message. You can configure this component with a set of predicates which define how the message is routed.
content-enricher A content enricher can be used to add extra information to the message from a different source.
message-filter With a message filter you specify a set of predicates which determine whether to process the message or not.
pipeline The pipeline component is a bridge between an In-Only (or Robust-In- Only) MEP and an In-Out MEP. This is the opposite of the async bridge.
resequencer A resequencer can be used to re-order a set of incoming messages before passing them on in a the new order.
split-aggregator A split aggregator is used to reassemble messages that have been split by a splitter.
static-recipient-list A static recipient list will forward the incoming message to a set of predefined destinations.
static-routing-slip The static routing slip routes a message through a set of services. It uses the result of the first invocation as input for the next.
wire-tap The wire-tap will copy and forward a message to the specified destination.
xpath-splitter This splitter uses an xpath expression to split an incoming message in multiple parts.

Camel Routing

Apache Camel is a project which provides a lof of different routing and integration options. In this section we'll show how to use Camel with ServiceMix and give an overview of the routing options it provides. Installing the Camel component in ServiceMix is done in the same way as we did for the EIP component. We use the features/list command to check what's already installed and we can use features/add to add new Camel functionality. Once installed we can use Camel to route messages between our components. Camel provides two types of configuration: XML and Java based DSL, XML configuration was used for the following two listings:

Camel XML configuration - Listing 1: Camel configuration

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<import resource="classpath:org/Apache/servicemix/camel/nmr/camel-nmr.xml" />
	<camelContext xmlns="http://camel.Apache.org/schema/spring">
	<route>
		<from uri="ftp://gertv@localhost/testfile?password=secret"/>
		<to uri="nmr:IncomingOrders"/>
	</route>
</beans>				
	
Camel XML configuration - Listing 2: Target service

<beans xmlns:file="http://servicemix.Apache.org/file/1.0"
	xmlns:dzone="http://servicemix.org/dzone/">
	<import resource="classpath:org/Apache/servicemix/camel/nmr/camel-nmr.xml" />
	<file:sender service="nmr:IncomingOrders" directory="file:target/pollerFiles" />
</beans>			
	

In these two listings you can see how we can easily integrate the Camel routes with the other components from ServiceMix. We use the nmr prefix to tell Camel to send the message to the NMR. The other service, which can be seperately deployed will then pick-up this message since it's also configured to listen to a nmr prefixed service.

Now let's look at two listings that use Camel's Java based DSL to configure the routes. For this we need a small XML file describing where the routes can be found, and a Java file which contains the routing.

Camel Java configuration - Listing 1: Spring configuration

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="
	http://www.springframework.org/schema/beans http://www.
	springframework.org/schema/
	beans/spring-beans-2.0.xsd
	http://activemq.Apache.org/camel/schema/spring
	http://activemq.Apache.org/camel/schema/spring/camel-spring.xsd">
		
<import resource="classpath:org/Apache/servicemix/camel/nmr/camelnmr.xml" />
	<camelContext xmlns="http://activemq.Apache.org/camel/schema/spring">
		<package>dzone.refcards.camel.routes</package>
	</camelContext>
</beans>
	
Camel Java configuration - Listing 2: Java route

public class SimpleRouter extends RouteBuilder {
	public void configure() throws Exception {
		from("timer:myTimerEvent?fixedRate=true")
			.setBody(constant("Hello World!")).
				to("nmr:someService");
	}
}
	

Camel itself provides a lot of standard functionality. It doesn't just provide routing, it can also provide connectivity for different technologies. For more information on Camel please see it's website at http://camel.Apache.org/ or look at the "Enterprise Integrations Patterns with Camel" Refcard.

Hot Tip

Differences between ServiceMix and Camel
If you've looked at the Camel website you notice that it provides much the same functionality as ServiceMix. It provides connectivity to various standards and technologies, provides routing and transformation and even allows you to expose Web services. The main difference though is that Camel isn't a container. Camel is designed to be used inside some other container. We've shown that you can use Camel in ServiceMix, but you can also use Camel in other ESBs or in ActiveMQ or CXF. So if you just want an routing and mediation engine Camel is a good choice. If you however need a full ESB with good support for JBI, a flexible OSGi based kernel, hot-deploy and easy administration ServiceMix is the better choice.

ServiceMix and web services

Support for Web services is an important feature for an ESB. ServiceMix uses the CXF project for this. Since CXF is also completely spring based, using CXF to deploy Web services is very easy.

Hosting Web services

When you want to expose a service as a webservice you can easily do this using CXF. Just create a CXF OSGi bundle using the archetype: servicemix-osgicxf-code-first-archetype. This will create an OSGi and CXF enabled maven project which you can use to develop webservices. Now just edit the src/main/ resources/META-INF/spring/beans.xml file and after you've run the mvn install command you can deploy the bundle to ServiceMix. The following listing shows such an example. This will create a Web service and host it on http://localhost:8080/cfx/HelloDzone.

CXF Host Web service example using CXF

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:jaxws="http://cxf.Apache.org/jaxws"
	xsi:schemaLocation="
	http://www.springframework.org/schema/beans
	http://www.springframework.org/schema/beans/spring-beans.xsd
	http://cxf.Apache.org/jaxws http://cxf.Apache.org/schemas/jaxws.
	xsd">
		
	<import resource="classpath:META-INF/cxf/cxf.xml" />
	<import resource="classpath:META-INF/cxf/cxf-extension-soap.xml" />
	<import resource="classpath:META-INF/cxf/cxf-extension-http.xml" />
	<import resource="classpath:META-INF/cxf/osgi/cxf-extensionosgi.xml" />
	<jaxws:endpoint id="helloDZone"
		implementor="dzone.examples.ws.HelloDZoneImpl"
			address="/HelloDzone"/>
</beans>	
	

In the previous example we hoseted a Web service which could be called from outside the container. You can also configure CXF to host the Web service internally by prefixing the address with nmr. That way you can easily expose JAX-WS annotated java beans to the other services inside the ESB. The following example shows this:

CXF Host Web service internally

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:jaxws="http://cxf.apache.org/jaxws"
	xsi:schemaLocation="
	http://www.springframework.org/schema/beans
	http://www.springframework.org/schema/beans/spring-beans.xsd
	http://cxf.apache.org/jaxws http://cxf.apache.org/schemas/jaxws.xsd">
		
	<import resource="classpath:META-INF/cxf/cxf.xml" />
	<import resource="classpath:META-INF/cxf/cxf-extension-soap.xml" />
	<import resource="classpath:META-INF/cxf/transport/nmr/cxftransportnmr.xml" />
	<jaxws:endpoint id="helloDzone"
		implementor="dzone.examples.ws.HelloDZoneImpl"
			address="nmr:helloDZone" />
</beans>	
	

You can also host a Web services using the servicemix-cxf-bc component.

Host Web service using the servicemix-cxf-bc component

<beans xmlns:cxfbc="http://servicemix.Apache.org/cxfbc/1.0"
		xmlns:dzone="http://dzone.org/refcard/example">
	<cxfbc:consumer wsdl="classpath:dzone-example.wsdl"
		targetService="dzone:ExampleService"
		targetInterface="dzone:Example"/>
</beans>	
	

Consuming Web services

Consuming Web services in ServiceMix is just as easy. ServiceMix provides two different options for this. You can use Camel or use the servicemix-cxf-bc component:

Consume Web servicemix using the servicemix-cxf-bc component

<beans xmlns:cxfbc="http://servicemix.Apache.org/cxfbc/1.0"
		xmlns:dzone="http://dzone.org/refcard/example">
	<cxfbc:provider wsdl="classpath:target-service.wsdl"
		locationURI="http://webservice.com/Service"
		endpoint="ServicePort"
		service="dzone:ServicePortService"/>
</beans>	
	

With this configuration you can consume a Web service which is located at http://webservice.com/Service and which is defined by the WSDL file target-service.wsdl. Other services can use this component by making a call to the dzone:ServicePortService.

You can also consume a Web service using Camel. For more information on how you can configure the Camel route for this look at the Camel CXF integration section of the Camel website: http://camel.Apache.org/cxf.html.

For Web services ServiceMix provides the following useful archetypes:

Name Description
servicemix-cxf-bc-service-unit Create a maven project which uses the JBI CXF binding component.
servicemix-cxf-se-service-unit Create a maven project which uses the JBI CXF service engine.
servicemix-cxf-se-wsdlfirstservice-unit Create a maven project which uses the JBI CXF service engine. This project is based on WSDL first development.
servicemix-osgi-cxf-codefirstarchetype Create a maven project which uses CXF and OSGi together. This project is based on code first development.
servicemix-osgi-cxf-wsdlfirstarchetype Create a maven project which uses CXF and OSGi together. This project is based on wsdl first development.

Servicemix Components

Besides integration with Web services through CXF, ServiceMix provides a lot of components you can use out of the box to integrate with various other standards and technologies. In this section we'll give an overview of these components. This list is based on the 2009.1 versions. Most of this information can also be found in the XML schemas of these components.

ServiceMix Components

XML Element Description
ServiceMix Bean
Endpoint Allows you to define a simple bean that can receive and send message exchanges.
ServiceMix File
Poller A polling endpoint that looks for a file or files in a directory and sends the files to a target service. You can configure various options on this endpoint such as archiving, filters, use of subdirectories etc.
Sender An endpoint that receives messages from the NMR and writes them to a specific file or directory.
ServiceMix CXF Binding Component
consumer A consumer endpoint that is capable of using SOAP/HTTP or SOAP/JMS.
Provider A provider endpoint that is capable of exposing SOAP/HTTP or SOAP/JMS services.
ServiceMix CXF Service Engine
Endpoint With the Drools Endpoint you can use a drools rule set as a service or as a router.
ServiceMix FTP
Poller This endpoint can be used to poll an FTP directory for files, download them and send them to a service.
Sender With a sender endpoint you can store a message on an FTP server.
ServiceMix HTTP
Consumer Plain HTTP consumer endpoint. This endpoint can be used to handle plain HTTP request (without SOAP) or to be able to process the request in a non standard way.
Provider A plain HTTP provider. This type of endpoint can be used to send non- SOAP requests to HTTP endpoints.
Soap-Consumer An HTTP consumer endpoint that is optimized to work with SOAP messages.
Soap-Provider An HTTP provider endpoint that is optimized to work with SOAP messages.
ServiceMix JMS
Consumer An endpoint that can receive messages from a JMS broker.
Provider An endpoint that can send messages to a JMS broker.
Soap-Consumer A JMS consumer that is optimized to work with SOAP messages.
Soap-Provider A JMS provider that is optimized to work with SOAP messages.
JCA-Consumer A JMS consumer that uses JCA to connect to the JMS broker.
ServiceMix Mail
Poller An endpoint which can be used to retrieve messages.
Sender An endpoint which you can use to send messages.
ServiceMix OSWorkflow
Endpoint This endpoint can be used to start an OSWorkflow proces.
ServiceMix Quartz
Endpoint The Quartz endpoint can be used to fire messages into the NMR at specific intervals.
ServiceMix Saxon
XSLT With the XSLT endpoint you can apply an XSLT transformation to the received message.
Proxy The proxy component allows you to transform an incoming message and send it to an endpoint. You can also configure a transformation that needs to be applied to the result of that invocation.
XQuery The XQuery endpoint can be used to apply a selected XQuery to the input document.
ServiceMix Scripting
Endpoint With the scripting endpoint you can create a service which is implemented using a scripting language. The following languages are supported: Groovy, JRuby, Rhino JavaScript
ServiceMix SMPP
Consumer A polling component which bind with jSMPP and receive SMPP messages and sends the SMPPs into the NMR as messages.
Provider A provider component receives XML message from the NMR and converts into SMPP packet and sends it to SMPP server.
ServiceMix SNMP
Poller With this poller you can receive SNMP events by using the SNMP4J library.
ServiceMix Validation
Endpoint With this endpoint you can provide schema validation of documents using JAXP 1.3 and XMLSchema or RelaxNG.
ServiceMix-VFS
Poller An polling endpoint that looks for a file or files in a virtual file system (based on Apache commons-vfs) and sends the files to a target service.
Sender An endpoint which receives messages from the NMR and writes the message to the virtual file system.
ServiceMix-wsn2005
Create-pullpoint Lets you create a WS-Notification pull point that can be used by a requester to retrieve accumulated notification messages.
Publisher Sends messages to a specific topic.
Registerpublisher An endpoint that can be used by publishers to register themselves.
Subscribe Lets you create subscriptions to a specific topic using the WSNotification specification.
Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Daily Dose - JDK 7, 8, Project Coin, and Lambda Pass With Much Reservation; Tom Peierls Resigns

The four JSRs behind Java 7 and Java 8 were recently approved by the Executive Committee.  However, there was a good deal of discontent among members who voted for and against the JSRs.  Apache kept its promise to vote against all of Oracle's proposed JSRs,...

0 replies - 22040 views - 12/08/10 by Mitchell Pronsc... in Daily Dose

Apache Solr

Getting Optimal Search Results

By Chris Hostetter

15,452 Downloads · Refcard 120 of 151 (see them all)

Download
FREE PDF


The Essential Apache Solr Cheat Sheet

Apache Solr is the HTTP-based server product of the Apache Lucene Project. It makes it easy for programmers to develop sophisticated, high performance search applications with open source technology. The Apache Solr project continues to gain more advanced features such as geo-searching, faceting, dynamic clustering, rich document handling, and database integration, making it a useful tool in the Big Data revolution. This DZone Refcard on Solr Essentials starts with an introduction to Solr and how to run it. Youll learn about searching, solrconfig.xml, schema.xml, field types, analyzers, indexing, and advanced search features.
HTML Preview
Apache Solr: Getting Optimal Search Results

Apache Solr: Getting Optimal Search Results

By Chris Hostetter

ABOUT SOLR

Solr makes it easy for programmers to develop sophisticated, high performance search applications with advanced features such as faceting, dynamic clustering, database integration and rich document handling.

Solr (http://lucene.apache.org/solr/) is the HTTP based server product of the Apache Lucene Project. It uses the Lucene Java library at its core for indexing and search technology, as well as spell checking, hit highlighting, and advanced analysis/tokenization capabilities.

The fundamental premise of Solr is simple. You feed it a lot of information, then later you can ask it questions and find the piece of information you want. Feeding in information is called indexing or updating. Asking a question is called a querying.

Figure 1:
Figure 1: A typical Solr setup

Core Solr Concepts

Solr’s basic unit of information is a document: a set of information that describes something, like a class in Java. Documents themselves are composed of fields. These are more specific pieces of information, like attributes in a class.

RUNNING SOLR

Solr Installation

The LucidWorks for Solr installer (http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr) makes it easy to set up your initial Solr instance. The installer brings you through configuration and deployment of the Web service on either Jetty or Tomcat.

Solr Home Directory

Solr Home is the main directory where Solr will look for configuration files, data and plug-ins.

When LucidWorks is installed at ~/LucidWorks the Solr Home directory is ~/LucidWorks/lucidworks/solr/.

Single Core and Multicore Setup

By default, Solr is set up to manage a single “Solr Core” which contains one index. It is also possible to segment Solr into multiple virtual instances of cores, each with its own configuration and indices. Cores can be dedicated to a single application, or to different ones, but all are administered through a common administration interface.

Multiple Solr Cores can be configured by placing a file named solr.xml in your Solr Home directory, identifying each Solr Core, and the corresponding instance directory for each. When using a single Solr Core, the Solr Home directory is automatically the instance directory for your Solr Core.

Configuration of each Solr Core is done through two main config files, both of which are placed in the conf subdirectory for that Core:

  • schema.xml: where you describe your data
  • solrconfig.xml: where you describe how people can interact with your data.

By default, Solr will store the index inside the data subdirectory for that Core.

Solr Administration

Administration for Solr can be done through http://[hostname]:8983 /solr/admin which provides a section with menu items for monitoring indexing and performance statistics, information about index distribution and replication, and information on all threads running in the JVM at the time. There is also a section where you can run queries, and an assistance area.

SCHEMA.XML

To build a searchable index, Solr takes in documents composed of data fields of specific field types. The schema.xml configuration file defines the field types and specific fields that your documents can contain, as well as how Solr should handle those fields when adding documents to the index or when querying those fields. When you perform a query, schema.xml is structured as follows:


<schema>
	<types>
	<fields>
	<uniqueKey>
	<defaultSearchField>
	<solrQueryParser>
	<copyField>
</schema>

FIELD TYPES

A field type includes three important pieces of information:

  • The name of the field type
  • Implementation class name
  • Field attributes

Field types are defined in the types element of schema.xml.


<fieldType name=”textTight” class=”solr.TextField”>
…
<:/fieldType>

The type name is specified in the name attribute of the fieldType element. The name of the implementing class, which makes sure the field is handled correctly, is referenced using the class attribute.

Hot Tip

Shorthand for Class References When referencing classes in Solr, the string solr is used as shorthand in place of full Solr package names, such as org.apache.solr.schema or org.apache.solr.analysis.

Numeric Types

Solr supports two distinct groups of field types for dealing with numeric data:

  • Numerics with Trie Encoding: TrieDateField, TrieDoubleField, TrieIntField, TrieFloatField, and TrieLongField.
  • Numerics Encoded As Strings: DateField, SortableDoubleField, SortableIntField, SortableFloatField, and SortableLongField.
Which Type to Use?

Trie encoded types support faster range queries, and sorting on these fields is more RAM efficient. Documents that do not have a value for a Trie field will be sorted as if they contained the value of “0”. String encoded types are less efficient for range queries and sorting, but support the sortMissingLast and sortMissingFirst attributes.

Class Description
BinaryField Binary data that needs to be base64 encoded when reading or writing
BoolField Contains either true or false. Values of “1”, “t”, or “T” in the first character are interpreted as true. Any other values in the first character are interpreted as false.
ExternalFileField Pulls values from a file on disk.
RandomSortField Does not contain a value. Queries that sort on this field type will return results in random order. Use a dynamic field to use this feature.
StrField String
TextField Text, usually multiple words or tokens
UUIDField Universally Unique Identifier (UUID). Pass in a value of “NEW” and Solr will create a new UUID.

Hot Tip

Date Field Dates are of the format YYYY-MM-DDThh:mm:ssZ. The Z is the timezone indicator (for UTC) in the canonical representation. Solr requires date and times to be in the canonical form, so clients are required to format and parse dates in UTC when dealing with Solr. Date fields also support date math, such as expressing a time two months from now using NOW+2MONTHS.

Field Type Properties

The field class determines most of the behavior of a field type, but optional properties can also be defined in schema.xml.

Some important Boolean properties are:

Property Description
indexed If true, the value of the field can be used in queries to retrieve matching documents. This is also required for fields where sorting is needed.
stored If true, the actual value of the field can be retrieved in query results.
sortMissingFirst sortMissingLast Control the placement of documents when a sort field is not present in supporting field types.
multiValued If true, indicates that a single document might contain multiple values for this field type.

ANALYZERS

Field analyzers are used both during ingestion, when a document is indexed, and at query time. Analyzers are only valid for <fieldType> declarations that specify the TextField class. Analyzers may be a single class or they may be composed of a series of zero or more CharFilter, one Tokenizer and zero or more TokenFilter classes.

Analyzers are specified by adding <analyzer> children to the <fieldType> element in the schema.xml config file. Field Types typically use a single analyzer, but the type attribute can be used to specify distinct analyzers for the index vs query.

The simplest way to configure an analyzer is with a single <analyzer> element whose class attribute is the fully qualified Java class name of an existing Lucene analyzer.

For more configurable analysis, an analyzer chain can be created using a simple <analyzer> element with no class attribute, with the child elements that name factory classes for CharFilter, Tokenizer and TokenFilter to use, and in the order they should run, as in the following example:


<fieldType name=”nametext” class=”solr.TextField”>
  <analyzer>
	<charFilter class=”solr.HTMLStripCharFilterFactory”/>
	<tokenizer class=”solr.StandardTokenizerFactory”/>
	<filter class=”solr.StandardFilterFactory”/>
	<filter class=”solr.LowerCaseFilterFactory”/>
  </analyzer>
</fieldType>

CharFilter

CharFilter pre-process input characters with the possibility to add, remove or change characters while preserving the original character offsets.

The following table provides an overview of some of the CharFilter factories available in Solr 1.4:

CharFilter Description
MappingCharFilterFactory Applies mapping contained in a map to the character stream. The map contains pairings of String input to String output.
PatternReplaceCharFilterFactory Applies a regular expression pattern to the string in the character stream, replacing matches with the specified replacement string.
HTMLStripCharFilterFactory Strips HTML from the input stream and passes the result to either a CharFilter or a Tokenizer. This filter removes tags while keeping content. It also removes <script>, <style>, comments, and processing instructions.

Tokenizer

Tokenizer breaks up a stream of text into tokens. Tokenizer reads from a Reader and produces a TokenStream containing various metadata such as the locations at which each token occurs in the field.

The following table provides an overview of some of the Tokenizer factory classes included in Solr 1.4:

Tokenizer Description
StandardTokenizerFactory Treats whitespace and punctuation as delimiters.
NGramTokenizerFactory Generates n-gram tokens of sizes in the given range.
EdgeNGramTokenizerFactory Generates edge n-gram tokens of sizes in the given range.
PatternTokenizerFactory Uses a Java regular expression to break the text stream into tokens.
WhitespaceTokenizerFactory Splits the text stream on whitespace, returning sequences of non-whitespace characters as tokens.

TokenFilter

TokenFilter consumes and produces TokenStreams. TokenFilter looks at each token sequentially and decides to pass it along, replace it or discard it.

A TokenFilter may also do more complex analysis by buffering to look ahead and consider multiple tokens at once.

The following table provides an overview of some of the TokenFilter factory classes included in Solr 1.4:

TokenFilter Description
KeepWordFilterFactory Discards all tokens except those that are listed in the given word list. Inverse of StopFilterFactory.
LengthFilterFactory Passes tokens whose length falls within the min/max limit specified.
LowerCaseFilterFactory Converts any uppercases letters in a token to lowercase.
PatternReplaceFilterFactory Applies a regular expression to each token, and substitutes the given
PhoneticFilterFactory Creates tokens using one of the phonetic encoding algorithms from the org.apache.commons.codec.language package.
PorterStemFilterFactory An algorithmic stemmer that is not as accurate as tablebased stemmer, but faster and less complex.
ShingleFilterFactory Constructs shingles (token n-grams) from the token stream.
StandardFilterFactory Removes dots from acronyms and ‘s from the end of tokens. This class only works when used in conjunction with the StandardTokenizerFactory
StopFilterFactory Discards, or stops, analysis of tokens that are on the given stop words list.
SynonymFilterFactory Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token.
TrimFilterFactory Trims leading and trailing whitespace from tokens.
WordDelimitedFilterFactory Splits and recombines tokens at punctuations, case change and numbers. Useful for indexing

Hot Tip

Testing Your Analyzer There is a handy page in the Solr admin interface that allows you to test out your analysis against a field type at the http://[hostname]:8983/solr/admin/ analysis.jsp page in your installation.

FIELDS

Once you have field types set up, defining the fields themselves is simple: all you need to do is supply the name and a reference to the name of the declared type you wish to use. You can also provide options that override the options for that field type.


<field name=”price” type=”sfloat” indexed=”true”/>

Dynamic Fields

Dynamic fields allow you to define behavior for fields that are not explicitly defined in the schema, allowing you to have fields in your document whose underlying <fieldType/> will be driven by the field naming convention instead of having an explicit declaration for every field.

Dynamic fields are also defined in the fields element of the schema, and have a name, field type, and options.


<dynamicField name=”*_i” type=”sint” indexed=”true” stored=”true”/>

OTHER SCHEMA ELEMENTS

Copying Fields

Solr has a mechanism for making copies of fields so that you can apply several distinct field types to a single piece of incoming information.


<copyField source=”cat” dest=”text” maxChars=”30000” />

Unique Key

The uniqueKey element specifies which field is a unique identifier for documents. Although uniqueKey is not required, it is nearly always warranted by your application design. For example, uniqueKey should be used if you will ever update a document in the index.


<uniqueKey>id</uniqueKey>

Default Search Field

If you are using the Lucene query parser, queries that don’t specify a field name will use the defaultSearchField. The dismax query parser does not use this value in Solr 1.4.


<defaultSearchField>text</defaultSearchField>

Query Parser Operator

In queries with multiple clauses that are not explicitly required or prohibited, Solr can either return results where all conditions are met or where one or more conditions are met. The default operator controls this behavior. An operator of AND means that all conditions must be fulfilled, while an operator of OR means that one or more conditions must be true.

In schema.xml, use the solrQueryParser element to control what operator is used if an operator is not specified in the query. The default operator setting only applies to the Lucene query parser (not the DisMax query parser, which uses the mm parameter to control the equivalent behavior).

SOLRCONFIG.XML

Configuring solrconfig.xml

solrconfig.xml, found in the conf directory for the Solr Core, comprises of a set of XML statements that set the configuration value for your Solr instance.

AutoCommit

The <updateHandler> section affects how updates are done internally. The <autoCommit> subelement contains further configuration for controlling how often pending updates will be automatically pushed to the index.

Element Description
<maxDocs> Number of updates that have occurred since last commit
<maxTime> Number of milliseconds since the oldest uncommitted update

If either of these limits is reached, then Solr automatically performs a commit operation. If the <autoCommit> tag is missing, then only explicit commits will update the index.

HTTP RequestDispatcher Settings

The <requestDispatcher> section controls how the RequestDispatcher implementation responds to HTTP requests.

Element Description
<requestParsers> Contains attributes for enableRemoteStreaming and multipartUploadLimitInKB
<httpCaching> Specifies how Solr should generate its HTTP caching-related headers
Internal Caching

The <query> section contains settings that affect how Solr will process and respond to queries.

There are three predefined types of caches that you can configure whose settings affect performance:

Element Description
<filterCache> Used by SolrIndexSearcher for filters for unordered sets of all documents that match a query. Solr usese the filterCache to cache results of queries that use the fq search parameter.
<queryResultCache> Holds the sorted and paginated results of previous searches
<documentCache> Holds Lucene Document objects (the stored fields for each document).

Request Handlers

A Request Handler defines the logic executed for any request. Multiple instances of various request handlers, each with different names and configuration options can be declared. The qt url parameter or the path of the url can be used to select the request handler by name.

Most request handlers recognize three main sub-sections in their declaration:

  • default, which is used when a request does not include a parameter.
  • append, which is added to the parameter values specified in the request.
  • invariant, which overrides values specified in the query.

LucidWorks for Solr includes the following indexing handlers:

  • XMLUpdateRequestHandler: processes XML messages containing data and other index modification instructions.
  • BinaryUpdateRequestHandler: processes messages from the Solr Java client.
  • CSVRequestHandler: processes CSV files containing documents
  • DataImportHandler: processes commands to pull data from remote data sources
  • ExtractingRequestHandler (aka Solr Cell): uses Apache Tika to process binary files such as Office/PDF and index them

The out-of-the-box searching handler is SearchHandler.

Search Components

Instances of SearchComponent define discrete units of logic that can be combined together and reused by Request Handlers (in particular SearchHandler) that know about them. The default SearchComponent used by SearchHandler is query, facet, mlt (MoreLikeThis), highlight, stats, debug. Additional Search Components are also available with additional configuration.

Response Writers

Response writers generate the formatted response of a search. The wt url parameter selects the response writer to use by name. The default response writers are json, php, phps, python, ruby, xml, and xslt.

INDEXING

Indexing is the process of adding content to a Solr index, and as necessary, modifying that content or deleting it. By adding content to an index, it becomes searchable by Solr.

Client Libraries

There are a number of client libraries available to access Solr. SolrJ is a Java client included with the Solr 1.4 release which allows clients to add, update and query the Solr index. http://wiki.apache.org/solr/IntegratingSolr provides a list of such libraries.

Indexing Using XML

Solr accepts POSTed XML messages that add/update, commit, delete and delete by query using the http://[hostname]:8983/solr/update url. Multiple documents can be specified in a single <add> command.


<add>
  <doc>
		<field name=”employeeId”>05991</field>
		<field name=”office”>Bridgewater</field>
   </doc>
  [<doc> ... </doc>[<doc> ... </doc>]]
</add>

Command Description
commit Writes all documents loaded since last commit
optimize Requests Solr to merge the entire index into a single segment to improve search performance

Delete by id deletes the document with the specified ID (i.e. uniqueKey), while delete by query deletes documents that match the specified query:


<delete><id>05991</id></delete>
<delete><query>office:Bridgewater</query></delete>

Indexing Using CSV

CSV records can be uploaded to Solr by sending the data to the http://[hostname]:8983/solr/update/csv URL.

The CSV handler accepts various parameters, some of which can be overridden on a per field basis using the form:


f.fieldname.parameter=value

These parameters can be used to specify how data should be parsed, such as specifying the delimiter, quote character and escape characters. You can also handle whitespace, define which lines or field names to skip, map columns to fields, or specify if columns should be split into multiple values.

Indexing Using SolrCell

Using the Solr Cell framework, Solr uses Tika to automatically determine the type of a document and extract fields from it. These fields are then indexed directly, or mapped to other fields in your schema.

The URL for this handler is http://[hostname]:8983:solr/update/extract.

The Extraction Request Handler accepts various parameters that can be used to specify how data should be mapped to fields in the schema, including specific XPaths of content to be extracted, how content should be mapped to fields, whether attributes should be extracted, and in which format to extract content. You can also specify a dynamic field prefix to use when extracting content that has no corresponding field.

Indexing Using Data Import Handler

The Data Import Handler (DIH) can pull data from relational databases (through JDBC), RSS feeds, emails repositories, and structure XML using XPath to generate fields.

The Data Import Handler is registered in solrconfig.xml, with a pointer to its data-config.xml file which has the following structure:


<dataConfig>
  <dataSource/>
  <document>
    <entity>
	 <field column=”” name=””/>
	 <field column=”” name=””/>
    </entity>
  </document>
</dataConfig>

The Data Import Handler is accessed using the http://[hostname]:8983/solr/dataimport URL but it also includes a browser-based console which allows you to experiment with data-config.xml changes and demonstrates all of the commands and options to help with development. You can access the console at this address: http://[hostname]:port/solr/admin/dataimport.jsp

SEARCHING

Data can be queried using either the http://[hostname]:8983/solr/ select?qt=name URL, or by using the http://[hostname]:8983/solr/name syntax for SearchHandler instances with names that begin with a “/”.

SearchHandler processes requests by delegating to its Search Components which interpret the various request parameters. The QueryComponent delegates to a query parser, which determines which documents the user is interested in. Different query parsers support different syntax.

Query Parsing

Input to a query parser can include:

  • Sear ch strings—that is, terms to sear ch for in the index.
  • Parameters for fine-tuning the query by incr easing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results.
  • Parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application’s schema.

Search parameters may also specify a filter query. As part of a search response, a filter query runs a query against the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the strategic use of filter queries can improve search performance.

Common Query Parameters

The table below summarizes Solr’s common query parameters:

Parameter Description
defType The query parser to be used to process the query
sort Sort results in ascending or descending order based on the documents score or another characteristic
start An offset (0 by default) to the results that Solr should begin displaying
rows Indicates how many rows of results are displayed at a time (10 by default)
fq Applies a filter query to the search results
fl Limits the query’s results to a listed set of fields
debugQuery Causes Solr to include additional debugging information in the response, including score explain information for each document returned
explainOther Allows client to specify a Lucene query to identify a set of documents not already included in the response, returning explain information for each of those documents
wt Specified the Response Writer to be used to format the query response

Lucene Query Parser

The standard query parser syntax allows users to specify queries containing complex expressions, such as: . http://[hostname]:8983/solr/select?q=id:SP2514N+price:[*+TO+10].

The standard query parser supports the parameters described in the following table:

Parameter Description
q Query string using the Lucene Query syntax
q.op Specified the default operator for the query expression, overriding that in schema.xml. May be AND or OR
df Default field, overriding what is defined in schema.xml

DisMax Query Parser

The DisMax query parser is designed to provide an experience similar to that of popular search engines such as Google, which rarely display syntax errors to users.

Instead of allowing complex expressions in the query string, additional parameters can be used to specify how the query string should be used to find matching documents.

Parameter Description
q Defines the raw user input strings for the query
q.alt Calls the standard query parser and defined query input strings, when q is not used
qf Query Fields: the fields in the index on which to perform the query
mm Minimum “Should” Match: a minimum number of clauses in the query that must match a document. This can be specified as a complex expression.
pf Phrase Fields: Fields that give a score boost when all terms of the q parameter appear in close proximity
ps Phrase Slop: the number of positions all terms can be apart in order to match the pf boost
tie Tie Breaker: a float value (less than 1) used as a multiplier with more then one of the qf fields containing a term from the query string. The smaller the value, the less influence multiple matching fields have
bq Boost Query: a raw Lucene query that will be added to the users query to influence the score
bf Boost Function: like bq, but directly supports the Solr function query syntax

ADVANCED SEARCH FEATURES

Faceting makes it easy for users to drill down on search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.

There are three types of faceting, all of which use indexed terms:

  • Field Faceting: treats each indexed term as a facet constraint.
  • Query Faceting: allows the client to specify an arbitrary query and uses that as a facet constraint.
  • Date Range Faceting: creates date range queries on the fly.

Solr provides a collection of highlighting utilities which can be called by various Request Handlers to include highlighted matches in field values. Popular search engines such as Google and Yahoo! return snippets in their search results: 3-4 lines of text offering a description of a search result.

When an index becomes too large to fit on a single system, or when a query takes too long to execute, the index can be split into multiple shards on different Solr servers, for Distributed Search. Solr can query and merge results across shards. It’s up to you to get all your documents indexed on each shard of your server farm. Solr does not include out-of-the-box support for distributed indexing, but your method can be as simple as a round robin technique. Just index each document to the next server in the circle.

Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.

The primary purpose of the Replication Handler is to replicate an index to multiple slave servers which can then use loadbalancing for horizontal scaling. The Replication Handler can also be used to make a back-up copy of a server’s index, even without any slave servers in operation.

MoreLikeThis is a component that can be used with the SearchHandler to return documents similar to each of the documents matching a query. The MoreLikeThis Request Handler can be used instead of the SearchHandler to find documents similar to an individual document, utilizing faceting, pagination and filtering on the related documents.

About The Authors

Photo of author Chris Hostetter

Chris Hostetter

Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer for the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been.

Recommended Book

Photo of Book: Lucid works for solr

Designed to provide complete, comprehensive documentation, the Reference Guide is intended to be more encyclopedic and less of a cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting started to well experienced developers extending their application or troubleshooting. It will be of use at any point in the application lifecycle, for whenever you need deep, authoritative information about Solr.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Getting Started with Apache Hadoop

By Eugene Ciurana and Masoud Kalali

15,277 Downloads · Refcard 117 of 151 (see them all)

Download
FREE PDF


The Essential Apache Hadoop Cheat Sheet

The Apache Hadoop Refcard from DZone is the perfect introduction and quick reference to the MapReduce technology that is leading the charge in the Big Data Movement. This Refcard presents a basic blueprint for applying MapReduce to solving large-scale, unstructured data processing problems by showing you how to deploy and use an Apache Hadoop computational cluster. It complements DZone Refcardz #43 and #105, which provide introductions to high-performance computational scalability and high-volume data handling techniques, including MapReduce.
HTML Preview
Getting Started with Apache Hadoop

Getting Started with Apache Hadoop

By Eugene Ciurana and Masoud Kalali

INTRODUCTION

This Refcard presents a basic blueprint for applying MapReduce to solving large-scale, unstructured data processing problems by showing how to deploy and use an Apache Hadoop computational cluster. It complements DZone Refcardz #43 and #103, which provide introductions to highperformance computational scalability and high-volume data handling techniques, including MapReduce.

What Is MapReduce?

MapReduce refers to a framework that runs on a computational cluster to mine large datasets. The name derives from the application of map() and reduce() functions repurposed from functional programming languages.

  • “Map” applies to all the members of the dataset and returns a list of results
  • “Reduce” collates and resolves the results from one or more mapping operations executed in parallel
  • Very large datasets are split into large subsets called splits
  • A parallelized operation performed on all splits yields the same results as if it were executed against the larger dataset before turning it into splits
  • Implementations separate business logic from multiprocessing logic
  • MapReduce framework developers focus on pr ocess dispatching, locking, and logic flow
  • App developers focus on implementing the business logic without worrying about infrastructure or scalability issues
Implementation patterns

The Map(k1, v1) -> list(k2, v2) function is applied to every item in the split. It produces a list of (k2, v2) pairs for each call. The framework groups all the results with the same key together in a new split.

The Reduce(k2, list(v2)) -> list(v3) function is applied to each intermediate results split to produce a collection of values v3 in the same domain. This collection may have zero or more values. The desired result consists of all the v3 collections, often aggregated into one result file.

Hot Tip

MapReduce frameworks produce lists of values. Users familiar with functional programming mistakenly expect a single result from the mapping operations.
figure1

APACHE HADOOP

Apache Hadoop is an open source, Java framework for implementing reliable and scalable computational networks. Hadoop includes several subprojects:

  • MapReduce
  • Pig
  • ZooKeeper
  • HBase
  • HDFS
  • Hive
  • Chukwa

This Refcard presents how to deploy and use the common tools, MapReduce, and HDFS for application development after a brief overview of all of Hadoop’s components.

Hot Tip

http://hadoop.apache.org is the authoritative reference for all things Hadoop.

Hadoop comprises tools and utilities for data serialization, file system access, and interprocess communication pertaining to MapReduce implementations. Single and clustered configurations are possible. This configuration almost always includes HDFS because it’s better optimized for high throughput MapReduce I/O than general-purpose file systems.

Components

Figure 2 shows how the various Hadoop components relate to one another:

figure2
Essentials
  • HDFS - a scalable, high-performance distributed file system. It stores its data blocks on top of the native file system. HDFS is designed for consistency; commits aren’t considered “complete” until data is written to at least two different configurable volumes. HDFS presents a single view of multiple physical disks or file systems.
  • MapReduce - A Java-based job tracking, node management, and application container for mappers and reducers written in Java or in any scripting language that supports STDIN and STDOUT for job interaction.

Hot Tip

Hadoop also supports other file systems likeAmazon Simple Storage Service (S3), Kosmix’s CloudStore, and IBM’s General Parallel File System. These may be cheaper alternatives to hosting data in the local data center.
Frameworks
  • Chukwa - a data collection system for monitoring, displaying, and analyzing logs from large distributed systems.
  • Hive - structured data warehousing infrastructure that provides a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis.
  • HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.
Utilities
  • Pig - a set of tools for programmatic flat-file data analysis that provides a programming language, data transformation, and parallelized processing.
  • Sqoop - a tool for importing and exporting data stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers.
  • ZooKeeper - a distributed application management tool for configuration, event synchronization, naming, and group services used for managing the nodes in a Hadoop computational network.

Hot Tip

Sqoop is a product released by Cloudera, the most influential Hadoop commercial vendor, under the Apache 2.0 license. The source code and binary packages are available at: http://wiki.github.com/cloudera/sqoop

Hadoop Cluster Building Blocks

Hadoop clusters may be deployed in three basic configurations:

Mode Description Usage
Local (default) Multi-threading components, single JVM Development, test, debug
Pseudo-distributed Multiple JVMs, single node Development, test, debug
Distributed All components run in separate nodes Staging, production

Figure 3 shows how the components are deployed for any of these configurations:

figure3

Each node in a Hadoop installation runs one or more daemons executing MapReduce code or HDFS commands. Each daemon’s responsibilities in the cluster are:

  • NameNode: manages HDFS and communicates with every DataNode daemon in the cluster
  • JobTracker: dispatches jobs and assigns splits (splits) to mappers or reducers as each stage completes
  • TaskTracker: executes tasks sent by the JobTracker and reports status
  • DataNode: Manages HDFS content in the node and updates status to the NameNode

These daemons execute in the three distinct processing layers of a Hadoop cluster: master (Name Node), slaves (Data Nodes), and user applications.

Name Node (Master)
  • Manages the file system name space
  • Keeps track of job execution
  • Manages the cluster
  • Replicates data blocks and keeps them evenly distributed
  • Manages lists of files, list of blocks in each file, list of blocks per node, and file attributes and other meta-data
  • Tracks HDFS file creation and deletion operations in an activity log

Depending on system load, the NameNode and JobTracker daemons may run on separate computers.

Hot Tip

Although there can be two or more Name Nodes in a cluster, Hadoop supports only one Name Node. Secondary nodes, at the time of writing, only log what happened in the primary. The Name Node is a single point of failure that requires manual fail-over!
Data Nodes (Slaves)
  • Store blocks of data in their local file system
  • Store meta-data for each block
  • Serve data and meta-data to the job they execute
  • Send periodic status r eports to the Name Node
  • Send data blocks to other nodes r equired by the Name Node

Data nodes execute the DataNode and TaskTracker daemons described earlier in this section.

User Applications
  • Dispatch mappers and reducers to the Name Node for execution in the Hadoop cluster
  • Execute implementation contracts for Java and for scripting languages mappers and reducers
  • Provide application-specific execution parameters
  • Set Hadoop runtime configuration parameters with semantics that apply to the Name or the Data nodes

A user application may be a stand-alone executable, a script, a web application, or any combination of these. The application is required to implement either the Java or the str eaming APIs.

Hadoop Installation

Hot Tip

Cygwin is a requirement for any Windows systems running Hadoop — install it before continuing if you’re using this OS.

Required detailed instructions for this section are available at: http://hadoop.apache.org/comon/docs/current

  • Ensure that Java 6 and both ssh and sshd are running in all nodes
  • Get the most recent, stable release from http://hadoop.apache.org/common/releases.html
  • Decide on local, pseudo-distributed or distributed mode
  • Install the Hadoop distribution on each server
  • Set the HADOOP_HOME environment variable to the directory where the distribution is installed
  • Add $HADOOP_HOME/bin to PATH

Follow the instructions for local, pseudo-cluster ed, or clustered configuration from the Hadoop site. All the configuration files are located in the directory $HADOOP_HOME/conf; the minimum configuration requirements for each file are:

  • hadoop-env.sh — environmental configuration, JVM configuration, logging, master and slave configuration files
  • core-site.xml — site wide configuration, such as users, groups, sockets
  • hdfs-site.xml — HDFS block size, Name and Data node directories
  • mapred-site.xml — total MapReduce tasks, JobTracker address
  • masters, slaves files — NameNode, JobTracker, DataNodes, and TaskTrackers addresses, as appropriate
Test the Installation

Log on to each server without a passphrase: ssh servername or ssh localhost

Format a new distributed file system: hadoop namenode -format

Start the Hadoop daemons: start-all.sh

Check the logs for errors at $HADOOP_HOME/logs!

Browse the NameNode and JobTracker interfaces at (localhost is a valid name for local configurations):

  • http://namenode.server.name:50070/
  • http://jobtracker.server.name:50070/

HADOOP QUICK REFERENCE

The official commands guide is available from: http://hadoop.apache.org/common/docs/current/commands_ manual.html

Usage

Hot Tip

hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop can parse generic options and run classes from the command line. confdir can override the default $HADOOP_HOME/ conf directory.

Generic Options

-conf <config file> App configuration file
-D <property=value> Set a property
-fs <local|namenode:port> Specify a namenode
-jg <local|jobtracker:port> Specify a job tracker; applies only to a job
-files <file1, file2, .., fileN> Files to copy to the cluster (job only)
-libjars <file1, file2, ..,fileN> .jar files to include in the classpath (job only)
-archives Archives to unbundle on the computational nodes (job only)
User Commands
archive -archiveName file.har /var/data1 /var/data2 Create an archive
distcp
hdfs://node1:8020/dir_a
hdfs://node2:8020/dir_b
Distributed copy from one or more node/dirs to a target
fsck -locations /var/data1
fsck -move /var/data1
fsck /var/data
File system checks: list block/location, move corrupted files to /lost+found, and general check
job -list [all]
job -submit job_file
job -status 42
job -kill 42
Job list, dispatching, status check, and kill; submitting a job returns its ID
pipes -conf file
pipes -map File.class
pipes -map M.class -reduce
R.class -files
Use HDFS and MapReduce from a C++ program
queue -list List job queues
Administrator Commands
balancer -threshold 50 Cluster balancing at percent of disk capacity
daemonlog -getlevel host name Fetch http://host/logLevel?log=name
datanode Run a new datanode
jobtracker Run a new job tracker
namenode -format
namenode -regular
namenode -upgrade
namenode -finalize
Format, start a new instance, upgrade from a previous version of Hadoop, or remove previous version's files and complete upgrade

HDFS shell commands apply to local or HDFS file systems and take the form:


hadoop dfs -command dfs_command_options

HDFS Shell
du /var/data1 hdfs://node/data2 Display cumulative of files and directories
lsr Recursive directory list
cat hdfs://node/file Types a file to stdout
count hdfs://node/data Count the directories, files, and bytes in a path
chmod, chgrp, chown Permissions
expunge Empty file system trash
get hdfs://node/data2 /var/data2 Recursive copy files to the local system
put /var/data2 hdfs://node/data2 Recursive copy files to the target file system
cp, mv, rm Copy, move, or delete files in HDFS only
mkdir hdfs://node/path Recursively create a new directory in the target
setrep -R -w 3 Recursively set a file or directory replication factor (number of copies of the file)

Hot Tip

Wildcard expansion happens in the host’s shell, not in the HDFS shell! A command issued to a directory will affect the directory and all the files in it, inclusive. Remember this to prevent surprises.

To leverage this quick reference, review and understand all the Hadoop configuration, deployment, and HDFS management concepts. The complete documentation is available from http://hadoop.apache.org.

HADOOP APPS QUICK HOW-TO

A Hadoop application is made up of one or more jobs. A job consists of a configuration file and one or more Java classes or a set of scripts. Data must alr eady exist in HDFS.

Figure 4 shows the basic building blocks of a Hadoop application written in Java:

Figure4

An application has one or more mappers and reducers and a configuration container that describes the job, its stages, and intermediate results. Classes are submitted and monitored using the tools described in the previous section.

Input Formats and Types

  • KeyValueTextInputFormat — Each line represents a key and value delimited by a separator; if the separator is missing the key and value are empty
  • TextInputFormat — The key is the line number, the value is the text itself for each line
  • NLineInputFormat — N sequential lines represent the value, the offset is the key
  • MultiFileInputFormat — An abstraction that the user overrides to define the keys and values in terms of multiple files
  • Sequence Input Format — Raw format serialized key/value pairs
  • DBInputFormat — JDBC driver fed data input

Output Formats

The output formats have a 1:1 correspondence with the input formats and types. The complete list is available from: http://hadoop.apache.org/common/docs/current/api

Word Indexer Job Example

Applications are often required to index massive amounts of text. This sample application shows how to build a simple indexer for text files. The input is free-form text such as:


hamlet@11141\tKING CLAUDIUS\tWe doubt it nothing: heartily
farewell.

The map function output should be something like:


<KING, hamlet@11141>
<CLAUDIUS, hamlet@11141>
<We, hamlet@11141>
<doubt, hamlet@11141>

The number represents the line in which the text occurred. The mapper and reducer/combiner implementations in this section require the documentation from:http://hadoop.apache.org/mapreduce/docs/current/api

The Mapper

The basic Java code implementation for the mapper has the form:


public class LineIndexMapper
	extends MapReduceBase
	implements Mapper {
	  
   public void map(LongWritable k,
	Text v, OutputCollector o,
	Reporter r) throws IOException { /* implementation here
*/ }
   .
   .
}

The implementation itself uses standard Java text manipulation tools; you can use regular expressions, scanners, whatever is necessary.

Hot Tip

There were significant changes to the method signatures in Hadoop 0.18, 0.20, and 0.21 - check the documentation to get the exact signature for the version you use.
The Reducer/Combiner

The combiner is an output handler for the mapper to reduce the total data transferred over the network. It can be thought of as a reducer on the local node.


public class LineIndexReducer
		extends MapReduceBase
		implements Reducer {
	  public void reduce(Text k,
		Iterator v,
		OutputCollector o,
		Reporter r) throws IOException {
	  /* implementation */ }
	 .
	 .
}

The reducer iterates over keys and values generated in the previous step adding a line number to each word’s occurrence index. The reduction results have the form:


<KING, hamlet@11141; hamlet@42691; lear@31337>

A complete index shows the line where each word occurs, and the file/work where it occurred.

Job Driver

public class Driver {
  public static void main(String… argV) {
	Job job = new Job(new Configuration(), “test”);
	job.setMapper(LineIndexMapper.class);
	job.setCombiner(LineIndexReducer.class);
	job.setReducer(LineIndexReducer.class);
	
	job.waitForCompletion(true);
  }
} // Driver

This driver is submitted to the Hadoop cluster for processing, along with the rest of the code in a .jar file. One or more files must be available in a reachable hdfs://node/path before submitting the job using the command:


hadoop jar shakespeare_indexer.jar

Using the Streaming API

The streaming API is intended for users with very limited Java knowledge and interacts with any code that supports STDIN and STDOUT streaming. Java is considered the best choice for “heavy duty” jobs. Development speed could be a r eason for using the streaming API instead. Some scripted languages may work as well or better than Java in specific problem domains. This section shows how to implement the same mapper and reducer using awk and compares its performance against Java’s.

The Mapper

#!/usr/bin/gawk -f
  {
	for (n = 2;n <= NF;n++) {
	  gsub(“[,:;)(|!\\[\\]\\.\\?]|--”,””);
	  if (length($n) > 0) printf(“%s\t%s\n”, $n, $1);
    }
}

The output is mapped with the key, a tab separator, then the index occurrence.

The Reducer

#!/usr/bin/gawk -f
{ wordsList[$1] = ($1 in wordsList) ?
sprintf(“%s,%s”,wordsList[$1], $2) : $2; }

END {
  for (key in wordsList)
    printf(“%s\t%s\n”, key,wordsList[key]);
}

The output is a list of all entries for a given word, like in the previous section:


doubt\thamlet@111141,romeoandjuliet@23445,henryv@426917

Awk’s main advantage is conciseness and raw text processing power over other scripting languages and Java. Other languages, like Python and Perl, ar e supported if they are installed in the Data Nodes. It’s all about balancing speed of development and deployment vs. speed of execution.

Job Driver

hadoop jar hadoop-streaming.jar -mapper shakemapper.awk
-reducer shakereducer.awk -input hdfs://node/shakespeareworks

Performance Tradeoff
Figure5

Hot Tip

The streamed awk invocation vs. Java are functionally equivalent and the awk version is only about 5% slower. This may be a good tradeoff if the scripted version is significantly faster to develop and is continuously maintained.

STAYING CURRENT

Do you want to know about specific projects and use cases where NoSQL and data scalability are the hot topics? Join the scalability newsletter:

http://eugeneciurana.com/scalablesystems

About The Authors

Photo of author Eugene Ciurana

Eugene Ciurana

Eugene Ciurana (http://eugeneciurana.com) is an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability large scale systems. Over the last two years, Eugene designed and built hybrid cloud scalable systems and computational networks for leading financial, software, insurance, and healthcare companies in the US, Japan, Mexico, and Europe.

Publications

  • Developing with Google App Engine, Apr ess
  • DZone Refcar d #105: NoSQL and Data Scalability
  • DZone Refcar d #43: Scalability and High A vailability
  • DZone Refcar d #38: SOA Patterns
  • The Tesla Testament: A Thriller, CIMEntertainment

Masoud Kalali

Photo of author Masoud Kalali

Masoud Kalali(http://kalali.me) is a software engineer and author. He has been working on software development projects since 1998. He is experienced in a variety of technologies and platforms..

Masoud is the author of several DZone Refcardz, including: Using XML in Java, Berkeley DB Java Edition, Java EE Security , and GlassFish v3. Masoud is also the author of a book on GlassFish Security published by Packt. He is one of the foundin g members of the NetBeans Dream Team and is a GlassFish community spotlighted developer.

Recommended Book

Photo of Book: Hadoop: The Definitive Guide

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Getting Started with Maven Repository Management

By Jason Van Zyl

20,297 Downloads · Refcard 98 of 151 (see them all)

Download
FREE PDF


The Essential Maven Repository Cheat Sheet

Maven Repositories provides a standard for storing and serving binary software. Maven and other tools such as Ivy interact with repositories to search for binary software artifacts, locate project dependencies, and retrieve software artifacts from a repository. Maven Repository managers serve two purposes: they act as highly configurable proxies between your organization and the public Maven repositories and they provide an organization with a deployment destination for your own generated artifacts. This DZone Refcard is an in-depth introduction to Maven Repository Management. We cover everything from what is Repository Management to Proxy Repositories, Hosted Repositories, and Repository Groups.
HTML Preview
Getting Started with Maven Repository Management

Getting Started with Maven Repository Management

By Jason Van Zyl

MAVEN REPOSITORY MANAGEMENT

A Maven repository provides a standard for storing and serving binary software. Maven and other tools such as Ivy interact with repositories to search for binary software artifacts, locate project dependencies, and retrieve software artifacts from a repository.

Maven Repository managers serve two purposes: they act as highly configurable proxies between your organization and the public Maven repositories and they also provide an organization with a deployment destination for your own generated artifacts.

Proxy Remote Repositories

When you proxy a remote repository, you repository manager accepts requests for artifacts from clients. If the artifact is not already cached, the repository manager will retrieve the artifact from the remote repository and cache the artifact. Subsequent requests for the same artifact will be served from the local cache.

cache"

Hosted Internal Repositories

When you host a repository, your repository manager takes care of organizing, storing, and serving binary artifacts. You can use a hosted, internal repository to store internal release artifacts, snapshot artifacts, or 3rd party artifacts.

artifacts"

Release Artifacts

These are specific, point-in-time releases. Released artifacts are considered to be solid, stable, and perpetual in order to guarantee that builds which depend upon them are repeatable over time. Released JAR artifacts are associated with PGP signatures and checksums verify both the authenticity and integrity of the binary software artifact. The Central Maven repository stores release artifacts.

Snapshot Artifacts

Snapshots capture a work in progress and are used during development. A Snapshot artifact has both a version number such as “1.3.0” or “1.3” and a timestamp. For example, a snapshot artifact for commons-lang 1.3.0 might have the name commons-lang-1.3.0-20090314.182342-1.jar.

Reasons to Use a Repository Manager

  • Builds will run much fasteras they will be downloading artifacts from a local cache.
  • Builds will be more stablebecause you will not be relying on external resources. If your internet connection becomes unavailable, your builds will rely on a local cache of artifacts from a remote repository.
  • You can deploy 3rd party artifacts to your repository manager. If you have a proprietary JDBC driver, add it to an internal 3rd party repository so developers can add it as a project dependency without having to manually install it in a local repository.
  • It will be easier to collaborateand distribute software internally. Instead of sending other developers instructions for checking out source from source control and building entire applications from source, publish artifacts to an internal repository and share binary artifacts.
  • If you are deploying software to the public, the fastest way to get your users productive is with a standard Maven repository.
  • You can control which artifacts and repositories are referenced by your projects.

Additional Features and Benefits

Searching and Indexing Artifacts:All repository managers provide an easy way to index and search software artifacts using the standard Nexus Indexer format.

Repository Groups:Repository managers can consolidate multiple repositories into a single repository group making it easier to configure tools to retrieve artifacts from a single URL.

Repository"

Procuring External Artifacts:Organizations often want some control over what artifacts are allowed into the organization. Many repository managers allow administrators to define lists of allowed and/or blocked repositories.

Staging and Release Management:Repository managers can also support decisions and workflow associated with software releases sending email notifications to release managers, developers, and testers.

Release

Security and LDAP Integration:Repository managers can be configured to verify artifacts downloaded from remote repositories and to integrate with external security providers such as LDAP.

Multiple Repository Formats:Repository managers can also automatically transform between various repository formats including OSGi Bundle repositories (OBR), P2 repositories, Maven repositories, and other repository formats.

REPOSITORY COORDINATES

Repositories store artifacts using a set of coordinates: groupId, artifactId, version, and packaging. The GAV coordinate standard is the foundation for Maven’s ability to manage dependencies.

Hot Tip

This set of coordinates is often referred to as a GAV coordinate, which is short for “Group, Artifact, Version coordinate.”

Coordinate: groupId

A group identifier groups a set of artifacts into a logical group. For example, software components being produced by the Maven project are available under the groupId org.apache.maven.

Coordinate: artifactId

An artifact is simply a name for a software artifact. A simple web application project might have the artifactId “simple-webapp”, and a simple library might be “simple-library”. The combination of groupId and artifactId must be unique for a project.

Coordinate: version

A numerical version for a software artifact. For example, if your simple-library artifact has a Major release version of 1, a minor release version of 2, and point release version of 3, your version would be 1.2.3. Versions can also contain extra information to denote release status such as “1.2-beta”.

Coordinate: packaging

Packaging describes the contents of the software artifact. While the most common artifact is a JAR, Maven repositories can store any type binary software format including ZIP, SWC, SWF, NAR, WAR, EAR, SAR.

Addressing Resources in a Repository

Tools designed to interact with Maven repositories translate artifact coordinates into a URL which corresponds to a location in a Maven repository. If a tool such as Maven is looking for version 1.2.0 of the some-library JAR in the group com.example, this request is translated into:


/com/example/some-library/1.2.0/some-library-1.2.0.jar
/com/example/some-library/1.2.0/some-library-1.2.0.pom

pg2"

PROJECT DEPENDENCIES

Build tools like Maven and Ivy allow you to define project dependencies using Maven coordinates.

Declaring Dependencies in Maven


<project>
...
  <dependencies>
  <dependency>
      <groupId>org.codehaus.xfire</groupId>
      <artifactId>xfire-java5</artifactId>
      <version>1.2.5</version>
  </dependency>
  <dependency>
    <groupId>junit</groupId>
      <artifactId>junit</artifactId>
     <version>3.8.1</version>
     <scope>test</scope>
    </dependency>
</dependencies>
...
</project>

REMOTE REPOSITORIES

Central Maven Repository

The Central Maven repository contains almost 90,000 software artifacts occupying around 100 GB of disk space. You can look at Central as an example of how Maven repositories operate and how they are assembled.
http://repo1.maven.org

Apache Snapshot Repository

The Apache Snapshot repository contains snapshot artifacts for projects in the Apache Software Foundation. http://repository.apache.org/snapshots/

Codehaus Snapshot Repository

The Codehaus Snapshot repository contains snapshot artifacts for projects hosted by Codehaus. http://nexus.codehaus.org/snapshots/

ABOUT NEXUS

Nexus manages software “artifacts” required for development, deployment, and provisioning. If you develop software, Nexus can help you share those artifacts with other developers and end-users. Maven’s central repository has always served as a great convenience for users of Maven, but it has always been recommended to maintain your own repositories to ensure stability within your organization. Nexus greatly simplifies the maintenance of your own internal repositories and access to external repositories. With Nexus you can completely control access to, and deployment of, every artifact in your organization from a single location.

Downloading Nexus Open Source

To download Nexus Open Source, go to http://nexus.sonatype.org and click on the Download menu item. Download the nexus-oss-webapp-1.6.0-bundle.tar.gz or nexus-oss-webapp-1.6.0-bundle.zip file from the Download directory.

Downloading Nexus Professional

To download Nexus Professional, go to http://www.sonatype.com/products/nexus and click on Download Nexus Pro. After you fill out a simple registration form, a download link will be sent via email.

Installing Java

Nexus Open Source and Nexus Professional only have one prerequisite, a Java Runtime Environment (JRE) compatible with Java 5 or higher. To download the latest release of the Sun JDK, go to http://developers.sun.com/downloads/.

Installing Nexus

Unpack the Nexus distribution in any directory. Nexus doesn’t have any hard coded directories, it will run from any directory. If you downloaded the ZIP archive, run:


$ unzip nexus-webapp-1.6.0-bundle.zip

And, if you downloaded the GZip’d TAR archive, run:


$ tar xvzf nexus-webapp-1.6.0-bundle.tgz

This will create two directories nexus-webapp-1.6.0/ and sonatype-work/.

The Sonatype Work Directory

The Nexus installation directory nexus-webapp-1.6.0 has a sibling directory named sonatype-work/. This directory contains all of the repository and configuration data for Nexus and is stored outside of the Nexus installation directory to make it easier to upgrade to a newer version of Nexus.

RUNNING NEXUS

When you start Nexus for the first time, it will be running on http://localhost:8081/nexus/. To start Nexus, find the appropriate startup script for your platform in the ${NEXUS_HOME}/bin/jsw directory.

Starting Nexus

The following example listing starts Nexus using the script for Mac OS X. The Mac OS X wrapper is started with a call to nexus start:


$ cd ~/nexus-webapp-1.6.0
$ ls ./bin/jsw/
aix-ppc-32/ linux-ppc-64/ solaris-sparc-32/
aix-ppc-64/ linux-x86-32/ solaris-sparc-64/
hpux-parisc-32/ linux-x86-64/ solaris-x86-32/
hpux-parisc-64/ macosx-universal-32/ windows-x86-32/
$ chmod -R a+x bin
$ ./bin/jsw/macosx-universal-32/nexus start
Nexus Repository Manager...
$ tail -f logs/wrapper.log
INFO ... [ServletContainer:default] -SelectChannelConnector@0.0.0.0:8081


Configuring Nexus as a Service

When installing Nexus, you will often want to configure Nexus as a service. To configure Nexus as a service on Windows:

  • (A) Open a Command Prompt
  • (B) Change directories to C:/Program Files/nexus-webapp-1.6.0
  • (C) Change directories to bin/jsw/windows-x86-32
  • (D) Run InstallNexus.bat to install Nexus as a Windows Service
  • (E) Run “net start nexus-webapp” to start the Nexus service

To configure Nexus as a service on Linux:

  • (A) Copy bin/jsw/$PLATFORM/nexus to /etc/init.d
  • (B) chmod 755 /etc/init.d/nexus
  • (C) Edit the startup script changing APP_NAME, APP_LONG_NAME, NEXUS_HOME, PLATFORM, WRAPPER_CMD, and WRAPPER_CONF
  • (D) (Optional) Set the RUN_AS_USER to “nexus

Login to Nexus

To use Nexus, fire up a web browser and go to: http://localhost:8081/nexus. Click on the “Log In” link in the upper right-hand corner of the web page, and you should see the login dialog.

login"

Hot Tip

THE DEFAULT NEXUS USERNAME AND PASSWORD IS “admin” AND “admin123”.

Post-install Checklist

After installing Nexus make sure to make the following configuration changes.

  • Change the Administrative Password by clicking on Security -> Users. Right-click on the admin user and choose “Set Password”.
  • Configure the SMTP Settings by selecting Administration -> Server and filling out the SMTP server information.
  • Enable Remote Index Downloads for the Central Maven Repository. Click on Views/Repositories -> Repositories. Select the “Maven Central” repository and open up the Configuration tab. Under Remote Repository Access set Download Remote Indexes to true.
  • Install Professional License (Nexus Professional Only). Select Administration -> Licensing and upload your Nexus Professional License.

CONFIGURING MAVEN FOR NEXUS

To use Nexus, you will configure Maven to check Nexus instead of the public repositories. To do this, you’ll need to edit your mirror settings in your ~/.m2/settings.xml file.

Update your Maven Settings

Place the following XML into a file named ~/.m2/settings. xml. This Maven Settings file configures your Maven builds to fetch artifacts from the public group of the Nexus installation available at http://localhost:8081/nexus/


<settings>
 <mirrors>
  <mirror>
   <!--This sends everything else to /public -->
   <id>nexus</id>
   <mirrorOf>*</mirrorOf>
   <url>http://localhost:8081/nexus/content/groups/public</url>
  </mirror>
 </mirrors>
<profiles>
 <profile>
  <id>nexus</id>
  <repositories>
   <repository>
    <id>central</id>
     <url>http://central</url>
     <releases><enabled>true</enabled></releases>
     <snapshots><enabled>true</enabled></snapshots>
    </repository>
   </repositories>
 <pluginRepositories>
  <pluginRepository>
   <id>central</id>
    <url>http://central</url>
     <releases><enabled>true</enabled></releases>
     <snapshots><enabled>true</enabled></snapshots>
    </pluginRepository>
   </pluginRepositories>
 </profile>
</profiles>
  <activeProfiles>
   <!--make the profile active all the time -->
    <activeProfile>nexus</activeProfile>
  </activeProfiles>
</settings>

Deploying Artifacts to Nexus

To deploy artifacts to Nexus you must set server credentials in your Maven Settings and configure your project’s POM to publish to Nexus. Using the default deployment user’s credentials, put the following server element in your Maven Settings XML stored in ~/.m2/settings.xml


<settings>
…
<servers>
<server>
   <id>releases</id>
   <username>deployment</username>
   <password>deployment123</password>
 </server>
<server>
   <id>snapshots</id>
   <username>deployment</username>
   <password>deployment123</password>
<</server>
</servers>
…
</settings>

And, add the following XML to your Maven project’s pom.xml:


<distributionManagement>
  <repository>
     <id>releases</id>
     <name>Releases Repository</name>
     <url>
  http://localhost:8081/nexus/content/repositories/releases
</url>
  </repository>
    <snapshotRepository>
    <id>snapshots</id>
   <name>Snapshot Repository</name>
 <url>
http://localhost:8081/nexus/content/repositories/snapshots
  </url>
  </snapshotRepository>
</distributionManagement>

This configures your Maven build to deploy snapshots to the hosted Snapshots repository and releases to the hosted Releases repository. When Maven performs the deployment, it will match the id element of the repository with the id element of the server in the settings.xml and send the appropriate credentials.

Hot Tip

The default deployment user is deployment and the default password is deployment123.

PROXY REPOSITORIES

This section details working with Proxy Repositories.

What is a Proxy Repository?

A proxy repository sits between your builds and a remote repository like the Central Maven repository. When you ask a proxy repository for an artifact, it checks a local cache of artifacts it has already downloaded. If it does not have the artifact requested, it will retrieve the artifact from the remote repository.

Proxy repositories speed up your builds by serving frequently used artifacts from a local cache. They also provide for more stability in case when your internet connection or the remote repository becomes unavailable.

Adding a New Proxy Repository

To add a new Proxy Repository, go to Views/Repositories -> Repositories, and click on the Add button as shown in the following figure. Select Proxy Repository from the drop down:

Proxy"

Once you select Proxy Repository you will see the New Proxy Repository form shown here:

Proxy

Supply a unique identifier and name, choose a Repository Policy of either Release or Snapshot, and provide the URL of the remote repository in the Remote Storage Location.

Enabling Remote Index Downloads

While Nexus is preconfigured with the Central Maven repository, it is not configured to download indexes from remote repositories. Enabling indexes is essential if you want to take full advantage of Nexus’ intuitive search interface. To enable Remote Index Downloads. Go to Views/Repositories -> Repositories. Select the Maven Central repository and click on the Configuration tab. Set “Download Remote Indexes” to true and click on Save. Nexus will then download the repository index from the remote repository. This process may take a few minutes depending on the speed of your connection.

If the remote index has been successfully downloaded and processed, the Browse Index tab for the Maven Central repository will display thousands of artifacts.

HOSTED REPOSITORIES

What is a Hosted Repository?

A Hosted Repository contains artifacts which have been published to a Nexus instance. These published artifacts are stored in the Sonatype Work directory. This can include repositories that hold release artifacts and repositories that hold snapshot artifacts.

Nexus comes configured with three Hosted repositories: 3rd Party, Releases, and Snapshots. The Releases repository is for your own internal software release artifacts, and the Snapshots repository is for your own project’s snapshot artifacts. The 3rd Party repository is for 3rd party artifacts such as proprietary drivers or commercial libraries which are not available from a public Maven repository.

Adding a New Hosted Repository

Proxy

REPOSITORY GROUPS

What is a Repository Group?

A repository groups combines one or more repositories under a single repository URL. You use repository groups to simplify the configuration of tools like Maven which need to retrieve artifacts from a set of common repositories. As a Nexus administrator you can define new repositories, control which repositories are available in a group and the order in which artifacts are resolved from repositories in a group.

Adding Repositories to a Group

Nexus ships with a Public Repository Group which contains all of your hosted and proxy repositories. If you create a new repository, and you need to add this repository to the Public Group, go to Views/Repositories -> Repositories and select the Configuration tab.

Config

To add a repository to repository group, drag a repository from the “Available Repositories” list to the “Ordered Group Repositories” list and click on the Save button.

Reordering Repositories in a Group

When Nexus resolves an artifact from a Repository Group it iterates over the repositories in the group, returning the first match. If an artifact exists in more than one repository, you may need to change the order of repositories in a Repository Group. To change the order, go to Repositories/View -> Repositories, select the group you need to reorder, and then select the Configuration tab. To reorder repositories, click and drag repositories to the correct order in the Ordered Group Repositories field and then click Save.

NEXUS ADMINISTRATION

Configuring Nexus Server

To configure Sonatype Nexus, click on Administration -> Server this will load the Nexus configuration panel. The following is a list of some of the configuration sections in this panel:

SMTP Settings: Nexus supports release and deployment using email. Before Nexus can send emails, you will need to configure the appropriate SMTP settings in this section.

HTTP Request Settings: Configure custom timeouts and retry behavior for remote repositories as well as customize the Nexus User Agent.

Security Settings: Nexus’ pluggable security providers are configured in this section. You can control which security realms are active and the order in which they are consulted during authentication and authorization.

Anonymous Access:Control how and if Nexus is made available to anonymous, unauthenticated users.

Application Server Settings:If Nexus is hosted behind a proxy, or if you need to customize the URL, you can do so here.

System Notifications Settings:Configure automatic email notifications for important system events.

Configuring Scheduled Tasks

If you are publishing snapshots releases to Nexus, you will want to configure at least one scheduled task to periodically delete older snapshots releases. To configure a Scheduled Task, click on Administration -> Scheduled Tasks, and click on the Add button. Select the appropriate Task Type. Some of the more common and useful Task Types follow:

Backup All Nexus Configuration Files:Will cause Nexus to create a snapshot of all Nexus configuration files.

Download Indexes:Nexus will retrieve or update indexes for all remote, proxy repositories.

Evict Unused Proxy Items:If space is a premium, you can configure Nexus to remove proxy items which have not been used within a specific time period.

Remove Snapshots from Repository:Nexus can be configured to keep a minimum number of repositories and to delete snapshots older than a specific time period.

Scheduled tasks can be configured to send an email alert when they are executed, and you can schedule a task to run Once, Hourly, Daily, Weekly, Monthly, or using a custom cron expression.

Defining Repository Routes

Repository routes allow you to direct requests matching specific patterns to specific repositories. For example, if you wanted to make sure all requests for artifacts under org.someoss where directed to internal, hosted Releases and Snapshots repositories, you would define the following route:


Type: Inclusive
URL Pattern: .*/org/some-oss/.*
Repositories: Releases, Snapshots

To define a Repository Route, go to Administration -> Routing. The Routing panel is where you can edit existing routes and create additional routes.

Configuring Nexus Security

Nexus Security has a highly configurable Role-based Access Control system which relies on Privileges, Roles, and Users. By default, Nexus ships with a default admin, deployment, anonymous user along with associated roles. To configure a new Nexus user, go to Security -> Users and open up the Users panel. On the users panel, click on the Add button to add a new Nexus user. Once the user is created, click on the user to edit the user’s email address or to assign the user new Nexus roles.

To create or edit roles, click on Security -> Roles. Most of the default roles cannot be edited directly, but you are free to create new, custom roles by clicking on the Add button. Once a role is created, you can assign it new privileges, by dragging Roles and Privileges from the Available Roles/Privileges list to the Selected Roles/Privileges list and clicking on the Save button.

NEXUS PROFESSIONAL

Nexus Professional is a central point of access to external repositories which provides the necessary controls to make sure that only approved artifacts enter into your software development environment. Central features of Nexus Professional are:

Nexus Procurement Suite:Gives Nexus administrators control of what artifacts are allowed into an organization from a remote repository.

Nexus Staging Suite:Provides workflow support for software releases. Artifacts can be deployed to staging repositories, tested, and promoted only after they have been tested and certified.

Hosting Project Web Sites:With Nexus Professional, you can publish Maven project sites directly to your repository manager.

Support for OSGi Repositories:Nexus Professional supports OBR and P2 repositories used in OSGi and Eclipse development.

Enterprise LDAP Support:Nexus Professional adds support for LDAP clustering, and supporting mixed authentication configurations for multiple sources of security information including Atlassian’s Crowd server.

In addition to these features, Nexus Pro also adds support for Artifact Bundles, Centralized Management of Maven Settings, Custom Repository Metadata, Self-serve User Account Sign-up, and Artifact Validation and Verification.

OTHER NEXUS RESOURCES

For more information about Sonatype’s Nexus, see the following resources:

Free Nexus Book:
http://books.sonatype.com/nexus-book

Nexus OSS Site:
http://nexus.sonatype.org

Nexus Pro Site:
http://www.sonatype.com/products/nexus

Participate in the Nexus Community

Everyone is welcome to participate in the Nexus community as a developer or user. To participate, take advantage of the following resources:

Nexus IRC Channel:
#nexus on irc.codehaus.org:6667

Subscribe to the Nexus User Mailing List:
nexus-user-subscribe@sonatype.org

Subscribe to the Nexus Developer Mailing List:
nexus-dev-subscribe@sonatype.org

Subscribe to the Nexus Pro User Mailing List:
nexus-pro-users-subscribe@sonatype.org

Checkout Nexus Source Code from Subversion:
http://svn.sonatype.org/nexus/trunk

Browse the Nexus JIRA Project:
https://issues.sonatype.org/browse/NEXUS

About The Authors

Photo of author Jason Van Zyl

Jason Van Zyl

Jason Van Zyl is the founder and CTO of Sonatype, the Maven company, and founder of the Apache Maven Project, the Plexus IoC framework, and the Apache Velocity project.

Recommended Book

Nexus

This book covers both Nexus Open Source and Nexus Professional, a product which brings full control and visibility to organizations which depend on Maven repositories to manage releases and distribute software.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Download Your Free Grails Cheat Sheet Now!

New to Grails? Click here to download DZone's latest Refcard: Getting Started with Grails.

1 replies - 7427 views - 06/29/09 by Lyndsey Clevesy in Announcements