apache
Apache Maven 2
By Matthew McCullough
34,897 Downloads · Refcard 55 of 151 (see them all)
Download
FREE PDF
The Essential Maven 2 Cheat Sheet
People who downloaded this DZone Refcard also liked:
Apache Maven 2
By Matthew McCullough
ABOUT APACHE MAVEN
Maven is a comprehensive project information tool, whose most common application is building Java code. Maven is often considered an alternative to Ant, but as you’ll see in this Refcard, it offers unparalleled software lifecycle management, providing a cohesive suite of verification, compilation, testing, packaging, reporting, and deployment plugins.
Maven is receiving renewed recognition in the emerging development space for its convention over configuration approach to builds. This Refcard aims to give JVM platform developers a range of basic to advanced execution commands, tips for debugging Mavenized builds, and a clear introduction to the “Maven vocabulary”.
Interoperability and Extensibility
New Maven users are pleasantly surprised to find that Maven offers easy-to-write custom build-supplementing plugins, reuses any desired aspect of Ant, and can compile native C, C++, and .NET code in addition to its strong support for Java and JVM languages and platforms, such as Scala, JRuby, Groovy and Grails.

THE MVN COMMAND
Maven supplies a Unix shell script and MSDOS batch file named mvn and mvn.bat respectively. This command is used to start all Maven builds. Optional parameters are supplied in a space-delimited fashion. An example of cleaning and packaging a project, then running it in a Jetty servlet container, yet skipping the unit tests, reads as follows:
mvn clean package jetty:run –Dmaven.test.skip
PROJECT OBJECT MODEL
The world of Maven revolves around metadata files named pom.xml. A file of this name exists at the root of every Maven project and defines the plugins, paths and settings that supplement the Maven defaults for your project.
Basic pom.xml Syntax
The smallest valid pom.xml, which inherits the default artifact type of “jar”, reads as follows:
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.ambientideas</groupId>
<artifactId>barestbones</artifactId>
<version>1.0-SNAPSHOT</version>
</project>
Super POM
The Super POM is a virtual pom.xml file that ships inside the core Maven JARs, and provides numerous default settings. All projects automatically inherit from the Super POM, much like the Object super class in Java. Its contents can be viewed in one of two ways:
View Super POM via SVN
Open the following SVN viewing URL in your web browser:
http://svn.apache.org/repos/asf/maven/components/branches/maven-2.1.x/pom.xml
View Super POM via effective-pom
Run the following command in a directory that contains the most minimal Maven project pom.xml, listed above.
mvn help:effective-pom
Multi-module Projects
Maven showcases exceptional support for componentization via its concept of multi-module builds. Place sub-projects in sub-folders beneath your top level project and reference each with a module tag. To build all sub projects, just execute your normal mvn command and goals from a prompt in the top-most directory.
<project>
<!-- ... -->
<packaging>pom</packaging>
<modules>
<module>servlets</module>
<module>ejbs</module>
<module>ear</module>
</modules>
</project>
Artifact Vector
Each Maven project produces an element, such as a JAR, WAR or EAR, uniquely identified by a composite of fields known as groupId, artifactId, packaging, version and scope. This vector of fields uniquely distinguishes a Maven artifact from all others.
Many Maven reports and plugins print the details of a specific artifact in this colon separated fashion:
groupid:artifactid:packaging:version:scope
An example of this output for the core Spring JAR would be:
org.springframework:spring:jar:2.5.6:compile
EXECUTION GROUPS
Maven divides execution into four nested hierarchies. From most-encompassing to most-specific, they are: Lifecycle, Phase, Plugin, and Goal.
Lifecycles, Phases, Plugins and Goals
Maven defines the concept of language-independent project build flows that model the steps that all software goes through during a compilation and deployment process.

Lifecycles represent a well-recognized flow of steps (Phases) used in software assembly.
Each step in a lifecycle flow is called a phase. Zero or more plugin goals are bound to a phase.
A plugin is a logical grouping and distribution (often a single JAR) of related goals, such as JARing.
A goal, the most granular step in Maven, is a single executable task within a plugin. For example, discrete goals in the jar plugin include packaging the jar (jar:jar), signing the jar (jar:sign), and verifying the signature (jar:sign-verify).
Executing a Phase or Goal
At the command prompt, either a phase or a plugin goal can be requested. Multiple phases or goals can be specified and are separated by spaces.
If you ask Maven to run a specific plugin goal, then only that goal is run. This example runs two plugin goals: compilation
of code, then JARing the result, skipping over any intermediate steps.
mvn compile:compile jar:jar
Conversely, if you ask Maven to execute a phase, all phases and bound plugin goals up to that point in the lifecycle are also executed. This example requests the deploy lifecycle phase, which will also execute the verification, compilation, testing and packaging phases.
mvn deploy
Online and Offline
During a build, Maven attempts to download any uncached referenced artifacts and proceeds to cache them in the ~/.m2/repository directory on Unix, or in the %USERPROFILE%/.m2/repository directory on Windows.
To prepare for compiling offline, you can instruct Maven to download all referenced artifacts from the Internet via the command:
mvn dependency:go-offline
If all required artifacts and plugins have been cached in your local repository, you can instruct Maven to run in offline mode with a simple flag:
mvn <phase or goal> -o
Built-in Maven Lifecycles
Maven ships with three lifecycles; clean, default, and site. Many of the phases within these three lifecycles are bound to a sensible plugin goal.

The clean lifecycle is simplistic in nature. It deletes all generated and compiled artifacts in the output directory.
| Clean Lifecycle | |
| Lifecycle Phase | Purpose |
| pre-clean | |
| clean | Remove all generated and compiled artifacts in preperation for a fresh build. |
| post-clean | |
| Default Lifecycle | |
| Lifecycle Phase | Purpose |
| validate | Cross check that all elements necessary for the build are correct and present. |
| initialize | Set up and bootstrap the build process. |
| generate-sources | Generate dynamic source code |
| process-sources | Filter, sed and copy source code |
| generate-resources | Generate dynamic resources |
| process-resources | Filter, sed and copy resources files. |
| compile | Compile the primary or mixed language source files. |
| process-classes | Augment compiled classes, such as for code-coverage instrumentation. |
| generate-test-sources | Generate dynamic unit test source code. |
| process-test-sources | Filter, sed and copy unit test source code. |
| generate-test-resources | Generate dynamic unit test resources. |
| process-test-resources | Filter, sed and copy unit test resources. |
| test-compile | Compile unit test source files |
| test | Execute unit tests |
| prepare-package | Manipulate generated artifacts immediately prior to packaging. (Maven 2.1 and above) |
| package | Bundle the module or application into a distributable package (commonly, JAR, WAR, or EAR). |
| pre-integration-test | |
| integration-test | Execute tests that require connectivity to external resources or other components |
| post-integration-test | |
| verify | Inspect and cross-check the distribution package (JAR, WAR, EAR) for correctness. |
| install | Place the package in the user’s local Maven repository. |
| deploy | Upload the package to a remote Maven repository |
The site lifecycle generates a project information web site, and can deploy the artifacts to a specified web server or local path.
| Site Lifecycle | |
| Lifecycle Phase | Purpose |
| pre-site | Cross check that all elements necessary for the build are correct and present. |
| site | Generate an HTML web site containing project information and reports. |
| post-site | |
| site-deploy | Upload the generated website to a web server |
Default Goal
The default goal codifies the author’s intended usage of the build script. Only one goal or lifecycle can be set as the default. The most common default goal is install.
<project>
[...]
<build>
lt;defaultGoal>install</defaultGoal>
</build>
[...]
</project>
HELP
Help for a Plugin
Lists all the possible goals for a given plugin and any associated documentation.
help:describe -Dplugin=<pluginname>
Help for POMs
To view the composite pom that’s a result of all inherited poms:
mvn help:effective-pom
Help for Profiles
To view all profiles that are active from either manual or automatic activation:
mvn help:active-profiles
DEPENDENCIES
Declaring a Dependency
To express your project’s reliance on a particular artifact, you declare a dependency in the project’s pom.xml.

<project>
<dependencies>
<dependency>
<groupId>com.yourcompany</groupId>
<artifactId>yourlib</artifactId>
<version>1.0</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
</dependencies>
<!-- ... -->
</project>
Standard Scopes
Each dependency can specify a scope, which controls its visibility and inclusion in the final packaged artifact, such as a WAR or EAR. Scoping enables you to minimize the JARs that ship with your product.
| Scope | Description |
| compile | Needed for compilation, included in packages. |
| test | Needed for unit tests, not included in packages. |
| provided | Needed for compilation, but provided at runtime by the runtime container. |
| system | Needed for compilation, given as absolute path on disk, and not included in packages. |
| import | An inline inclusion of a POM-type artifact facilitating dependency-declaring POM snippets. |
PLUGINS
Adding a Plugin
A plugin and its configuration are added via a small declaration, very similar to a dependency, in the <build> section of your pom.xml.
<build>
<!-- ... -->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<maxmem>512m</maxmem>
</configuration>
</plugin>
</plugins>
</build>
Common Plugins
Maven created an acronym for its plugin classes that aggregates “Plain Old Java Object” and “Maven Java Object” into the resultant word, Mojo.
There are dozens of Maven plugins, but a handful constitute some of the most valuable, yet underused features:
| surefire | Runs unit tests. |
| checkstyle | Checks the code’s styling |
| clover | Code coverage evaluation. |
| enforcer | Verify many types of environmental conditions as prerequisites. |
| assembly | Creates ZIPs and other distribution packages of apps and their transitive dependency JARs. |

VISUALIZE DEPENDENCIES
Users often mention that the most challenging task is identifying dependencies: why they are being included, where they are coming from and if there are collisions. Maven has a suite of goals to assist with this.
List a hierarchy of dependencies.
mvn dependency:tree
List dependencies in alphabetic form.
mvn dependency:resolve
List plugin dependencies in alphabetic form.
mvn dependency:resolve-plugins
Analyze dependencies and list any that are unused, or undeclared.
mvn dependency:analyze
REPOSITORIES
Repositories are the web sites that host collections of Maven plugins and dependencies.
Declaring a Repository
<repositories>
lt;repository>
<id>JavaDotNetRepo</id>
<url>https://maven-repository.dev.java.net</url>
</repository>
</repositories>
The Maven community strongly recommends using a repository manager such as Nexus to define all repositories. This results in cleaner pom.xml files and centrally cached and managed connections to external artifact sources. Nexus can be downloaded from http://nexus.sonatype.org/
Popular Repositories
| Central | http://repo1.maven.org/maven2/ |
| Java.net | https://maven-repository.dev.java.net/ |
| Codehaus | http://repository.codehaus.org/ |
| JBoss | http://repository.jboss.org/maven2 |

PROPERTY VARIABLES
A wide range of predefined or custom of property variables can be used anywhere in your pom.xml files to keep string and path repetition to a minimum.
All properties in Maven begin with ${ and end with }. To list all available properties, run the following command.
mvn help:expressions
Predefined Properties (Partial List)
| ${env.PATH} | Any OS environment variable such as EDITOR, or GROOVY_HOME. Specifically, the PATH environment variable. |
| ${project.groupId} | Any project node from the aggregated Maven pom.xml. Specifically, the Group ID of the project |
| ${project.artifactId} | Name of the artifact. |
| ${project.basedir} | Path of the pom.xml. |
| ${settings.localRepository} | The path to the user’s local repository. |
| ${java.home} | Any Java System Property. Specifically, the Java System Property path to its home. |
| ${java.vendor} | The Java System Property declaring the JRE vendor’s name. |
| ${my.somevar} | A user-defined variable. |
Project properties could previously be referenced with a ${pom.basedir} prefix or no prefix at all ${basedir}. Maven now requires that you prefix these variables with the word project ${project.basedir}.
Define a Property
You can define a new custom property in your pom.xml like so:
<project>
[...]
<properties>
<my.somevar>My Value</my.somevar>
</properties>
[...]
</project>
DEBUGGING
Exception Full Stack Traces
If a Maven plugin is reporting an error, to see the full detail of the exception’s stack trace run Maven with the -e flag.
mvn <yourgoal> -e
Output Debugging Info
Whenever reporting a Maven bug, or troubleshooting a problem, turn on all the debugging info by running Maven like so:
mvn <yourgoal> -X
Debug Maven Core/Plugins
Core Maven operations and plugins can be stepped through with any JPDA-compatible debugger, the most common option being Eclipse. When run in debug mode, Maven will wait for you to connect your debugger to socket port 8000 before continuing with its lifecycle.
mvnDebug <yourgoal>
Preparing to Execute Maven in Debug Mode
Listening for transport dt_socket at address: 8000
Debug a Unit Test
Your suite or an individual unit test can be debugged in much the same fashion by telling the Surefire test-execution plugin to wait for you to attach a debugger to port 5005.
mvn test -Dmaven.surefire.debug
Listening for transport dt_socket at address: 5005
SOURCE CODE MANAGEMENT
Configuring SCM
Your project’s SCM connection can be quickly configured with just three XML tags, which adds significant capabilities to the scm, release, and reactor plugins.
The connection tag is your read-only view of your repository and developerConnection is the writable link. URL is your web-based view of the source.
<scm>
<connection>scm:svn:http://myvendor.com/ourrepo/trunk</
connection>
<developerConnection>
scm:svn:https://myvendor.com/ourrepo/trunk
</developerConnection>
<url>http://myvendor.com/viewsource.pl</url>
</scm>

Using the SCM Plugin
The core SCM plugin offers two highly useful goals.
The diff command produces a standard Unix patch file with the extension .diff of the pending (uncommitted) changes on disk that can be emailed or attached to a bug report.
mvn scm:diff
The update-subprojects goal invokes a recursive scm-provider specific update (svn update, git pull) across all the submodules of a multimodule project.
mvn scm:update-subprojects
PROFILES
Profiles are a means to conditionally turn on portions of Maven configuration, including plugins, pathing and configuration.
The most common uses of profiles are for Windows/Unix platform-specific variations and build-time customization of JAR dependencies based on the use of a specific Weblogic, Websphere or JBoss J2EE vendor.
<project>
[...]
<profiles>
<profile>
<id>YourProfile</id>
[...settings, build, plugins etc...]
<dependencies>
<dependency>
<groupId>com.yourcompany</groupId>
<artifactId>yourlib</artifactId>
</dependency>
<dependencies>
</profile>
</profiles>
[...]
</project>
Profile Definition Locations
Profiles can be defined in pom.xml, profiles.xml (parallel to the pom.xml), ~/.m2/settings.xml, or $M2_HOME/conf/settings.xml.

PROFILE ACTIVATION
Profiles can be activated manually from the command line or through an activation rule (OS, file existence, Maven version, etc.). Profiles are primarily additive, so best practices suggest leaving most off by default, and activating based on specific conditions.
Manual Profile Activation
mvn <yourgoal> –P YourProfile
Automatic Profile Activation
<project>
[...]
<profiles>
<profile>
<id>YourProfile</id>
[...settings, build, etc...]
<activation>
<os>
<name>Windows XP</name>
<family>Windows</family>
<arch>x86</arch>
<version>5.1.2600</version>
</os>
<file>
<missing>somefolder/somefile.txt</missing>
</file>
</activation>
</profile>
</profiles>
[...]
</project>
CUTTING A RELEASE
Maven offers excellent automation for cutting a release of your project. In short, this is a plugin-guided ceremony for verifying that all tests pass, tagging your source code repository, and altering the POMs to reflect a product version increment.
The prepare goal runs the unit tests, continuing only if all pass, then increments the value in the pom <version> tag to a release version, tags the source repository accordingly, and increments the pom version tag back to a SNAPSHOT version.
mvn release:prepare
After a release has been successfully prepared, run the perform goal. This goal checks out the prepared release and deploys it to the POM’s specified remote Maven repository for consumption by other teams and Maven builds.
mvn release:perform
ARCHETYPES
An archetype is a powerful template that uses your corporate Java package names and project name in the instantiated project and establishes a baseline of dependencies, with a bonus of basic sample code.
You can leverage public archetypes for quickly starting a project that uses a familiar stack, such as Struts+Spring, or Tapestry+Hibernate. You can also create private archetypes within your company to offer new projects a level of consistent dependencies matching your approved corporate technology stack.
Using an Archetype
The default behavior of the generate goal is to bring up a menu of choices. You are then prompted for various replaceables such as package name and artifactId. Type this command, then answer each question at the command line prompt.
mvn archetype:generate
Creating Archetypes
An archetype can be created from an existing project, using it as the pattern by which to build the template. Run the command from the root of your existing project.
mvn archetype:create-from-project
Archetype Catalogs
The Maven Archetype plugin comes bundled with a default catalog of applications it can create, but other projects on the Internet also publish catalogs. To use an alternate catalog:
mvn archetype:generate –DarchetypeCatalog=<catalog>
A list of the most commonly used catalogs is as follows:
local
remote
http://repo.fusesource.com/maven2
http://cocoon.apache.org
http://download.java.net/maven/2
http://myfaces.apache.org
http://tapestry.formos.com/maven-repository
http://scala-tools.org
http://www.terracotta.org/download/reflector/maven2/
REPORTS
Maven has a robust offering of reporting plugins, commonly run with the site generation phase, that evaluate and aggregate information about the project, contributors, it’s source, tests, code coverage, and more.
Adding a Report Plugin
<:reporting>
<:plugins>
<:plugin>
<:artifactId>maven-javadoc-plugin<:/artifactId>
<:/plugin>
<:/plugins>
<:/reporting>

About The Author

Matthew McCullough
Matthew McCullough is an Open Source Architect with the Denver, Colorado consulting firm Ambient Ideas, LLC which he co-founded in 1997. He’s spent the last 13 years passionately aiming for ever-greater efficiencies in software development, all while exploring how to share these practices with his clients and their team members. Matthew is a nationally touring speaker on all things open source and has provided long term mentoring and architecture services to over 40 companies ranging from startups to Fortune 500 firms. Feedback and questions are always welcomed at matthewm@ambientideas.com
Recommended Book
Several sources for Maven have appeared online for some time, but nothing served as an introduction and comprehensive reference guide to this tool -- until now. Maven: The Definitive Guide is the ideal book to help you manage development projects for software, webapplications, and enterprise applications. And it comes straight from the source.
Matthew McCullough co-founded the consulting firm Ambient Ideas and is a nationally touring speaker on all things open source with 13 years of programming experience.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Enterprise Integration Patterns with Apache Camel
By
10,530 Downloads · Refcard of 151 (see them all)
Download
FREE PDF
The Essential EIP with Apache Camel Cheat Sheet
People who downloaded this DZone Refcard also liked:
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Understanding Lucene
Powering Better Search Results
By Erik Hatcher
11,541 Downloads · Refcard 137 of 151 (see them all)
Download
FREE PDF
The Essential Apache Lucene Cheat Sheet
People who downloaded this DZone Refcard also liked:
Understanding Lucene: Powering Better Search Results
By Erik Hatcher
WHAT IS LUCENE?
The Lucene Ecosystem
“Lucene” is a broadly used term. It’s the original Java indexing and search library created by Doug Cutting. Lucene was then chosen as a top-level Apache Software Foundation project name — http://lucene.apache.org. The name is also used for various ports of the Java library to other languages (Lucene.Net, PyLucene, etc). The following table shows the key projects at http://lucene.apache.org.
| Project | Description |
| Lucene - Java | Java-based indexing and search library. Also comes with extras such as highlighting, spellchecking, etc. |
| Solr | High-performance enterprise search server. HTTP interface. Built upon Lucene Java. Adds faceting, replication, sharding, and more. |
| Droids | Intelligent robot crawling framework. |
| Open Relevance | Aims to collect and distribute free materials for relevance testing and performance. |
| PyLucene | Python port of the Lucene Java project. |
There are many projects and products that use, expose, port, or in some way wrap various pieces of the Apache Lucene ecosystem.
WHICH LUCENE DISTRIBUTION?
There are many ways to obtain and leverage Lucene technology. How you choose to go about it will depend on your specific needs and integration points, your technical expertise and resources, and budget/time constraints.
When Lucene in Action was published in 2004, before the advent of many of the projects mentioned above, we just had Lucene Java and some other open-source building blocks. It served its purpose and did so extremely well. Lucene has only gotten better since then: faster, more efficient, newer features, and more. If you’ve got Java skills you can easily grab lucene.jar and go for it.
However, some better and easier ways to build Lucene-based search applications are now available. Apache Solr, specifically, is a top notch server architecture, built from the ground up with Lucene. Solr factors in Lucene best practices and simplifies many aspects of indexing content and integrating search into your application as well as addressing scalability needs that exceed the capacity of single machines.
This Refcard is about the concepts of Lucene more than the specifics of the Lucene API. We’ll be shining the light on Lucene internals and concepts with Solr. Solr provides some very direct ways to interact with Lucene.
We recommend you start with one of the following distributions:
- LucidWorks for Solr – certified distributions of the official Apache Solr distributions, including any critical bug fixes and key performance enhancements.
- Apache Solr – a great starting point for developers; grab a distro, write a script, integrate into UI.

If you’re getting started on building a search application, your quickest, easiest bet is to use LucidWorks Enterprise. LucidWorks Enterprise is Lucene and Solr, plus more. Easy to install, easy to configure and monitor. LucidWorks Enterprise is free for development, with support subscriptions available for production deployments.
Lucid Imagination offers professional services, training, and the new LucidWorks Enterprise platform. Visit http://www.lucidimagination.com.
Definitions/Glossary
There are many common terms used when elaborating on Lucene’s design and usage.
| Term | Definition/context/usage |
| Document | Returnable search result item. A document typically represents a crawled web page, a file system file, or a row from a database query. |
| Field | Property, metadata item, or attribute of a document. Documents typically have a unique key field, often called “id”. Other common fields are “title”, “body”, “last_modified_date”, and “categories”. |
| Term | Searchable text, extracted from each indexed field by analysis (a process of tokenization and filtering). |
| tf/idf | Term frequency / inverse document frequency. This is a commonly used factor, computing the relationship between term frequency (how many uses of the query term exists in the entire index) to the inverse document frequency (how many documents in the entire collection that contain that query term, inverted). |
Lucene Java and Core Lucene Concepts Explained
The design of Lucene is, at a high level, quite straightforward. Documents are “indexed”.
Documents are a representation of whatever types of “objects” and granularities your application needs to work with on the search/discovery side of the equation. In other words, when thinking Lucene, it is important to consider the use cases / demands of the encompassing application in order to effectively tune the indexing process with the end goal in mind.
Lucene provides APIs to open, read, write, and search an index. Documents contain “fields”. Fields are the useful individually named attributes of a document used by your search application. For example, when indexing traditional files such as Word, HTML, and PDF documents, commonly used fields are “title”, “body”, “keywords”, “author”, and “last_modified_date”.
DOCUMENTS
Documents, to Lucene, are the findable items. Here’s where domain-specific abstractions really matter. A Lucene Document can represent a file on a file system, a row in a database, a news article, a book, a poem, an historical artifact (see collections. si.edu), and so on. Documents contain “fields”. Fields represent attributes of the containing document, such as title, author, keywords, filename, file_type, lastModified, and fileSize.
Fields have a name and one or more values. A field name, to Lucene, is arbitrary, whatever you want.
When indexing documents, the developer has the choice of what fields to add to the Document instance, their names, and how they are each handled. Field values can be stored and/or indexed. A large part of the magic of Lucene is in how field values are analyzed and how a field’s terms are represented and structured.
“document” example

The heart of Lucene’s search capabilities is in the elegance of the index structure, a form of an “inverted index”. An inverted index is a data structure mapping “terms” to the documents. Indexed fields can be “analyzed”, a process of tokenizing and filtering text into individual searchable terms. Often these terms from the analysis process are simply the individual words from the text. The analysis process of general text typically also includes normalization processes (lowercasing, stemming, other cleansing). There are many interesting and sophisticated ways indexing analysis tuning techniques can facilitate typical search application needs for sorting, faceting, spell checking, autosuggest, highlighting, and more.
Inverted Index
Again we need to look back at the search application needs. Almost every search application ends up with a human user interface with the infamous and ubiquitous “search box”.
The trick is going from a human entered “query” to returning matching documents blazingly fast. This is where the inverted index structure comes into play. For example, a user searching for “mountain” can be readily accommodated by looking up the term in the inverted index and matching associated documents.
Not only are documents matched to a query, but they are also scored. For a given search request, a subset of the matching documents are returned to the user. We can easily provide sorting options for the results, though presenting results in “relevancy” order is more often the desired sort criteria. Relevancy refers to a numeric “score” based on the relationship between the query and the matching document. (“Show me the documents best matching my query first, please”).
The following formula (straight from Lucene’s Similarity class javadoc) illustrates the basic factors used to score a document.
Lucene practical scoring formula
Each of the factors in this equation are explained further in the following table:
| Factor | Explanation |
| score(q,d) | The final computed value of numerous factors and weights, numerically representing the relationship between the query and a given document. |
| coord(q,d) | A search-time score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query’s terms will receive a higher score than another document with fewer query terms. |
| queryNorm(q) | A normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. |
| tf(t in d) | Correlates to the term’s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. Note that tf(t in q) is assumed to be 1 and, therefore, does not appear in this equation. However, if a query contains twice the same term, there will be two term-queries with that same term. Hence, the computation would still be correct (although not very efficient). |
| idf(t) | Stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score. idf(t) appears for t in both the query and the document, hence it is squared in the equation. |
| t.getBoost() | A search-time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). |
| norm(t,d) | Encapsulates a few (indexing time) boost and length factors. |
Understanding how these factors work can help you control exactly how to get the most effective search results from your search application. It's worth noting that in many applications these days, there are numerous other factors involved in scoring a document. Consider boosting documents by recency (latest news articles bubble up), popularity/ratings (or even like/dislike factors), inbound link count, user search/click activity feedback, profit margin, geographic distance, editorial decisions, or many other factors. But let's not get carried away just yet, and focus on Lucene's basic tf/idf.
So now we've briefly covered the gory details of how Lucene works for matching and scoring documents during a search. There's one missing bit of magic, going from the human input of a search box and translating that into a representative data structure, the Lucene Query object. This string, Query process is called "queryparsing". Lucene itself includes a basic QueryParser that can parse sophisticated expressions including AND, OR, +/-, parenthetical grouped expressions, range, fuzzy, wildcarded, and phrase query clauses. For example, the following expression will match documents with a title field with the terms "Understanding" and Lucene collocated successively (provided positional information was enabled!) where the mimeType (MIME type is the document type) value is "application/pdf":
title:”Understanding Lucene” AND mimeType:application/PDF
For more information on Lucene QueryParser syntax, see http://lucene.apache.org/java/3_0_3/queryparsersyntax.html (or the docs for the version of Lucene you are using).
It is important to note that query parsing and allowable user syntax is often an area of customization consideration. Lucene’s API richly exposes many Query subclasses, making it very straightforward to construct sophisticated Query objects using building blocks such as TermQuery, BooleanQuery, PhraseQuery, WildcardQuery, and so on.
Shining the Light on Lucene: Solr
Apache Solr embeds Java Lucene, exposing its capabilities through an easy-to-use HTTP interface. Solr has Lucene best practices built in, and provides distributed and replicated search for large scale power.
For the examples that follow, we’ll be using Solr as the front-end to Lucene. This allows us to demonstrate the capabilities with simple HTTP commands and scripts, rather than coding in Java directly. Additionally, Solr adds easy-to-use faceting, clustering, spell checking, autosuggest, rich document indexing, and much more. We’ll introduce some of Solr’s value-added pieces along the way.
Lucene has a lot of flexibility, likely much more than you will need or use. Solr layers some general common-sense best practices on top of Lucene with a schema. A Solr schema is conceptually the same as a relational database schema. It is a way to map fields/ columns to data types, constraints, and representations. Let’s take a preview look at fields defined in the Solr schema (conf/schema. xml) for our running example:
<fields>
<field name=”id”
type=”string” indexed=”true” stored=”true”/>
<field name=”title”
type=”text_en” indexed=”true” stored=”true” />
<field name=”mimeType”
type=”string” indexed=”true” stored=”true” />
<field name=”lastModified”
type=”date” indexed=”true” stored=”true” />
</fields>
The schema constrains all fields of a particular name (there is dynamic wildcard matching capability too) to a “field type”. A field type controls how the Lucene Field instances are constructed during indexing, in a consistent manner. We saw above that Lucene fields have a number of additional attributes and controls, including whether the field value is stored, indexed, if indexed, how so, which analysis chain, and whether positions, offsets, and/or term vectors are stored.
Our Running Example, Quick Proof-of-Concepts
The (Solr) documents we index will have a unique “id” field, a “title” field, a “mimeType” field to represent the file type for filtering/faceting purposes, and a “lastModified” date field to represent a file’s last modified timestamp. Here’s an example document (in Solr XML format, suitable for direct POSTing):
<add>
<doc>
<field name=”id”>doc01</field>
<field name=”title”>Our first document</field>
<field name=”mimeType”>application/pdf</field>
<field name=”lastModified”>NOW</field>
</doc>
</add>
That example shows indexing the metadata regarding an actual file. Ultimately, we also want the contents of the file to be searchable. Solr natively supports extracting and indexing content from rich documents. And LucidWorks Enterprise has built-in file and web crawling and scheduling along with content extraction.
Launching Solr, using its example configuration, is as straightforward as this, from a Solr installation directory:
cd example
java –jar start.jar
And from another command-shell, documents can be easily indexed. Our example document shown previously (saved as docs.xml for us) can be indexed like this:
cd example/exampledocs
java –jar post.jar docs.xml
First of all, this isn’t going to work out of the box, as we have a custom schema and applications needs not supported by Solr’s example configuration. Get used to it, it’s the real world! The example schema is there as an example, and likely inappropriate for your application as-is. Borrow what makes sense for your own applications needs, but don’t leave cruft behind.
At this point, we have a fully functional search engine, with a single document, and will use this for all further examples. Solr will be running at http://localhost:8983/solr.
INDEXING
The process of adding documents to Lucene or Solr is called indexing. With Lucene Java, you create a new Document instance and call the addDocument method of an IndexWriter. This is straightforward and simple enough, leaving the burden on you to come up with the textual strings that'll comprise the document.
Contrast with Solr, which provides numerous ways out of the box to index. We've seen an example of Solr XML, one basic way to bring in documents. Here are detailed examples of various ways to index content into Solr. Solr’s schema centralizes the decisions made about how fields are indexed, freeing the indexer from any internal knowledge about how fields should be handled.
Solr XML/JSON
Solr’s basic XML format can be a convenient way to map your applications “documents” into Solr. A simple HTTP post to /update is all it takes.
Posting XML to Solr can be done using the post.jar tool that comes with Solr’s example data, curl (see Solr’s post.sh), or any other HTTP library or tool capable of POST. In fact, most of the popular Solr client API libraries out there simply wrap an HTTP library with some convenience methods for indexing documents, packaging up documents and field values into this XML structure and POSTing it to Solr’s /update handler. Documents indexed in this fashion will be updated if they share the same unique key field value (configured in schema.xml) as existing documents.
Recently, JSON support has been added so it can be even cleaner to post documents into Solr and easier to adapt to a wider variety of clients. It looks like this:
{“add”: {
“doc”: {
“id”: “doc02”,
“title”: “Solr JSON”,
“mimeType”: “application/pdf”}
}
}
Simply post this type of JSON to /update/json. All other Solr commands can be posted as JSON as well (delete, commit, optimize).
Comma, or Tab, Separated Values
Another extremely convenient and handy way to bring documents into Solr is through CSV (comma-separated variables; or, more generally, column-separated variables as the separator character is configurable). An example CSV file is shown here:
id,title,mimeType,lastModified
doc03,CSV ftw,application/pdf,2011-02-28T23:59:59Z
This CSV can be POSTed to the /update/csv handler, mapping rows to documents and columns to fields in a flexible, mappable manner. Using curl, this file (we named docs.csv) can be posted like this:
curl “http://localhost:8983/solr /update/csv?commit=true” --databinary
@docs.csv -H ‘Content-type:text/plain; charset=utf-8’
Note that this Content-type header is a necessary HTTP header to use for the CSV update handler.
Indexing Rich Document Types
Thus far, our indexing examples have omitted extracting and indexing file content. Numerous rich document types, such as Word, PDF, and HTML, can be processed using Solr’s built-in Apache Tika integration. To index the contents and metadata of a Word document, using the HTTP command-line tool curl, this is basically all that is needed:
curl “http://localhost:8983/solr/update/extract?literal.id=doc04” -F
“myfile=@technical_manual.doc”
To index rich documents with Lucene’s API, you would need to interface with one or more extractor libraries, such as Tika, extract the text, and map full text and document metadata as appropriate to Lucene fields. It’s much more straightforward, with no coding, to accomplish this task with Solr.

DataImportHandler
And finally, Solr includes a general-purpose “data import handler” framework that has built-in capabilities for indexing relational databases (anything with a JDBC driver), arbitrary XML, and e-mail folders. The neat thing about the DataImportHandler is that it allows aggregating data from various sources into whole Solr documents.
For more information on Solr’s DataImportHandler, see http://wiki.apache.org/solr/DataImportHandler.
Deleting Documents
Documents can be deleted from a Lucene index, either by precise term matching (a unique identifier field, generally) or in bulk for all documents matching a Query.
When using Solr, deletes are accomplished by POSTing <delete><id>refcard01</id></delete> or <delete><query>mi meType:application/PDF</query></delete> XML messages to the /update handler. Or “delete”: { “id”:”ID”} or “delete”: { “query”:”mimeType:application/pdf” } messages to /update/json.

Committing
Lucene is designed such that documents can continuously be indexed, though the view of what is searchable is fixed to a certain snapshot of an index (for performance, caching, and versioning reasons). This architecture allows batches of documents to be indexed and only made searchable after the entire batch has been ingested. Pending changes to an index, including added and deleted documents, are made visible using a commit command. With Solr, a <commit/> message can be posted to the /update handler, “commit”: {} to /update/json, or even simpler as a bodyless /update GET (or POST) with commit=true set: http://localhost:8983/solr/update?commit=true
FIELDS
As mentioned, fields have a lot of configuration flexibility. The following table details the various decisions you must make regarding each fields configuration.
| Field Attribute | Effect and Uses |
| stored | Stores the original incoming field value in the index. Stored field values are available when documents are retrieved for search results. |
| term positions | Location information of terms within a field. Positional information is necessary for proximity-related queries, such as phrase queries. |
| term offsets | Character begin and end offset values of a term within a fields textual value. Offsets can be handy for increasing performance of generating query term highlighted field fragments. This one typically is a trade-off between highlighting performance and index size. If offsets aren’t stored, they can be computed at highlighting time. |
| term vectors | An “inverted index” structure within a document, containing term/frequency pairs. Term vectors can be useful for more advanced search techniques, such as “more like this” where terms and their frequencies within a single document can be leveraged for finding similar documents. |
In Solr’s schema.xml, a field can be configured to have all of these bells and whistles enabled like this:
<field name=”kitchen_sink” type=”text” indexed=”true” stored=”true”
termVectors=”true” termPositions=”true” termOffsets=”true” />
Only indexed fields have “terms”. These additional term-based structures are only available on indexed fields and really only make sense when used with analyzed full-text fields.
When indexing non-textual information, such as dates or numbers, the representation and ordering of the terms in the index drastically impact the types of operations available. Especially for numeric and date types, which typically are used for range queries and sorting, Lucene (and Solr) offer special ways to handle them. When indexing dates and numerics, use the Trie*Field types in Solr, and the NumericField/NumericTokenStream API’s with Lucene. This is a crucial reminder that what you want your end application to do with the search server greatly impacts how you index your documents. Sorting and range queries, specifically, require up-front planning to index properly to support those operations.
ANALYSIS
The Lucene analysis process consists of several stages. The text is sent initially through an optional CharFilter, then through a Tokenizer, and finally through any number of TokenFilters. CharFilters are useful for mapping diacritical characters to their ASCII equivalent, or mapping Traditional to Simplified Chinese. A Tokenizer is the first step in breaking a string into “tokens” (what they are called before being written to the index as “terms”). TokenFilters can subsequently add, remove, or modify/augment tokens in a sequential pipeline fashion.

Using the Solr admin analysis introspection tool, using the field type “text_en” with the value “Understanding Lucene Refcard”, the following terms result:
The analysis tool shows the term text that would be indexed ([understanding], [lucene]…), and the position and offset attributes we previously discussed. The analysis tool will handily show you the term output of each of the analysis stages, from tokenization through each of the filters.
SEARCHING
Now that we’ve got content indexed, searching it is easy! Ultimately, a Lucene Query object is handed to a Lucene IndexSearcher.search() method and results are processed. How to construct a query is the next step.
With Lucene Java, TermQuery is the most primitive Query. Then there’s BooleanQuery, PhraseQuery, and many other Query subclasses to choose from. Programmatically, the sky’s the limit in terms of query complexity. Lucene also includes a QueryParser, which parses a string into a Query object, supporting fielded, grouped, fuzzy, phrase, range, AND/OR/NOT/+/- and other sophisticated syntax.
Solr makes this all possible without coding and accepts a simple string query (q) parameter (and other parameters that can affect query parsing/generation). Solr includes a couple of general purpose query parsers, most notably a schema-aware subclass of Lucene’s QueryParser. This Lucene query parser is the default.

Searching Solr is a straightforward HTTP request to / select?q=<your query>. Displaying search results in JSON (adding &wt=json) format, we get something like this:
{“responseHeader”:{
“status”:0,
“QTime”:2,
“params”:{
“indent”:”true”, “wt”:”json”, “q”:”*:*”}},
“response”:{“numFound”:3,”start”:0,
“docs”:[
{“id”:”refcard01”,
“timestamp”:”2011-02-17T20:44:49.064Z”,
“title”:[ “Understanding Lucene”]}, {
“id”:”refcard02”, “timestamp”:”2011-02-17T20:48:16.862Z”,
“title”:[ “Refcard 2”]}, { “id”:”doc03” ,
“mimeType”:”application/pdf”, “lastModified”:”2011-02-
28T23:59:59Z”, “timestamp”:”2011-02-17T21:42:31.423Z”,
“title”:[ “CSV ftw”]}] }}
Note that Solr can return search results in a number of formats (XML, JSON, Ruby, PHP, Python, CSV, etc), choose the one that is most convenient for your environment.
Debugging Query Parsing
Query parsing is complex business. It can be very helpful in seeing a representation of the underlying Query object generated. By adding a debug=query parameter to the request, you can see how a query is parsed. For example, using the query “title:lucene AND timestamp:[NOW-1YEAR TO NOW]“, the debug output returns a parsedquery value of:
parsedquery:+title:lucene +timestamp:[1266446158657 TO
1297982158657]”
Note that AND translated to both clauses as mandatory (leading +) and the date range values were parsed by Solr’s useful date math feature and then converted to the Lucene “date” type index representation.
Explaining Result Scoring
Now that we have real documents indexed, we can take a look at Lucene’s scoring first-hand. Solr provides an easy way to look at Lucene’s “explain” output, which details how/why a document scored the way it did. In our Refcard lab, doing a title:lucene search matches a document and scores it like this:
0.8784157 = (MATCH) fieldWeight(title:lucene in 0), product of:
1.0 = tf(termFreq(title:lucene)=1)
1.4054651 = idf(docFreq=1, maxDocs=3)
0.625 = fieldNorm(field=title, doc=0)
Add the debug=results parameter to the Solr search request to have explanation output added to the response.
BELLS AND WHISTLES
Solr includes a number of other features; some of them wrap Lucene Java add-on libraries and some of them (like faceting and rich function query/sort capability) are currently only at the Solr layer. We aren’t going into any detail of these particular features here, but now that you understand Lucene, you have the foundation to understand basically how they work from the inverted index structure on up. These features include:
- Faceting: providing counts for various document attributes across the entire result set.
- Highlighting: generating relevant snippets of document text, highlighting query terms. Useful in result display to show users the context in which their queries matched.
- Spell checking: “Did you mean…?”. Looks up terms textually close to the query terms and suggests possible intended queries.
- More-like-this: Given a particular document, or some arbitrary text, what other documents are similar?
Version Information
These Refcard demos use the current development branch of Lucene/Solr. This is likely to be what is eventually released from Apache as Lucene and Solr 4.0. LucidWorks Enterprise is also based on this same version. The concepts apply to all versions of Lucene and Solr, and the bulk of these examples should also work with earlier versions of Solr.
For Further Information
For all things Apache Lucene, start here: http://lucene.apache.org
Solr sports relatively decent developer-centric documentation: http://wiki.apache.org/solr
Lucene in Action (Manning): http://www.manning.com/lucene
To answer your Lucene questions, try LucidFind — http://search.lucidimagination.com — where the Lucene ecosystems e-mail lists, wikis, issue tracker, etc are made searchable for the entire Lucene community’s benefit.
See Apache Solr: Getting Optimal Search Results, http://refcardz.dzone.com/refcardz/solr-essentials, for more information on Apache Solr.
About The Authors

Erik Hatcher
Erik Hatcher evangelizes and engineers at Lucid Imagination. He co-authored both Lucene in Action and Java Development with Ant. At Lucid, he has worked with many companies deploying Lucene/Solr search systems. Erik has spoken at numerous industry events including Lucene EuroCon, ApacheCon, JavaOne, OSCON, and user groups and meetups around the world.
Recommended Book
When Lucene first appeared, this superfast search engine was nothing short of amazing. Today, Lucene still delivers. Its high-performance, easy-to-use API features like numeric fields, payloads, near-realtime search, and huge increases in indexing and searching speed make it the leading search tool.
And with clear writing, reusable examples, and unmatched advice, Lucene in Action, Second Edition is still the definitive guide to effectively integrating search into your applications. This totally revised book shows you how to index your documents, including formats such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering and covers the numerous improvements to Lucene since the first edition. Source code is for Lucene 3.0.1.

Erik Hatcher evangelizes and engineers Solr and Lucene technology at Lucid Imagination. He co-authored both Lucene in Action and Java Development with Ant.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
The Top Twelve Integration Patterns for Apache Camel
By Claus Ibsen
8,194 Downloads · Refcard 47 of 151 (see them all)
Download
FREE PDF
The Essential Apache Camel Cheat Sheet
People who downloaded this DZone Refcard also liked:
Enterprise Integration Patterns: with Apache Camel
By Claus Ibsen
About Enterprise Integration Patterns
Integration is a hard problem. To help deal with the complexity of integration problems the Enterprise Integration Patterns (EIP) have become the standard way to describe, document and implement complex integration problems. Hohpe & Woolf's book the Enterprise Integration Patterns has become the bible in the integration space - essential reading for any integration professional.
Apache Camel is an open source project for implementing the EIP easily in a few lines of Java code or Spring XML configuration. This reference card, the first in a two card series, guides you through the most common Enterprise Integration Patterns and gives you examples of how to implement them either in Java code or using Spring XML. This Refcard is targeted for software developers and enterprise architects, but anyone in the integration space can benefit as well.
About Apache Camel
Apache Camel is a powerful open source integration platform based on Enterprise Integration Patterns (EIP) with powerful Bean Integration. Camel lets you implementing EIP routing using Camels intuitive Domain Specific Language (DSL) based on Java (aka fluent builder) or XML. Camel uses URI for endpoint resolution so its very easy to work with any kind of transport such as HTTP, REST, JMS, web service, File, FTP, TCP, Mail, JBI, Bean (POJO) and many others. Camel also provides Data Formats for various popular formats such as: CSV, EDI, FIX, HL7, JAXB, Json, Xstream. Camel is an integration API that can be embedded in any server of choice such as: J2EE Server, ActiveMQ, Tomcat, OSGi, or as standalone. Camels Bean Integration let you define loose coupling allowing you to fully separate your business logic from the integration logic. Camel is based on a modular architecture allowing you to plugin your own component or data format, so they seamlessly blend in with existing modules. Camel provides a test kit for unit and integration testing with strong mock and assertion capabilities.
Essential Patterns
This group consists of the most essential patterns that anyone working with integration must know.
Pipes and Filters
| How can we perform complex processing on a message while maintaining independence and flexibility? | |
![]() |
|
| Problem | A single event often triggers a sequence of processing steps |
| Solution | Use Pipes and Filters to divide a larger processing steps (filters) that are connected by channels (pipes) |
| Camel | Camel supports Pipes and Filters using the pipeline node. |
| Java DSL |
Where jms represents the JMS component used for consuming JMS messages on the JMS broker. Direct is used for combining endpoints in a synchronous fashion, allow you to divide routes into sub routes and/or reuse common routes. Tip: Pipeline is the default mode of operation when you specify multiple outputs, so it can be omitted and replaced with the more common node:
TIP: You can also separate each step as individual to nodes:
|
| Spring DSL |
|
Message Router
| How can you deouple indevidual processing steps so that messages can be passed to different filters depending on a set of conditions? | |
![]() |
|
| Problem | Pipes and Filters route each message in the same processing steps. How can we route messages differently? |
| Solution | Filter using predicates to choose the right output destination. |
| Camel | Camel supports Message Router using the choice node. For more details see the Content Based router pattern. |
Content-Based Router
| How do we handle a situation where the implementation of a single logical function (e.g., inventory check) is spread across multiple physical systems? | |
![]() |
|
| Problem | How do we ensure a Message is sent to the correct recipient based on information from its content? |
| Solution | Use a Content-Based Router to route each message to the correct recipient based on the message content. |
| Camel | Camel has extensive support for Content-Based Routing. Camel supports content based routing based on choice, filter, or any other expression. |
| Java DSL |
Choice
TIP: In the route above end() can be omitted as its the last node and we do not route the message to a new destination after the choice. TIP: You can continue routing after the choice ends. |
| Spring DSL |
Choice
TIP: In Spring DSL you cannot invoke code, as opposed to the Java DSL that is 100% Java. To express the predicates for the choices we need to use a language. We will use simple language that uses a simple expression parser that supports a limited set of operators. You can use any of the more powerful languages supported in Camel such as: JavaScript, Groovy, Unified EL and many others. TIP: You can also use a method call to invoke a method on a bean to evaluate the predicate. Lets try that:
Notice how we use Bean Parameter Binding to instruct Camel to invoke this method and pass in the type header as the String parameter. This allows your code to be fully decoupled from any Camel API so its easy to read, write and unit test. |
Message Translator
| How can systems using different data formats communicate with each other using messaging? | |
![]() |
|
| Problem | Each application uses its own data format, so we need to translate the message into the data format the application supports. |
| Solution | Use a special filter, a messae translator, between filters or applications to translate one data format into another. |
| Camel | Camel supports the message translator using the processor, bean or transform nodes. TIP: Camel routes the message as a chain of processor nodes. |
| Java DSL |
Processor
BeanInstead of the processor we can use Bean (POJO). An advantage of using a Bean over Processor is the fact that we do not have to implement or use any Camel specific interfaces or types. This allows you to fully decouple your beans from Camel.
TIP: Camel can create an instance of the bean automatically; you can just refer to the class type.
TIP: Camel will try to figure out which method to invoke on the bean in case there are multiple methods. In case of ambiguity you can specify which methods to invoke by the method parameter:
TransformTransform is a particular processor allowing you to set a response to be returned to the original caller. We use transform to return a constant ACK response to the TCP listener after we have copied the message to the JMS queue. Notice we use a constant to build an "ACK" string as response.
|
| Spring DSL |
Processor
In Spring DSL Camel will look up the processor or POJO/Bean in the registry based on the id of the bean. Bean
Transform
|
| Annotation DSL | You can also use the @Consume annotation for transformations. For example in the method below we consume from a JMS queue and do the transformation in regular Java code. Notice that the input and output parameters of the method is String. Camel will automatically coerce the payload to the expected type defined by the method. Since this is a JMS example the response will be sent back to the JMS reply-to destination.
TIP: You can use Bean Parameter Binding to help Camel coerce the Message into the method parameters. For instance you can use @Body, @Headers parameter annotations to bind parameters to the body and headers. |
Message Filter
| How can a component avoid receiving unwanted messages? | |
![]() |
|
| Problem | How do you discard unwanted messages? |
| Solution | Use a special kind of Message Router, a Message Filter, to eliminate undesired messages from a channel based on a set of criteria. |
| Camel | Camel has support for Message Filter using the filter node. The filter evaluates a predicate whether its true or false; only allowing the true condition to pass the filter, where as the false condition will silently be ignored. |
| Java DSL | We want to discard any test messages so we only route non-test messages to the
order queue.
|
| Spring DSL | For the Spring DSL we use XPath to evaluate the predicate. The $test is a special
shorthand in Camel to refer to the header with the given name. So even if the
payload is not XML based we can still use XPath to evaluate predicates.
|
Dynamic Router
![]() |
|
| Problem | How can we route messages based on a dynamic list of destinations? |
| Solution | Use a Dynamic Router, a router that can self-configure based on special configuration messages from participating destinations. |
| Camel | Camel has support for Dynamic Router using the Dynamic Recipient List combined with a data store holding the list of destinations. |
| Java DSL | We use a Processor as the dynamic router to determine the destinations. We
could also have used a Bean instead.
|
| Spring DSL |
|
| Annotation DSL |
TIP: Notice how we used Bean Parameter Binding to bind the parameters to the route method based on an @XPath expression on the XML payload of the JMS message. This allows us to extract the customer id as a string parameter. @Header wil bind a JMS property with the key location. Document is the XML payload of the JMS message. TIP: Camel uses its strong type converter feature to convert the payload to the type of the method parameter. We could use String and Camel will convert the body to a String instead. You can register your own type converters as well using the @Converter annotation at the class and method level. |
Recipient List
| How do we route a message to a list of statically or dynamically specified recipients? | |
![]() |
|
| Problem | How can we route messages based on a static or dynamic list of destinations? |
| Solution | Define a channel for each recipient. Then use a Recipient List to inspect an incoming message, determine the list of desired recipients and forward the message to all channels associated with the recipients in the list. |
| Camel | Camel supports the static Recipient List using the multicast node, and the dynamic Recipient List using the recipientList node. |
| Java DSL |
StaticIn this route we route to a static list of two recipients, that will receive a copy of the same message simultaneously.
DynamicIn this route we route to a dynamic list of recipients defined in the message header [mails] containing a list of recipients as endpoint URLs. The bean processMails is used to add the header[mails] to the message.
And in the process mails bean we use @Headers Bean Parameter Binding to provide a java.util.Map to store the recipients.
|
| Spring DSL |
Static
DynamicIn this example we invoke a method call on a Bean to provide the dynamic list of recipients.
|
| Annotation DSL | In the CustomerService class we annoate the whereTo method with @RecipientList, and return a single destination based on the customer id. Notice the flexibility of Camel as it can adapt accordingly to how you define what your methods are returning: a single element, a list, an iterator, etc.
And then we can route to the bean and it will act as a dynamic recipient list.
|
Splitter
| How can we process a message if it contains multiple elements, each of which may have to be processed in a different way? | |
![]() |
|
| Problem | How can we split a single message into pieces to be routed individually? |
| Solution | Use a Splitter to break out the composite message into a series of individual messages, each containing data related to one item. |
| Camel | Camel has support for Splitter using the split node. |
| Java DSL | In this route we consume files from the inbox folder. Each file is then split into a new message. We use a tokenizer to split the file content line by line based on line breaks.
TIP: Camel also supports splitting streams using the streaming node. We can split the stream by using a comma:
TIP: In the routes above each individual split message will be executed in sequence. Camel also supports parallel execution using the parallelProcessing node.
|
| Spring DSL | In this route we use XPath to split XML payloads received on the JMS order queue.
And in this route we split the messages using a regular expression
TIP: Split evaluates an org.apahce.camel.Expression to provide something that is iterable to produce each individual new message. This allows you to provide any kind of expression such as a Bean invoked as a method call.
|
Aggregator
| How do we combine the results of individual, but related messages so that they can be processed as a whole? | |
![]() |
|
| Problem | How do we combine multiple messages into a single combined message? |
| Solution | Use a stateful filter, an Aggregator, to collect and store individual messages until it receives a complete set of related messages to be published. |
| Camel | Camel has support for the Aggregator using the aggregate node. Camel uses a stateful batch processor that is capable of aggregating related messaged into a single combined message. A correlation expression is used to determine which messages should be aggregated. An aggregation strategy is used to combine aggregated messages into the result message. Camel’s aggregator also supports a completion predicate allowing you to signal when the aggregation is complete. Camel also supports other completion signals based on timeout and/or a number of messages already aggregated. |
| Java DSL |
Stock quote exampleWe want to update a website every five minutes with the latest stock quotes. The quotes are received on a JMS topic. As we can receive multiple quotes for the same stock within this time period we only want to keep the last one as its the most up to date. We can do this with the aggregator:
As the correlation expression we use XPath to fetch the stock symbol from the message body. As the aggregation strategy we use the default provided by Camel that picks the latest message, and thus also the most up to date. The time period is set as a timeout value in milliseconds. Loan broker exampleWe aggregate responses from various banks for their quote for a given loan request. We want to pick the bank with the best quote (the cheapest loan), therefore we need to base our aggregation strategy to pick the best quote.
We use a completion predicate that signals when we have received more than 2 quotes for a given loan, giving us at least 3 quotes to pick among. The following shows the code snippet for the aggregation strategy we must implement to pick the best quote:
|
| Spring DSL |
Loan Broker Example
TIP: We use the simple language to declare the completion predicate. Simple is a basic language that supports a primitive set of operators. ${header. CamelAggregatedSize} will fetch a header holding the number of messages aggregated. TIP: If the completed predicate is more complex we can use a method call to invoke a Bean so we can do the evaluation in pure Java code:
Notice how we can use Bean Binding Parameter to get hold of the aggregation size as a parameter, instead of looking it up in the message. |
Resequencer
| How can we get a stream of related but out-of-sequence messages back into the correct order? | |
![]() |
|
| Problem | How do we ensure ordering of messages? |
| Solution | Use a stateful filter, a Resequencer, to collect and reorder messages so that they can be published in a specified order. |
| Camel | Camel has support for the Resequencer using the resequence node. Camel uses a stateful batch processor that is capable of reordering related messages. Camel supports two resequencing algorithms: -batch = collects messages into a batch, sorts the messages and publish the messages -stream = re-orders, continuously, message streams based on detection of gaps between messages. Batch is similar to the aggregator but with sorting. Stream is the traditional Resequencer pattern with gap detection. Stream requires usage of number (longs) as sequencer numbers, enforced by the gap detection, as it must be able to compute if gaps exist. A gap is detected if a number in a series is missing, e.g. 3, 4, 6 with number 5 missing. Camel will back off the messages until number 5 arrives. |
| Java DSL |
Batch:We want to process received stock quotes, once a minute, ordered by their stock symbol. We use XPath as the expression to select the stock symbol, as the value used for sorting.
Camel will default the order to ascending. You can provide your own comparison for sorting if needed. Stream:Suppose we continuously poll a file directory for inventory updates, and its important they are processed in sequence by their inventory id. To do this we enable streaming and use one hour as the timeout.
|
| Spring DSL |
Batch:
Stream:
Notice that you can enable streaming by specifying <stream-config> instead
of |
Dead Letter Channel
| What will the messaging system do with a message it cannot deliver? | |
![]() |
|
| Problem | The messaging system cannot deliver a message |
| Solution | When a message cannot be delivered it should be moved to a Dead Letter Channel |
| Camel | Camel has extensive support for Dead Letter Channel by its error handler and exception clauses. Error handler supports redelivery policies to decide how many times to try redelivering a message, before moving it to a Dead Letter Channel. The default Dead Letter Channel will log the message at ERROR level and perform up to 6 redeliveries using a one second delay before each retry. Error handler has two scopes: global and per route TIP: See Exception Clause in the Camel documentation for selective interception of thrown exception. This allows you to route certain exceptions differently or even reset the failure by marking it as handled. TIP: DeadLetterChannel supports processing the message before it gets redelivered using onRedelivery. This allows you to alter the message beforehand (i.e. to set any custom headers). |
| Java DSL |
Global scope
In this route we override the global scope to use up to five redeliveries, where as the global only has three. You can of course also set a different error queue destination:
|
| Spring DSL |
The error handler is configured very differently in the Java DSL vs. the Spring DSL. The Spring DSL relies more on standard Spring bean configuration whereas the Java DSL uses fluent builders. Global scopeThe Global scope error handler is configured using the errorHandlerRef attribute on the camelContext tag.
Route scopeRoute scoped is configured using the errorHandlerRef attribute on the route tag.
For both the error handler itself is configured using a regular Spring bean
|
Wire Tap
| How do you inspect messages that travel on a point-to-point channel? | |
![]() |
|
| Problem | How do you tap messages while they are routed? |
| Solution | Insert a Wire Tap into the channel, that publishes each incoming message to the main channel as well as to a secondary channel. |
| Camel | Camel has support for Wire Tap using the wireTap node, that supports two modes: traditional and new message. The traditional mode sends a copy of the original message, as opposed to sending a new message. All messages are sent as Event Message and runs in parallel with the original message. |
| Java DSL |
TraditionalThe route uses the traditional mode to send a copy of the original message to the seda tapped queue, while the original message is routed to its destination, the process order bean.
New messageIn this route we tap the high priority orders and send a new message containing a body with the from part of the order. Tip: As Camel uses an Expression for evaluation you can use other functions than xpath, for instance to send a fixed String you can use constant.
|
| Spring DSL |
Traditional
New Message
|
Conclusion
The twelve patterns in this Refcard cover the most used patterns in the integration space, together with two of the most complex such as the Aggregator and the Dead Letter Channel. In the second part of this series we will take a further look at common patterns and transations.
Get More Information
| Camel Website http://camel.apache.org | The home of the Apache Camel project. Find downloads, tutorials, examples, getting started guides, issue tracker, roadmap, mailing lists, irc chat rooms, and how to get help. |
| FuseSource Website http://fusesource.com | The home of the FuseSource company, the professional company behind Apache Camel with enterprise offerings, support, consulting and training. |
| About Author http://davsclaus.blogspot.com | The personal blog of the author of this reference card. |
About The Author

Claus Ibsen
Claus Ibsen is a passionate open-source enthusiast who specializes in the integration space. As an engineer in the Progress FUSE open source team he works full time on Apache Camel, FUSE Mediation Router (based on Apache Camel) and related projects. Claus is very active in the Apache Camel and FUSE communities, writing blogs, twittering, assisting on the forums irc channels and is driving the Apache Camel roadmap.
About Progress Fuse
FUSE products are standards-based, open source enterprise integration tools based on Apache SOA projects, and are productized and supported by the people who wrote the code.
Recommended Book
Utilizing years of practical experience, seasoned experts Gregor Hohpe and Bobby Woolf show how asynchronous messaging has proven to be the best strategy for enterprise integration success. However, building and deploying messaging solutions presents a number of problems for developers. Enterprise Integration Patterns provides an invaluable catalog of sixty-five patterns, with real-world solutions that demonstrate the formidable of messaging and help you to design effective messaging solutions for your enterprise.

Claus Ibsen is a principal engineer working for FuseSource Corporation specializing in the enterprise integration space. Claus focuses mostly on Apache Camel.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Apache Hadoop Deployment
A Blueprint for Reliable Distributed Computing
By Eugene Ciurana
9,712 Downloads · Refcard 133 of 151 (see them all)
Download
FREE PDF
The Essential Hadoop Deployment Cheat Sheet
People who downloaded this DZone Refcard also liked:
Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
By Eugene Ciurana
INTRODUCTION
This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out Refcard #117, Getting Started with Apache Hadoop, for basic terminology and for an overview of the tools available in the Hadoop Project.
WHICH HADOOP DISTRIBUTION?
Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. This Refcard presents how to deploy and use development and production computational networks. HDFS, MapReduce, and Pig are the foundational tools for developing Hadoop applications.
There are two basic Hadoop distributions:
- Apache Hadoop is the main open-source, bleeding-edge distribution from the Apache foundation.
- The Cloudera Distribution for Apache Hadoop (CDH) is an open-source, enterprise-class distribution for productionready environments.
The decision of using one or the other distributions depends on the organization’s desired objective.
- The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together.
- CDH removes the guesswork and offers an almost turnkey product for robustness and stability; it also offers some tools not available in the Apache distribution.

The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. CDH, on the other hand, is designed as a drop-in component for all major Linux distributions.

Minimum Prerequisites
- Java 1.6 from Oracle, version 1.6 update 8 or later; identify your current JAVA_HOME
- sshd and ssh for managing Hadoop daemons across multiple systems
- rsync for file and directory synchronization across the nodes in the cluster
- Create a service account for user hadoop where $HOME=/home/hadoop
SSH Access
Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration.
| Listing 1 - Hadoop SSH Prerequisits |
|
The public key for this example is left blank. If this were to run on a public network it could be a security hole. Distribute the public key from the master node to all other nodes for data exchange. All nodes are assumed to run in a secure network behind the firewall.

Enterprise: CDH Prerequisites
Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions.

CDH on Ubuntu Pre-Install Setup
Execute these commands as root or via sudo to add the Cloudera repositories:
| Listing 2 - Ubuntu Pre-Install Setup |
|
CDH on Red Hat Pre-Install Setup
Run these commands as root or through sudo to add the yum Cloudera repository:
| Listing 3 - Red Hat Pre-Install Setup |
|
Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. Don’t mix and match operating systems, distributions, Hadoop, or Java versions!
Hadoop for Development
- Hadoop runs as a single Java process, in non-distributed mode, by default. This configuration is optimal for development and debugging.
- Hadoop also offers a pseudo-distributed mode, in which every Hadoop daemon runs in a separate Java process. This configuration is optimal for development and will be used for the examples in this guide.

Hadoop for Production
- Production environments are deployed across a group of machines that make the computational network. Hadoop must be configured to run in fully distributed, clustered mode.
APACHE HADOOP INSTALLATION
This Refcard is a reference for development and production deployment of the components shown in Figure 1. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released.
Figure 1 - Hadoop Components

A non-trivial, basic Hadoop installation includes at least these components:
- Hadoop Common: the basic infrastructure necessary for running all components and applications
- HDFS: the Hadoop Distributed File System
- MapReduce: the framework for large data set distributed processing
- Pig: an optional, high-level language for parallel computation and data flow
Enterprise users often chose CDH because of:
- Flume: a distributed service for efficient large data transfers in real-time
- Sqoop: a tool for importing relational databases into Hadoop clusters
Apache Hadoop Development Deployment
The steps in this section must be repeated for every node in a Hadoop cluster. Downloads, installation, and configuration
could be automated with shell scripts. All these steps are performed as the service user hadoop, defined in the
prerequisites section.
http://hadoop.apache.org/common/releases.html has the latest version of the common tools. This guide used version 0.20.2.
- Download Hadoop from a mirror and unpack it in the /home/hadoop work directory.
- Set the JAVA_HOME environment variable.
- Set the run-time environment:
| Listing 4 - Set the Hadoop Runtime Environment |
|
Configuration
Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.xml, hdfs-site.xml, and the mapred-site.xml. These files configure the master, the file system, and the MapReduce framework and live in the runtime/conf directory.
| Listing 5 - Pseudo-Distributed Operation Config |
|
These files are documented in the Apache Hadoop Clustering reference, http://is.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.
Test the Hadoop Installation
Hadoop requires a formatted HDFS cluster to do its work:
hadoop namenode -format
The HDFS volume lives on top of the standard file system. The format command will show this upon successful completion:
/tmp/dfs/name has been successfully formatted.
Start the Hadoop processes and perform these operations to validate the installation:
- Use the contents of runtime/conf as known input
- Use Hadoop for finding all text matches in the input
- Check the output directory to ensure it works
| Listing 6 - Testing the Hadoop Installation |
|

- View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install
| Listing 7 - Job Completion and Daemon Termination |
|
That’s it! Apache Hadoop is installed in your system and ready for development.
CDH Development Deployment
CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. Compare the instructions in Listing 8 against the previous section. CDH simplifies installation and configuration for huge time savings.
| Listing 8 - Installing CDH |
|
Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Install Flume or Pig with the instructions in Listing 9.
| Listing 9 - Adding Optional Components |
|
Test the CDH Installation
The CDH daemons are ready to be executed as services. There is no need to create a service account for executing them. They can be started or stopped as any other Linux service, as shown in Listing 10.
| Listing 10 - Starting the CDH Daemons |
|
CDH will create an HDFS partition when its daemons start. It’s another convenience it offers over regular Hadoop. Listing 11 shows how to validate the installation by:
- Listing the HDFS module
- Moving files to the HDFS volume
- Running an example job
- Validating the output
| Listing 11 - Testing the CDH Installation |
|
The daemons will continue to run until the server stops. All the Hadoop services are available.
Monitoring the Local Installation
Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces. All daemons expose their data over HTTP. The users can chose to monitor a node or daemon interactively using the web UI, like in Figure 2. Developers, monitoring tools, and system administrators can use the same ports for tracking the system performance and state using web service calls.
Figure 2 - NameNode status web UI
The web interface can be used for monitoring the JobTracker, which dispatches tasks to specific nodes in a cluster, the DataNodes, or the NameNode, which manages directory namespaces and file nodes in the file system.
HADOOP MONITORING PORTS
Use the information in Table 1 for configuring a development workstation or production server firewall.
| Port | Service |
| 50030 | JobTracker |
| 50060 | TaskTrackers |
| 50070 | NameNode |
| 50075 | DataNodes |
| 50090 | Secondary NameNode |
| 50105 | Backup Node |
Table 1 - Hadoop ports
Plugging a Monitoring Agent
The Hadoop daemons also expose internal data over a RESTful interface. Automated monitoring tools like Nagios, Splunk, or SOBA can use them. Listing 12 shows how to fetch a daemon’s metrics as a JSON document:
| Listing 12 - Fetching Daemon Metrics |
| http://localhost:50070/metrics?format=json |
All the daemons expose these useful resource paths:
- /metrics - various data about the system state
- /stacks - stack traces for all threads
- /logs - enables fetching logs from the file system
- /logLevel - interface for setting log4j logging levels
Each daemon type also exposes one or more resource paths specific to its operation. A comprehensive list is available from: http://is.gd/MBN4qz
APACHE HADOOP PRODUCTION DEPLOYMENT
The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.

Figure 3 - Hadoop Computational Network
The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The basic procedure for deploying a Hadoop cluster is:
- Pick a Hadoop distribution
- Prepare a basic configuration on one node
- Deploy the same pre-configured package across all machines in the cluster
- Configure each machine in the network according to its role
The Apache Hadoop documentation shows this as a rather involved process. The value-added in CDH is that most of that work is already in place. Role-based configuration is very easy to accomplish. The rest of this Refcard will be based on CDH.
Handling Multiple Configurations: Alternatives
Each server role will be determined by its configuration, since they will all have the same software installed. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations.

The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. This customization is where all the extra work went into CDH. The CDH installation uses alternatives to set the effective CDH configuration.
Setting Up the Production Configuration
Listing 13 takes a basic Hadoop configuration and sets it up for production.
| Listing 13 - Set the Production Configuration |
|
The server will restart all the Hadoop daemons using the new production configuration.
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop
Pick a node from the cluster to act as the NameNode (see Figure 3). All Hadoop activity depends on having a valid R/W file system. Format the distributed file system from the NameNode, using user hdfs:
| Listing 14 - Create a New File System |
| sudo -u hdfs hadoop namenode -format |
Stop all the nodes to complete the file system, permissions, and ownership configuration. Optionally, set daemons for automatic startup using rc.d.
| Listing 15 - Stop All Daemons |
|
File System Setup
Every node in the cluster must be configured with appropriate directory ownership and permissions. Execute the commands in Listing 16 in every node:
| Listing 16 - File System Setup |
|
Starting the Cluster
- Start the NameNode to make HDFS available to all nodes
- Set the MapReduce owner and permissions in the HDFS volume
- Start the JobTracker
- Start all other nodes
CDH daemons are defined in /etc/init.d — they can be configured to start along with the operating system or they can be started manually. Execute the command appropriate for each node type using this example:
| Listing 17 - Starting a Node Example |
|
Use jobtracker, datanode, tasktracker, etc. corresponding to the node you want to start or stop.

| Listing 18 - Set the MapReduce Directory Up |
|
Update the Hadoop Configuration Files
| Listing 19 - Minimal HDFS Config Update |
|
The last step consists of configuring the MapReduce nodes to find their local working and system directories:
| Listing 20 - Minimal MapReduce Config Update |
|
Start the JobTracker and all other nodes. You now have a working Hadoop cluster. Use the commands in Listing 11 to validate that it’s operational.
WHAT’S NEXT?
The instructions in this Refcard result in a working development or production Hadoop cluster. Hadoop is a complex framework and requires attention to configure and maintain it. Review the Apache Hadoop and Cloudera CDH documentation. Pay particular attention to the sections on:
- How to write MapReduce, Pig, or Hive applications
- Multi-node cluster management with ZooKeeper
- Hadoop ETL with Sqoop and Flume
Happy Hadoop computing!
STAYING CURRENT
Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.eu/scalablesystems
About The Authors

Eugene Ciurana
Eugene Ciurana (http://eugeneciurana.eu) is the VP of Technology at Badoo.com, the largest dating site worldwide, and cofounder of SOBA Labs, the most sophisticated public and private clouds management software. Eugene is also an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability systems. He recently built scalable computational networks for leading financial, software, insurance, SaaS, government, and healthcare companies in the US, Japan, Mexico, and Europe.
Publications
- Developing with Google App Engine, Apress
- DZone Refcard #117: Getting Started with Apache Hadoop
- DZone Refcard #105: NoSQL and Data Scalability
- DZone Refcard #43: Scalability and High Availability
- The Tesla Testament: A Thriller, CIMEntertainment
Thank You!
Thanks to all the technical reviewers, especially to Pavel Dovbush at http://dpp.su
Recommended Book
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open-source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems; programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Eugene Ciurana is an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability large scale systems.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
ServiceMix 4.2
The Apache Open Source ESB
By Jos Dirksen
8,730 Downloads · Refcard 65 of 151 (see them all)
Download
FREE PDF
The Essential ServiceMix Cheat Sheet
People who downloaded this DZone Refcard also liked:
Getting Started with ServiceMix 4.0
By Jos Dirksen
About Servicemix 4.0
In the open source community there are many different solutions for each problem. When you look for an open source ESB, however, you don't have that many options. Even though there are many open source ESB projects, not all of them are mature enough to be used to solve enterprise mission critical integration problems. ServiceMix is one of the open source projects that is mature enough to be used in these scenarios. ServiceMix, an Apache project, has been around for a couple of years now. It provides all the features you expect from an ESB such as routing, transformation, etc. The previous version was built based on JBI (JSR-208), but in its latest iteration, which we're discussing in this Refcard, ServiceMix has moved to an OSGi based architecture, which we'll discuss later on.
This DZone Refcard will provide an overview of the core elements of ServiceMix 4.0 and will show you how to use ServiceMix 4 by providing example configurations.
Servicemix 4.0 Architecture
Before we show how to configure ServiceMix 4.0 for use, let us first look at the architecture of ServiceMix 4.0. This figure shows the following components:

ServiceMix Kernel: In this figure you can see that the basis of ServiceMix 4 is the ServiceMix Kernel. This kernel, which is based on the Apache Felix Karaf project (an OSGi based runtime), handles the core features ServiceMix provides, such as hot-deployment, provisioning of libraries or applications, remote access using ssh, JMX management and more.
ServiceMix NMR: This component, a normalized message router, handles all the routing of messages within ServiceMix and is used by all the other components.
>ActiveMQ: ActiveMQ, another Apache project, is the message broker which is used to exchange messages between components. Besides this ActiveMQ can also be used to create a fully distributed ESB.
Web: ServiceMix 4 also provides a web component. You can use this to start ServiceMix 4 embedded in a web application. An example of this is provided in the ServiceMix distribution.
JBI compatibility layer: The previous version of ServiceMix was based on JBI 1.0. For JBI a lot of components (from ServiceMix, but also from other parties), are available. This layer provides compatibility with the JBI specification, so that all the components from the previous version of ServiceMix can run on ServiceMix 4. Be sure though to use the 2009.01 version of these components.
Camel NMR: ServiceMix 4 provides a couple of different ways you can configure routing. You can use the endpoints provided by the ServiceMix NMR, but you can also use more advanced routing engines. One of those is the Camel NMR. This component allows you to run Camel based routes on ServiceMix.
CXF NMR: Besides an NMR based on Camel, ServiceMix also provides an NMR based on CXF. You can use this NMR to expose and route to Java POJOs annotated with JAX-WS annotations.

OSGi runtime
ServiceMix runs on an OSGi based kernel, but what is
OSGi? In short an OSGi container provides a service
based in-VM platform on which you can deploy
services and components dynamically. OSGi provides
strict classloasing seperation and forces you to think
about the dependencies your components have.
Besides that OSGi also defines a simple lifecycle
model for your services and components. This results
in an environment where you can easily add and
remove components and services at runtime and
allows the creation of modular applications. An added
advantage of using an OSGi container is that you
can use many components out of the box: remote
administration, a web container, configuration and
preferences services, etc.
Before we move on to the next part, let's have a quick look at how a message is processed by ServiceMix. The following figure shows how a message is routed by the NMR. In this case we're showing a reply / response (in-out) message pattern.

In this figure you can see a number of steps being executed:
- The consumer creates a message exchange for a specific service and sends a request.
- The NMR determines the provider this exchange needs to be sent to and queus the message for delivery. The provider accepts this message and executes its business logic.
- After the provider has finished processing, the response message is returned to the NMR.
- The NMR once again queues the message for delivery. This time to the consumer. The consumer accepts the message.
- After the response is accepted, the consumer sends a confirmation to the NMR.
- The NMR routes this confirmation to the provider, who accepts it and ends this message exchange.
Now that we've seen the architecture and how a message is handled by the NMR, we'll have a look at how to configure ServiceMix 4.
Configuration of ServiceMix 4.0
ServiceMix 4 configuration is mostly done through Spring XML files supported by XML schemas for easy code completion. Let's look at two simple examples. The first one uses the File Binding component to poll a directory and the second one exposes a Web service using ServiceMix's CXF support.
<beans xmlns:file="http://servicemix.Apache.org/file/1.0"
xmlns:dzone="http://servicemix.org/dzone/">
<file:poller service="foo:filePoller"
endpoint="filePoller"
targetService="foo:fileSender"
file="inbox" />
</beans>
In this listing you can see that we define a poller. A poller is one of the standard components that is provided by ServiceMix's file-binding-component. If we deploy this configuration to ServiceMix, ServiceMix will start polling the inbox directory for files. If it finds one, the file will be sent to the specified targetService.

Service Addressing
An important concept to understand when working
with ServiceMix is that of services and endpoints.
When you configure services on a component you
need to tell ServiceMix how to route messages to and
from that service. This name is called a service endpoint.
If you look back at the previous example we
created a file:poller. On this file:poller we defined
a service and an endpoint attribute. These two
attributes together uniquely identify this
file:poller. Note though that you can have multiple
endpoints defined on the same service. You can also
see a targetService attribute on the file:poller.
Besides this attribute there is also a targetEndpoint
attribute. With these two attributes you identify the
service endpoint to sent the message to. The targetEndpoint
isn't always needed, if only one endpoint
is registered on that service.
In the following listing, we've again used a simple XML file. This time we've configured a webservice.
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:jaxws="http://cxf.Apache.org/jaxws"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://cxf.Apache.org/jaxws http://cxf.Apache.org/schemas/jaxws.
xsd">
<import resource="classpath:META-INF/cxf/cxf.xml" /> 1
<import resource="classpath:META-INF/cxf/cxf-extension-soap.xml" />
<import resource="classpath:META-INF/cxf/cxf-extension-http.xml" />
<import resource="classpath:META-INF/cxf/osgi/cxf-extensionosgi.xml" />
<jaxws:endpoint id="helloWorld"
implementor="dzone.refcards.HelloWorld"
address="/HelloWorld"/>
</beans>
In this listing we use a jaxws:endpoint to define a webservice. The implementor points to a simple POJO annotated with JAX-WS annotations. If this example is deployed to ServiceMix, ServiceMix will register a webservice based on the value in the address attribute.
Deployment of ServiceMix 4 Components
ServiceMix provides a number of different options which you can use to deploy artifacts. In this section we'll look at these options, and show you how to use these.
ServiceMix 4, deployment options
| Name | Description |
| OSGi Bundles | ServiceMix 4 is built around OSGi and ServiceMix 4 also allows you to deploy your configurations as an OSGi bundle with all the advantages OSGi provides. |
| Spring XML files | ServiceMix 4 support plain Spring XML files. |
| JBI artifacts | You can also deploy artifacts following the JBI standard (service assemblies and service units) to ServiceMix 4. |
| Feature descriptors | This is a Karaf specific way for installing applications. It will install the necessary OSGi bundles and will add configuration defaults. This is mostly used to install core parts of the ServiceMix distribution. |
OSGi bundle deployment
The easiest way to create an OSGi based ServiceMix bundle
is by using Maven 2. To create a bundle you need to take a
couple of simple steps. The first one is adding the mavenbundle-
plugin to your pom.xml file. This is shown in the
following code fragment.
...
<dependencies>
<dependency>
<groupId>org.Apache.felix</groupId>
<artifactId>org.osgi.core</name>
<version>1.0.0</version>
</dependency>
...
</dependencies>
...
<build>
<plugins>
<plugin>
<groupId>org.Apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId>
<configuration>
<instructions>
<Bundle-SymbolicName>${pom.artifactId}</Bundle-SymbolicName>
<Import-Package>*,org.Apache.camel.osgi</Import-Package>
<Private-Package>org.Apache.servicemix.examples.camel</Private-Package>
</instructions>
</configuration>
</plugin>
</plugins>
</build>
...
The important part here is the instructions section. This determines how the plugin packages your project. For more information on these settings see the maven OSGi bundle plugin page at http://cwiki.Apache.org/FELIX/Apachefelixmaven-bundle-plugin-bnd.html.
The next step is to make sure your project is bundled as a OSGi bundle. You do this by setting the <packaging> element in your pom.xml to bundle.
Now you can use mvn install to create an OSGi bundle, which you can copy to the deploy directory of ServiceMix and your bundle will be installed. If you use Spring to configure your application, make sure the Spring configuration files are located in the META-INF/spring directory. That way the Spring application context will be automatically created based on these files.
If you don't want to do this by hand you can also use a Maven archetype. ServiceMix provides a set of archetypes you can use. A good starting point for a project is the Camel OSGi archetype which you can use by executing the following following Maven command:
mvn archetype:create -DarchetypeGroupId=org.Apache.servicemix.tooling
-DarchetypeArtifactId=servicemix-osgi-camel-archetype
-DarchetypeVersion=4.0.0.2-fuse
-DgroupId=com.yourcompany -DartifactId=camel-router
-DremoteRepositories=http://repo.fusesource.com/maven2/
There are many other archetypes available. For an overview of the available archetypes see: http://repo.fusesource.com/maven2/org/Apache/servicemix/tooling/
Spring XML Files Deployment
It's also possible to deploy Spring files without OSGi. Just drop a Spring file into the deploy directory. There are two points to take into account. First, you need to add the following to your Spring configuration file:
<bean class="org.Apache.servicemix.common.osgi.EndpointExporter" />
This will register the endpoints you've configured in your Spring file. The next element is optional but is good practice to add:
<manifest>
Bundle-Version = 1.0.0
Bundle-Name = Dzone :: Dzone test application
Bundle-SymbolicName = dzone.refcards.test
Bundle-Description = An example for servicemix refcard
Bundle-Vendor = jos.dirksen@gmail.com
Require-Bundle = servicemix-file, servicemix-eip
</manifest>
Using a manifest configuration element allows you to specify how your application is registered in ServiceMix.
JBI artifacts deployment
If you've already invested in JBI based applications, you can still use ServiceMix 4 to run them in. Just deploy your Service Assembly (SA) in the ServiceMix deploy directory and ServiceMix will deploy your application.
Feature descriptor based deployment
If you've got an application which contains many bundles and that requires additional configuration you can use a feature to easily manage this. A feature contains a set of bundles and configuration which can be easily installed from the ServiceMix console. The following listing shows the feature descriptor of the nmr component.
<features>
<feature name="nmr" version="1.0.0">
<bundle>mvn:org.Apache.servicemix.document/org.Apache.servicemix.document/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.api/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.core/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.osgi/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.spring/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.commands/1.0.0</bundle>
<bundle>mvn:org.Apache.servicemix.nmr/org.Apache.servicemix.nmr.management/1.0.0</bundle>
</feature>
</features>
If you want to install this feature you can just type features/install nmr from the ServiceMix console.
Routing in ServiceMix 4.0
For routing in ServiceMix you've got two options:
- EIP: ServiceMix provides a JBI component that implements a number of Enterprise Integration Patterns.
- Camel: You can use Camel routes in ServiceMix. Camel provides the most flexible and exhaustive routing options for ServiceMix
EIP Component Routing
This routing is provided by the EIP component. To check whether this is installed in your ServiceMix runtime you can execute features/list from the ServiceMix commandline. This will show you a list of installed features. If you see [installed] [ 2009.01] servicemix-eip the component is installed. If it shows uninstalled instead of installed, you can use the features/install servicemix-eip to install this component. You can now use this router using a simple XML file:
<eip:static-routing-slip service="test:routingSlip" endpoint="endpoint">
<eip:targets>
<eip:exchange-target service="test:echo" />
<eip:exchange-target service="test:echo" />
</eip:targets>
</eip:static-routing-slip>
When installed this component provides the following routing options (this information is also available in the XSD of this component):
| XML Element | Description |
| async-bridge | The async bridge pattern is used to bridge an In-Out exchange with two In-Only (or Robust-In-Only) exchanges. This pattern is the opposite of the pipeline. |
| content-basedrouter | Component that can be used for content based routing of the message. You can configure this component with a set of predicates which define how the message is routed. |
| content-enricher | A content enricher can be used to add extra information to the message from a different source. |
| message-filter | With a message filter you specify a set of predicates which determine whether to process the message or not. |
| pipeline | The pipeline component is a bridge between an In-Only (or Robust-In- Only) MEP and an In-Out MEP. This is the opposite of the async bridge. |
| resequencer | A resequencer can be used to re-order a set of incoming messages before passing them on in a the new order. |
| split-aggregator | A split aggregator is used to reassemble messages that have been split by a splitter. |
| static-recipient-list | A static recipient list will forward the incoming message to a set of predefined destinations. |
| static-routing-slip | The static routing slip routes a message through a set of services. It uses the result of the first invocation as input for the next. |
| wire-tap | The wire-tap will copy and forward a message to the specified destination. |
| xpath-splitter | This splitter uses an xpath expression to split an incoming message in multiple parts. |
Camel Routing
Apache Camel is a project which provides a lof of different routing and integration options. In this section we'll show how to use Camel with ServiceMix and give an overview of the routing options it provides. Installing the Camel component in ServiceMix is done in the same way as we did for the EIP component. We use the features/list command to check what's already installed and we can use features/add to add new Camel functionality. Once installed we can use Camel to route messages between our components. Camel provides two types of configuration: XML and Java based DSL, XML configuration was used for the following two listings:
| Camel XML configuration - Listing 1: Camel configuration |
|
| Camel XML configuration - Listing 2: Target service |
|
In these two listings you can see how we can easily integrate the Camel routes with the other components from ServiceMix. We use the nmr prefix to tell Camel to send the message to the NMR. The other service, which can be seperately deployed will then pick-up this message since it's also configured to listen to a nmr prefixed service.
Now let's look at two listings that use Camel's Java based DSL to configure the routes. For this we need a small XML file describing where the routes can be found, and a Java file which contains the routing.
| Camel Java configuration - Listing 1: Spring configuration |
|
| Camel Java configuration - Listing 2: Java route |
|
Camel itself provides a lot of standard functionality. It doesn't just provide routing, it can also provide connectivity for different technologies. For more information on Camel please see it's website at http://camel.Apache.org/ or look at the "Enterprise Integrations Patterns with Camel" Refcard.

Differences between ServiceMix and Camel
If you've looked at the Camel website you notice that it provides much the same functionality as ServiceMix. It provides
connectivity to various standards and technologies, provides routing and transformation and even allows you to expose
Web services. The main difference though is that Camel isn't a container. Camel is designed to be used inside some other
container. We've shown that you can use Camel in ServiceMix, but you can also use Camel in other ESBs or in ActiveMQ or CXF.
So if you just want an routing and mediation engine Camel is a good choice. If you however need a full ESB with good support
for JBI, a flexible OSGi based kernel, hot-deploy and easy administration ServiceMix is the better choice.
ServiceMix and web services
Support for Web services is an important feature for an ESB. ServiceMix uses the CXF project for this. Since CXF is also completely spring based, using CXF to deploy Web services is very easy.
Hosting Web services
When you want to expose a service as a webservice you can easily do this using CXF. Just create a CXF OSGi bundle using the archetype: servicemix-osgicxf-code-first-archetype. This will create an OSGi and CXF enabled maven project which you can use to develop webservices. Now just edit the src/main/ resources/META-INF/spring/beans.xml file and after you've run the mvn install command you can deploy the bundle to ServiceMix. The following listing shows such an example. This will create a Web service and host it on http://localhost:8080/cfx/HelloDzone.
| CXF Host Web service example using CXF |
|
In the previous example we hoseted a Web service which could be called from outside the container. You can also configure CXF to host the Web service internally by prefixing the address with nmr. That way you can easily expose JAX-WS annotated java beans to the other services inside the ESB. The following example shows this:
| CXF Host Web service internally |
|
You can also host a Web services using the servicemix-cxf-bc component.
| Host Web service using the servicemix-cxf-bc component |
|
Consuming Web services
Consuming Web services in ServiceMix is just as easy. ServiceMix provides two different options for this. You can use Camel or use the servicemix-cxf-bc component:
| Consume Web servicemix using the servicemix-cxf-bc component |
|
With this configuration you can consume a Web service which is located at http://webservice.com/Service and which is defined by the WSDL file target-service.wsdl. Other services can use this component by making a call to the dzone:ServicePortService.
You can also consume a Web service using Camel. For more information on how you can configure the Camel route for this look at the Camel CXF integration section of the Camel website: http://camel.Apache.org/cxf.html.
For Web services ServiceMix provides the following useful archetypes:
| Name | Description |
| servicemix-cxf-bc-service-unit | Create a maven project which uses the JBI CXF binding component. |
| servicemix-cxf-se-service-unit | Create a maven project which uses the JBI CXF service engine. |
| servicemix-cxf-se-wsdlfirstservice-unit | Create a maven project which uses the JBI CXF service engine. This project is based on WSDL first development. |
| servicemix-osgi-cxf-codefirstarchetype | Create a maven project which uses CXF and OSGi together. This project is based on code first development. |
| servicemix-osgi-cxf-wsdlfirstarchetype | Create a maven project which uses CXF and OSGi together. This project is based on wsdl first development. |
Servicemix Components
Besides integration with Web services through CXF, ServiceMix provides a lot of components you can use out of the box to integrate with various other standards and technologies. In this section we'll give an overview of these components. This list is based on the 2009.1 versions. Most of this information can also be found in the XML schemas of these components.
ServiceMix Components
| XML Element | Description |
| ServiceMix Bean | |
| Endpoint | Allows you to define a simple bean that can receive and send message exchanges. |
| ServiceMix File | |
| Poller | A polling endpoint that looks for a file or files in a directory and sends the files to a target service. You can configure various options on this endpoint such as archiving, filters, use of subdirectories etc. |
| Sender | An endpoint that receives messages from the NMR and writes them to a specific file or directory. |
| ServiceMix CXF Binding Component | |
| consumer | A consumer endpoint that is capable of using SOAP/HTTP or SOAP/JMS. |
| Provider | A provider endpoint that is capable of exposing SOAP/HTTP or SOAP/JMS services. |
| ServiceMix CXF Service Engine | |
| Endpoint | With the Drools Endpoint you can use a drools rule set as a service or as a router. |
| ServiceMix FTP | |
| Poller | This endpoint can be used to poll an FTP directory for files, download them and send them to a service. |
| Sender | With a sender endpoint you can store a message on an FTP server. |
| ServiceMix HTTP | |
| Consumer | Plain HTTP consumer endpoint. This endpoint can be used to handle plain HTTP request (without SOAP) or to be able to process the request in a non standard way. |
| Provider | A plain HTTP provider. This type of endpoint can be used to send non- SOAP requests to HTTP endpoints. |
| Soap-Consumer | An HTTP consumer endpoint that is optimized to work with SOAP messages. |
| Soap-Provider | An HTTP provider endpoint that is optimized to work with SOAP messages. |
| ServiceMix JMS | |
| Consumer | An endpoint that can receive messages from a JMS broker. |
| Provider | An endpoint that can send messages to a JMS broker. |
| Soap-Consumer | A JMS consumer that is optimized to work with SOAP messages. |
| Soap-Provider | A JMS provider that is optimized to work with SOAP messages. |
| JCA-Consumer | A JMS consumer that uses JCA to connect to the JMS broker. |
| ServiceMix Mail | |
| Poller | An endpoint which can be used to retrieve messages. |
| Sender | An endpoint which you can use to send messages. |
| ServiceMix OSWorkflow | |
| Endpoint | This endpoint can be used to start an OSWorkflow proces. |
| ServiceMix Quartz | |
| Endpoint | The Quartz endpoint can be used to fire messages into the NMR at specific intervals. |
| ServiceMix Saxon | |
| XSLT | With the XSLT endpoint you can apply an XSLT transformation to the received message. |
| Proxy | The proxy component allows you to transform an incoming message and send it to an endpoint. You can also configure a transformation that needs to be applied to the result of that invocation. |
| XQuery | The XQuery endpoint can be used to apply a selected XQuery to the input document. |
| ServiceMix Scripting | |
| Endpoint | With the scripting endpoint you can create a service which is implemented using a scripting language. The following languages are supported: Groovy, JRuby, Rhino JavaScript |
| ServiceMix SMPP | |
| Consumer | A polling component which bind with jSMPP and receive SMPP messages and sends the SMPPs into the NMR as messages. |
| Provider | A provider component receives XML message from the NMR and converts into SMPP packet and sends it to SMPP server. |
| ServiceMix SNMP | |
| Poller | With this poller you can receive SNMP events by using the SNMP4J library. |
| ServiceMix Validation | |
| Endpoint | With this endpoint you can provide schema validation of documents using JAXP 1.3 and XMLSchema or RelaxNG. |
| ServiceMix-VFS | |
| Poller | An polling endpoint that looks for a file or files in a virtual file system (based on Apache commons-vfs) and sends the files to a target service. |
| Sender | An endpoint which receives messages from the NMR and writes the message to the virtual file system. |
| ServiceMix-wsn2005 | |
| Create-pullpoint | Lets you create a WS-Notification pull point that can be used by a requester to retrieve accumulated notification messages. |
| Publisher | Sends messages to a specific topic. |
| Registerpublisher | An endpoint that can be used by publishers to register themselves. |
| Subscribe | Lets you create subscriptions to a specific topic using the WSNotification specification. |
About The Author

Jos Dirksen
Jos Dirksen is a software architect for Atos Origin, where he has been the architect for a number of large integration projects over the last couple of years. Jos has worked with various integration products, commercial and open source, for the last five years. He co-authored the book Open Source ESBs in Action, and regularly presents on topics ranging from enterprise integration patterns to JavaFX and OSGi, at such conferences as Devoxx and JavaOne.
Recommended Book
Open-Source ESBs in Action describes how to use ESBs in realworld situations. You will learn how the various features of an ESB such as transformation, routing, security, connectivity, and more can be implemented on the example of two open-source ESB implementations: Mule and ServiceMix.

Jos Dirksen is a software architect for Atos Origin, where he has been the architect for a number of large integration projects over the last couple of years.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Apache Solr
Getting Optimal Search Results
By Chris Hostetter
15,452 Downloads · Refcard 120 of 151 (see them all)
Download
FREE PDF
The Essential Apache Solr Cheat Sheet
People who downloaded this DZone Refcard also liked:
Apache Solr: Getting Optimal Search Results
By Chris Hostetter
ABOUT SOLR
Solr makes it easy for programmers to develop sophisticated, high performance search applications with advanced features such as faceting, dynamic clustering, database integration and rich document handling.
Solr (http://lucene.apache.org/solr/) is the HTTP based server product of the Apache Lucene Project. It uses the Lucene Java library at its core for indexing and search technology, as well as spell checking, hit highlighting, and advanced analysis/tokenization capabilities.
The fundamental premise of Solr is simple. You feed it a lot of information, then later you can ask it questions and find the piece of information you want. Feeding in information is called indexing or updating. Asking a question is called a querying.
Figure 1: A typical Solr setup
Core Solr Concepts
Solr’s basic unit of information is a document: a set of information that describes something, like a class in Java. Documents themselves are composed of fields. These are more specific pieces of information, like attributes in a class.
RUNNING SOLR
Solr Installation
The LucidWorks for Solr installer (http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr) makes it easy to set up your initial Solr instance. The installer brings you through configuration and deployment of the Web service on either Jetty or Tomcat.
Solr Home Directory
Solr Home is the main directory where Solr will look for configuration files, data and plug-ins.
When LucidWorks is installed at ~/LucidWorks the Solr Home directory is ~/LucidWorks/lucidworks/solr/.
Single Core and Multicore Setup
By default, Solr is set up to manage a single “Solr Core” which contains one index. It is also possible to segment Solr into multiple virtual instances of cores, each with its own configuration and indices. Cores can be dedicated to a single application, or to different ones, but all are administered through a common administration interface.
Multiple Solr Cores can be configured by placing a file named solr.xml in your Solr Home directory, identifying each Solr Core, and the corresponding instance directory for each. When using a single Solr Core, the Solr Home directory is automatically the instance directory for your Solr Core.
Configuration of each Solr Core is done through two main config files, both of which are placed in the conf subdirectory for that Core:
- schema.xml: where you describe your data
- solrconfig.xml: where you describe how people can interact with your data.
By default, Solr will store the index inside the data subdirectory for that Core.
Solr Administration
Administration for Solr can be done through http://[hostname]:8983 /solr/admin which provides a section with menu items for monitoring indexing and performance statistics, information about index distribution and replication, and information on all threads running in the JVM at the time. There is also a section where you can run queries, and an assistance area.
SCHEMA.XML
To build a searchable index, Solr takes in documents composed of data fields of specific field types. The schema.xml configuration file defines the field types and specific fields that your documents can contain, as well as how Solr should handle those fields when adding documents to the index or when querying those fields. When you perform a query, schema.xml is structured as follows:
<schema>
<types>
<fields>
<uniqueKey>
<defaultSearchField>
<solrQueryParser>
<copyField>
</schema>
FIELD TYPES
A field type includes three important pieces of information:
- The name of the field type
- Implementation class name
- Field attributes
Field types are defined in the types element of schema.xml.
<fieldType name=”textTight” class=”solr.TextField”>
…
<:/fieldType>
The type name is specified in the name attribute of the fieldType element. The name of the implementing class, which makes sure the field is handled correctly, is referenced using the class attribute.

Numeric Types
Solr supports two distinct groups of field types for dealing with numeric data:
- Numerics with Trie Encoding: TrieDateField, TrieDoubleField, TrieIntField, TrieFloatField, and TrieLongField.
- Numerics Encoded As Strings: DateField, SortableDoubleField, SortableIntField, SortableFloatField, and SortableLongField.
Which Type to Use?
Trie encoded types support faster range queries, and sorting on these fields is more RAM efficient. Documents that do not have a value for a Trie field will be sorted as if they contained the value of “0”. String encoded types are less efficient for range queries and sorting, but support the sortMissingLast and sortMissingFirst attributes.
| Class | Description |
| BinaryField | Binary data that needs to be base64 encoded when reading or writing |
| BoolField | Contains either true or false. Values of “1”, “t”, or “T” in the first character are interpreted as true. Any other values in the first character are interpreted as false. |
| ExternalFileField | Pulls values from a file on disk. |
| RandomSortField | Does not contain a value. Queries that sort on this field type will return results in random order. Use a dynamic field to use this feature. |
| StrField | String |
| TextField | Text, usually multiple words or tokens |
| UUIDField | Universally Unique Identifier (UUID). Pass in a value of “NEW” and Solr will create a new UUID. |

Field Type Properties
The field class determines most of the behavior of a field type, but optional properties can also be defined in schema.xml.
Some important Boolean properties are:
| Property | Description |
| indexed | If true, the value of the field can be used in queries to retrieve matching documents. This is also required for fields where sorting is needed. |
| stored | If true, the actual value of the field can be retrieved in query results. |
| sortMissingFirst sortMissingLast | Control the placement of documents when a sort field is not present in supporting field types. |
| multiValued | If true, indicates that a single document might contain multiple values for this field type. |
ANALYZERS
Field analyzers are used both during ingestion, when a document is indexed, and at query time. Analyzers are only valid for <fieldType> declarations that specify the TextField class. Analyzers may be a single class or they may be composed of a series of zero or more CharFilter, one Tokenizer and zero or more TokenFilter classes.
Analyzers are specified by adding <analyzer> children to the <fieldType> element in the schema.xml config file. Field Types typically use a single analyzer, but the type attribute can be used to specify distinct analyzers for the index vs query.
The simplest way to configure an analyzer is with a single <analyzer> element whose class attribute is the fully qualified Java class name of an existing Lucene analyzer.
For more configurable analysis, an analyzer chain can be created using a simple <analyzer> element with no class attribute, with the child elements that name factory classes for CharFilter, Tokenizer and TokenFilter to use, and in the order they should run, as in the following example:
<fieldType name=”nametext” class=”solr.TextField”>
<analyzer>
<charFilter class=”solr.HTMLStripCharFilterFactory”/>
<tokenizer class=”solr.StandardTokenizerFactory”/>
<filter class=”solr.StandardFilterFactory”/>
<filter class=”solr.LowerCaseFilterFactory”/>
</analyzer>
</fieldType>
CharFilter
CharFilter pre-process input characters with the possibility to add, remove or change characters while preserving the original character offsets.
The following table provides an overview of some of the CharFilter factories available in Solr 1.4:
| CharFilter | Description |
| MappingCharFilterFactory | Applies mapping contained in a map to the character stream. The map contains pairings of String input to String output. |
| PatternReplaceCharFilterFactory | Applies a regular expression pattern to the string in the character stream, replacing matches with the specified replacement string. |
| HTMLStripCharFilterFactory | Strips HTML from the input stream and passes the result to either a CharFilter or a Tokenizer. This filter removes tags while keeping content. It also removes <script>, <style>, comments, and processing instructions. |
Tokenizer
Tokenizer breaks up a stream of text into tokens. Tokenizer reads from a Reader and produces a TokenStream containing various metadata such as the locations at which each token occurs in the field.
The following table provides an overview of some of the Tokenizer factory classes included in Solr 1.4:
| Tokenizer | Description |
| StandardTokenizerFactory | Treats whitespace and punctuation as delimiters. |
| NGramTokenizerFactory | Generates n-gram tokens of sizes in the given range. |
| EdgeNGramTokenizerFactory | Generates edge n-gram tokens of sizes in the given range. |
| PatternTokenizerFactory | Uses a Java regular expression to break the text stream into tokens. |
| WhitespaceTokenizerFactory | Splits the text stream on whitespace, returning sequences of non-whitespace characters as tokens. |
TokenFilter
TokenFilter consumes and produces TokenStreams. TokenFilter looks at each token sequentially and decides to pass it along, replace it or discard it.
A TokenFilter may also do more complex analysis by buffering to look ahead and consider multiple tokens at once.
The following table provides an overview of some of the TokenFilter factory classes included in Solr 1.4:
| TokenFilter | Description |
| KeepWordFilterFactory | Discards all tokens except those that are listed in the given word list. Inverse of StopFilterFactory. |
| LengthFilterFactory | Passes tokens whose length falls within the min/max limit specified. |
| LowerCaseFilterFactory | Converts any uppercases letters in a token to lowercase. |
| PatternReplaceFilterFactory | Applies a regular expression to each token, and substitutes the given |
| PhoneticFilterFactory | Creates tokens using one of the phonetic encoding algorithms from the org.apache.commons.codec.language package. |
| PorterStemFilterFactory | An algorithmic stemmer that is not as accurate as tablebased stemmer, but faster and less complex. |
| ShingleFilterFactory | Constructs shingles (token n-grams) from the token stream. |
| StandardFilterFactory | Removes dots from acronyms and ‘s from the end of tokens. This class only works when used in conjunction with the StandardTokenizerFactory |
| StopFilterFactory | Discards, or stops, analysis of tokens that are on the given stop words list. |
| SynonymFilterFactory | Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. |
| TrimFilterFactory | Trims leading and trailing whitespace from tokens. |
| WordDelimitedFilterFactory | Splits and recombines tokens at punctuations, case change and numbers. Useful for indexing |

FIELDS
Once you have field types set up, defining the fields themselves is simple: all you need to do is supply the name and a reference to the name of the declared type you wish to use. You can also provide options that override the options for that field type.
<field name=”price” type=”sfloat” indexed=”true”/>
Dynamic Fields
Dynamic fields allow you to define behavior for fields that are not explicitly defined in the schema, allowing you to have fields in your document whose underlying <fieldType/> will be driven by the field naming convention instead of having an explicit declaration for every field.
Dynamic fields are also defined in the fields element of the schema, and have a name, field type, and options.
<dynamicField name=”*_i” type=”sint” indexed=”true” stored=”true”/>
OTHER SCHEMA ELEMENTS
Copying Fields
Solr has a mechanism for making copies of fields so that you can apply several distinct field types to a single piece of incoming information.
<copyField source=”cat” dest=”text” maxChars=”30000” />
Unique Key
The uniqueKey element specifies which field is a unique identifier for documents. Although uniqueKey is not required, it is nearly always warranted by your application design. For example, uniqueKey should be used if you will ever update a document in the index.
<uniqueKey>id</uniqueKey>
Default Search Field
If you are using the Lucene query parser, queries that don’t specify a field name will use the defaultSearchField. The dismax query parser does not use this value in Solr 1.4.
<defaultSearchField>text</defaultSearchField>
Query Parser Operator
In queries with multiple clauses that are not explicitly required or prohibited, Solr can either return results where all conditions are met or where one or more conditions are met. The default operator controls this behavior. An operator of AND means that all conditions must be fulfilled, while an operator of OR means that one or more conditions must be true.
In schema.xml, use the solrQueryParser element to control what operator is used if an operator is not specified in the query. The default operator setting only applies to the Lucene query parser (not the DisMax query parser, which uses the mm parameter to control the equivalent behavior).
SOLRCONFIG.XML
Configuring solrconfig.xml
solrconfig.xml, found in the conf directory for the Solr Core, comprises of a set of XML statements that set the configuration value for your Solr instance.
AutoCommit
The <updateHandler> section affects how updates are done internally. The <autoCommit> subelement contains further configuration for controlling how often pending updates will be automatically pushed to the index.
| Element | Description |
| <maxDocs> | Number of updates that have occurred since last commit |
| <maxTime> | Number of milliseconds since the oldest uncommitted update |
If either of these limits is reached, then Solr automatically performs a commit operation. If the <autoCommit> tag is missing, then only explicit commits will update the index.
HTTP RequestDispatcher Settings
The <requestDispatcher> section controls how the RequestDispatcher implementation responds to HTTP requests.
| Element | Description |
| <requestParsers> | Contains attributes for enableRemoteStreaming and multipartUploadLimitInKB |
| <httpCaching> | Specifies how Solr should generate its HTTP caching-related headers |
Internal Caching
The <query> section contains settings that affect how Solr will process and respond to queries.
There are three predefined types of caches that you can configure whose settings affect performance:
| Element | Description |
| <filterCache> | Used by SolrIndexSearcher for filters for unordered sets of all documents that match a query. Solr usese the filterCache to cache results of queries that use the fq search parameter. |
| <queryResultCache> | Holds the sorted and paginated results of previous searches |
| <documentCache> | Holds Lucene Document objects (the stored fields for each document). |
Request Handlers
A Request Handler defines the logic executed for any request. Multiple instances of various request handlers, each with different names and configuration options can be declared. The qt url parameter or the path of the url can be used to select the request handler by name.
Most request handlers recognize three main sub-sections in their declaration:
- default, which is used when a request does not include a parameter.
- append, which is added to the parameter values specified in the request.
- invariant, which overrides values specified in the query.
LucidWorks for Solr includes the following indexing handlers:
- XMLUpdateRequestHandler: processes XML messages containing data and other index modification instructions.
- BinaryUpdateRequestHandler: processes messages from the Solr Java client.
- CSVRequestHandler: processes CSV files containing documents
- DataImportHandler: processes commands to pull data from remote data sources
- ExtractingRequestHandler (aka Solr Cell): uses Apache Tika to process binary files such as Office/PDF and index them
The out-of-the-box searching handler is SearchHandler.
Search Components
Instances of SearchComponent define discrete units of logic that can be combined together and reused by Request Handlers (in particular SearchHandler) that know about them. The default SearchComponent used by SearchHandler is query, facet, mlt (MoreLikeThis), highlight, stats, debug. Additional Search Components are also available with additional configuration.
Response Writers
Response writers generate the formatted response of a search. The wt url parameter selects the response writer to use by name. The default response writers are json, php, phps, python, ruby, xml, and xslt.
INDEXING
Indexing is the process of adding content to a Solr index, and as necessary, modifying that content or deleting it. By adding content to an index, it becomes searchable by Solr.
Client Libraries
There are a number of client libraries available to access Solr. SolrJ is a Java client included with the Solr 1.4 release which allows clients to add, update and query the Solr index. http://wiki.apache.org/solr/IntegratingSolr provides a list of such libraries.
Indexing Using XML
Solr accepts POSTed XML messages that add/update, commit, delete and delete by query using the http://[hostname]:8983/solr/update url. Multiple documents can be specified in a single <add> command.
<add>
<doc>
<field name=”employeeId”>05991</field>
<field name=”office”>Bridgewater</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
| Command | Description |
| commit | Writes all documents loaded since last commit |
| optimize | Requests Solr to merge the entire index into a single segment to improve search performance |
Delete by id deletes the document with the specified ID (i.e. uniqueKey), while delete by query deletes documents that match the specified query:
<delete><id>05991</id></delete>
<delete><query>office:Bridgewater</query></delete>
Indexing Using CSV
CSV records can be uploaded to Solr by sending the data to the http://[hostname]:8983/solr/update/csv URL.
The CSV handler accepts various parameters, some of which can be overridden on a per field basis using the form:
f.fieldname.parameter=value
These parameters can be used to specify how data should be parsed, such as specifying the delimiter, quote character and escape characters. You can also handle whitespace, define which lines or field names to skip, map columns to fields, or specify if columns should be split into multiple values.
Indexing Using SolrCell
Using the Solr Cell framework, Solr uses Tika to automatically determine the type of a document and extract fields from it. These fields are then indexed directly, or mapped to other fields in your schema.
The URL for this handler is http://[hostname]:8983:solr/update/extract.
The Extraction Request Handler accepts various parameters that can be used to specify how data should be mapped to fields in the schema, including specific XPaths of content to be extracted, how content should be mapped to fields, whether attributes should be extracted, and in which format to extract content. You can also specify a dynamic field prefix to use when extracting content that has no corresponding field.
Indexing Using Data Import Handler
The Data Import Handler (DIH) can pull data from relational databases (through JDBC), RSS feeds, emails repositories, and structure XML using XPath to generate fields.
The Data Import Handler is registered in solrconfig.xml, with a pointer to its data-config.xml file which has the following structure:
<dataConfig>
<dataSource/>
<document>
<entity>
<field column=”” name=””/>
<field column=”” name=””/>
</entity>
</document>
</dataConfig>
The Data Import Handler is accessed using the http://[hostname]:8983/solr/dataimport URL but it also includes a browser-based console which allows you to experiment with data-config.xml changes and demonstrates all of the commands and options to help with development. You can access the console at this address: http://[hostname]:port/solr/admin/dataimport.jsp
SEARCHING
Data can be queried using either the http://[hostname]:8983/solr/ select?qt=name URL, or by using the http://[hostname]:8983/solr/name syntax for SearchHandler instances with names that begin with a “/”.
SearchHandler processes requests by delegating to its Search Components which interpret the various request parameters. The QueryComponent delegates to a query parser, which determines which documents the user is interested in. Different query parsers support different syntax.
Query Parsing
Input to a query parser can include:
- Sear ch strings—that is, terms to sear ch for in the index.
- Parameters for fine-tuning the query by incr easing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results.
- Parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application’s schema.
Search parameters may also specify a filter query. As part of a search response, a filter query runs a query against the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the strategic use of filter queries can improve search performance.
Common Query Parameters
The table below summarizes Solr’s common query parameters:
| Parameter | Description |
| defType | The query parser to be used to process the query |
| sort | Sort results in ascending or descending order based on the documents score or another characteristic |
| start | An offset (0 by default) to the results that Solr should begin displaying |
| rows | Indicates how many rows of results are displayed at a time (10 by default) |
| fq | Applies a filter query to the search results |
| fl | Limits the query’s results to a listed set of fields |
| debugQuery | Causes Solr to include additional debugging information in the response, including score explain information for each document returned |
| explainOther | Allows client to specify a Lucene query to identify a set of documents not already included in the response, returning explain information for each of those documents |
| wt | Specified the Response Writer to be used to format the query response |
Lucene Query Parser
The standard query parser syntax allows users to specify queries containing complex expressions, such as: . http://[hostname]:8983/solr/select?q=id:SP2514N+price:[*+TO+10].
The standard query parser supports the parameters described in the following table:
| Parameter | Description |
| q | Query string using the Lucene Query syntax |
| q.op | Specified the default operator for the query expression, overriding that in schema.xml. May be AND or OR |
| df | Default field, overriding what is defined in schema.xml |
DisMax Query Parser
The DisMax query parser is designed to provide an experience similar to that of popular search engines such as Google, which rarely display syntax errors to users.
Instead of allowing complex expressions in the query string, additional parameters can be used to specify how the query string should be used to find matching documents.
| Parameter | Description |
| q | Defines the raw user input strings for the query |
| q.alt | Calls the standard query parser and defined query input strings, when q is not used |
| qf | Query Fields: the fields in the index on which to perform the query |
| mm | Minimum “Should” Match: a minimum number of clauses in the query that must match a document. This can be specified as a complex expression. |
| pf | Phrase Fields: Fields that give a score boost when all terms of the q parameter appear in close proximity |
| ps | Phrase Slop: the number of positions all terms can be apart in order to match the pf boost |
| tie | Tie Breaker: a float value (less than 1) used as a multiplier with more then one of the qf fields containing a term from the query string. The smaller the value, the less influence multiple matching fields have |
| bq | Boost Query: a raw Lucene query that will be added to the users query to influence the score |
| bf | Boost Function: like bq, but directly supports the Solr function query syntax |
ADVANCED SEARCH FEATURES
Faceting makes it easy for users to drill down on search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.
There are three types of faceting, all of which use indexed terms:
- Field Faceting: treats each indexed term as a facet constraint.
- Query Faceting: allows the client to specify an arbitrary query and uses that as a facet constraint.
- Date Range Faceting: creates date range queries on the fly.
Solr provides a collection of highlighting utilities which can be called by various Request Handlers to include highlighted matches in field values. Popular search engines such as Google and Yahoo! return snippets in their search results: 3-4 lines of text offering a description of a search result.
When an index becomes too large to fit on a single system, or when a query takes too long to execute, the index can be split into multiple shards on different Solr servers, for Distributed Search. Solr can query and merge results across shards. It’s up to you to get all your documents indexed on each shard of your server farm. Solr does not include out-of-the-box support for distributed indexing, but your method can be as simple as a round robin technique. Just index each document to the next server in the circle.
Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.
The primary purpose of the Replication Handler is to replicate an index to multiple slave servers which can then use loadbalancing for horizontal scaling. The Replication Handler can also be used to make a back-up copy of a server’s index, even without any slave servers in operation.
MoreLikeThis is a component that can be used with the SearchHandler to return documents similar to each of the documents matching a query. The MoreLikeThis Request Handler can be used instead of the SearchHandler to find documents similar to an individual document, utilizing faceting, pagination and filtering on the related documents.
About The Authors

Chris Hostetter
Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer for the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been.
Recommended Book
Designed to provide complete, comprehensive documentation, the Reference Guide is intended to be more encyclopedic and less of a cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting started to well experienced developers extending their application or troubleshooting. It will be of use at any point in the application lifecycle, for whenever you need deep, authoritative information about Solr.
Download Now

Chris Hostetter is a Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer for the Apache Lucene/Solr Projects.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Getting Started with Apache Hadoop
By Eugene Ciurana and Masoud Kalali
15,277 Downloads · Refcard 117 of 151 (see them all)
Download
FREE PDF
The Essential Apache Hadoop Cheat Sheet
People who downloaded this DZone Refcard also liked:
Getting Started with Apache Hadoop
By Eugene Ciurana and Masoud Kalali
INTRODUCTION
This Refcard presents a basic blueprint for applying MapReduce to solving large-scale, unstructured data processing problems by showing how to deploy and use an Apache Hadoop computational cluster. It complements DZone Refcardz #43 and #103, which provide introductions to highperformance computational scalability and high-volume data handling techniques, including MapReduce.
What Is MapReduce?
MapReduce refers to a framework that runs on a computational cluster to mine large datasets. The name derives from the application of map() and reduce() functions repurposed from functional programming languages.
- “Map” applies to all the members of the dataset and returns a list of results
- “Reduce” collates and resolves the results from one or more mapping operations executed in parallel
- Very large datasets are split into large subsets called splits
- A parallelized operation performed on all splits yields the same results as if it were executed against the larger dataset before turning it into splits
- Implementations separate business logic from multiprocessing logic
- MapReduce framework developers focus on pr ocess dispatching, locking, and logic flow
- App developers focus on implementing the business logic without worrying about infrastructure or scalability issues
Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every item in the split. It produces a list of (k2, v2) pairs for each call. The framework groups all the results with the same key together in a new split.
The Reduce(k2, list(v2)) -> list(v3) function is applied to each intermediate results split to produce a collection of values v3 in the same domain. This collection may have zero or more values. The desired result consists of all the v3 collections, often aggregated into one result file.

APACHE HADOOP
Apache Hadoop is an open source, Java framework for implementing reliable and scalable computational networks. Hadoop includes several subprojects:
- MapReduce
- Pig
- ZooKeeper
- HBase
- HDFS
- Hive
- Chukwa
This Refcard presents how to deploy and use the common tools, MapReduce, and HDFS for application development after a brief overview of all of Hadoop’s components.

Hadoop comprises tools and utilities for data serialization, file system access, and interprocess communication pertaining to MapReduce implementations. Single and clustered configurations are possible. This configuration almost always includes HDFS because it’s better optimized for high throughput MapReduce I/O than general-purpose file systems.
Components
Figure 2 shows how the various Hadoop components relate to one another:
Essentials
- HDFS - a scalable, high-performance distributed file system. It stores its data blocks on top of the native file system. HDFS is designed for consistency; commits aren’t considered “complete” until data is written to at least two different configurable volumes. HDFS presents a single view of multiple physical disks or file systems.
- MapReduce - A Java-based job tracking, node management, and application container for mappers and reducers written in Java or in any scripting language that supports STDIN and STDOUT for job interaction.

Frameworks
- Chukwa - a data collection system for monitoring, displaying, and analyzing logs from large distributed systems.
- Hive - structured data warehousing infrastructure that provides a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis.
- HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.
Utilities
- Pig - a set of tools for programmatic flat-file data analysis that provides a programming language, data transformation, and parallelized processing.
- Sqoop - a tool for importing and exporting data stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers.
- ZooKeeper - a distributed application management tool for configuration, event synchronization, naming, and group services used for managing the nodes in a Hadoop computational network.

Hadoop Cluster Building Blocks
Hadoop clusters may be deployed in three basic configurations:
| Mode | Description | Usage |
| Local (default) | Multi-threading components, single JVM | Development, test, debug |
| Pseudo-distributed | Multiple JVMs, single node | Development, test, debug |
| Distributed | All components run in separate nodes | Staging, production |
Figure 3 shows how the components are deployed for any of these configurations:
Each node in a Hadoop installation runs one or more daemons executing MapReduce code or HDFS commands. Each daemon’s responsibilities in the cluster are:
- NameNode: manages HDFS and communicates with every DataNode daemon in the cluster
- JobTracker: dispatches jobs and assigns splits (splits) to mappers or reducers as each stage completes
- TaskTracker: executes tasks sent by the JobTracker and reports status
- DataNode: Manages HDFS content in the node and updates status to the NameNode
These daemons execute in the three distinct processing layers of a Hadoop cluster: master (Name Node), slaves (Data Nodes), and user applications.
Name Node (Master)
- Manages the file system name space
- Keeps track of job execution
- Manages the cluster
- Replicates data blocks and keeps them evenly distributed
- Manages lists of files, list of blocks in each file, list of blocks per node, and file attributes and other meta-data
- Tracks HDFS file creation and deletion operations in an activity log
Depending on system load, the NameNode and JobTracker daemons may run on separate computers.

Data Nodes (Slaves)
- Store blocks of data in their local file system
- Store meta-data for each block
- Serve data and meta-data to the job they execute
- Send periodic status r eports to the Name Node
- Send data blocks to other nodes r equired by the Name Node
Data nodes execute the DataNode and TaskTracker daemons described earlier in this section.
User Applications
- Dispatch mappers and reducers to the Name Node for execution in the Hadoop cluster
- Execute implementation contracts for Java and for scripting languages mappers and reducers
- Provide application-specific execution parameters
- Set Hadoop runtime configuration parameters with semantics that apply to the Name or the Data nodes
A user application may be a stand-alone executable, a script, a web application, or any combination of these. The application is required to implement either the Java or the str eaming APIs.
Hadoop Installation

Required detailed instructions for this section are available at: http://hadoop.apache.org/comon/docs/current
- Ensure that Java 6 and both ssh and sshd are running in all nodes
- Get the most recent, stable release from http://hadoop.apache.org/common/releases.html
- Decide on local, pseudo-distributed or distributed mode
- Install the Hadoop distribution on each server
- Set the HADOOP_HOME environment variable to the directory where the distribution is installed
- Add $HADOOP_HOME/bin to PATH
Follow the instructions for local, pseudo-cluster ed, or clustered configuration from the Hadoop site. All the configuration files are located in the directory $HADOOP_HOME/conf; the minimum configuration requirements for each file are:
- hadoop-env.sh — environmental configuration, JVM configuration, logging, master and slave configuration files
- core-site.xml — site wide configuration, such as users, groups, sockets
- hdfs-site.xml — HDFS block size, Name and Data node directories
- mapred-site.xml — total MapReduce tasks, JobTracker address
- masters, slaves files — NameNode, JobTracker, DataNodes, and TaskTrackers addresses, as appropriate
Test the Installation
Log on to each server without a passphrase: ssh servername or ssh localhost
Format a new distributed file system: hadoop namenode -format
Start the Hadoop daemons: start-all.sh
Check the logs for errors at $HADOOP_HOME/logs!
Browse the NameNode and JobTracker interfaces at (localhost is a valid name for local configurations):
- http://namenode.server.name:50070/
- http://jobtracker.server.name:50070/
HADOOP QUICK REFERENCE
The official commands guide is available from: http://hadoop.apache.org/common/docs/current/commands_ manual.html
Usage

Hadoop can parse generic options and run classes from the command line. confdir can override the default $HADOOP_HOME/ conf directory.
Generic Options
| -conf <config file> | App configuration file |
| -D <property=value> | Set a property |
| -fs <local|namenode:port> | Specify a namenode |
| -jg <local|jobtracker:port> | Specify a job tracker; applies only to a job |
| -files <file1, file2, .., fileN> | Files to copy to the cluster (job only) |
| -libjars <file1, file2, ..,fileN> | .jar files to include in the classpath (job only) |
| -archives |
Archives to unbundle on the computational nodes (job only) |
| User Commands | |
| archive -archiveName file.har /var/data1 /var/data2 | Create an archive |
| distcp hdfs://node1:8020/dir_a hdfs://node2:8020/dir_b |
Distributed copy from one or more node/dirs to a target |
| fsck -locations /var/data1 fsck -move /var/data1 fsck /var/data |
File system checks: list block/location, move corrupted files to /lost+found, and general check |
| job -list [all] job -submit job_file job -status 42 job -kill 42 |
Job list, dispatching, status check, and kill; submitting a job returns its ID |
| pipes -conf file pipes -map File.class pipes -map M.class -reduce R.class -files |
Use HDFS and MapReduce from a C++ program |
| queue -list | List job queues |
| Administrator Commands | |
| balancer -threshold 50 | Cluster balancing at percent of disk capacity |
| daemonlog -getlevel host name | Fetch http://host/logLevel?log=name |
| datanode | Run a new datanode |
| jobtracker | Run a new job tracker |
| namenode -format namenode -regular namenode -upgrade namenode -finalize |
Format, start a new instance, upgrade from a previous version of Hadoop, or remove previous version's files and complete upgrade |
HDFS shell commands apply to local or HDFS file systems and take the form:
hadoop dfs -command dfs_command_options
| HDFS Shell | |
| du /var/data1 hdfs://node/data2 | Display cumulative of files and directories |
| lsr | Recursive directory list |
| cat hdfs://node/file | Types a file to stdout |
| count hdfs://node/data | Count the directories, files, and bytes in a path |
| chmod, chgrp, chown | Permissions |
| expunge | Empty file system trash |
| get hdfs://node/data2 /var/data2 | Recursive copy files to the local system |
| put /var/data2 hdfs://node/data2 | Recursive copy files to the target file system |
| cp, mv, rm | Copy, move, or delete files in HDFS only |
| mkdir hdfs://node/path | Recursively create a new directory in the target |
| setrep -R -w 3 | Recursively set a file or directory replication factor (number of copies of the file) |

To leverage this quick reference, review and understand all the Hadoop configuration, deployment, and HDFS management concepts. The complete documentation is available from http://hadoop.apache.org.
HADOOP APPS QUICK HOW-TO
A Hadoop application is made up of one or more jobs. A job consists of a configuration file and one or more Java classes or a set of scripts. Data must alr eady exist in HDFS.
Figure 4 shows the basic building blocks of a Hadoop application written in Java:
An application has one or more mappers and reducers and a configuration container that describes the job, its stages, and intermediate results. Classes are submitted and monitored using the tools described in the previous section.
Input Formats and Types
- KeyValueTextInputFormat — Each line represents a key and value delimited by a separator; if the separator is missing the key and value are empty
- TextInputFormat — The key is the line number, the value is the text itself for each line
- NLineInputFormat — N sequential lines represent the value, the offset is the key
- MultiFileInputFormat — An abstraction that the user overrides to define the keys and values in terms of multiple files
- Sequence Input Format — Raw format serialized key/value pairs
- DBInputFormat — JDBC driver fed data input
Output Formats
The output formats have a 1:1 correspondence with the input formats and types. The complete list is available from: http://hadoop.apache.org/common/docs/current/api
Word Indexer Job Example
Applications are often required to index massive amounts of text. This sample application shows how to build a simple indexer for text files. The input is free-form text such as:
hamlet@11141\tKING CLAUDIUS\tWe doubt it nothing: heartily
farewell.
The map function output should be something like:
<KING, hamlet@11141>
<CLAUDIUS, hamlet@11141>
<We, hamlet@11141>
<doubt, hamlet@11141>
The number represents the line in which the text occurred. The mapper and reducer/combiner implementations in this section require the documentation from:http://hadoop.apache.org/mapreduce/docs/current/api
The Mapper
The basic Java code implementation for the mapper has the form:
public class LineIndexMapper
extends MapReduceBase
implements Mapper {
public void map(LongWritable k,
Text v, OutputCollector o,
Reporter r) throws IOException { /* implementation here
*/ }
.
.
}
The implementation itself uses standard Java text manipulation tools; you can use regular expressions, scanners, whatever is necessary.

The Reducer/Combiner
The combiner is an output handler for the mapper to reduce the total data transferred over the network. It can be thought of as a reducer on the local node.
public class LineIndexReducer
extends MapReduceBase
implements Reducer {
public void reduce(Text k,
Iterator v,
OutputCollector o,
Reporter r) throws IOException {
/* implementation */ }
.
.
}
The reducer iterates over keys and values generated in the previous step adding a line number to each word’s occurrence index. The reduction results have the form:
<KING, hamlet@11141; hamlet@42691; lear@31337>
A complete index shows the line where each word occurs, and the file/work where it occurred.
Job Driver
public class Driver {
public static void main(String… argV) {
Job job = new Job(new Configuration(), “test”);
job.setMapper(LineIndexMapper.class);
job.setCombiner(LineIndexReducer.class);
job.setReducer(LineIndexReducer.class);
job.waitForCompletion(true);
}
} // Driver
This driver is submitted to the Hadoop cluster for processing, along with the rest of the code in a .jar file. One or more files must be available in a reachable hdfs://node/path before submitting the job using the command:
hadoop jar shakespeare_indexer.jar
Using the Streaming API
The streaming API is intended for users with very limited Java knowledge and interacts with any code that supports STDIN and STDOUT streaming. Java is considered the best choice for “heavy duty” jobs. Development speed could be a r eason for using the streaming API instead. Some scripted languages may work as well or better than Java in specific problem domains. This section shows how to implement the same mapper and reducer using awk and compares its performance against Java’s.
The Mapper
#!/usr/bin/gawk -f
{
for (n = 2;n <= NF;n++) {
gsub(“[,:;)(|!\\[\\]\\.\\?]|--”,””);
if (length($n) > 0) printf(“%s\t%s\n”, $n, $1);
}
}
The output is mapped with the key, a tab separator, then the index occurrence.
The Reducer
#!/usr/bin/gawk -f
{ wordsList[$1] = ($1 in wordsList) ?
sprintf(“%s,%s”,wordsList[$1], $2) : $2; }
END {
for (key in wordsList)
printf(“%s\t%s\n”, key,wordsList[key]);
}
The output is a list of all entries for a given word, like in the previous section:
doubt\thamlet@111141,romeoandjuliet@23445,henryv@426917
Awk’s main advantage is conciseness and raw text processing power over other scripting languages and Java. Other languages, like Python and Perl, ar e supported if they are installed in the Data Nodes. It’s all about balancing speed of development and deployment vs. speed of execution.
Job Driver
hadoop jar hadoop-streaming.jar -mapper shakemapper.awk
-reducer shakereducer.awk -input hdfs://node/shakespeareworks
Performance Tradeoff

STAYING CURRENT
Do you want to know about specific projects and use cases where NoSQL and data scalability are the hot topics? Join the scalability newsletter:
http://eugeneciurana.com/scalablesystems
About The Authors

Eugene Ciurana
Eugene Ciurana (http://eugeneciurana.com) is an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability large scale systems. Over the last two years, Eugene designed and built hybrid cloud scalable systems and computational networks for leading financial, software, insurance, and healthcare companies in the US, Japan, Mexico, and Europe.
Publications
- Developing with Google App Engine, Apr ess
- DZone Refcar d #105: NoSQL and Data Scalability
- DZone Refcar d #43: Scalability and High A vailability
- DZone Refcar d #38: SOA Patterns
- The Tesla Testament: A Thriller, CIMEntertainment
Masoud Kalali

Masoud Kalali(http://kalali.me) is a software engineer and author. He has been working on software development projects since 1998. He is experienced in a variety of technologies and platforms..
Masoud is the author of several DZone Refcardz, including: Using XML in Java, Berkeley DB Java Edition, Java EE Security , and GlassFish v3. Masoud is also the author of a book on GlassFish Security published by Packt. He is one of the foundin g members of the NetBeans Dream Team and is a GlassFish community spotlighted developer.
Recommended Book
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.
Eugene Ciurana is an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability large scale systems.
This DZone Refcard was authored by Eugene Ciurana. Click here to learn more.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Getting Started with Maven Repository Management
By Jason Van Zyl
20,297 Downloads · Refcard 98 of 151 (see them all)
Download
FREE PDF
The Essential Maven Repository Cheat Sheet
People who downloaded this DZone Refcard also liked:
Getting Started with Maven Repository Management
By Jason Van Zyl
MAVEN REPOSITORY MANAGEMENT
A Maven repository provides a standard for storing and serving binary software. Maven and other tools such as Ivy interact with repositories to search for binary software artifacts, locate project dependencies, and retrieve software artifacts from a repository.
Maven Repository managers serve two purposes: they act as highly configurable proxies between your organization and the public Maven repositories and they also provide an organization with a deployment destination for your own generated artifacts.
Proxy Remote Repositories
When you proxy a remote repository, you repository manager accepts requests for artifacts from clients. If the artifact is not already cached, the repository manager will retrieve the artifact from the remote repository and cache the artifact. Subsequent requests for the same artifact will be served from the local cache.
Hosted Internal Repositories
When you host a repository, your repository manager takes care of organizing, storing, and serving binary artifacts. You can use a hosted, internal repository to store internal release artifacts, snapshot artifacts, or 3rd party artifacts.
Release Artifacts
These are specific, point-in-time releases. Released artifacts are considered to be solid, stable, and perpetual in order to guarantee that builds which depend upon them are repeatable over time. Released JAR artifacts are associated with PGP signatures and checksums verify both the authenticity and integrity of the binary software artifact. The Central Maven repository stores release artifacts.
Snapshot Artifacts
Snapshots capture a work in progress and are used during development. A Snapshot artifact has both a version number such as “1.3.0” or “1.3” and a timestamp. For example, a snapshot artifact for commons-lang 1.3.0 might have the name commons-lang-1.3.0-20090314.182342-1.jar.
Reasons to Use a Repository Manager
- Builds will run much fasteras they will be downloading artifacts from a local cache.
- Builds will be more stablebecause you will not be relying on external resources. If your internet connection becomes unavailable, your builds will rely on a local cache of artifacts from a remote repository.
- You can deploy 3rd party artifacts to your repository manager. If you have a proprietary JDBC driver, add it to an internal 3rd party repository so developers can add it as a project dependency without having to manually install it in a local repository.
- It will be easier to collaborateand distribute software internally. Instead of sending other developers instructions for checking out source from source control and building entire applications from source, publish artifacts to an internal repository and share binary artifacts.
- If you are deploying software to the public, the fastest way to get your users productive is with a standard Maven repository.
- You can control which artifacts and repositories are referenced by your projects.
Additional Features and Benefits
Searching and Indexing Artifacts:All repository managers provide an easy way to index and search software artifacts using the standard Nexus Indexer format.
Repository Groups:Repository managers can consolidate multiple repositories into a single repository group making it easier to configure tools to retrieve artifacts from a single URL.
Procuring External Artifacts:Organizations often want some control over what artifacts are allowed into the organization. Many repository managers allow administrators to define lists of allowed and/or blocked repositories.
Staging and Release Management:Repository managers can also support decisions and workflow associated with software releases sending email notifications to release managers, developers, and testers.
Security and LDAP Integration:Repository managers can be configured to verify artifacts downloaded from remote repositories and to integrate with external security providers such as LDAP.
Multiple Repository Formats:Repository managers can also automatically transform between various repository formats including OSGi Bundle repositories (OBR), P2 repositories, Maven repositories, and other repository formats.
REPOSITORY COORDINATES
Repositories store artifacts using a set of coordinates: groupId, artifactId, version, and packaging. The GAV coordinate standard is the foundation for Maven’s ability to manage dependencies.

Coordinate: groupId
A group identifier groups a set of artifacts into a logical group. For example, software components being produced by the Maven project are available under the groupId org.apache.maven.
Coordinate: artifactId
An artifact is simply a name for a software artifact. A simple web application project might have the artifactId “simple-webapp”, and a simple library might be “simple-library”. The combination of groupId and artifactId must be unique for a project.
Coordinate: version
A numerical version for a software artifact. For example, if your simple-library artifact has a Major release version of 1, a minor release version of 2, and point release version of 3, your version would be 1.2.3. Versions can also contain extra information to denote release status such as “1.2-beta”.
Coordinate: packaging
Packaging describes the contents of the software artifact. While the most common artifact is a JAR, Maven repositories can store any type binary software format including ZIP, SWC, SWF, NAR, WAR, EAR, SAR.
Addressing Resources in a Repository
Tools designed to interact with Maven repositories translate artifact coordinates into a URL which corresponds to a location in a Maven repository. If a tool such as Maven is looking for version 1.2.0 of the some-library JAR in the group com.example, this request is translated into:
/com/example/some-library/1.2.0/some-library-1.2.0.jar
/com/example/some-library/1.2.0/some-library-1.2.0.pom
PROJECT DEPENDENCIES
Build tools like Maven and Ivy allow you to define project dependencies using Maven coordinates.
Declaring Dependencies in Maven
<project>
...
<dependencies>
<dependency>
<groupId>org.codehaus.xfire</groupId>
<artifactId>xfire-java5</artifactId>
<version>1.2.5</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
...
</project>
REMOTE REPOSITORIES
Central Maven Repository
The Central Maven repository contains almost 90,000 software artifacts occupying around 100 GB of disk space. You can look at Central as an example of how Maven repositories operate and how they are assembled.
http://repo1.maven.org
Apache Snapshot Repository
The Apache Snapshot repository contains snapshot artifacts for projects in the Apache Software Foundation. http://repository.apache.org/snapshots/
Codehaus Snapshot Repository
The Codehaus Snapshot repository contains snapshot artifacts for projects hosted by Codehaus. http://nexus.codehaus.org/snapshots/
ABOUT NEXUS
Nexus manages software “artifacts” required for development, deployment, and provisioning. If you develop software, Nexus can help you share those artifacts with other developers and end-users. Maven’s central repository has always served as a great convenience for users of Maven, but it has always been recommended to maintain your own repositories to ensure stability within your organization. Nexus greatly simplifies the maintenance of your own internal repositories and access to external repositories. With Nexus you can completely control access to, and deployment of, every artifact in your organization from a single location.
Downloading Nexus Open Source
To download Nexus Open Source, go to http://nexus.sonatype.org and click on the Download menu item. Download the nexus-oss-webapp-1.6.0-bundle.tar.gz or nexus-oss-webapp-1.6.0-bundle.zip file from the Download directory.
Downloading Nexus Professional
To download Nexus Professional, go to http://www.sonatype.com/products/nexus and click on Download Nexus Pro. After you fill out a simple registration form, a download link will be sent via email.
Installing Java
Nexus Open Source and Nexus Professional only have one prerequisite, a Java Runtime Environment (JRE) compatible with Java 5 or higher. To download the latest release of the Sun JDK, go to http://developers.sun.com/downloads/.
Installing Nexus
Unpack the Nexus distribution in any directory. Nexus doesn’t have any hard coded directories, it will run from any directory. If you downloaded the ZIP archive, run:
$ unzip nexus-webapp-1.6.0-bundle.zip
And, if you downloaded the GZip’d TAR archive, run:
$ tar xvzf nexus-webapp-1.6.0-bundle.tgz
This will create two directories nexus-webapp-1.6.0/ and sonatype-work/.
The Sonatype Work Directory
The Nexus installation directory nexus-webapp-1.6.0 has a sibling directory named sonatype-work/. This directory contains all of the repository and configuration data for Nexus and is stored outside of the Nexus installation directory to make it easier to upgrade to a newer version of Nexus.
RUNNING NEXUS
When you start Nexus for the first time, it will be running on http://localhost:8081/nexus/. To start Nexus, find the appropriate startup script for your platform in the ${NEXUS_HOME}/bin/jsw directory.
Starting Nexus
The following example listing starts Nexus using the script for Mac OS X. The Mac OS X wrapper is started with a call to nexus start:
$ cd ~/nexus-webapp-1.6.0
$ ls ./bin/jsw/
aix-ppc-32/ linux-ppc-64/ solaris-sparc-32/
aix-ppc-64/ linux-x86-32/ solaris-sparc-64/
hpux-parisc-32/ linux-x86-64/ solaris-x86-32/
hpux-parisc-64/ macosx-universal-32/ windows-x86-32/
$ chmod -R a+x bin
$ ./bin/jsw/macosx-universal-32/nexus start
Nexus Repository Manager...
$ tail -f logs/wrapper.log
INFO ... [ServletContainer:default] -SelectChannelConnector@0.0.0.0:8081
Configuring Nexus as a Service
When installing Nexus, you will often want to configure Nexus as a service. To configure Nexus as a service on Windows:
- (A) Open a Command Prompt
- (B) Change directories to C:/Program Files/nexus-webapp-1.6.0
- (C) Change directories to bin/jsw/windows-x86-32
- (D) Run InstallNexus.bat to install Nexus as a Windows Service
- (E) Run “net start nexus-webapp” to start the Nexus service
To configure Nexus as a service on Linux:
- (A) Copy bin/jsw/$PLATFORM/nexus to /etc/init.d
- (B) chmod 755 /etc/init.d/nexus
- (C) Edit the startup script changing APP_NAME, APP_LONG_NAME, NEXUS_HOME, PLATFORM, WRAPPER_CMD, and WRAPPER_CONF
- (D) (Optional) Set the RUN_AS_USER to “nexus
Login to Nexus
To use Nexus, fire up a web browser and go to: http://localhost:8081/nexus. Click on the “Log In” link in the upper right-hand corner of the web page, and you should see the login dialog.

Post-install Checklist
After installing Nexus make sure to make the following configuration changes.
- Change the Administrative Password by clicking on Security -> Users. Right-click on the admin user and choose “Set Password”.
- Configure the SMTP Settings by selecting Administration -> Server and filling out the SMTP server information.
- Enable Remote Index Downloads for the Central Maven Repository. Click on Views/Repositories -> Repositories. Select the “Maven Central” repository and open up the Configuration tab. Under Remote Repository Access set Download Remote Indexes to true.
- Install Professional License (Nexus Professional Only). Select Administration -> Licensing and upload your Nexus Professional License.
CONFIGURING MAVEN FOR NEXUS
To use Nexus, you will configure Maven to check Nexus instead of the public repositories. To do this, you’ll need to edit your mirror settings in your ~/.m2/settings.xml file.
Update your Maven Settings
Place the following XML into a file named ~/.m2/settings. xml. This Maven Settings file configures your Maven builds to fetch artifacts from the public group of the Nexus installation available at http://localhost:8081/nexus/
<settings>
<mirrors>
<mirror>
<!--This sends everything else to /public -->
<id>nexus</id>
<mirrorOf>*</mirrorOf>
<url>http://localhost:8081/nexus/content/groups/public</url>
</mirror>
</mirrors>
<profiles>
<profile>
<id>nexus</id>
<repositories>
<repository>
<id>central</id>
<url>http://central</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>true</enabled></snapshots>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>central</id>
<url>http://central</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>true</enabled></snapshots>
</pluginRepository>
</pluginRepositories>
</profile>
</profiles>
<activeProfiles>
<!--make the profile active all the time -->
<activeProfile>nexus</activeProfile>
</activeProfiles>
</settings>
Deploying Artifacts to Nexus
To deploy artifacts to Nexus you must set server credentials in your Maven Settings and configure your project’s POM to publish to Nexus. Using the default deployment user’s credentials, put the following server element in your Maven Settings XML stored in ~/.m2/settings.xml
<settings>
…
<servers>
<server>
<id>releases</id>
<username>deployment</username>
<password>deployment123</password>
</server>
<server>
<id>snapshots</id>
<username>deployment</username>
<password>deployment123</password>
<</server>
</servers>
…
</settings>
And, add the following XML to your Maven project’s pom.xml:
<distributionManagement>
<repository>
<id>releases</id>
<name>Releases Repository</name>
<url>
http://localhost:8081/nexus/content/repositories/releases
</url>
</repository>
<snapshotRepository>
<id>snapshots</id>
<name>Snapshot Repository</name>
<url>
http://localhost:8081/nexus/content/repositories/snapshots
</url>
</snapshotRepository>
</distributionManagement>
This configures your Maven build to deploy snapshots to the hosted Snapshots repository and releases to the hosted Releases repository. When Maven performs the deployment, it will match the id element of the repository with the id element of the server in the settings.xml and send the appropriate credentials.

PROXY REPOSITORIES
This section details working with Proxy Repositories.
What is a Proxy Repository?
A proxy repository sits between your builds and a remote repository like the Central Maven repository. When you ask a proxy repository for an artifact, it checks a local cache of artifacts it has already downloaded. If it does not have the artifact requested, it will retrieve the artifact from the remote repository.
Proxy repositories speed up your builds by serving frequently used artifacts from a local cache. They also provide for more stability in case when your internet connection or the remote repository becomes unavailable.
Adding a New Proxy Repository
To add a new Proxy Repository, go to Views/Repositories -> Repositories, and click on the Add button as shown in the following figure. Select Proxy Repository from the drop down:
Once you select Proxy Repository you will see the New Proxy Repository form shown here:
Supply a unique identifier and name, choose a Repository Policy of either Release or Snapshot, and provide the URL of the remote repository in the Remote Storage Location.
Enabling Remote Index Downloads
While Nexus is preconfigured with the Central Maven repository, it is not configured to download indexes from remote repositories. Enabling indexes is essential if you want to take full advantage of Nexus’ intuitive search interface. To enable Remote Index Downloads. Go to Views/Repositories -> Repositories. Select the Maven Central repository and click on the Configuration tab. Set “Download Remote Indexes” to true and click on Save. Nexus will then download the repository index from the remote repository. This process may take a few minutes depending on the speed of your connection.
If the remote index has been successfully downloaded and processed, the Browse Index tab for the Maven Central repository will display thousands of artifacts.
HOSTED REPOSITORIES
What is a Hosted Repository?
A Hosted Repository contains artifacts which have been published to a Nexus instance. These published artifacts are stored in the Sonatype Work directory. This can include repositories that hold release artifacts and repositories that hold snapshot artifacts.
Nexus comes configured with three Hosted repositories: 3rd Party, Releases, and Snapshots. The Releases repository is for your own internal software release artifacts, and the Snapshots repository is for your own project’s snapshot artifacts. The 3rd Party repository is for 3rd party artifacts such as proprietary drivers or commercial libraries which are not available from a public Maven repository.
Adding a New Hosted Repository
REPOSITORY GROUPS
What is a Repository Group?
A repository groups combines one or more repositories under a single repository URL. You use repository groups to simplify the configuration of tools like Maven which need to retrieve artifacts from a set of common repositories. As a Nexus administrator you can define new repositories, control which repositories are available in a group and the order in which artifacts are resolved from repositories in a group.
Adding Repositories to a Group
Nexus ships with a Public Repository Group which contains all of your hosted and proxy repositories. If you create a new repository, and you need to add this repository to the Public Group, go to Views/Repositories -> Repositories and select the Configuration tab.
To add a repository to repository group, drag a repository from the “Available Repositories” list to the “Ordered Group Repositories” list and click on the Save button.
Reordering Repositories in a Group
When Nexus resolves an artifact from a Repository Group it iterates over the repositories in the group, returning the first match. If an artifact exists in more than one repository, you may need to change the order of repositories in a Repository Group. To change the order, go to Repositories/View -> Repositories, select the group you need to reorder, and then select the Configuration tab. To reorder repositories, click and drag repositories to the correct order in the Ordered Group Repositories field and then click Save.
NEXUS ADMINISTRATION
Configuring Nexus Server
To configure Sonatype Nexus, click on Administration -> Server this will load the Nexus configuration panel. The following is a list of some of the configuration sections in this panel:
SMTP Settings: Nexus supports release and deployment using email. Before Nexus can send emails, you will need to configure the appropriate SMTP settings in this section.
HTTP Request Settings: Configure custom timeouts and retry behavior for remote repositories as well as customize the Nexus User Agent.
Security Settings: Nexus’ pluggable security providers are configured in this section. You can control which security realms are active and the order in which they are consulted during authentication and authorization.
Anonymous Access:Control how and if Nexus is made available to anonymous, unauthenticated users.
Application Server Settings:If Nexus is hosted behind a proxy, or if you need to customize the URL, you can do so here.
System Notifications Settings:Configure automatic email notifications for important system events.
Configuring Scheduled Tasks
If you are publishing snapshots releases to Nexus, you will want to configure at least one scheduled task to periodically delete older snapshots releases. To configure a Scheduled Task, click on Administration -> Scheduled Tasks, and click on the Add button. Select the appropriate Task Type. Some of the more common and useful Task Types follow:
Backup All Nexus Configuration Files:Will cause Nexus to create a snapshot of all Nexus configuration files.
Download Indexes:Nexus will retrieve or update indexes for all remote, proxy repositories.
Evict Unused Proxy Items:If space is a premium, you can configure Nexus to remove proxy items which have not been used within a specific time period.
Remove Snapshots from Repository:Nexus can be configured to keep a minimum number of repositories and to delete snapshots older than a specific time period.
Scheduled tasks can be configured to send an email alert when they are executed, and you can schedule a task to run Once, Hourly, Daily, Weekly, Monthly, or using a custom cron expression.
Defining Repository Routes
Repository routes allow you to direct requests matching specific patterns to specific repositories. For example, if you wanted to make sure all requests for artifacts under org.someoss where directed to internal, hosted Releases and Snapshots repositories, you would define the following route:
Type: Inclusive
URL Pattern: .*/org/some-oss/.*
Repositories: Releases, Snapshots
To define a Repository Route, go to Administration -> Routing. The Routing panel is where you can edit existing routes and create additional routes.
Configuring Nexus Security
Nexus Security has a highly configurable Role-based Access Control system which relies on Privileges, Roles, and Users. By default, Nexus ships with a default admin, deployment, anonymous user along with associated roles. To configure a new Nexus user, go to Security -> Users and open up the Users panel. On the users panel, click on the Add button to add a new Nexus user. Once the user is created, click on the user to edit the user’s email address or to assign the user new Nexus roles.
To create or edit roles, click on Security -> Roles. Most of the default roles cannot be edited directly, but you are free to create new, custom roles by clicking on the Add button. Once a role is created, you can assign it new privileges, by dragging Roles and Privileges from the Available Roles/Privileges list to the Selected Roles/Privileges list and clicking on the Save button.
NEXUS PROFESSIONAL
Nexus Professional is a central point of access to external repositories which provides the necessary controls to make sure that only approved artifacts enter into your software development environment. Central features of Nexus Professional are:
Nexus Procurement Suite:Gives Nexus administrators control of what artifacts are allowed into an organization from a remote repository.
Nexus Staging Suite:Provides workflow support for software releases. Artifacts can be deployed to staging repositories, tested, and promoted only after they have been tested and certified.
Hosting Project Web Sites:With Nexus Professional, you can publish Maven project sites directly to your repository manager.
Support for OSGi Repositories:Nexus Professional supports OBR and P2 repositories used in OSGi and Eclipse development.
Enterprise LDAP Support:Nexus Professional adds support for LDAP clustering, and supporting mixed authentication configurations for multiple sources of security information including Atlassian’s Crowd server.
In addition to these features, Nexus Pro also adds support for Artifact Bundles, Centralized Management of Maven Settings, Custom Repository Metadata, Self-serve User Account Sign-up, and Artifact Validation and Verification.
OTHER NEXUS RESOURCES
For more information about Sonatype’s Nexus, see the following resources:
Free Nexus Book:
http://books.sonatype.com/nexus-book
Nexus OSS Site:
http://nexus.sonatype.org
Nexus Pro Site:
http://www.sonatype.com/products/nexus
Participate in the Nexus Community
Everyone is welcome to participate in the Nexus community as a developer or user. To participate, take advantage of the following resources:
Nexus IRC Channel:
#nexus on irc.codehaus.org:6667
Subscribe to the Nexus User Mailing List:
nexus-user-subscribe@sonatype.org
Subscribe to the Nexus Developer Mailing List:
nexus-dev-subscribe@sonatype.org
Subscribe to the Nexus Pro User Mailing List:
nexus-pro-users-subscribe@sonatype.org
Checkout Nexus Source Code from Subversion:
http://svn.sonatype.org/nexus/trunk
Browse the Nexus JIRA Project:
https://issues.sonatype.org/browse/NEXUS
About The Authors

Jason Van Zyl
Jason Van Zyl is the founder and CTO of Sonatype, the Maven company, and founder of the Apache Maven Project, the Plexus IoC framework, and the Apache Velocity project.
Recommended Book
This book covers both Nexus Open Source and Nexus Professional, a product which brings full control and visibility to organizations which depend on Maven repositories to manage releases and distribute software.
BUY NOW

Jason Van Zyl is the founder and CTO of Sonatype, a commercial Maven company, and founder of the Apache Maven and Velocity projects.
your friends & followers...
DZone greatly appreciates your support.
Your download should begin immediately.
If it doesn't, click here.
Spotlight Resources
Essential EMF
The Eclipse Modeling Framework (EMF) is a powerful framework and code generation facility for building Java applications based on simple model...













