Did you know? DZone has great portals for Python, Cloud, NoSQL, and HTML5!

Apache

  • submit to reddit

Getting Started with Selenium

By Frank Cohen

16,809 Downloads · Refcard 67 of 151 (see them all)

Download
FREE PDF


The Essential Selenium Cheat Sheet

Selenium is a portable software testing framework for Web applications. Selenium works well for QA testers needing record/playback authoring of tests and for software developers needing to author tests in Java, Ruby, Python, PHP, and several other languages using the Selenium API. The Selenium architecture runs tests directly in most modern Web browsers. This DZone Refcard starts with how to install Selenium and then moves on to cover working with TinyMCE, Ajax Objects, Reporting options and even the Future of Selenium.
HTML Preview
Getting Started with Selenium

Getting Started with Selenium

By Frank Cohen

About Selenium

Selenium is a portable software testing framework for Web applications. Selenium works well for QA testers needing record/playback authoring of tests and for software developers needing to author tests in Java, Ruby, Python, PHP, and several other languages using the Selenium API. The Selenium architecture runs tests directly in most modern Web browsers, including MS IE, Firefox, Opera, Safari, and Chrome. Selenium deploys on Windows, Linux, and Macintosh platforms.

Selenium was developed by a team of programmers and testers at ThoughtWorks. Selenium is open source software, released under the Apache 2.0 license and can be downloaded and used without royalty to the originators.

Architecture in a Nutshell

Selenium Browserbot is a JavaScript class that runs within a hidden frame within a browser window. The Browserbot runs your Web application within a sub-frame. The Browserbot receives commands to operate against your Web application, including commands to open a page, type characters into form fields, and click buttons.

Selenium architecture offers several ways to play a test.

Selenium architecture

Functional testing (Type 1) uses the Selenium IDE add-on to Firefox to record and playback Selenium tests in Firefox. Functional testing (Type 2) uses Selenium Grid to run tests in a farm of browsers and operating environments. For example, run install Selenium Grid on 3 operation environments (for example, Windows Vista, Windows XP, and Ubutu) and on each install 2 browser (for example, Microsoft Internet Explorer and Firefox) to smoke test, integration test, and functional test your application on 6 combinations of operating environment and browser. Many more combinations of operating environment and browser are possible. An option for functional testing (Type 2) is to use the PushToTest TestMaker/TestNode open source project. It uses Selenium RC to provide Selenium Gridlike capability with the added advantage of providing datadriven Selenium tests, results analysis charts and graphs, and better stability of the test operations.

The PushToTest open-source project provides Selenium datadriven testing, load testing, service monitoring, and reporting. TestMaker runs load and performance tests (Type 3) in a PushToTest TestNode using the PushToTest SeleniumHTMLUnit library and HTMLUnit Web browser (and Rhino JavaScript engine.)

Hot Tip

HTMLUnit runs Selenium tests faster than a real browser and requires much less memory and CPU resources.

Installing selenium

Selenium IDE installs as a Firefox add-on. Below are the steps to download and install Selenium IDE:

  1. Download selenium-ide-1.0.2.xpi (or similar) from http://seleniumhq.org.
  2. From Firefox open the .xpi file. Follow the Firefox instructions.
  3. Note: Selenium Grid runs as an Ant task. You need JDK 1.6, Ant 1.7, and the Selenium Grid 1.0 binary distribution. Additional directions can be found at http://selenium-grid.seleniumhq.org/get_started.html
  4. See http://www.pushtotest.com/products for TestMaker installation instructions.

Record/playback using selenium ide

Hot Tip

Selenium IDE is a Firefox add-on that records clicks, typing, and other actions to make a test, which you can play back in the Firefox browser. Open Selenium IDE from the Firefox Tools drop-down menu, Selenium IDE command.

Selenium IDE

Selenium IDE preferences

Selenium IDE records interactions with the Web application, with one command per line. Clicking a recorded command highlights the command, displays a reference page, and displays the command in a command form editor. Click the command form entry down-triangle to see a list of all the Selenium commands.

Run the current test by clicking the Run Test Case icon in the icon bar. Right click a test command to choose the Set Breakpoint command. Selenium IDE runs the test to a breakpoint and then pauses. The icon bar Step icon continues executing the test one command at a time.

With Selenium IDE open, the menu bar context changes to provide access to Selenium commands: Open/Close Test Case and Test Suite. Test Suites contain one or more Test Cases.

Use the Options dropdown menu, Options command to set general preferences for Selenium IDE.

Selenium IDE provides an extensibility API set called User Extensions. You can implement custom functions and modify Selenium IDE behavior by writing JavaScript functions. We do not recommend writing User Extensions as the Selenium project makes no guarantees to be backwardly compatible from one version to the next.

Selenium Context Menu provides quick commands to insert new Selenium commands, evaluate XPath expressions within the live Web page, and to show all available Selenium commands. Right click on commands in Selenium IDE, and right-click on elements in the browser page to view the Selenium Context Menu commands.

Selenese Table Format

Selenium IDE is meant to be a light-weight record/playback tool to facilitate getting started with Selenium. It is not designed to be a full test development environment. While Selenium records in an HTML table format (named Selenese) the table format only handles simple procedural test use cases. The Selenese table format does not provide operational test data support, conditionals, branching, and looping. For these you must Export Selenese files into Java, Ruby, or other supported languages.

Selenium Command reference

Selenium comes with commands to: control Selenium test operations, browser and cookie operations, pop-up, button, list, edit field, keyboard, mouse, and form operations. Selenium also provides access operations to examine the Web application (details are at http://release.seleniumhq.org/selenium-core/0.8.0/reference.html).

Command Value, Target, Wait Command
Selenium Control
setTimeout milliseconds
setMouseSpeed number of pixels
setMouseSpeedAndWait
setSpeed milliseconds
setSpeedAndWait
addLocationStrategy strategyName
addLocationStrategyAndWait
allowNativeXpath boolean
allowNativeXpathAndWait
ignoreAttributesWithoutValue boolean
ignoreAttributesWithoutValueAndWait
assignId locator
assignIdAndWait
captureEntirePageScreenShot filename, kwargs
captureEntirePageScreenShotAndWait
echo message
pause milliseconds
runScript javascript
runScriptAndWait
waitForCondition javascript
waitForPageToLoad milliseconds
waitForPopUp windowID
fireEvent locator
fireEventAndWait
Browser Operations
open url
openAndWait
openWindow url
openWindowAndWait
goBack goBackAndWait
refresh refreshAndWait
close
deleteCookie name
deleteCookieAndWait
deleteAllVisibleCookies deleteAllVisibleCookiesAndWait
setBrowserLogLevel logLevel
setBrowserLogLevelAndWait
Cookie Operations
createCookie nameValuePair
createCookieAndWait
deleteCookie name
deleteCookieAndWait
deleteAllVisibleCookies deleteAllVisibleCookiesAndWait
Popup Box Operations
answerOnNextPrompt answer
answerOnNextPromptAndWait
chooseCancelOnNextConfirmation chooseCancelOnNextConfirmationAndWait
chooseOkOnNextConfirmation chooseOkOnNextConfirmationAndWait
Checkbox & Radio Buttons
check locator
checkAndWait
uncheck locator
uncheckAndWait
Lists & Dropdowns
addSelection locator
addSelectionAndWait
removeSelection removeSelectionAndWait
removeAllSelections removeAllSelectionsAndWait
Edit Fields
type locator
typeAndWait
typeKeys locator
typeKeysAndWait
setCursorPosition locator
setCursorPositionAndWait
Keyboard Operations
keyDown locator
keyDownAndWait
keyPress locator
keyPressAndWait
keyUp locator
keyUpAndWait
altKeyDown altKeyDownAndWait
altKeyUp altKeyUpAndWait
controlKeyDown controlKeyDownAndWait
controlKeyUp controlKeyUpAndWait
metaKeyDown metaKeyDownAndWait
metaKeyUp metaKeyUpAndWait
shiftKeyDown shiftKeyDownAndWait
shiftKeyUp shiftKeyUpAndWait
Mouse Operations
click locator
clickAndWait
clickAt locator
clickAtAndWait
doubleClick locator
doubleClickAndWait
doubleClickAt locator
doubleClickAtAndWait
contextMenu locator
contextMenuAndWait
contextMenuAt locator
contextMenuAtAndWait
mouseDown locator
mouseDownAndWait
mouseDownA locator
mouseDownAtAndWait
mouseMove locator
mouseMoveAndWait
mouseMoveAt locator
mouseMoveAtAndWait
mouseOut locator
mouseOutAndWait
mouseOver locator
mouseOverAndWait
mouseUp locator
mouseUpAndWait
mouseUpAt locator
mouseUpAtAndWait
dragAndDrop locator
dragAndDropAndWait
dragAndDropToObject sourceLocator
dragAndDropToObjectAndWait
Form Operations
submit formLocator
submitAndWait
Windows/Element Selection
select locator
selectAndWait
selectFrame locator
selectWindow windowID
focus locator
focusAndWait
highlight locator
highlightAndWait
windowFocus windowFocusAndWait
windowMaximize windowMaximizeAndWait

Selenese Table Format

Selenium commands identify elements within a Web page using:

identifier=id Select the element with the specified @id attribute. If no match is found, select the first element whose @name attribute is id.
name=name Select the first element with the specified @name attribute. The name may optionally be followed by one or more elementfilters, separated from the name by whitespace. If the filterType is not specified, value is assumed. For example, name=style value=carol
dom=javascriptExpression Find an element using JavaScript traversal of the HTML Document Object Model. DOM locators must begin with "document." For example: dom=document.forms['form1'].myList dom=document.images[1]
xpath=xpathExpression Locate an element using an XPath expression. Here are a few examples:

xpath=//img[@alt='The image alt text']
xpath=//table[@id='table1']//tr[4]/td[2]
/html/body/table/tr/td/a
//div[@id='manage_messages_iterator']
//tr[@class='SelectedRow']/td[2]
//td[child::text()='myemail@me.com']
//td[contains(child::text(),'@')]

link=textPattern Select the link (anchor) element which contains text matching the specified pattern.
css=cssSelectorSyntax Select the element using css selectors. For example:

css=ahref="/sites/all/modules/dzone/assets/refcardz/067/#id1"]
css=span#firstChild + span

Selenium 1.0 css selector locator supports all css1, css2 and css3 selectors except namespace in css3, some pseudo classes(:nthof-type, :nth-last-of-type, :first-of-type, :last-of-type, :only-of-type, :visited, :hover, :active, :focus, :indeterminate) and pseudo elements(::first-line, ::first-letter, ::selection, ::before, ::after). Without an explicit locator prefix, Selenium uses the following default strategies:

dom, for locators starting with "document." xpath, for locators starting with "//" identifier, otherwise

Your choice of element locator type has an impact on the test playback performance. The following table compares performance of Selenium element locators using Firefox 3 and Internet Explorer 7.

Locator used Type Firefox 3 Internet Explorer 7
q Locator 47 ms 798 ms
//input[@name='q'] XPath 32 ms 563 ms
//html[1]/body[1]//form[1]//input[2] XPath 47 ms 859 ms
//input[2] XPath 31 ms 564 ms
document.forms[0].elements[1] DOM Index 31 ms 125 ms

Additional details on Selenium performance can be found at: http://www.pushtotest.com/docs/thecohenblog/symposium

Script-Driven Testing

Selenium implements a domain specific language (DSL) for testing. Some applications do not lend themselves to record/ playback: 1) The test flow changes depending on the results of a step in the test, 2) The input data changes depending on the state of the application, and 3) The test requires asynchronously operating test flows. For these conditions, consider using the Selenium DSL in a script driven test. Selenium provides support for Java, Python, Ruby, Groovy, PHP, and C#.

Selenium IDE helps get a script-driven test started by exporting to a unit test format. For example, consider the following test in the Selenese table format:

Selenese table format

Use the Selenium IDE File menu, Export, Python Selenium RC command to export the test to a jUnit-style TestCase written in Python. The following shows the Java source code:


package com.example.tests;

from selenium import selenium
import unittest, time, re

class franktest(unittest.TestCase):
	def setUp(self):
		self.verificationErrors = []
		self.selenium = selenium("localhost", 4444, "*chrome", \
			"http://change-this-to-the-site-you-are-testing/")
		self.selenium.start()
	def test_franktest(self):
		sel = self.selenium
		sel.open("/")
		sel.type("q", "sock puppet")
		sel.click("sa")
		sel.wait_for_page_to_load("30000")
		sel.click("//div[@id='res']/div[1]/ol/li[1]/div/h2/a/em")
		sel.click("//div[@id='res']/div[1]/ol/li[1]/div/h2/a/em")
		sel.wait_for_page_to_load("30000")
		
	def tearDown(self):
		self.selenium.stop()
		self.assertEqual([], self.verificationErrors)
		
if __name__ == "__main__":
unittest.main()

An exported test like the one above has access to all of Python's functions, including conditionals, looping and branching, reusable object libraries, inheritance, collections, and dynamically typed data formats.

Selenium provides a Selenium RC client package for Java, Python, C#, Ruby, Groovy, PHP, and Perl. The client object identifies the Selenium RC service in its constructor:


self.selenium = selenium("localhost", 4444, "*iexplore", \
	"http://change-this-to-the-site-you-are-testing/")
self.selenium.start()

The above code identifies the Selenium RC service running on the localhost machine at port 4444. This client will run the test in Microsoft Internet Explorer. The third parameter identifies the base URL from which the recorded test will operate.

Selenium RC service

Using the selenium.start() command initializes and starts the Selenium RC service. The Selenium RC client module (import selenium in Python) provides methods to operate the Selenium DSL commands (click, type, etc.) in the Browserbot running in the browser. For example, selenium.click("open") tells the Browserbot to a click command to the element with an id tag equal to "open". The browser responds to the click command and communicates with the Web application.

At the end of the test the selenium.stop() command ends the Selenium RC service.

Selenium and Ajax

Ajax uses asynchronous JavaScript functions to manipulate the browser's DOM representation of the Web page. Many Selenium commands are not compatible with Ajax. For example, ClickAndWait will time-out waiting for the browser to load the Web page because Ajax functions that manipulate the current Web page in response to a click event do not reload the page. We recommend using Selenium commands that poll the DOM until the Ajax methods complete their tasks. For example, waitUntilElementPresent polls the DOM until the JavaScript function adds the desired element to the page before continuing with the rest of the Selenium script.

Consider the following checklist when using Selenium with Ajax applications:

Check mark

Your Selenium tests may require a large number of extra commands to ensure the test stays in synchronization with the Ajax application. Consider an Ajax application that requires a log-in, then displays a selection list of items, then presents an order form. Ajax enabled applications often deliver multiple steps of function on a single page and show-and-hide elements as you work with the application. Some even disable form submit buttons and other user interface elements until you enter enough valid information. For an application like this you will need a combination of Selenium commands. Consider the following Selenium test:

waitForElementPresent pauses the test until the Ajax application adds the requisite element to the page. waitForCondition pauses the test until the JavaScript function evaluates to true.

Check mark

Some Ajax applications use lazy-loading techniques to improve user interaction with the application. A stock market application provides a list of 10 stock quotes asynchronously after the user clicks the submit button. The list may take 10 to 50 seconds to completly update on the screen. Using waitForXPathCount pauses the test until the page contains the number of nodes that match the specified XPath expression.

Check mark

Many Ajax applications use dynamic element id tags. The Ajax application that named the Log-out button app_6 may later rename the button to app_182. We recommend using DOM element locator techniques, or XPath techniques if needed, to dynamically find elements on a positional or other attribute means.

Command window

Working with tinymce and Ajax objects

Ajax is about moving functions off the server and into the browser. Selenium architecture supports innovative new browser-based functions because Selenium's Browserbot is a JavaScript class itself. The Browserbot even lets Selenium tests operate JavaScript functions as part of the test. For example, TinyMCE (http://tinymce.moxiecode.com) is a graphical text editor component for embedding in Web pages. TinyMCE supports styled text and what-you-see-is-what-you-get editing. Testing a TinyMCE can be challenging. Selenium offers click and type functions that interact with TinyMCE but no direct commands for TinyMCE's more advanced functions. For example, imagine testing TinyMCE's ability to stylize text. The test needs to insert test, move the insertion point, select a sentence, bold the text, and drag the sentence to another paragraph. This is beyond Selenium's DSL. Instead, the Selenium test may include JavaScript commands that interact with TinyMCE's published API (http://tinymce.moxiecode.com/documentation.php).

Here is an example of using the TinyMCE API from a Selenium test context:


this.browserbot.getCurrentWindow().tinyMCE.execCommand
('mceInsertContent',false,'<b>Hello world!!</b>');

Run the above JavaScript function from within a Selenium test using the AssertEval command.


AssertEval javascript:this.browserbot.getCurrentWindow().tinyMCE.
execCommand('mceInsertContent',false,'<b>Hello world!!</b>');

Data Production

Selenium offers no operational test data production capability itself. For example, a Selenium test of a sign-in page usually needs sign-in name and sign-in password operational test data to operate. Two options are available: 1) Use the data access features in Java, Ruby, or one of other supported languages, 2) Use PushToTest TestMaker's Selenium Script Runner to inject data from comma separated value (CSV) files, relational databases, objects, and Web services. See http://tinyurl.com/btxvn4 for details.

Create a Comma-Separated-Value file. Use your favorite text editor or spreadsheet program. Name the file data.csv. The contents must be in the following form.

Comma-Separated-Value file

The first row of the data file contains column names. These will be used to map values into the Selenium test. Change the Selenium test to refer to mapping name. PushToTest maps the data from the named column in the CSV data file to the Selenium test data using the first row definitions.

Connect the Data Production Library (DPL) to the Selenium test in a TestMaker TestScenario. Begin by definition a HashDPL. This DPL reads from CSV data files and provides the data to the test.


<DataSources>
	<dpl name="mydpl" type="HashDPL">
		<argument name="file" dpl="rsc" value="getDataByIndex" index="0"/>
	</dpl>
</DataSources>

Next, tell the TestScenario to send the data.csv and Selenium test files to the TestNodes that will operate the test.


<resources>
	<data path="data.csv"/>
	<selenese path="CalendarTest.selenium"/>
</resources>

Then tell the Selenium ScriptRunner to use the DPL provided data when running the Selenium test.


<run name="CalendarTest" testclass="CalendarTest.selenium"
	method="runSeleneseFile" langtype="selenium">
	<argument dpl="mydpl" name="DPL_Properties" value="getNextData"/>
</run>

The getNextData operation gets the next row of data from the CSV file. The Selenium ScriptRunner injexts the data into the Selenium test.

Browser Sandbox, Redirect, and proxy issues

Selenium RC launches the browser with itself as the proxy server to inject the Javascript of the Browserbot and your test. This architecture makes it possible to run the same test on multiple browsers. However, some browsers will warn the user of possible security threats when the proxy starts and when the test requests functions or pages outside of the originating domain. The browser takes control and stops the Browserbot operations to display the warning message. When this happens, the test stops until a user dismisses the warning. There are no reliable cross-browser workarounds.

Some Web applications redirect from http to https URLs. The browser will often issue a warning that stops the Selenium test.

Selnium does not support a test moving across domains. For example, a test that started with a baseurl of www.mydomain. com may not open a page on www.secondomain.com.

selenium RC browser profiles

Selenium Remote Control (RC) enables test operation on multiple real browsers. A browser profile attribute may be any of the following installed browsers: chrome, konqueror, piiexplore, iehta, mock, opera, pifirefox, safari, iexplore and custom. Append the path to the real browser after browser profile if your system path does not state the path to the browser. For example:


*firefox /Applications/Firefox.app/Contents/MacOS/firefox

Component approach example

Many organizations pursue a "Test and Trash" methodology to achieve agile software development lifecycles. For example, an organization in pursuit of agile techniques may change up to 30% of an application with an application lifecycle of 8 weeks. Without giving the change much thought, up to 30% of their recorded tests break!

Sample test

We recommend a component approach to building tests. Test components perform specific test operations. We write or record tests as individuals components of test function. For example, a component operates the sign-in function of a private Web application. When the sign-in portion of the application changes, we only need to change the sign-in test and the rest of test continues to perform normally.

Selenium supports the component approach in three ways: Selenium IDE supports Test Suites and Test Cases, exporting Selenium tests to dynamic languages (Java, Ruby, Perl, etc.) creates reusable software classes, and 3) PushToTest TestMaker supports multiple use cases with parameterized test use cases.

In Selenium IDE, the File menu enables tests to be saved as test cases or test suites. Record a test, use File -> Save Test Case. Create a second Test Case by choosing File -> New Test Case. Record the second test use case. Save the TestSuite for these two test use cases by choosing File -> Save TestSuite. Click the "Run entire test suite" icon from the Selenium IDE tool bar.

TestMaker defines test use cases using a simple XML notation:


<usecases>
	<usecase name="MailerCheck_usecase">
		<test>
		<run name="LogIn" testclass="Login.selenium" instance="myinst"
			method="runSeleneseFile" langtype="selenium">
		</run>
		<run name="OrderProduct" testclass="OrderProduct.selenium" instance="myinst"
			method="runSeleneseFile" langtype="selenium">
		</run>
		</test>
	</usecase>
</usecases>

Reporting options

Selenium offers no results reporting capability of its own. Two options are available: 1) Write your tests as a set of JUnit tests and use JUnit Report (http://ant.apache.org/manual/OptionalTasks/junitreport.html) to plot success/failure charts, 2) Use PushToTest TestMaker Results Analysis Engine to produce more than 300 charts from the transaction and step time tracking of Selenium tests.

For example, TestMaker tracks Selenium command duration in a test suite or test case. Consider the following chart. This shows the "Step" time it takes to process each Selenium command in a test use case over 10 equal periods of time that the test took to operate.

Step contribution

Selenium Biosphere

Test Maker allows repurposing Selenium tests as load test service monitors. http://www.pushtotest.com

BrowserMob facilitates low-cost Selenium load testing. http://browsermob.com/load-testing

SauceLabs provides a farm of Selenium RC servers for testing. http://saucelabs.com/

ThoughtWorks Twist can be used for test authoring and management. http://studios.thoughtworks.com/twist-agile-test-automation

Running a Selenium test as a functional test in TestMaker. TestMaker displays the success/failure of each command in the test and the duration in milliseconds of each step.

The Future, Selenium 2.0 (AKA Webdriver )

The Selenium Project started the WebDriver project, to be delivered as Selenium 2.0. WebDriver is a new architecture that plays Selenium tests by driving the browser through its native interface. This solves the test playback stability issue in Selenium 1.0 but requires the Selenium project to maintain individual API drivers for all the supported browsers. While there is no release date for Selenium 2.0, the WebDriver code is already functional and available for download at http://code.google.com/p/webdriver.

Available Training

SkillsMatter.com, Think88com, PushToTest.com, RTTSWeb.com, and Scott Bellware (http://blog.scottbellware.com) offer training courses fro Selenium. PushToTest offers free Open Source Test Workshops (http://workshop.pushtotest.com) as a meet-up for Selenium and other Open Source Test tool users.

About The name Selenium

Selenium lore has it that the originators chose the name of Selenium after learning that Selenium is the antidote to Mercury poisoning. There appears to be no love between the Selenium team and HP Mercury, but perhaps a bit of envy

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Apache Maven 2

By Matthew McCullough

34,910 Downloads · Refcard 55 of 151 (see them all)

Download
FREE PDF


The Essential Maven 2 Cheat Sheet

Maven is a comprehensive project information tool whose most common application is building Java code. It is receiving renewed recognition in the emerging development space for its convention over configuration approach to builds. This DZone Refcard showcases how Maven offers unparalleled software lifecycle management, and gives Java developers a wide range of execution commands, tips for debugging Mavenized builds, and a clear introduction to the "Maven vocabulary". This Refcard also covers the MVN command, dependencies, plugins, profiles and more. Download it today!
HTML Preview
Apache Maven 2

Apache Maven 2

By Matthew McCullough

ABOUT APACHE MAVEN

Maven is a comprehensive project information tool, whose most common application is building Java code. Maven is often considered an alternative to Ant, but as you’ll see in this Refcard, it offers unparalleled software lifecycle management, providing a cohesive suite of verification, compilation, testing, packaging, reporting, and deployment plugins.

Maven is receiving renewed recognition in the emerging development space for its convention over configuration approach to builds. This Refcard aims to give JVM platform developers a range of basic to advanced execution commands, tips for debugging Mavenized builds, and a clear introduction to the “Maven vocabulary”.

Interoperability and Extensibility

New Maven users are pleasantly surprised to find that Maven offers easy-to-write custom build-supplementing plugins, reuses any desired aspect of Ant, and can compile native C, C++, and .NET code in addition to its strong support for Java and JVM languages and platforms, such as Scala, JRuby, Groovy and Grails.

Hot Tip

All things Maven can be found at http://maven.apache.org

THE MVN COMMAND

Maven supplies a Unix shell script and MSDOS batch file named mvn and mvn.bat respectively. This command is used to start all Maven builds. Optional parameters are supplied in a space-delimited fashion. An example of cleaning and packaging a project, then running it in a Jetty servlet container, yet skipping the unit tests, reads as follows:


mvn clean package jetty:run –Dmaven.test.skip

PROJECT OBJECT MODEL

The world of Maven revolves around metadata files named pom.xml. A file of this name exists at the root of every Maven project and defines the plugins, paths and settings that supplement the Maven defaults for your project.

Basic pom.xml Syntax

The smallest valid pom.xml, which inherits the default artifact type of “jar”, reads as follows:


<project>
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.ambientideas</groupId>
	<artifactId>barestbones</artifactId>
	<version>1.0-SNAPSHOT</version>
</project>

Super POM

The Super POM is a virtual pom.xml file that ships inside the core Maven JARs, and provides numerous default settings. All projects automatically inherit from the Super POM, much like the Object super class in Java. Its contents can be viewed in one of two ways:

View Super POM via SVN

Open the following SVN viewing URL in your web browser:


http://svn.apache.org/repos/asf/maven/components/branches/maven-2.1.x/pom.xml

View Super POM via effective-pom

Run the following command in a directory that contains the most minimal Maven project pom.xml, listed above.


mvn help:effective-pom

Multi-module Projects

Maven showcases exceptional support for componentization via its concept of multi-module builds. Place sub-projects in sub-folders beneath your top level project and reference each with a module tag. To build all sub projects, just execute your normal mvn command and goals from a prompt in the top-most directory.


<project>
  <!-- ... -->
  <packaging>pom</packaging>
  <modules>
    <module>servlets</module>
    <module>ejbs</module>
    <module>ear</module>
  </modules>
</project>

Artifact Vector

Each Maven project produces an element, such as a JAR, WAR or EAR, uniquely identified by a composite of fields known as groupId, artifactId, packaging, version and scope. This vector of fields uniquely distinguishes a Maven artifact from all others.

Many Maven reports and plugins print the details of a specific artifact in this colon separated fashion:


groupid:artifactid:packaging:version:scope

An example of this output for the core Spring JAR would be:


org.springframework:spring:jar:2.5.6:compile

EXECUTION GROUPS

Maven divides execution into four nested hierarchies. From most-encompassing to most-specific, they are: Lifecycle, Phase, Plugin, and Goal.

Lifecycles, Phases, Plugins and Goals

Maven defines the concept of language-independent project build flows that model the steps that all software goes through during a compilation and deployment process.

Lifecycles

Lifecycles represent a well-recognized flow of steps (Phases) used in software assembly.

Each step in a lifecycle flow is called a phase. Zero or more plugin goals are bound to a phase.

A plugin is a logical grouping and distribution (often a single JAR) of related goals, such as JARing.

A goal, the most granular step in Maven, is a single executable task within a plugin. For example, discrete goals in the jar plugin include packaging the jar (jar:jar), signing the jar (jar:sign), and verifying the signature (jar:sign-verify).

Executing a Phase or Goal

At the command prompt, either a phase or a plugin goal can be requested. Multiple phases or goals can be specified and are separated by spaces.


If you ask Maven to run a specific plugin goal, then only that goal is run. This example runs two plugin goals: compilation of code, then JARing the result, skipping over any intermediate steps. mvn compile:compile jar:jar

Conversely, if you ask Maven to execute a phase, all phases and bound plugin goals up to that point in the lifecycle are also executed. This example requests the deploy lifecycle phase, which will also execute the verification, compilation, testing and packaging phases.


mvn deploy

Online and Offline

During a build, Maven attempts to download any uncached referenced artifacts and proceeds to cache them in the ~/.m2/repository directory on Unix, or in the %USERPROFILE%/.m2/repository directory on Windows.

To prepare for compiling offline, you can instruct Maven to download all referenced artifacts from the Internet via the command:


mvn dependency:go-offline

If all required artifacts and plugins have been cached in your local repository, you can instruct Maven to run in offline mode with a simple flag:


mvn <phase or goal> -o

Built-in Maven Lifecycles

Maven ships with three lifecycles; clean, default, and site. Many of the phases within these three lifecycles are bound to a sensible plugin goal.

Hot Tip

The official lifecycle reference, which extensively lists all the default bindings, can be found at http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html

The clean lifecycle is simplistic in nature. It deletes all generated and compiled artifacts in the output directory.

Clean Lifecycle
Lifecycle Phase Purpose
pre-clean
clean Remove all generated and compiled artifacts in preperation for a fresh build.
post-clean
Default Lifecycle
Lifecycle Phase Purpose
validate Cross check that all elements necessary for the build are correct and present.
initialize Set up and bootstrap the build process.
generate-sources Generate dynamic source code
process-sources Filter, sed and copy source code
generate-resources Generate dynamic resources
process-resources Filter, sed and copy resources files.
compile Compile the primary or mixed language source files.
process-classes Augment compiled classes, such as for code-coverage instrumentation.
generate-test-sources Generate dynamic unit test source code.
process-test-sources Filter, sed and copy unit test source code.
generate-test-resources Generate dynamic unit test resources.
process-test-resources Filter, sed and copy unit test resources.
test-compile Compile unit test source files
test Execute unit tests
prepare-package Manipulate generated artifacts immediately prior to packaging. (Maven 2.1 and above)
package Bundle the module or application into a distributable package (commonly, JAR, WAR, or EAR).
pre-integration-test
integration-test Execute tests that require connectivity to external resources or other components
post-integration-test
verify Inspect and cross-check the distribution package (JAR, WAR, EAR) for correctness.
install Place the package in the user’s local Maven repository.
deploy Upload the package to a remote Maven repository

The site lifecycle generates a project information web site, and can deploy the artifacts to a specified web server or local path.

Site Lifecycle
Lifecycle Phase Purpose
pre-site Cross check that all elements necessary for the build are correct and present.
site Generate an HTML web site containing project information and reports.
post-site
site-deploy Upload the generated website to a web server

Default Goal

The default goal codifies the author’s intended usage of the build script. Only one goal or lifecycle can be set as the default. The most common default goal is install.


<project>
   [...]
   <build>
      lt;defaultGoal>install</defaultGoal>
   </build>
   [...]
</project>

HELP

Help for a Plugin

Lists all the possible goals for a given plugin and any associated documentation.


help:describe -Dplugin=<pluginname>

Help for POMs

To view the composite pom that’s a result of all inherited poms:


mvn help:effective-pom

Help for Profiles

To view all profiles that are active from either manual or automatic activation:


mvn help:active-profiles

DEPENDENCIES

Declaring a Dependency

To express your project’s reliance on a particular artifact, you declare a dependency in the project’s pom.xml.

Hot Tip

You can use the search engine at repository.sonatype.org to find dependencies by name and get the xml necessary to paste into your pom.xml

<project>
  <dependencies>
    <dependency>
	 <groupId>com.yourcompany</groupId>
	 <artifactId>yourlib</artifactId>
         <version>1.0</version>
	 <type>jar</type>
	 <scope>compile</scope>
    </dependency>
   </dependencies>
  <!-- ... -->
</project>

Standard Scopes

Each dependency can specify a scope, which controls its visibility and inclusion in the final packaged artifact, such as a WAR or EAR. Scoping enables you to minimize the JARs that ship with your product.

Scope Description
compile Needed for compilation, included in packages.
test Needed for unit tests, not included in packages.
provided Needed for compilation, but provided at runtime by the runtime container.
system Needed for compilation, given as absolute path on disk, and not included in packages.
import An inline inclusion of a POM-type artifact facilitating dependency-declaring POM snippets.

PLUGINS

Adding a Plugin

A plugin and its configuration are added via a small declaration, very similar to a dependency, in the <build> section of your pom.xml.


<build>
  <!-- ... -->
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-compiler-plugin</artifactId>
      <configuration>
        <maxmem>512m</maxmem>
     </configuration>
    </plugin>
  </plugins>
</build>

Common Plugins

Maven created an acronym for its plugin classes that aggregates “Plain Old Java Object” and “Maven Java Object” into the resultant word, Mojo.

There are dozens of Maven plugins, but a handful constitute some of the most valuable, yet underused features:

surefire Runs unit tests.
checkstyle Checks the code’s styling
clover Code coverage evaluation.
enforcer Verify many types of environmental conditions as prerequisites.
assembly Creates ZIPs and other distribution packages of apps and their transitive dependency JARs.

Hot Tip

The full catalog of plugins can be found at: http://maven.apache.org/plugins/index.html

VISUALIZE DEPENDENCIES

Users often mention that the most challenging task is identifying dependencies: why they are being included, where they are coming from and if there are collisions. Maven has a suite of goals to assist with this.

List a hierarchy of dependencies.


mvn dependency:tree

List dependencies in alphabetic form.


mvn dependency:resolve

List plugin dependencies in alphabetic form.


mvn dependency:resolve-plugins

Analyze dependencies and list any that are unused, or undeclared.


mvn dependency:analyze

REPOSITORIES

Repositories are the web sites that host collections of Maven plugins and dependencies.

Declaring a Repository


<repositories>
  lt;repository>
  <id>JavaDotNetRepo</id>
    <url>https://maven-repository.dev.java.net</url>
  </repository>
</repositories>

The Maven community strongly recommends using a repository manager such as Nexus to define all repositories. This results in cleaner pom.xml files and centrally cached and managed connections to external artifact sources. Nexus can be downloaded from http://nexus.sonatype.org/

Popular Repositories

Central http://repo1.maven.org/maven2/
Java.net https://maven-repository.dev.java.net/
Codehaus http://repository.codehaus.org/
JBoss http://repository.jboss.org/maven2

Hot Tip

A near complete list of repositores can be found at http://www.mvnbrowser.com/repositories.html

PROPERTY VARIABLES

A wide range of predefined or custom of property variables can be used anywhere in your pom.xml files to keep string and path repetition to a minimum.

All properties in Maven begin with ${ and end with }. To list all available properties, run the following command.


mvn help:expressions

Predefined Properties (Partial List)

${env.PATH} Any OS environment variable such as EDITOR, or GROOVY_HOME. Specifically, the PATH environment variable.
${project.groupId} Any project node from the aggregated Maven pom.xml. Specifically, the Group ID of the project
${project.artifactId} Name of the artifact.
${project.basedir} Path of the pom.xml.
${settings.localRepository} The path to the user’s local repository.
${java.home} Any Java System Property. Specifically, the Java System Property path to its home.
${java.vendor} The Java System Property declaring the JRE vendor’s name.
${my.somevar} A user-defined variable.

Project properties could previously be referenced with a ${pom.basedir} prefix or no prefix at all ${basedir}. Maven now requires that you prefix these variables with the word project ${project.basedir}.

Define a Property

You can define a new custom property in your pom.xml like so:


<project>
   [...]
   <properties>
      <my.somevar>My Value</my.somevar>
   </properties>
   [...]
</project>

DEBUGGING

Exception Full Stack Traces

If a Maven plugin is reporting an error, to see the full detail of the exception’s stack trace run Maven with the -e flag.


mvn <yourgoal> -e

Output Debugging Info

Whenever reporting a Maven bug, or troubleshooting a problem, turn on all the debugging info by running Maven like so:


mvn <yourgoal> -X

Debug Maven Core/Plugins

Core Maven operations and plugins can be stepped through with any JPDA-compatible debugger, the most common option being Eclipse. When run in debug mode, Maven will wait for you to connect your debugger to socket port 8000 before continuing with its lifecycle.


mvnDebug <yourgoal>
Preparing to Execute Maven in Debug Mode
Listening for transport dt_socket at address: 8000

Debug a Unit Test

Your suite or an individual unit test can be debugged in much the same fashion by telling the Surefire test-execution plugin to wait for you to attach a debugger to port 5005.


mvn test -Dmaven.surefire.debug
Listening for transport dt_socket at address: 5005

SOURCE CODE MANAGEMENT

Configuring SCM

Your project’s SCM connection can be quickly configured with just three XML tags, which adds significant capabilities to the scm, release, and reactor plugins.

The connection tag is your read-only view of your repository and developerConnection is the writable link. URL is your web-based view of the source.


<scm>
  <connection>scm:svn:http://myvendor.com/ourrepo/trunk</
connection>
  <developerConnection>
     scm:svn:https://myvendor.com/ourrepo/trunk
  </developerConnection>
  <url>http://myvendor.com/viewsource.pl</url>
</scm>

Hot Tip

Over 12 SCM systems are supported by Maven. The full list can be viewed at http://docs.codehaus.org/display/SCM/SCM+Matrix

Using the SCM Plugin

The core SCM plugin offers two highly useful goals.

The diff command produces a standard Unix patch file with the extension .diff of the pending (uncommitted) changes on disk that can be emailed or attached to a bug report.


mvn scm:diff

The update-subprojects goal invokes a recursive scm-provider specific update (svn update, git pull) across all the submodules of a multimodule project.


mvn scm:update-subprojects

PROFILES

Profiles are a means to conditionally turn on portions of Maven configuration, including plugins, pathing and configuration.

The most common uses of profiles are for Windows/Unix platform-specific variations and build-time customization of JAR dependencies based on the use of a specific Weblogic, Websphere or JBoss J2EE vendor.


<project>
     [...]
  <profiles>
    <profile>
      <id>YourProfile</id>
         [...settings, build, plugins etc...]
      <dependencies>
        <dependency>
          <groupId>com.yourcompany</groupId>
          <artifactId>yourlib</artifactId>
       </dependency>
      <dependencies>
   </profile>
 </profiles>
[...]
</project>

Profile Definition Locations

Profiles can be defined in pom.xml, profiles.xml (parallel to the pom.xml), ~/.m2/settings.xml, or $M2_HOME/conf/settings.xml.

Hot Tip

The full Maven Profile reference, including details about when to use each of the profile definition files, can be found at http://maven.apache.org/guides/introduction/introduction-to-profiles.html

PROFILE ACTIVATION

Profiles can be activated manually from the command line or through an activation rule (OS, file existence, Maven version, etc.). Profiles are primarily additive, so best practices suggest leaving most off by default, and activating based on specific conditions.

Manual Profile Activation


mvn <yourgoal> –P YourProfile

Automatic Profile Activation


<project>
     [...]
 <profiles>
   <profile>
     <id>YourProfile</id>
     [...settings, build, etc...]
  <activation>
    <os>
      <name>Windows XP</name>
      <family>Windows</family>
      <arch>x86</arch>
      <version>5.1.2600</version>
   </os>
    <file>
       <missing>somefolder/somefile.txt</missing>
    </file>
  </activation>
</profile>
</profiles>
[...]
</project>

CUTTING A RELEASE

Maven offers excellent automation for cutting a release of your project. In short, this is a plugin-guided ceremony for verifying that all tests pass, tagging your source code repository, and altering the POMs to reflect a product version increment.

The prepare goal runs the unit tests, continuing only if all pass, then increments the value in the pom <version> tag to a release version, tags the source repository accordingly, and increments the pom version tag back to a SNAPSHOT version.


mvn release:prepare

After a release has been successfully prepared, run the perform goal. This goal checks out the prepared release and deploys it to the POM’s specified remote Maven repository for consumption by other teams and Maven builds.


mvn release:perform

ARCHETYPES

An archetype is a powerful template that uses your corporate Java package names and project name in the instantiated project and establishes a baseline of dependencies, with a bonus of basic sample code.

You can leverage public archetypes for quickly starting a project that uses a familiar stack, such as Struts+Spring, or Tapestry+Hibernate. You can also create private archetypes within your company to offer new projects a level of consistent dependencies matching your approved corporate technology stack.

Using an Archetype

The default behavior of the generate goal is to bring up a menu of choices. You are then prompted for various replaceables such as package name and artifactId. Type this command, then answer each question at the command line prompt.


mvn archetype:generate

Creating Archetypes

An archetype can be created from an existing project, using it as the pattern by which to build the template. Run the command from the root of your existing project.


mvn archetype:create-from-project

Archetype Catalogs

The Maven Archetype plugin comes bundled with a default catalog of applications it can create, but other projects on the Internet also publish catalogs. To use an alternate catalog:


mvn archetype:generate –DarchetypeCatalog=<catalog>

A list of the most commonly used catalogs is as follows:


local
remote
http://repo.fusesource.com/maven2
http://cocoon.apache.org
http://download.java.net/maven/2


http://myfaces.apache.org
http://tapestry.formos.com/maven-repository
http://scala-tools.org
http://www.terracotta.org/download/reflector/maven2/

REPORTS

Maven has a robust offering of reporting plugins, commonly run with the site generation phase, that evaluate and aggregate information about the project, contributors, it’s source, tests, code coverage, and more.

Adding a Report Plugin


<:reporting>
 <:plugins>
    <:plugin>
      <:artifactId>maven-javadoc-plugin<:/artifactId>
    <:/plugin>
  <:/plugins>
<:/reporting>

Hot Tip

A list of commonly used reporting plugins can be reviewed here http://maven.apache.org/plugins/

About The Author

Photo of MatthewMcCullough

Matthew McCullough

Matthew McCullough is an Open Source Architect with the Denver, Colorado consulting firm Ambient Ideas, LLC which he co-founded in 1997. He’s spent the last 13 years passionately aiming for ever-greater efficiencies in software development, all while exploring how to share these practices with his clients and their team members. Matthew is a nationally touring speaker on all things open source and has provided long term mentoring and architecture services to over 40 companies ranging from startups to Fortune 500 firms. Feedback and questions are always welcomed at matthewm@ambientideas.com

Recommended Book

Maven

Several sources for Maven have appeared online for some time, but nothing served as an introduction and comprehensive reference guide to this tool -- until now. Maven: The Definitive Guide is the ideal book to help you manage development projects for software, webapplications, and enterprise applications. And it comes straight from the source.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Getting Started with Apache Wicket

By Andrew Lombardi

13,852 Downloads · Refcard 63 of 151 (see them all)

Download
FREE PDF


The Essential Apache Wicket Cheat Sheet

Apache Wicket is a Java-based web application framework that has rapidly grown to be a favorite among many developers. Among the dizzying number of Web frameworks available today, Wicket’s simple and intuitive approach to Web development has led it to become a favorite among many Java developers. This DZone Refcard brings you quickly up to speed on some of the key features of Apache Wicket 1.3, showing you how to configure the framework, define your domain model, create standard Wicket components and add internationalization options.
HTML Preview
Getting Started with Apache Wicket

Getting Started with Apache Wicket

By Andrew Lombardi

About Apache Wicket

Apache Wicket is a Java-based web application framework that has rapidly grown to be a favorite among many developers. It features a POJO data model, no XML, and a proper mark-up / logic separation not seen in most frameworks. Apache Wicket gives you a simple framework for creating powerful, reusable components and offers an object oriented methodology to web development while requiring only Java and HTML. This Refcard covers Apache Wicket 1.3 and describes common configuration, models, the standard components, implementation of a form, the markup and internationalization options available.

Project Layout

The project layout most typical of Apache Wicket applications is based on the default Maven directories. Any Wicket component that requires view markup in the form of HTML needs to be side-by-side with the Java file. Using Maven however, we can separate the source directories into java/ and resources/ to give some distinction. To get started, download either the wicket-quickstart project and modify it to your needs, or use the maven archetype here:


mvn archetype:create \
-DarchetypeGroupId=org.apache.wicket \
-DarchetypeArtifactId=wicket-archetype-quickstart \
-DarchetypeVersion=1.3.5 \
-DgroupId=com.mysticcoders.refcardmaker \
-DartifactId=refcardmaker

Either way, if using Maven, you’ll need the wicket jar, and the latest slf4j jar.


<dependency>
  <groupId>org.apache.wicket</groupId>
  <artifactId>wicket</artifactId>
  <version>1.3.6</version>
</dependency>
<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-api</artifactId>
  <version>1.4.2</version>
</dependency>

Configuring the web application

I mentioned that Wicket has no XML, and that’s mostly true, but J2EE requires a web.xml file to do anything. We set up the WicketFilter and point it to our implementation of WebApplication along with the URL mapping.


<web-app>
  <filter>
    <filter-name>wicketFilter</filter-name>
    <filter-class>org.apache.wicket.protocol.http.
WicketFilter</filter-class>


<init-param>
      <param-name>applicationClassName</param-name>
      <param-value>com.mysticcoders.refcardmaker.
RefcardApplication</param-value>
    </init-param>
    <init-param>
      <param-name>filterPath</param-name>
      <param-value>/*</param-value>
    </init-param>
  </filter>
  <filter-mapping>
    <filter-name>wicketFilter</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>
</web-app>

Apache Wicket offers a development and deployment mode that can be configured in the web.xml file:


<context-param>
	<param-name>configuration</param-name>
	<param-value>development</param-value>
</context-param>

Hot Tip

Depending on your configuration needs, you can set this parameter in the web.xml as either:
  • a context-param or init-param to the filter
  • a command line parameter wicket.configuration
  • by overriding Application.getConfigurationType()

Models

Apache Wicket uses models to separate the domain layer from the view layer in your application and to bind them together. Components can retrieve data from their model, and convert and store data in the model upon receiving an event. There are a variety of implementations of Model, and they all extend from the interface IModel.

IModel

There are only two methods that a class would have to implement to be a Model, and that is getObject and setObject. getObject returns the value from the model, and setObject sets the value of the model. Your particular implementation of IModel can get data from wherever you’d like; the Component in the end only requires the ability to get and set the value. Every component in Wicket has a Model: some use it, some don’t, but it’s always there.

PropertyModel

A model contains a domain object, and it’s common practice to follow JavaBean conventions. The PropertyModel allows you to use a property expression to access a property in your domain object. For instance if you had a model containing Person and it had a getter/setter for firstName to access this property, you would pass the String “firstName” to that Model.

CompoundPropertyModel

An even fancier way of using models is the CompoundPropertyModel. Since most of the time, the property identifier you would give Wicket mimics that of the JavaBean property, this Model takes that implied association and makes it work for you.


...
setModel(new CompoundPropertyModel(person));
add(new Label(“firstName”));
add(new Label(“lastName”));
add(new Label(“address.address1”));
...

We can see from the example above, that if we set the model with the person object using a CompoundPropertyModel, the corresponding components added to this parent Component will use the component identifiers as the property expression.

IDetachable

In order to keep domain objects around, you’re either going to need a lot of memory / disk space, or devise a method to minimize what gets serialized in session. The detachable design helps you do this with minimal effort. Simply store as little as needed to reconstruct the domain object, and within the detach method that your Model overrides, null out the rest.

LoadableDetachableModel

In order to make things easier, LoadableDetachableModel implements a very common use case for detachable models. It gives you the ability to override a few constructors and a load method, and provides automatic reattach and detach within Wicket’s lifecycle. Let’s look at an example:


public class LoadableRefcardModel extends LoadableDetachableModel
{
	private Long id;
	public LoadableRefcardModel(Refcard refcard) {
	  super(refcard);
	  id = refcard.getId();
	}
	public LoadableRefcardModel(Long id) {
	  super();
      this.id = id;
	}
	protected Object load() {
      if(id == null) return new Refcard();
      RefcardDao dao = ...
      return dao.get(id);
    }
}

Here we have two constructors, each grabbing the identifier and storing it with the Model. We also override the load method so that we can either return a newly created Object, or use an access object to return the Object associated with the Model’s stored identifier. LoadableDetachableModel handles the process of attaching and detaching the Object properly giving us as little overhead as possible.

Components

In Wicket, Components display data and can react to events from the end user. In terms of the Model-View-Controller pattern, a Component is the View and the Controller. The following three distinct items make up every Wicket Component in some way:

  • Java Class implementation – defines the behavior and responsibilities
  • HTML Markup – defines the Components using their identifiers within view markup to determine where it shows to the end user
  • The Model – provides data to the Component

Now that we have an idea about what makes up a Component, let’s look at a few of the building blocks that make up the majority of our Pages. Forms and their Components are so important they have their own section.

Label

When developing your application, if you’d like to show text on the frontend chances are pretty good that you’ll be using the Label Component. A Label contains a Model object which it will convert to a String for display on the frontend.


<span wicket:id=”message”>[message]</span>
...
add(new Label(“message”, “Hello, World!”));

The first portion is an HTML template, which gives a component identifier of “message” which must be matched in the Java code. The Java code passes the component identifier as the first parameter.

Link

Below is a list of the different types of links, bookmarkable and non-bookmarkable, and how they are used to navigate from page-to-page.

Name Description
Link

If linking to another Page, it is best to use a Link in most instances:
add(new Link(“myLink”) {
    public void onClick() {
      setResponsePage(MyNewPage.class);
  }
}

BookmarkablePageLink A Bookmarkable page gives you a human readable URL that can be linked to directly from outside of the application. The default look of this URL can be overridden as we’ll see in the next section. add(new BookmarkablePageLink(“myLink”, MyNewPage.class);
ExternalLink If linking to an external website that is not within your application, here’s the Link component you’ll need and an example: add(new ExternalLink(“myLink”, http://www. mysticcoders.com, “Mystic”);

Repeaters

Due to the lack of any executable code inside of Wicket’s HTML templates, the method of showing lists may seem a little counterintuitive at first. One of the simplest methods for showing a list of data is the RepeatingView. Here’s an example of how to use it:


<ul>
    <li wicket:id=”list”></li>
</ul>
...
RepeatingView list = new RepeatingView(“list”);
add(list);
for(int i = 1; i <= 10; i++) {
    list.add(new Label(list.newChildId(), “Item “ + i));
}

This will simply print out a list from 1 to 10 into HTML. RepeatingView provides a method .newChildId() which should be used to ensure the Component identifier is unique. As your needs get more complex, this method quickly turns stale as there is a lot of setup that has to be done. Using a ListView is a great approach for managing possibly complex markup and business logic, and is more akin to other ways we’re asked to interact with Apache Wicket:


<ul>
   <li wicket:id=”list”><span wicket:id=”description”>[descripti
on]</li>
</ul>
...
ListView list = new ListView(“list”, Arrays.asList(“1”, “2”, “3”,
“4”, “5”, “6”, “7”, “8”, “9”, “10”) {
   @Override
   protected void populateItem(ListItem item) {
      String text = (String)item.getModelObject();
      item.add(new Label(“description”, text));
   }
};
add(list);

This method, while it looks more complex, allows us a lot more flexibility in building our lists to show to the user. The two list approaches described above each suffer from some drawbacks, one of which is that the entirety of the list must be held in memory. This doesn’t work well for large data sets, so if you need finer grain control on how much data is kept in memory, paging, etc., DataTable or DataView is something to look into.

Custom

The beauty of Wicket is that reuse is as simple as putting together a Panel of Components and adding it to any number of pages – this could be a login module, a cart, or whatever you think needs to be reused. For more great examples of reusable components check out the wicket-extensions (http://cwiki.apache.org/Wicket/wicket-extensions.html) and wicket-stuff (http:..wicketstuff.org) projects.

Hot Tip

Since Wicket always needs a tag to bind to, even for a label, a tag is sometimes easier to place into your markup; however, this can throw your CSS design off. .setRenderBodyOnly(true) can be used so the span never shows on the frontend but be careful using this with any AJAX enabled components, since it requires the tag to stick around.

Page and navigation

A Wicket Page is a component that allows you to group components together that make up your view. All Components will be related in a tree hierarchy to your page, and if the page is bookmarkable you can navigate directly to it. Tocreate a new page, simply extend WebPage and start adding components.

Most webapps will share common areas that you don’t want to duplicate on every page -- this is where markup inheritance comes into play. Because every page is just a component, you can extend from a base page and inherit things like header, navigation, footer, whatever fits your requirements. Here’s an example:


public class BasePage extends WebPage {
   ... header, footer, navigation, etc ...
}
public class HomePage extends BasePage {
... everything else, the content of your pages...
}

Everything is done similarly to how you would do it in Java, without the need for messy configuration files. If we need to offer up pages that can be referenced and copied, we’re going to need to utilize bookmarkable pages. The default Wicket implementation of a BookmarkablePage is not exactly easy to memorize, so in your custom Application class you can define several mount points for your pages:


// when a user goes to /about they will get directly to this page
mountBookmarkablePage(“/about”, AboutPage.class);
// this mount makes page available at /blog/param0/param1/param2 and fills PageParameters with 0-indexed numbers as the key mount(new IndexedParamUrlCodingStrategy(“/blog”, BlogPage.class);
// this mount makes page available at /blog?paramKey=paramValue&pa ramKey2=paramValue2 mount(new QueryStringUrlCodingStrategy(“/blog”, BlogPage.class);

In your code, you’ll need several ways of navigating to pages, including within Link implementations, in Form onSubmits, and for an innumerable number of reasons. Here are a few of the more useful:


// Redirect to MyPage.class
setResponsePage(MyPage.class);
// Useful to immediately interrupt request processing to perform a redirect throw new RestartResponseException(MyPage.class);
// Redirect to an interim page such as a login page, keep the URL in memory so page can call continueToOriginalDestination() redirectToInterceptPage(LoginPage.class);
// Useful to immediately interrupt request processing to perform a redirectToInterceptPage call throw new RestartResponseAtInterceptPageException(MyPage.class);

Markup

Apache Wicket does require adding some attributes and tags to otherwise pristine X/HTML pages to achieve binding with Component code. The following table illustrates the attributes available to use in your X/HTML templates, the most important and often used being wicket:id.

Attribute Name Description
wicket:id Used on any X/HTML element you want to bind a compoent to
wicket:message Used on any tag we want to fill an attribute with a resource bundle value. To use, prefix with te [attributename]:[resource name]

The following table lists out all of the most commonly used tags in X/HTML templates with Wicket.

Tag Name Description
wicket:panel

This tag is used in your template to define the area associatedf with
the component. Anything outside of this tag’s hierarchy will be
ignored. It is sometimes useful to wrap each of your templates with
html and body tags like so:
<html xmlns:wicket=”http://wicket.apache.org”>
<body>
<wicket:panel> ... </wicket:panel>
</body>
</html>
In this example, you can avoid errors showing in your IDE, and it
won’t affect the resulting HTML.

wicket:child Used in conjunction with markup inheritance. The subclassing Page will replace the tag with the output of its component
wicket:extend Defining a page that inherits from a parent Page requires a mirroring of the relationship in your X/HTML template. As with wicket:panel, everything outside of the tag’s hierarchy will be ignored, and the component’s result will end up in the wrapping template
wicket:link

Using this tag enables autolinking to another page without having
to add BookmarkablePageLink’s to the component hierarchy as this
is done automatically for you. To link to the homepage from one of
its subpages:
<wicket:link><a href=”Homepage.html”>Homepage</
a></wicket:link></td>

wicket:head Adding this to the root-level hierarchy of the template will give you access to inject code into the X/HTML <head></head> section.
wicket:message This tag will look for the given key in the resource bundle component hierarchy and replace the tag with the String retrieved from that bundle property. To pull the resource property page.label: <wicket:message key=”page.label”>[page label]</ wicket:message>
wicket:remove The entire contents of this tag will be removed upon running this code in the container. Its use is to ensure that the template can show design intentions such as repeated content without interfering with the resulting markup.
wicket:fragment A fragment is an inline Panel. Using a Panel requires a separate markup file, and with a fragment this block can be contained within the parent component.
wicket:enclosure A convenience tag added in 1.3 that defines a block of code surrounding your component which derives its entire visibility from the enclosing component. This is useful in situations when showing multiple fields some of which may be empty or null where you don’t want to add WebMarkupContainers to every field just to mimic this behavior. For example if we were printing out phone and fax: <wicket:enclosure> <tr><td class=”label”>Fax:</td><td><span wicket:id=”fax”>[fax number]</span></td></tr> </wicket:enclosure> ... add(new Label(“fax”) { public boolean isVisible() { return getModelObjectAsString()!=null; } } );
wicket:container

This tag is useful when you don’t want to render any tags into the
markup because it may cause invalid markup. Consider the following:
<table>
  <wicket:container wicket:id=”repeater”>
    <tr><td>1</td></tr>
    <tr><td>2</td></tr>
  </wicket:container>
</table>
In this instance, if we were to add any code in between the table and
tr tags, it would be invalid. Wicket:container fixes that

.

Form

A Form in Wicket is a component that takes user input and processes it upon submission. This component is a logical holder of one or more input fields that get processed together. The Form component, like all others, must be bound to an HTML equivalent, in this case the <form> tag.


<form wicket:id=”form”>
	Name: <input type=”text” wicket:id=”name” />
	<input type=”submit” value=”Send” />
</form>
...
Form form = new Form(“form”) {
   @Override
     protected void onSubmit() {


  System.out.println(“form submit”);
   }
};
add(form);
form.add(new TextField(“name”, new Model(“”));

Form input controls can each have their own Models attached to them, or can inherit from their parent, the Form. This is usually a good place to use CompoundPropertyModel as it gets rid of a lot of duplicate code. As you can see, each input component should be added to the Form element.

Wicket uses a POST to submit your form, which can be changed by overriding the Form’s getMethod and returning Form.METHOD_GET. Wicket also uses a redirect to buffer implementation details of form posts which gets around the form repost popup. The following behavior settings can be changed:

Name Setting Description
No redirect IRequestCycleSettings.ONE_PASS_RENDER Renders the response directly
Redirect to buffer IRequestCycleSettings.REDIRECT_BUFFER Renders the response directly to a buffer, redirects the browser and prevents reposting the form
Redirect to render IRequestCycleSettings. REDIRECT_TO_RENDER Redirects the browser directly; renders in a separate request

Components of a Form

The following table lists all the different form components available, and how to use them with Models.

Name Example
TextField

<input type=”text” wicket:id=”firstName” />
...
add(new TextField(“firstName”, new
PropertyModel(person, “firstName”));

TextArea

<textarea wicket:id=”comment”></textarea>
...
add(new TextArea(“comment”, new
PropertyModel(feedback, “comment”));

Button

<form wicket:id=”form”>
   <input type=”submit” value=”Submit”
wicket:id=”submit” />
</form>
...
Form form = new Form(“form”) {
   @Override
   protected void onSubmit() {
      System.out.println(“onSubmit called”);
   }
};
add(form);
form.add(new Button(“submit”));

CheckBoxMultipleChoice

<span wicket:id=”operatingSystems”>
   <input type=”checkbox” /> Windows
<input type=”checkbox” /> OS/2 Warp
</span> ... add(new CheckBoxMultipleChoice(“operat ingSystems”, new PropertyModel(system, “operatingSystems”), Arrays.asList(“Windows”, “OS X”, “Linux”, “Solaris”, “HP/UX”, “DOS”)));
DropDownChoice

<select wicket:id=”states”>
   <option>[state]</option>
</select>
...
add(new DropDownChoice(“states”, new
PropertyModel(address, “state”),
listOfStates));

PasswordTextField

<input type=”password” wicket:id=”password”
/>
...
add(new PasswordTextField(“password”, new
PropertyModel(user, “password”));

RadioChoice

<span wicket:id=”gender”>
	<input type=”radio” /> Male<br />
	<input type=”radio” /> Female</br />
</span>
...
add(new RadioChoice(“sex”, new
PropertyModel(person, “gender”), Arrays.
asList(“Male”, “Female”));

SubmitLink

<form wicket:id=”form”>
   <a href=”#”
wicket:id=”submitLink”>Submit</a>
</form>
...
form.add(new SubmitLink(“submitLink”) {
	@Override
	public void onSubmit() {
System.out.println(“submitLink
called”);
	}
});</td>
	</tr>

Validation

When dealing with user input, we need to validate it against what we’re expecting, and guide the user in the right direction if they stray. Any user input is processed through this flow:

  • Check that the required input is supplied
  • Convert input values from String to the expected type
  • Validate input using registered validators
  • Push converted and validated input to models
  • Call onSubmit or onError depending on the result

Wicket provides the following set of validators:

Resource Key Example
Required textField.setRequired(true)
RangeValidator.range numField.add(RangeValidator.range(0,10))
MinimumValidator.minimum numField.add(MinimumValidator.minimum(0))
MaximumValidator.maximum numField.add(MaximumValidator.maximum(0))
StringValidator.exact textField.add(StringValidator.exact(8))
StringValidator.range textField.add(StringValidator.range(6,18))
StringValidator.maximum textField.add(StringValidator.maximum(8))
StringValidator.minimum textField.add(StringValidator.minimum(2))
DateValidator.range dateField.add(DateValidator.range(startDate, endDate))
DateValidator.minimum dateField.add(DateValidator.minimum(minDate))
DateValidator.maximum dateField.add(DateValidator.maximum(maxDate))
CreditCardValidator ccField.add(new CreditCardValidator())
PatternValidator textFIeld.add(new PatternValidator(“\d+”)
EmailAddressValidator emailField.add(EmailAddressValidator.getInstance())
UrlValidator urlField.add(new UrlValidator())
EqualInputValidator add(new EqualInputValidator(formComp1,formComp2))
EqualPasswordInputValidator Add(new EqualPasswordInputValidator(passFld1, passFld2))

More than one validator can be added to a component if needed. For instance, if you have a password that needs to be within the range of 6 – 20 characters, must be alphanumeric and is required, simply chain the needed validators above to your component. If the validators listed above don’t fit your needs, Wicket lets you create your own and apply them to your components.


public class PostalCodeValidator extends AbstractValidator {
	public PostalCodeValidator() {
	}
	@Override
	protected void onValidate(IValidatable validatable) {
	  String value = (String)validatable.getValue();
	  if(!postalCodeService.isValid(value)) {
	    error(validatable);
	  }
	}
	@Override
	protected String resourceKey() {
	  return “PostalCodeValidator”;
	}
	@Override
	protected Map variablesMap(IValidatable validatable) {
	  Map map = super.variablesMap(validatable);
	  map.put(“postalCode”, n);
	  return map;
	}
}

When Wicket has completed processing all input it will either pass control to the Form’s onSubmit, or the Form’s onError..If you don’t choose to override onError, you’ll need a way to customize the error messages that show up.

Feedback Messages

Apache Wicket offers a facility to send back messages for failed validations or flash messages to provide notification of status after submitting a form or performing some action. Wicket’s validators come with a default set of feedback messages in a variety of languages, which you can override in your own properties files. Here’s the order Wicket uses to grab messages out of resource bundles:

Location Order Description Example
Page class 1 Messages Specific to a page Index.properties Index_es.properties
Component class 2 Messages specific to a component AddressPanel_es.properties CheckOutForm.properties
Custom Application class 3 Default application-wide message bundle

RefcardApplication_es_
MX.properties
RefcardApplication_
es.properties
RefcardApplication.
properties

During a Form submission, if you’d like to pass back messages to the end user, Wicket has a message queue that you can access with any component:


info(“Info message”);
warn(“Warn message”);
error(“Error message”);

With that added to the queue, the most basic method of showing these to users is to use the FeedbackPanel component which you can add to your Page as follows:


<div wicket:id=”feedback”></div>
…
add(new FeedbackPanel(“feedback”));

When you’d like to get them back out again, it will give you an Iterator to cycle through on the subsequent page:


getSession().getFeedbackMessages().iterator();

Internationalization

Earlier sections touched on the order of resource bundles importance from the Page down to Wicket’s default application. Apache Wicket uses the same resource bundles standard in the Java platform, including the naming convention, properties file or XML file.

Using ResourceBundles, you can pull out messages in your markup using , or use a ResourceModel with the component to pull out the localized text.

Another available option is to directly localize the filename of the markup files, i.e. HomePage_es_MX.html, HomePage.html. Your default locale will be used for HomePage.html, and if you were from Mexico, Wicket would dutifully grab HomePage_es_ MX.html.

Hot Tip

Wicket’s Label component overrides the getModelObjectAsString of Component to offer you Localaware String’s output to the client, so you don’t have to create your own custom converter.
Resources
Wicket 1.3 Homepage http://wicket.apache.org/
Component Reference http://wicketstuff.org/wicket13/compref/
Wicket Wiki http://cwiki.apache.org/WICKET/
Wicket by Example http://wicketbyexample.com/

About The Author

Photo of Andrew Lombardi

Andrew Lombardi

is one of a new breed of businessmen: the enlightened entrepreneur. He has been writing code since he was a 5-year old, sitting at his dad’s knee at their Apple II computer. Having such a deep affinity for the computer model, it is no surprise that at the age of 17 he began to delve deeply into the inner workings of the human mind. He became a student of Neuro Linguistic Programming and other mind technologies, and then went on to study metaphysics. He is certified as an NLP Trainer, Master Hypnotherapist and Time Line Therapy practitioner.

Using all of his accumulated skills, at the age of 24, Andrew began his consulting business, Mystic Coders, LLC. Since the inception of Mystic in 2000, Andrew has been building the business and studying finance and economics as he stays on the cutting edge of computer technology.

Recommended Book

Wicket in action

Wicket in Action is an authoritative, comprehensive guide for Java developers building Wicketbased Web applications. This book starts with an introduction to Wicket’s structure and components, and moves quickly into examples of Wicket at work. Written by two of the project’s earliest and most authoritative experts, this book shows you both the “how-to” and the “why” of Wicket. As you move through the book, you’ll learn to use and customize Wicket components, how to interact with other technologies like Spring and Hibernate, and how to build rich, Ajax-driven features into your applications.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Daily Dose: Release of Jive 4.5.6

Version 4.5.6 Jive, the popular Java portal building platform, boasts a significant set of improvements over the previous releases. Jive users can access Jive remotely using Jive Mobile.  Support is included for Android, Backberry, and iPhone. Overall...

0 replies - 21564 views - 05/15/11 by Katie Mckinsey in Daily Dose

Daily Dose: Google Chrome Pwned

After withstanding several hacking attempts, Google Chrome's sandbox has been compromised. The VUPEN Security team officially announced their accomplishment on their website. A video accompanying the announcement shows the hack in progress, but fails to...

0 replies - 15449 views - 05/09/11 by Katie Mckinsey in Daily Dose

Daily Dose: Apache Releases Version 3.1.0 of Solr and Lucene

The newest release of Solr and Lucene are ready for action!  Project leaders have changed Solr's version number scheme to mirror Lucene's. Both the Solr 3.1.0 and Lucene 3.1.0 release have many new features.  Here are three Lucene 3.1.0 release...

0 replies - 9780 views - 04/01/11 by Katie Mckinsey in Daily Dose

Daily Dose: Apache Hive 0.7.0

The newest release of one of the Apache Project's pet data warehouses, Hive 0.7.0, has been announced.  Originally developed for Facebook and then turned over to Apache in 2009, the Hive development effort has produced two major releases in the past two...

0 replies - 9272 views - 03/31/11 by Katie Mckinsey in Daily Dose

Understanding Lucene

Powering Better Search Results

By Erik Hatcher

11,548 Downloads · Refcard 137 of 151 (see them all)

Download
FREE PDF


The Essential Apache Lucene Cheat Sheet

Apache Lucene is a cross-platform, high-performance, full-text search engine library written in Java. Today, there are also .NET and Python ports available. When used in conjunction with Apache Solr, Lucene becomes a world-class search platform. Solr includes a number of other features like faceting and a rich function query/sort capability. This Refcard will give you a foundational knowledge of Lucenes features from the inverted index structure on up. This includes documents, indexes, fields, analysis, searching and more. There will also be plenty of usage examples to look at with Solr as the front-end.
HTML Preview
Understanding Lucene Powering Better Search Results

Understanding Lucene: Powering Better Search Results

By Erik Hatcher

WHAT IS LUCENE?

The Lucene Ecosystem

“Lucene” is a broadly used term. It’s the original Java indexing and search library created by Doug Cutting. Lucene was then chosen as a top-level Apache Software Foundation project name — http://lucene.apache.org. The name is also used for various ports of the Java library to other languages (Lucene.Net, PyLucene, etc). The following table shows the key projects at http://lucene.apache.org.

Project Description
Lucene - Java Java-based indexing and search library. Also comes with extras such as highlighting, spellchecking, etc.
Solr High-performance enterprise search server. HTTP interface. Built upon Lucene Java. Adds faceting, replication, sharding, and more.
Droids Intelligent robot crawling framework.
Open Relevance Aims to collect and distribute free materials for relevance testing and performance.
PyLucene Python port of the Lucene Java project.

There are many projects and products that use, expose, port, or in some way wrap various pieces of the Apache Lucene ecosystem.

WHICH LUCENE DISTRIBUTION?

There are many ways to obtain and leverage Lucene technology. How you choose to go about it will depend on your specific needs and integration points, your technical expertise and resources, and budget/time constraints.

When Lucene in Action was published in 2004, before the advent of many of the projects mentioned above, we just had Lucene Java and some other open-source building blocks. It served its purpose and did so extremely well. Lucene has only gotten better since then: faster, more efficient, newer features, and more. If you’ve got Java skills you can easily grab lucene.jar and go for it.

However, some better and easier ways to build Lucene-based search applications are now available. Apache Solr, specifically, is a top notch server architecture, built from the ground up with Lucene. Solr factors in Lucene best practices and simplifies many aspects of indexing content and integrating search into your application as well as addressing scalability needs that exceed the capacity of single machines.

This Refcard is about the concepts of Lucene more than the specifics of the Lucene API. We’ll be shining the light on Lucene internals and concepts with Solr. Solr provides some very direct ways to interact with Lucene.

We recommend you start with one of the following distributions:

  • LucidWorks for Solr – certified distributions of the official Apache Solr distributions, including any critical bug fixes and key performance enhancements.
  • Apache Solr – a great starting point for developers; grab a distro, write a script, integrate into UI.

Hot Tip

If you’re getting started on building a search application, your quickest, easiest bet is to use LucidWorks Enterprise. LucidWorks Enterprise is Lucene and Solr, plus more. Easy to install, easy to configure and monitor. LucidWorks Enterprise is free for development, with support subscriptions available for production deployments.

Lucid Imagination offers professional services, training, and the new LucidWorks Enterprise platform. Visit http://www.lucidimagination.com.

Definitions/Glossary

There are many common terms used when elaborating on Lucene’s design and usage.

Term Definition/context/usage
Document Returnable search result item. A document typically represents a crawled web page, a file system file, or a row from a database query.
Field Property, metadata item, or attribute of a document. Documents typically have a unique key field, often called “id”. Other common fields are “title”, “body”, “last_modified_date”, and “categories”.
Term Searchable text, extracted from each indexed field by analysis (a process of tokenization and filtering).
tf/idf Term frequency / inverse document frequency. This is a commonly used factor, computing the relationship between term frequency (how many uses of the query term exists in the entire index) to the inverse document frequency (how many documents in the entire collection that contain that query term, inverted).

Lucene Java and Core Lucene Concepts Explained

The design of Lucene is, at a high level, quite straightforward. Documents are “indexed”.

Documents are a representation of whatever types of “objects” and granularities your application needs to work with on the search/discovery side of the equation. In other words, when thinking Lucene, it is important to consider the use cases / demands of the encompassing application in order to effectively tune the indexing process with the end goal in mind.

Lucene provides APIs to open, read, write, and search an index. Documents contain “fields”. Fields are the useful individually named attributes of a document used by your search application. For example, when indexing traditional files such as Word, HTML, and PDF documents, commonly used fields are “title”, “body”, “keywords”, “author”, and “last_modified_date”.

DOCUMENTS

Documents, to Lucene, are the findable items. Here’s where domain-specific abstractions really matter. A Lucene Document can represent a file on a file system, a row in a database, a news article, a book, a poem, an historical artifact (see collections. si.edu), and so on. Documents contain “fields”. Fields represent attributes of the containing document, such as title, author, keywords, filename, file_type, lastModified, and fileSize.

Fields have a name and one or more values. A field name, to Lucene, is arbitrary, whatever you want.

When indexing documents, the developer has the choice of what fields to add to the Document instance, their names, and how they are each handled. Field values can be stored and/or indexed. A large part of the magic of Lucene is in how field values are analyzed and how a field’s terms are represented and structured.

filename.doc
“document” example

Hot Tip

There are additional bits of metadata that can be indexed along with the terms text. Terms can optionally carry along their positions (relative position of term to previous term within the field), offsets (character offsets of the term in the original field), and payloads (arbitrary bytes associated with a term which can influence matching and scoring). Additionally, fields can store term vectors (an intra-field term/frequency data structure).

The heart of Lucene’s search capabilities is in the elegance of the index structure, a form of an “inverted index”. An inverted index is a data structure mapping “terms” to the documents. Indexed fields can be “analyzed”, a process of tokenizing and filtering text into individual searchable terms. Often these terms from the analysis process are simply the individual words from the text. The analysis process of general text typically also includes normalization processes (lowercasing, stemming, other cleansing). There are many interesting and sophisticated ways indexing analysis tuning techniques can facilitate typical search application needs for sorting, faceting, spell checking, autosuggest, highlighting, and more.

Inverted Index
Inverted Index

Again we need to look back at the search application needs. Almost every search application ends up with a human user interface with the infamous and ubiquitous “search box”.

box

The trick is going from a human entered “query” to returning matching documents blazingly fast. This is where the inverted index structure comes into play. For example, a user searching for “mountain” can be readily accommodated by looking up the term in the inverted index and matching associated documents.

Not only are documents matched to a query, but they are also scored. For a given search request, a subset of the matching documents are returned to the user. We can easily provide sorting options for the results, though presenting results in “relevancy” order is more often the desired sort criteria. Relevancy refers to a numeric “score” based on the relationship between the query and the matching document. (“Show me the documents best matching my query first, please”).

The following formula (straight from Lucene’s Similarity class javadoc) illustrates the basic factors used to score a document.

box 1
Lucene practical scoring formula

Each of the factors in this equation are explained further in the following table:

Factor Explanation
score(q,d) The final computed value of numerous factors and weights, numerically representing the relationship between the query and a given document.
coord(q,d) A search-time score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query’s terms will receive a higher score than another document with fewer query terms.
queryNorm(q) A normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable.
tf(t in d) Correlates to the term’s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. Note that tf(t in q) is assumed to be 1 and, therefore, does not appear in this equation. However, if a query contains twice the same term, there will be two term-queries with that same term. Hence, the computation would still be correct (although not very efficient).
idf(t) Stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score. idf(t) appears for t in both the query and the document, hence it is squared in the equation.
t.getBoost() A search-time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost().
norm(t,d) Encapsulates a few (indexing time) boost and length factors.

Understanding how these factors work can help you control exactly how to get the most effective search results from your search application. It's worth noting that in many applications these days, there are numerous other factors involved in scoring a document. Consider boosting documents by recency (latest news articles bubble up), popularity/ratings (or even like/dislike factors), inbound link count, user search/click activity feedback, profit margin, geographic distance, editorial decisions, or many other factors. But let's not get carried away just yet, and focus on Lucene's basic tf/idf.

So now we've briefly covered the gory details of how Lucene works for matching and scoring documents during a search. There's one missing bit of magic, going from the human input of a search box and translating that into a representative data structure, the Lucene Query object. This string, Query process is called "queryparsing". Lucene itself includes a basic QueryParser that can parse sophisticated expressions including AND, OR, +/-, parenthetical grouped expressions, range, fuzzy, wildcarded, and phrase query clauses. For example, the following expression will match documents with a title field with the terms "Understanding" and Lucene collocated successively (provided positional information was enabled!) where the mimeType (MIME type is the document type) value is "application/pdf":


title:”Understanding Lucene” AND mimeType:application/PDF

For more information on Lucene QueryParser syntax, see http://lucene.apache.org/java/3_0_3/queryparsersyntax.html (or the docs for the version of Lucene you are using).

It is important to note that query parsing and allowable user syntax is often an area of customization consideration. Lucene’s API richly exposes many Query subclasses, making it very straightforward to construct sophisticated Query objects using building blocks such as TermQuery, BooleanQuery, PhraseQuery, WildcardQuery, and so on.

Shining the Light on Lucene: Solr

Apache Solr embeds Java Lucene, exposing its capabilities through an easy-to-use HTTP interface. Solr has Lucene best practices built in, and provides distributed and replicated search for large scale power.

For the examples that follow, we’ll be using Solr as the front-end to Lucene. This allows us to demonstrate the capabilities with simple HTTP commands and scripts, rather than coding in Java directly. Additionally, Solr adds easy-to-use faceting, clustering, spell checking, autosuggest, rich document indexing, and much more. We’ll introduce some of Solr’s value-added pieces along the way.

Lucene has a lot of flexibility, likely much more than you will need or use. Solr layers some general common-sense best practices on top of Lucene with a schema. A Solr schema is conceptually the same as a relational database schema. It is a way to map fields/ columns to data types, constraints, and representations. Let’s take a preview look at fields defined in the Solr schema (conf/schema. xml) for our running example:


<fields>
	<field name=”id”
		type=”string” indexed=”true” stored=”true”/>
	<field name=”title”
		type=”text_en” indexed=”true” stored=”true” />
	<field name=”mimeType”
		type=”string” indexed=”true” stored=”true” />
	<field name=”lastModified”
		type=”date” indexed=”true” stored=”true” />
</fields>

The schema constrains all fields of a particular name (there is dynamic wildcard matching capability too) to a “field type”. A field type controls how the Lucene Field instances are constructed during indexing, in a consistent manner. We saw above that Lucene fields have a number of additional attributes and controls, including whether the field value is stored, indexed, if indexed, how so, which analysis chain, and whether positions, offsets, and/or term vectors are stored.

Our Running Example, Quick Proof-of-Concepts

The (Solr) documents we index will have a unique “id” field, a “title” field, a “mimeType” field to represent the file type for filtering/faceting purposes, and a “lastModified” date field to represent a file’s last modified timestamp. Here’s an example document (in Solr XML format, suitable for direct POSTing):


<add>
  <doc>
	<field name=”id”>doc01</field>
	<field name=”title”>Our first document</field>
	<field name=”mimeType”>application/pdf</field>
	<field name=”lastModified”>NOW</field>
  </doc>
</add>

That example shows indexing the metadata regarding an actual file. Ultimately, we also want the contents of the file to be searchable. Solr natively supports extracting and indexing content from rich documents. And LucidWorks Enterprise has built-in file and web crawling and scheduling along with content extraction.

Launching Solr, using its example configuration, is as straightforward as this, from a Solr installation directory:


cd example
java –jar start.jar

And from another command-shell, documents can be easily indexed. Our example document shown previously (saved as docs.xml for us) can be indexed like this:


cd example/exampledocs
java –jar post.jar docs.xml

First of all, this isn’t going to work out of the box, as we have a custom schema and applications needs not supported by Solr’s example configuration. Get used to it, it’s the real world! The example schema is there as an example, and likely inappropriate for your application as-is. Borrow what makes sense for your own applications needs, but don’t leave cruft behind.

At this point, we have a fully functional search engine, with a single document, and will use this for all further examples. Solr will be running at http://localhost:8983/solr.

INDEXING

The process of adding documents to Lucene or Solr is called indexing. With Lucene Java, you create a new Document instance and call the addDocument method of an IndexWriter. This is straightforward and simple enough, leaving the burden on you to come up with the textual strings that'll comprise the document.

Contrast with Solr, which provides numerous ways out of the box to index. We've seen an example of Solr XML, one basic way to bring in documents. Here are detailed examples of various ways to index content into Solr. Solr’s schema centralizes the decisions made about how fields are indexed, freeing the indexer from any internal knowledge about how fields should be handled.

sunny diagram

Solr XML/JSON

Solr’s basic XML format can be a convenient way to map your applications “documents” into Solr. A simple HTTP post to /update is all it takes.

Posting XML to Solr can be done using the post.jar tool that comes with Solr’s example data, curl (see Solr’s post.sh), or any other HTTP library or tool capable of POST. In fact, most of the popular Solr client API libraries out there simply wrap an HTTP library with some convenience methods for indexing documents, packaging up documents and field values into this XML structure and POSTing it to Solr’s /update handler. Documents indexed in this fashion will be updated if they share the same unique key field value (configured in schema.xml) as existing documents.

Recently, JSON support has been added so it can be even cleaner to post documents into Solr and easier to adapt to a wider variety of clients. It looks like this:


{“add”: {
  “doc”: {
	“id”: “doc02”,
	“title”: “Solr JSON”,
	“mimeType”: “application/pdf”}
  }
}

Simply post this type of JSON to /update/json. All other Solr commands can be posted as JSON as well (delete, commit, optimize).

Comma, or Tab, Separated Values

Another extremely convenient and handy way to bring documents into Solr is through CSV (comma-separated variables; or, more generally, column-separated variables as the separator character is configurable). An example CSV file is shown here:


id,title,mimeType,lastModified
doc03,CSV ftw,application/pdf,2011-02-28T23:59:59Z

This CSV can be POSTed to the /update/csv handler, mapping rows to documents and columns to fields in a flexible, mappable manner. Using curl, this file (we named docs.csv) can be posted like this:


curl “http://localhost:8983/solr /update/csv?commit=true” --databinary
@docs.csv -H ‘Content-type:text/plain; charset=utf-8’

Note that this Content-type header is a necessary HTTP header to use for the CSV update handler.

Indexing Rich Document Types

Thus far, our indexing examples have omitted extracting and indexing file content. Numerous rich document types, such as Word, PDF, and HTML, can be processed using Solr’s built-in Apache Tika integration. To index the contents and metadata of a Word document, using the HTTP command-line tool curl, this is basically all that is needed:


curl “http://localhost:8983/solr/update/extract?literal.id=doc04” -F
“myfile=@technical_manual.doc”

To index rich documents with Lucene’s API, you would need to interface with one or more extractor libraries, such as Tika, extract the text, and map full text and document metadata as appropriate to Lucene fields. It’s much more straightforward, with no coding, to accomplish this task with Solr.

Hot Tip

Apache Tika http://tika.apache.org/ is a toolkit for detecting and extracting metadata from various types of documents. Existing open-source extractors and parsers are bundled with Tika to handle the majority of file types folks desire to search. Tika is baked into Solr, under the covers of the /update/extract capability.

DataImportHandler

And finally, Solr includes a general-purpose “data import handler” framework that has built-in capabilities for indexing relational databases (anything with a JDBC driver), arbitrary XML, and e-mail folders. The neat thing about the DataImportHandler is that it allows aggregating data from various sources into whole Solr documents.

For more information on Solr’s DataImportHandler, see http://wiki.apache.org/solr/DataImportHandler.

Deleting Documents

Documents can be deleted from a Lucene index, either by precise term matching (a unique identifier field, generally) or in bulk for all documents matching a Query.

When using Solr, deletes are accomplished by POSTing <delete><id>refcard01</id></delete> or <delete><query>mi meType:application/PDF</query></delete> XML messages to the /update handler. Or “delete”: { “id”:”ID”} or “delete”: { “query”:”mimeType:application/pdf” } messages to /update/json.

Hot Tip

Deleting by query “*:*” and committing is a handy trick for deleting all documents and starting with a fresh index; very helpful during rapid iterative development.

Committing

Lucene is designed such that documents can continuously be indexed, though the view of what is searchable is fixed to a certain snapshot of an index (for performance, caching, and versioning reasons). This architecture allows batches of documents to be indexed and only made searchable after the entire batch has been ingested. Pending changes to an index, including added and deleted documents, are made visible using a commit command. With Solr, a <commit/> message can be posted to the /update handler, “commit”: {} to /update/json, or even simpler as a bodyless /update GET (or POST) with commit=true set: http://localhost:8983/solr/update?commit=true

FIELDS

As mentioned, fields have a lot of configuration flexibility. The following table details the various decisions you must make regarding each fields configuration.

Field Attribute Effect and Uses
stored Stores the original incoming field value in the index. Stored field values are available when documents are retrieved for search results.
term positions Location information of terms within a field. Positional information is necessary for proximity-related queries, such as phrase queries.
term offsets Character begin and end offset values of a term within a fields textual value. Offsets can be handy for increasing performance of generating query term highlighted field fragments. This one typically is a trade-off between highlighting performance and index size. If offsets aren’t stored, they can be computed at highlighting time.
term vectors An “inverted index” structure within a document, containing term/frequency pairs. Term vectors can be useful for more advanced search techniques, such as “more like this” where terms and their frequencies within a single document can be leveraged for finding similar documents.

In Solr’s schema.xml, a field can be configured to have all of these bells and whistles enabled like this:


<field name=”kitchen_sink” type=”text” indexed=”true” stored=”true”
termVectors=”true” termPositions=”true” termOffsets=”true” />

Only indexed fields have “terms”. These additional term-based structures are only available on indexed fields and really only make sense when used with analyzed full-text fields.

When indexing non-textual information, such as dates or numbers, the representation and ordering of the terms in the index drastically impact the types of operations available. Especially for numeric and date types, which typically are used for range queries and sorting, Lucene (and Solr) offer special ways to handle them. When indexing dates and numerics, use the Trie*Field types in Solr, and the NumericField/NumericTokenStream API’s with Lucene. This is a crucial reminder that what you want your end application to do with the search server greatly impacts how you index your documents. Sorting and range queries, specifically, require up-front planning to index properly to support those operations.

ANALYSIS

The Lucene analysis process consists of several stages. The text is sent initially through an optional CharFilter, then through a Tokenizer, and finally through any number of TokenFilters. CharFilters are useful for mapping diacritical characters to their ASCII equivalent, or mapping Traditional to Simplified Chinese. A Tokenizer is the first step in breaking a string into “tokens” (what they are called before being written to the index as “terms”). TokenFilters can subsequently add, remove, or modify/augment tokens in a sequential pipeline fashion.

Diagram 1

Hot Tip

Solr includes a very handy analysis introspection tool. You can access it at http://localhost:8983/sorl/admin/analysis.jsp. Specify a field name or field type, enter some text, and see how it gets analyzed through each of the processing stages.

Using the Solr admin analysis introspection tool, using the field type “text_en” with the value “Understanding Lucene Refcard”, the following terms result:

Diagram 2

The analysis tool shows the term text that would be indexed ([understanding], [lucene]…), and the position and offset attributes we previously discussed. The analysis tool will handily show you the term output of each of the analysis stages, from tokenization through each of the filters.

SEARCHING

Now that we’ve got content indexed, searching it is easy! Ultimately, a Lucene Query object is handed to a Lucene IndexSearcher.search() method and results are processed. How to construct a query is the next step.

With Lucene Java, TermQuery is the most primitive Query. Then there’s BooleanQuery, PhraseQuery, and many other Query subclasses to choose from. Programmatically, the sky’s the limit in terms of query complexity. Lucene also includes a QueryParser, which parses a string into a Query object, supporting fielded, grouped, fuzzy, phrase, range, AND/OR/NOT/+/- and other sophisticated syntax.

Solr makes this all possible without coding and accepts a simple string query (q) parameter (and other parameters that can affect query parsing/generation). Solr includes a couple of general purpose query parsers, most notably a schema-aware subclass of Lucene’s QueryParser. This Lucene query parser is the default.

Hot Tip

Solr also includes a number of other specialized query parsers and the capability to mix-and-match them in rich combinations. Most notably is the “dismax” (disjunction maximum) and a new experimental “edismax” (extended dismax) query parsers that allow typical users queries to query across a number of configurable fields with individual field boosting. Dismax is the parser most often used with Solr these days.

Searching Solr is a straightforward HTTP request to / select?q=<your query>. Displaying search results in JSON (adding &wt=json) format, we get something like this:


{“responseHeader”:{
	“status”:0,
	“QTime”:2,
	“params”:{
	  “indent”:”true”, “wt”:”json”, “q”:”*:*”}},
  “response”:{“numFound”:3,”start”:0,
	“docs”:[
	  {“id”:”refcard01”,
		“timestamp”:”2011-02-17T20:44:49.064Z”,
		“title”:[ 		“Understanding Lucene”]}, {
“id”:”refcard02”, 		“timestamp”:”2011-02-17T20:48:16.862Z”,
“title”:[ 		“Refcard 2”]}, 	{ 		“id”:”doc03” ,
“mimeType”:”application/pdf”,		 “lastModified”:”2011-02-
28T23:59:59Z”, 			“timestamp”:”2011-02-17T21:42:31.423Z”,
“title”:[		 “CSV ftw”]}] }}

Note that Solr can return search results in a number of formats (XML, JSON, Ruby, PHP, Python, CSV, etc), choose the one that is most convenient for your environment.

Debugging Query Parsing

Query parsing is complex business. It can be very helpful in seeing a representation of the underlying Query object generated. By adding a debug=query parameter to the request, you can see how a query is parsed. For example, using the query “title:lucene AND timestamp:[NOW-1YEAR TO NOW]“, the debug output returns a parsedquery value of:


parsedquery:+title:lucene +timestamp:[1266446158657 TO
1297982158657]”

Note that AND translated to both clauses as mandatory (leading +) and the date range values were parsed by Solr’s useful date math feature and then converted to the Lucene “date” type index representation.

Explaining Result Scoring

Now that we have real documents indexed, we can take a look at Lucene’s scoring first-hand. Solr provides an easy way to look at Lucene’s “explain” output, which details how/why a document scored the way it did. In our Refcard lab, doing a title:lucene search matches a document and scores it like this:


0.8784157 = (MATCH) fieldWeight(title:lucene in 0), product of:
	1.0 = tf(termFreq(title:lucene)=1)
	1.4054651 = idf(docFreq=1, maxDocs=3)
	0.625 = fieldNorm(field=title, doc=0)

Add the debug=results parameter to the Solr search request to have explanation output added to the response.

BELLS AND WHISTLES

Solr includes a number of other features; some of them wrap Lucene Java add-on libraries and some of them (like faceting and rich function query/sort capability) are currently only at the Solr layer. We aren’t going into any detail of these particular features here, but now that you understand Lucene, you have the foundation to understand basically how they work from the inverted index structure on up. These features include:

  • Faceting: providing counts for various document attributes across the entire result set.
  • Highlighting: generating relevant snippets of document text, highlighting query terms. Useful in result display to show users the context in which their queries matched.
  • Spell checking: “Did you mean…?”. Looks up terms textually close to the query terms and suggests possible intended queries.
  • More-like-this: Given a particular document, or some arbitrary text, what other documents are similar?

Version Information

These Refcard demos use the current development branch of Lucene/Solr. This is likely to be what is eventually released from Apache as Lucene and Solr 4.0. LucidWorks Enterprise is also based on this same version. The concepts apply to all versions of Lucene and Solr, and the bulk of these examples should also work with earlier versions of Solr.

For Further Information

For all things Apache Lucene, start here: http://lucene.apache.org

Solr sports relatively decent developer-centric documentation: http://wiki.apache.org/solr

Lucene in Action (Manning): http://www.manning.com/lucene

To answer your Lucene questions, try LucidFind — http://search.lucidimagination.com — where the Lucene ecosystems e-mail lists, wikis, issue tracker, etc are made searchable for the entire Lucene community’s benefit.

See Apache Solr: Getting Optimal Search Results, http://refcardz.dzone.com/refcardz/solr-essentials, for more information on Apache Solr.

About The Authors

Erik Hatcher

Erik Hatcher

Erik Hatcher evangelizes and engineers at Lucid Imagination. He co-authored both Lucene in Action and Java Development with Ant. At Lucid, he has worked with many companies deploying Lucene/Solr search systems. Erik has spoken at numerous industry events including Lucene EuroCon, ApacheCon, JavaOne, OSCON, and user groups and meetups around the world.

Recommended Book

Lucene in Action

When Lucene first appeared, this superfast search engine was nothing short of amazing. Today, Lucene still delivers. Its high-performance, easy-to-use API features like numeric fields, payloads, near-realtime search, and huge increases in indexing and searching speed make it the leading search tool.

And with clear writing, reusable examples, and unmatched advice, Lucene in Action, Second Edition is still the definitive guide to effectively integrating search into your applications. This totally revised book shows you how to index your documents, including formats such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering and covers the numerous improvements to Lucene since the first edition. Source code is for Lucene 3.0.1.

Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

The Top Twelve Integration Patterns for Apache Camel

By Claus Ibsen

8,198 Downloads · Refcard 47 of 151 (see them all)

Download
FREE PDF


The Essential Apache Camel Cheat Sheet

Enterprise Integration Patterns (EIP) have become the standard way to describe, document and implement complex integration problems. Apache Camel is an open-source project for implementing the EIP simply in a few lines of Java code or XML configuration. This DZone Refcard will guide you through the most common Enterprise Integration Patterns and give you examples of how to implement them either in Java code or using Spring XML. While it is targeted toward software developers and enterprise architects, anyone in the integration space can benefit from this Refcard.
HTML Preview
Enterprise Integration Patterns with Apache Camel

Enterprise Integration Patterns: with Apache Camel

By Claus Ibsen

About Enterprise Integration Patterns

Integration is a hard problem. To help deal with the complexity of integration problems the Enterprise Integration Patterns (EIP) have become the standard way to describe, document and implement complex integration problems. Hohpe & Woolf's book the Enterprise Integration Patterns has become the bible in the integration space - essential reading for any integration professional.

Apache Camel is an open source project for implementing the EIP easily in a few lines of Java code or Spring XML configuration. This reference card, the first in a two card series, guides you through the most common Enterprise Integration Patterns and gives you examples of how to implement them either in Java code or using Spring XML. This Refcard is targeted for software developers and enterprise architects, but anyone in the integration space can benefit as well.

About Apache Camel

Apache Camel is a powerful open source integration platform based on Enterprise Integration Patterns (EIP) with powerful Bean Integration. Camel lets you implementing EIP routing using Camels intuitive Domain Specific Language (DSL) based on Java (aka fluent builder) or XML. Camel uses URI for endpoint resolution so its very easy to work with any kind of transport such as HTTP, REST, JMS, web service, File, FTP, TCP, Mail, JBI, Bean (POJO) and many others. Camel also provides Data Formats for various popular formats such as: CSV, EDI, FIX, HL7, JAXB, Json, Xstream. Camel is an integration API that can be embedded in any server of choice such as: J2EE Server, ActiveMQ, Tomcat, OSGi, or as standalone. Camels Bean Integration let you define loose coupling allowing you to fully separate your business logic from the integration logic. Camel is based on a modular architecture allowing you to plugin your own component or data format, so they seamlessly blend in with existing modules. Camel provides a test kit for unit and integration testing with strong mock and assertion capabilities.

Essential Patterns

This group consists of the most essential patterns that anyone working with integration must know.

Pipes and Filters

Diagram How can we perform complex processing on a message while maintaining independence and flexibility?
Pipes and Filters
Problem A single event often triggers a sequence of processing steps
Solution Use Pipes and Filters to divide a larger processing steps (filters) that are connected by channels (pipes)
Camel Camel supports Pipes and Filters using the pipeline node.
Java DSL

from("jms:queue:order:in").pipeline("direct:transformOrder", "direct:validateOrder", "jms:queue:order:process");

Where jms represents the JMS component used for consuming JMS messages on the JMS broker. Direct is used for combining endpoints in a synchronous fashion, allow you to divide routes into sub routes and/or reuse common routes.

Tip: Pipeline is the default mode of operation when you specify multiple outputs, so it can be omitted and replaced with the more common node:


from("jms:queue:order:in").to("direct:transformOrder",
"direct:validateOrder", "jms:queue:order:process");

TIP: You can also separate each step as individual to nodes:


from("jms:queue:order:in")
	.to("direct:transformOrder")
	.to("direct:validateOrder")
	.to("jms:queue:order:process");

Spring DSL

<route>
	<from uri="jms:queue:order:in"/>
	<pipeline>
		<to uri="direct:transformOrder"/>
		<to uri="direct:validateOrder"/>
		<to uri="jms:queue:order:process"/>
	</pipeline>
</route>
<route>
	<from uri="jms:queue:order:in"/>
	<to uri="direct:transformOrder"/>
	<to uri="direct:validateOrder"/>
	<to uri="jms:queue:order:process"/>
</route>

Message Router

Diagram How can you deouple indevidual processing steps so that messages can be passed to different filters depending on a set of conditions?
Message Router
Problem Pipes and Filters route each message in the same processing steps. How can we route messages differently?
Solution Filter using predicates to choose the right output destination.
Camel Camel supports Message Router using the choice node. For more details see the Content Based router pattern.

Content-Based Router

Diagram How do we handle a situation where the implementation of a single logical function (e.g., inventory check) is spread across multiple physical systems?
Content-Based Router
Problem How do we ensure a Message is sent to the correct recipient based on information from its content?
Solution Use a Content-Based Router to route each message to the correct recipient based on the message content.
Camel Camel has extensive support for Content-Based Routing. Camel supports content based routing based on choice, filter, or any other expression.
Java DSL

Choice


from("jms:queue:order")
.choice()
.when(header("type").in("widget","wiggy"))
.to("jms:queue:order:widget")
.when(header("type").isEqualTo("gadget"))
.to("jms:queue:order:gadget")
.otherwise().to("jms:queue:order:misc")
.end();

TIP: In the route above end() can be omitted as its the last node and we do not route the message to a new destination after the choice.

TIP: You can continue routing after the choice ends.

Spring DSL

Choice


<route>
	<from uri="jms:queue:order"/>
	<choice>
		<when>
			<simple>${header.type} in 'widget,wiggy'</simple>
			<to uri="jms:queue:order:widget"/>
		</when>
		<when>
			<simple>${header.type} == 'gadget'</simple>
			<to uri="jms:queue:order:gadget"/>
		</when>
		<otherwise>
			<to uri="jms:queue:order:misc"/>
		</otherwise>
	</choice>
</route>

TIP: In Spring DSL you cannot invoke code, as opposed to the Java DSL that is 100% Java. To express the predicates for the choices we need to use a language. We will use simple language that uses a simple expression parser that supports a limited set of operators. You can use any of the more powerful languages supported in Camel such as: JavaScript, Groovy, Unified EL and many others.

TIP: You can also use a method call to invoke a method on a bean to evaluate the predicate. Lets try that:


<when>
	<method bean="myBean" method="isGadget"/>
	...
</when>

<bean id="myBean" class="com.mycomapany.MyBean"/>
	
public boolean isGadget(@Header(name = "type") String type) {
	return type.equals("Gadget");
}

Notice how we use Bean Parameter Binding to instruct Camel to invoke this method and pass in the type header as the String parameter. This allows your code to be fully decoupled from any Camel API so its easy to read, write and unit test.

Message Translator

Diagram How can systems using different data formats communicate with each other using messaging?
Message Translator
Problem Each application uses its own data format, so we need to translate the message into the data format the application supports.
Solution Use a special filter, a messae translator, between filters or applications to translate one data format into another.
Camel Camel supports the message translator using the processor, bean or transform nodes. TIP: Camel routes the message as a chain of processor nodes.
Java DSL

Processor


public class OrderTransformProcessor
		implements Processor {
	public void process(Exchange exchange)
			throws Exception {
		// do message translation here
	}
}
from("direct:transformOrder")
	.process(new OrderTransformProcessor());

Bean

Instead of the processor we can use Bean (POJO). An advantage of using a Bean over Processor is the fact that we do not have to implement or use any Camel specific interfaces or types. This allows you to fully decouple your beans from Camel.


public class OrderTransformerBean {
	public StringtransformOrder(String body) {
		// do message translation here
	}
}
Object transformer = new OrderTransformerBean();
from("direct:transformOrder").bean(transformer);

TIP: Camel can create an instance of the bean automatically; you can just refer to the class type.


from("direct:transformOrder")
	.bean(OrderTransformerBean.class);

TIP: Camel will try to figure out which method to invoke on the bean in case there are multiple methods. In case of ambiguity you can specify which methods to invoke by the method parameter:


from("direct:transformOrder")
	.bean(OrderTransformerBean.class, "transformOrder");

Transform

Transform is a particular processor allowing you to set a response to be returned to the original caller. We use transform to return a constant ACK response to the TCP listener after we have copied the message to the JMS queue. Notice we use a constant to build an "ACK" string as response.


from("mina:tcp://localhost:8888?textline=true")
	.to("jms:queue:order:in")
	.transform(constant("ACK"));

Spring DSL

Processor


<route>
	<from uri="direct:transformOrder"/>
	<process ref="transformer"/>
</route>

<bean id="transformer" class="com.mycompany.
OrderTransformProcessor"/>

In Spring DSL Camel will look up the processor or POJO/Bean in the registry based on the id of the bean.

Bean


<route>
<from uri="direct:transformOrder"/>
<bean ref="transformer"/>
</route>
<bean id="tramsformer"
class="com.mycompany.OrderTransformBean"/>

Transform


<route>
<from uri="mina:tcp://localhost:8888?textline=true"/>
<to uri="jms:queue:order:in"/>
<transform>
<constant>ACK</constant>
</transform>
</route>

Annotation DSL

You can also use the @Consume annotation for transformations. For example in the method below we consume from a JMS queue and do the transformation in regular Java code. Notice that the input and output parameters of the method is String. Camel will automatically coerce the payload to the expected type defined by the method. Since this is a JMS example the response will be sent back to the JMS reply-to destination.


@Consume(uri="jms:queue:order:transform")
public String transformOrder(String body) {
	// do message translation
}

TIP: You can use Bean Parameter Binding to help Camel coerce the Message into the method parameters. For instance you can use @Body, @Headers parameter annotations to bind parameters to the body and headers.

Message Filter

Diagram How can a component avoid receiving unwanted messages?
Message Filter
Problem How do you discard unwanted messages?
Solution Use a special kind of Message Router, a Message Filter, to eliminate undesired messages from a channel based on a set of criteria.
Camel Camel has support for Message Filter using the filter node. The filter evaluates a predicate whether its true or false; only allowing the true condition to pass the filter, where as the false condition will silently be ignored.
Java DSL We want to discard any test messages so we only route non-test messages to the order queue.

from("jms:queue:inbox")
	.filter(header("test").isNotEqualTo("true"))
	.to("jms:queue:order");

Spring DSL For the Spring DSL we use XPath to evaluate the predicate. The $test is a special shorthand in Camel to refer to the header with the given name. So even if the payload is not XML based we can still use XPath to evaluate predicates.

<route>
	<from uri="jms:queue:inbox"/>
	<filter>
		<xpath>$test = 'false'</xpath>
		<to uri="jms:queue:inbox"/>
	</filter>
</route>

Dynamic Router

Diagram
Dynamic Router
Problem How can we route messages based on a dynamic list of destinations?
Solution Use a Dynamic Router, a router that can self-configure based on special configuration messages from participating destinations.
Camel Camel has support for Dynamic Router using the Dynamic Recipient List combined with a data store holding the list of destinations.
Java DSL We use a Processor as the dynamic router to determine the destinations. We could also have used a Bean instead.

from("jms:queue:order")
	.processRef(myDynamicRouter)
	.recipientList("destinations");
	
public class MyDynamicRouter implements Processor {
	public void process(Exchange exchange) {
		// query a data store to find the best match of the
		// endpoint and return the destination(s) in the
		// header exchange.getIn()
		// .setHeader("destinations", list);
	}
}

Spring DSL

<route>
	<from uri="jms:queue:order"/>
	<process ref="myDynamicRouter"/>
	<recipientList>
		<header>destinations</destinations>
	</recipientList>
</route>

Annotation DSL

public class MyDynamicRouter {
	@Consume(uri = "jms:queue:order")
	@RecipientList
	public List<String> route(@XPath("/customer/id")
String customerId, @Header("location") String location,
Document body) {
		// query data store, find best match for the
		//endpoint and return destination (s)
	}
}

TIP: Notice how we used Bean Parameter Binding to bind the parameters to the route method based on an @XPath expression on the XML payload of the JMS message. This allows us to extract the customer id as a string parameter. @Header wil bind a JMS property with the key location. Document is the XML payload of the JMS message.

TIP: Camel uses its strong type converter feature to convert the payload to the type of the method parameter. We could use String and Camel will convert the body to a String instead. You can register your own type converters as well using the @Converter annotation at the class and method level.

Recipient List

Diagram How do we route a message to a list of statically or dynamically specified recipients?
Recipient List
Problem How can we route messages based on a static or dynamic list of destinations?
Solution Define a channel for each recipient. Then use a Recipient List to inspect an incoming message, determine the list of desired recipients and forward the message to all channels associated with the recipients in the list.
Camel Camel supports the static Recipient List using the multicast node, and the dynamic Recipient List using the recipientList node.
Java DSL

Static

In this route we route to a static list of two recipients, that will receive a copy of the same message simultaneously.


from("jms:queue:inbox")
	.multicast().to("file://backup", "seda:inbox");

Dynamic

In this route we route to a dynamic list of recipients defined in the message header [mails] containing a list of recipients as endpoint URLs. The bean processMails is used to add the header[mails] to the message.


from("seda:confirmMails").beanRef(processMails)
	.recipientList("destinations");

And in the process mails bean we use @Headers Bean Parameter Binding to provide a java.util.Map to store the recipients.


public void confirm(@Headers Map headers, @Body String body} {
	String[] recipients = ...
	headers.put(""destinations", recipients);
}

Spring DSL

Static


<route>
	<from uri="jms:queue:inbox" />
	<multicast>
		<to uri="file://backup"/>
		<to uri="seda:inbox"/>
	</multicast>
</route>

Dynamic

In this example we invoke a method call on a Bean to provide the dynamic list of recipients.


<route>
	<from uri="jms:queue:inbox" />
	<recipientList>
		<method bean="myDynamicRouter" method="route"/>
	</recipientList>
</route>

<bean id="myDynamicRouter"
	class="com.mycompany.MyDynamicRouter"/>
	
public class myDynamicRouter {
	public String[] route(String body) {
		return new String[] { "file://backup", .... }
	}
}

Annotation DSL

In the CustomerService class we annoate the whereTo method with @RecipientList, and return a single destination based on the customer id. Notice the flexibility of Camel as it can adapt accordingly to how you define what your methods are returning: a single element, a list, an iterator, etc.


public class CustomerService {
	@RecipientList
	public String whereTo(@Header("customerId") id) {
		return "jms:queue:customer:" + id;
	}
}

And then we can route to the bean and it will act as a dynamic recipient list.


from("jms:queue:inbox")
	.bean(CustomerService.class, "whereTo");

Splitter

Diagram How can we process a message if it contains multiple elements, each of which may have to be processed in a different way?
Splitter
Problem How can we split a single message into pieces to be routed individually?
Solution Use a Splitter to break out the composite message into a series of individual messages, each containing data related to one item.
Camel Camel has support for Splitter using the split node.
Java DSL

In this route we consume files from the inbox folder. Each file is then split into a new message. We use a tokenizer to split the file content line by line based on line breaks.


from("file://inbox")
	.split(body().tokenize("\n"))
	.to("seda:orderLines");

TIP: Camel also supports splitting streams using the streaming node. We can split the stream by using a comma:


.split(body().tokenize(",")).streaming().to("seda:parts");

TIP: In the routes above each individual split message will be executed in sequence. Camel also supports parallel execution using the parallelProcessing node.


.split(body().tokenize(",")).streaming()
	.parallelProcessing().to("seda:parts");

Spring DSL In this route we use XPath to split XML payloads received on the JMS order queue.

<route>
	<from uri="jms:queue:order"/>
	<split>
		<xpath>/invoice/lineItems</xpath>
		<to uri="seda:processOrderLine"/>
	</split>
</route>

And in this route we split the messages using a regular expression


<route>
	<from uri="jms:queue:order"/>
	<split>
		<tokenizer token="([A-Z|0-9]*);" regex="true"/>
		<to uri="seda:processOrderLine"/>
	</split>
</route>

TIP: Split evaluates an org.apahce.camel.Expression to provide something that is iterable to produce each individual new message. This allows you to provide any kind of expression such as a Bean invoked as a method call.


<split>
	<method bean="mySplitter" method="splitMe"/>
	<to uri="seda:processOrderLine"/>
</split>

<bean id="mySplitter" class="com.mycompany.MySplitter"/>

public List splitMe(String body) {
	// split using java code and return a List
	List parts = ...
	return parts;
}

Aggregator

Diagram How do we combine the results of individual, but related messages so that they can be processed as a whole?
Message Router
Problem How do we combine multiple messages into a single combined message?
Solution Use a stateful filter, an Aggregator, to collect and store individual messages until it receives a complete set of related messages to be published.
Camel Camel has support for the Aggregator using the aggregate node. Camel uses a stateful batch processor that is capable of aggregating related messaged into a single combined message. A correlation expression is used to determine which messages should be aggregated. An aggregation strategy is used to combine aggregated messages into the result message. Camel’s aggregator also supports a completion predicate allowing you to signal when the aggregation is complete. Camel also supports other completion signals based on timeout and/or a number of messages already aggregated.
Java DSL

Stock quote example

We want to update a website every five minutes with the latest stock quotes. The quotes are received on a JMS topic. As we can receive multiple quotes for the same stock within this time period we only want to keep the last one as its the most up to date. We can do this with the aggregator:


from("jms:topic:stock:quote")
	.aggregate().xpath("/quote/@symbol")
	.batchTimeout(5 * 60 * 1000).to("seda:quotes");

As the correlation expression we use XPath to fetch the stock symbol from the message body. As the aggregation strategy we use the default provided by Camel that picks the latest message, and thus also the most up to date. The time period is set as a timeout value in milliseconds.

Loan broker example

We aggregate responses from various banks for their quote for a given loan request. We want to pick the bank with the best quote (the cheapest loan), therefore we need to base our aggregation strategy to pick the best quote.


from("jms:topic:loan:quote")
	.aggregate().header("loanId")
	.aggregationStrategy(bestQuote)
	.completionPredicate(header(Exchange.AGGREGATED_SIZE)
	.isGreaterThan(2))
	.to("seda:bestLoanQuote");

We use a completion predicate that signals when we have received more than 2 quotes for a given loan, giving us at least 3 quotes to pick among. The following shows the code snippet for the aggregation strategy we must implement to pick the best quote:


public class BestQuoteStrategy implements AggregationStrategy {
	public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
		double oldQuote = oldExchange.getIn().getBody(Double.class);
		double newQuote = newExchange.getIn().getBody(Double.class);
		// return the "winner" that has the lowest quote
		return newQuote < oldQuote ? newExchange : oldExchange;
	}
}

Spring DSL

Loan Broker Example


<route>
	<from uri="jms:topic:loan:qoute"/>
	<aggregate strategyRef="bestQuote">
		<correlationExpression>
			<header>loanId</header>
		</correlationExpression>
		<completionPredicate>
			<simple>${header.CamelAggregatedSize} > 2</simple>
		</completionPredicate>
	</aggregate>
	<to uri="seda:bestLoanQuote"/>
</route>

<bean id="bestQuote"
	class="com.mycompany.BestQuoteStrategy"/>

TIP: We use the simple language to declare the completion predicate. Simple is a basic language that supports a primitive set of operators. ${header. CamelAggregatedSize} will fetch a header holding the number of messages aggregated.

TIP: If the completed predicate is more complex we can use a method call to invoke a Bean so we can do the evaluation in pure Java code:


<completionPredicate>
	<method bean="quoteService" method="isComplete"/>
</compledtionPrediacate>
public boolean isComplete(@Header(Exchange.AGGREGATED_SIZE)
	int count, String body) {
	return body.equals("STOP");
}

Notice how we can use Bean Binding Parameter to get hold of the aggregation size as a parameter, instead of looking it up in the message.

Resequencer

Diagram How can we get a stream of related but out-of-sequence messages back into the correct order?
Resequencer
Problem How do we ensure ordering of messages?
Solution Use a stateful filter, a Resequencer, to collect and reorder messages so that they can be published in a specified order.
Camel

Camel has support for the Resequencer using the resequence node. Camel uses a stateful batch processor that is capable of reordering related messages. Camel supports two resequencing algorithms:

-batch = collects messages into a batch, sorts the messages and publish the messages

-stream = re-orders, continuously, message streams based on detection of gaps between messages.

Batch is similar to the aggregator but with sorting. Stream is the traditional Resequencer pattern with gap detection. Stream requires usage of number (longs) as sequencer numbers, enforced by the gap detection, as it must be able to compute if gaps exist. A gap is detected if a number in a series is missing, e.g. 3, 4, 6 with number 5 missing. Camel will back off the messages until number 5 arrives.

Java DSL

Batch:

We want to process received stock quotes, once a minute, ordered by their stock symbol. We use XPath as the expression to select the stock symbol, as the value used for sorting.


from("jms:topic:stock:quote")
	.resequence().xpath("/quote/@symbol")
	.timeout(60 * 1000)
	.to("seda:quotes");

Camel will default the order to ascending. You can provide your own comparison for sorting if needed.

Stream:

Suppose we continuously poll a file directory for inventory updates, and its important they are processed in sequence by their inventory id. To do this we enable streaming and use one hour as the timeout.


from("file://inventory")
	.resequence().xpath("/inventory/@id")
	.stream().timeout(60 * 60 * 1000)
	.to("seda:inventoryUpdates");

Spring DSL

Batch:


<route>
	<from uri="jms:topic:stock:quote"/>
	<resequence>
		<xpath>/quote/@symbol</xpath>
		<batch-config batchTimeout="60000"/>
	</resequence>
	<to uri="seda:quotes"/>
</route>

Stream:


<route>
	<from uri="file://inventory"/>
	<resequence>
		<xpath>/inventory/@id
		<stream-config timeout="3600000"/>
	</resequence>
	<to uri="seda:quotes"/>
</route>

Notice that you can enable streaming by specifying <stream-config> instead of .

Dead Letter Channel

Diagram What will the messaging system do with a message it cannot deliver?
Message Router
Problem The messaging system cannot deliver a message
Solution When a message cannot be delivered it should be moved to a Dead Letter Channel
Camel

Camel has extensive support for Dead Letter Channel by its error handler and exception clauses. Error handler supports redelivery policies to decide how many times to try redelivering a message, before moving it to a Dead Letter Channel.

The default Dead Letter Channel will log the message at ERROR level and perform up to 6 redeliveries using a one second delay before each retry.

Error handler has two scopes: global and per route

TIP: See Exception Clause in the Camel documentation for selective interception of thrown exception. This allows you to route certain exceptions differently or even reset the failure by marking it as handled.

TIP: DeadLetterChannel supports processing the message before it gets redelivered using onRedelivery. This allows you to alter the message beforehand (i.e. to set any custom headers).

Java DSL

Global scope


errorHandler(deadLetterChannel("jms:queue:error")
	.maximumRedeliveries(3));
	
from(...)

Route scope
from("jms:queue:event")
	.errorHandler(deadLetterChannel()
	.maximumRedeliveries(5))
	.multicast().to("log:event", "seda:handleEvent");

In this route we override the global scope to use up to five redeliveries, where as the global only has three. You can of course also set a different error queue destination:


deadLetterChannel("log:badEvent").maximumRedeliveries(5)

Spring DSL

The error handler is configured very differently in the Java DSL vs. the Spring DSL. The Spring DSL relies more on standard Spring bean configuration whereas the Java DSL uses fluent builders.

Global scope

The Global scope error handler is configured using the errorHandlerRef attribute on the camelContext tag.


<camelContext errorHandlerRef="myDeadLetterChannel">
...
</camelContext>

Route scope

Route scoped is configured using the errorHandlerRef attribute on the route tag.


<route errorHandlerRef="myDeadLetterChannel">
...
</route>

For both the error handler itself is configured using a regular Spring bean


<bean id="myDeadLetterChannel" class="org.apache.camel.
builder.DeadLetterChannelBuilder">
	<property name="deadLetterUri" value="jms:queue:error"/>
	<property name="redeliveryPolicy"
		ref="myRedeliveryPolicy"/>
</bean>

<bean id="myRedeliverPolicy"
		class="org.apache.camel.processor.RedeliverPolicy">
	<property name="maximumRedeliveries" value="5"/>
	<property name="delay" value="5000"/>
</bean>

Wire Tap

Diagram How do you inspect messages that travel on a point-to-point channel?
Wire Tap
Problem How do you tap messages while they are routed?
Solution Insert a Wire Tap into the channel, that publishes each incoming message to the main channel as well as to a secondary channel.
Camel Camel has support for Wire Tap using the wireTap node, that supports two modes: traditional and new message. The traditional mode sends a copy of the original message, as opposed to sending a new message. All messages are sent as Event Message and runs in parallel with the original message.
Java DSL

Traditional

The route uses the traditional mode to send a copy of the original message to the seda tapped queue, while the original message is routed to its destination, the process order bean.


from("jms:queue:order")
	.wireTap("seda:tappedOrder")
	.to("bean:processOrder");

New message

In this route we tap the high priority orders and send a new message containing a body with the from part of the order. Tip: As Camel uses an Expression for evaluation you can use other functions than xpath, for instance to send a fixed String you can use constant.


from("jms:queue:order")
	.choice()
		.when("/order/priority = ‘high’")
			.wireTap("seda:from", xpath("/order/from"))
			.to("bean:processHighOrder");
		.otherwise()
			.to("bean:processOrder");

Spring DSL

Traditional


<route>
	<from uri="jms:queue:order"/>
	<wireTap uri="seda:tappedOrder"/>
	<to uri="bean:processOrder"/>
</route>

New Message


<route>
	<choice>
		<when>
			<xpath>/order/priority = 'high'</xpath>
			<wireTap uri="seda:from">
				<body><xpath>/order/from</xpath></body>
			</wireTap>
			<to uri="bean:processHighOrder"/>
		</when>
		<otherwise>
			<to uri="bean:processOrder"/>
		</otherwise>
	</choice>
</route>

Conclusion

The twelve patterns in this Refcard cover the most used patterns in the integration space, together with two of the most complex such as the Aggregator and the Dead Letter Channel. In the second part of this series we will take a further look at common patterns and transations.

Get More Information

Camel Website http://camel.apache.org The home of the Apache Camel project. Find downloads, tutorials, examples, getting started guides, issue tracker, roadmap, mailing lists, irc chat rooms, and how to get help.
FuseSource Website http://fusesource.com The home of the FuseSource company, the professional company behind Apache Camel with enterprise offerings, support, consulting and training.
About Author http://davsclaus.blogspot.com The personal blog of the author of this reference card.

About The Author

Photo of author Claus Ibsen

Claus Ibsen

Claus Ibsen is a passionate open-source enthusiast who specializes in the integration space. As an engineer in the Progress FUSE open source team he works full time on Apache Camel, FUSE Mediation Router (based on Apache Camel) and related projects. Claus is very active in the Apache Camel and FUSE communities, writing blogs, twittering, assisting on the forums irc channels and is driving the Apache Camel roadmap.

About Progress Fuse

FUSE products are standards-based, open source enterprise integration tools based on Apache SOA projects, and are productized and supported by the people who wrote the code.

Recommended Book

ASP.NET

Utilizing years of practical experience, seasoned experts Gregor Hohpe and Bobby Woolf show how asynchronous messaging has proven to be the best strategy for enterprise integration success. However, building and deploying messaging solutions presents a number of problems for developers. Enterprise Integration Patterns provides an invaluable catalog of sixty-five patterns, with real-world solutions that demonstrate the formidable of messaging and help you to design effective messaging solutions for your enterprise.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Apache Hadoop Deployment

A Blueprint for Reliable Distributed Computing

By Eugene Ciurana

9,715 Downloads · Refcard 133 of 151 (see them all)

Download
FREE PDF


The Essential Hadoop Deployment Cheat Sheet

Apache Hadoop Deployment is covered in this refcard. It's a basic blueprint for deploying Apache Hadoop HDFS and MapReduce using the Cloudera Distribution. It will take you from installation to deployment. It provides developers and data experts with the instructions they need for deploying Big Data applications. The process is made simpler by the Cloudera Distribution for Apache Hadoop: an open-source, enterprise-class distribution for production ready environments. To learn about basic tools and terminology of Hadoop, check out our Getting Started with Apache Hadoop Refcard.
HTML Preview
Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

By Eugene Ciurana

INTRODUCTION

This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out Refcard #117, Getting Started with Apache Hadoop, for basic terminology and for an overview of the tools available in the Hadoop Project.

WHICH HADOOP DISTRIBUTION?

Apache Hadoop is a scalable framework for implementing reliable and scalable computational networks. This Refcard presents how to deploy and use development and production computational networks. HDFS, MapReduce, and Pig are the foundational tools for developing Hadoop applications.

There are two basic Hadoop distributions:

  • Apache Hadoop is the main open-source, bleeding-edge distribution from the Apache foundation.
  • The Cloudera Distribution for Apache Hadoop (CDH) is an open-source, enterprise-class distribution for productionready environments.

The decision of using one or the other distributions depends on the organization’s desired objective.

  • The Apache distribution is fine for experimental learning exercises and for becoming familiar with how Hadoop is put together.
  • CDH removes the guesswork and offers an almost turnkey product for robustness and stability; it also offers some tools not available in the Apache distribution.

Hot Tip

Cloudera offers professional services and puts out an enterprise distribution of Apache Hadoop. Their toolset complements Apache’s. Documentation about Cloudera’s CDH is available from http://docs.cloudera.com.

The Apache Hadoop distribution assumes that the person installing it is comfortable with configuring a system manually. CDH, on the other hand, is designed as a drop-in component for all major Linux distributions.

Hot Tip

Linux is the supported platform for production systems. Windows is adequate but is not supported as a development platform.

Minimum Prerequisites

  • Java 1.6 from Oracle, version 1.6 update 8 or later; identify your current JAVA_HOME
  • sshd and ssh for managing Hadoop daemons across multiple systems
  • rsync for file and directory synchronization across the nodes in the cluster
  • Create a service account for user hadoop where $HOME=/home/hadoop
SSH Access

Every system in a Hadoop deployment must provide SSH access for data exchange between nodes. Log in to the node as the Hadoop user and run the commands in Listing 1 to validate or create the required SSH configuration.

Listing 1 - Hadoop SSH Prerequisits

keyFile=$HOME/.ssh/id_rsa.pub
pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
if ! ssh localhost -C true ; then \
  if [ ! -e “$keyFile” ]; then \
     ssh-keygen -t rsa -b 2048 -P ‘’ \
        -f “$pKeyFile”; \
 fi; \
 cat “$keyFile” >> “$authKeys”; \
 chmod 0640 “$authKeys”; \
 echo “Hadoop SSH configured”; \
else echo “Hadoop SSH OK”; fi

The public key for this example is left blank. If this were to run on a public network it could be a security hole. Distribute the public key from the master node to all other nodes for data exchange. All nodes are assumed to run in a secure network behind the firewall.

Hot Tip

All the bash shell commands in this Refcard are available for cutting and pasting from: http://ciurana.eu/DeployingHadoopDZone

Enterprise: CDH Prerequisites

Cloudera simplified the installation process by offering packages for Ubuntu Server and Red Hat Linux distributions.

Hot Tip

CDH packages have names like CDH2, CDH3, and so on, corresponding to the CDH version. The examples here use CDH3. Use the appropriate version for your installation.
CDH on Ubuntu Pre-Install Setup

Execute these commands as root or via sudo to add the Cloudera repositories:

Listing 2 - Ubuntu Pre-Install Setup

DISTRO=$(lsb_release -c | cut -f 2)
REPO=/etc/apt/sources.list.d/cloudera.list
echo “deb \
http://archive.cloudera.com/debian \
	$DISTRO-cdh3 contrib” > “$REPO”
echo “deb-src \
http://archive.cloudera.com/debian \
	$DISTRO-cdh3 contrib” >> “$REPO”
apt-get update

CDH on Red Hat Pre-Install Setup

Run these commands as root or through sudo to add the yum Cloudera repository:

Listing 3 - Red Hat Pre-Install Setup

curl -sL http://is.gd/3ynKY7 | tee \
	/etc/yum.repos.d/cloudera-cdh3.repo | \
	awk ‘/^name/’
yum update yum

Ensure that all the pre-required software and configuration are installed on every machine intended to be a Hadoop node. Don’t mix and match operating systems, distributions, Hadoop, or Java versions!

Hadoop for Development

  • Hadoop runs as a single Java process, in non-distributed mode, by default. This configuration is optimal for development and debugging.
  • Hadoop also offers a pseudo-distributed mode, in which every Hadoop daemon runs in a separate Java process. This configuration is optimal for development and will be used for the examples in this guide.

Hot Tip

If you have an OS X or a Windows development workstation, consider using a Linux distribution hosted on VirtualBox for running Hadoop. It will help prevent support or compatibility headaches.

Hadoop for Production

  • Production environments are deployed across a group of machines that make the computational network. Hadoop must be configured to run in fully distributed, clustered mode.

APACHE HADOOP INSTALLATION

This Refcard is a reference for development and production deployment of the components shown in Figure 1. It includes the components available in the basic Hadoop distribution and the enhancements that Cloudera released.

Figure1
Figure 1 - Hadoop Components

Hot Tip

Whether the user intends to run Hadoop in non-distributed or distributed modes, it’s best to install every required component in every machine in the computational network. Any computer may assume any role thereafter.

A non-trivial, basic Hadoop installation includes at least these components:

  • Hadoop Common: the basic infrastructure necessary for running all components and applications
  • HDFS: the Hadoop Distributed File System
  • MapReduce: the framework for large data set distributed processing
  • Pig: an optional, high-level language for parallel computation and data flow

Enterprise users often chose CDH because of:

  • Flume: a distributed service for efficient large data transfers in real-time
  • Sqoop: a tool for importing relational databases into Hadoop clusters

Apache Hadoop Development Deployment

The steps in this section must be repeated for every node in a Hadoop cluster. Downloads, installation, and configuration could be automated with shell scripts. All these steps are performed as the service user hadoop, defined in the prerequisites section.
http://hadoop.apache.org/common/releases.html has the latest version of the common tools. This guide used version 0.20.2.

  1. Download Hadoop from a mirror and unpack it in the /home/hadoop work directory.
  2. Set the JAVA_HOME environment variable.
  3. Set the run-time environment:
Listing 4 - Set the Hadoop Runtime Environment

version=0.20.2 # change if needed
identity=”hadoop-dev”
runtimeEnv=”runtime/conf/hadoop-env.sh”
ln -s hadoop-”$version” runtime
ln -s runtime/logs .
export HADOOP_HOME=”$HOME”
cp “$runtimeEnv” “$runtimeEnv”.org
echo “export \
HADOOP_SLAVES=$HADOOP_HOME/slaves” \
>> “$runtimeEnv”
mkdir “$HADOOP_HOME”/slaves
echo \
“export HADOOP_IDENT_STRING=$identity” >> \
“$runtimeEnv”
echo \
“export JAVA_HOME=$JAVA_HOME” \
>>”$runtimeEnv”
export \
PATH=$PATH:”$HADOOP_HOME”/runtime/bin
unset version; unset identity; unset runtimeEnv

Configuration

Pseudo-distributed operation (each daemon runs in a separate Java process) requires updates to core-site.xml, hdfs-site.xml, and the mapred-site.xml. These files configure the master, the file system, and the MapReduce framework and live in the runtime/conf directory.

Listing 5 - Pseudo-Distributed Operation Config

<!-- core-site.xml -->
<configuration>
 <property>
	<name>fs.default.name</name>
	<value>hdfs://localhost:9000</value>
 </property>
</configuration>

<!-- hdfs-site.xml -->
<configuration>
 <property>
	<name>dfs.replication</name>
	<value>1</value>
 </property>
</configuration>
<!-- mapred-site.xml -->
<configuration>

 <property>
	<name>mapred.job.tracker</name>
	<value>localhost:9001</value>
 </property>
</configuration>

These files are documented in the Apache Hadoop Clustering reference, http://is.gd/E32L4s — some parameters are discussed in this Refcard’s production deployment section.

Test the Hadoop Installation

Hadoop requires a formatted HDFS cluster to do its work:


hadoop namenode -format

The HDFS volume lives on top of the standard file system. The format command will show this upon successful completion:


/tmp/dfs/name has been successfully formatted.

Start the Hadoop processes and perform these operations to validate the installation:

  • Use the contents of runtime/conf as known input
  • Use Hadoop for finding all text matches in the input
  • Check the output directory to ensure it works

Listing 6 - Testing the Hadoop Installation

start-all.sh ; sleep 5
hadoop fs -put runtime/conf input
hadoop jar runtime/hadoop-*-examples.jar\
grep input output ‘dfs[a-z.]+’

Hot Tip

You may ignore any warnings or errors about a missing slaves file.
  • View the output files in the HDFS volume and stop the Hadoop daemons to complete testing the install
Listing 7 - Job Completion and Daemon Termination

hadoop fs -cat output/*
stop-all.sh

That’s it! Apache Hadoop is installed in your system and ready for development.

CDH Development Deployment

CDH removes a lot of grueling work from the Hadoop installation process by offering ready-to-go packages for mainstream Linux server distributions. Compare the instructions in Listing 8 against the previous section. CDH simplifies installation and configuration for huge time savings.

Listing 8 - Installing CDH

ver=”0.20”
command=”/usr/bin/aptitude”
if [ ! -e “$command” ];
then command=”/usr/bin/yum”; fi
“$command” install\
hadoop-”$ver”-conf-pseudo
unset command ; unset ver

Leveraging some or all of the extra components in Hadoop or CDH is another good reason for using it over the Apache version. Install Flume or Pig with the instructions in Listing 9.

Listing 9 - Adding Optional Components

apt-get install hadoop-pig
apt-get install flume
apt-get install sqoop

Test the CDH Installation

The CDH daemons are ready to be executed as services. There is no need to create a service account for executing them. They can be started or stopped as any other Linux service, as shown in Listing 10.

Listing 10 - Starting the CDH Daemons

for s in /etc/init.d/hadoop* ; do \
“$s” start; done

CDH will create an HDFS partition when its daemons start. It’s another convenience it offers over regular Hadoop. Listing 11 shows how to validate the installation by:

  • Listing the HDFS module
  • Moving files to the HDFS volume
  • Running an example job
  • Validating the output
Listing 11 - Testing the CDH Installation

hadoop fs -ls /
# run a job:
pushd /usr/lib/hadoop
hadoop fs -put /etc/hadoop/conf input
hadoop fs -ls input
hadoop jar hadoop-*-examples.jar \
grep input output ‘dfs[a-z.]+’
# Validate it ran OK:
hadoop fs -cat output/*

The daemons will continue to run until the server stops. All the Hadoop services are available.

Monitoring the Local Installation

Use a browser to check the NameNode or the JobTracker state through their web UI and web services interfaces. All daemons expose their data over HTTP. The users can chose to monitor a node or daemon interactively using the web UI, like in Figure 2. Developers, monitoring tools, and system administrators can use the same ports for tracking the system performance and state using web service calls.

Figure 2
Figure 2 - NameNode status web UI

The web interface can be used for monitoring the JobTracker, which dispatches tasks to specific nodes in a cluster, the DataNodes, or the NameNode, which manages directory namespaces and file nodes in the file system.

HADOOP MONITORING PORTS

Use the information in Table 1 for configuring a development workstation or production server firewall.

Port Service
50030 JobTracker
50060 TaskTrackers
50070 NameNode
50075 DataNodes
50090 Secondary NameNode
50105 Backup Node
Table 1 - Hadoop ports

Plugging a Monitoring Agent

The Hadoop daemons also expose internal data over a RESTful interface. Automated monitoring tools like Nagios, Splunk, or SOBA can use them. Listing 12 shows how to fetch a daemon’s metrics as a JSON document:

Listing 12 - Fetching Daemon Metrics
http://localhost:50070/metrics?format=json

All the daemons expose these useful resource paths:

  • /metrics - various data about the system state
  • /stacks - stack traces for all threads
  • /logs - enables fetching logs from the file system
  • /logLevel - interface for setting log4j logging levels

Each daemon type also exposes one or more resource paths specific to its operation. A comprehensive list is available from: http://is.gd/MBN4qz

APACHE HADOOP PRODUCTION DEPLOYMENT

The fastest way to deploy a Hadoop cluster is by using the prepackaged tools in CDH. They include all the same software as the Apache Hadoop distribution but are optimized to run in production servers and with tools familiar to system administrators.

Hot Tip

Detailed guides that complement this Refcard are available from Cloudera at http://is.gd/RBWuxm and from Apache at http://is.gd/ckUpu1.
Figure 3
Figure 3 - Hadoop Computational Network

The deployment diagram in Figure 3 describes all the participating nodes in a computational network. The basic procedure for deploying a Hadoop cluster is:

  • Pick a Hadoop distribution
  • Prepare a basic configuration on one node
  • Deploy the same pre-configured package across all machines in the cluster
  • Configure each machine in the network according to its role

The Apache Hadoop documentation shows this as a rather involved process. The value-added in CDH is that most of that work is already in place. Role-based configuration is very easy to accomplish. The rest of this Refcard will be based on CDH.

Handling Multiple Configurations: Alternatives

Each server role will be determined by its configuration, since they will all have the same software installed. CDH supports the Ubuntu and Red Hat mechanism for handling alternative configurations.

Hot Tip

Check the main page to learn more about alternatives. Ubuntu: man update-alternatives Red Hat: man alternatives

The Linux alternatives mechanism ensures that all files associated with a specific package are selected as a system default. This customization is where all the extra work went into CDH. The CDH installation uses alternatives to set the effective CDH configuration.

Setting Up the Production Configuration

Listing 13 takes a basic Hadoop configuration and sets it up for production.

Listing 13 - Set the Production Configuration

ver=”0.20”
prodConf=”/etc/hadoop-$ver/conf.prod”
cp -Rfv /etc/hadoop-”$ver”/conf.empty \
“$prodConf”
chown hadoop:hadoop “$prodConf”
# activate the new configuration:
alt=”/usr/sbin/update-alternatives”
if [ ! -e “$alt” ]; then alt=”/usr/sbin/alternatives”; fi
“$alt” --install /etc/hadoop-”$ver”/conf \
hadoop-”$ver”-conf “$prodConf” 50
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” restart; done

The server will restart all the Hadoop daemons using the new production configuration.

Figure 4
Figure 4 - Hadoop Conceptual Topology
Readying the NameNode for Hadoop

Pick a node from the cluster to act as the NameNode (see Figure 3). All Hadoop activity depends on having a valid R/W file system. Format the distributed file system from the NameNode, using user hdfs:

Listing 14 - Create a New File System
sudo -u hdfs hadoop namenode -format

Stop all the nodes to complete the file system, permissions, and ownership configuration. Optionally, set daemons for automatic startup using rc.d.

Listing 15 - Stop All Daemons

# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ;\
# Optional command for auto-start:
update-rc.d “$h” defaults; \
done

File System Setup

Every node in the cluster must be configured with appropriate directory ownership and permissions. Execute the commands in Listing 16 in every node:

Listing 16 - File System Setup

mkdir -p /data/1/dfs/nn /data/2/dfs/nn
mkdir -p /data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
mkdir -p /data/1/mapred/local \
/data/2/mapred/local
chown -R hdfs:hadoop /data/1/dfs/nn \
/data/2/dfs/nn /data/1/dfs/dn \
/data/2/dfs/dn /data/3/dfs/dn \
/data/4/dfs/dn
chown -R mapred:hadoop \
/data/1/mapred/local \
/data/2/mapred/local
chmod -R 755 /data/1/dfs/nn \
/data/2/dfs/nn \
/data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
chmod -R 755 /data/1/mapred/local \
/data/2/mapred/local

Starting the Cluster
  • Start the NameNode to make HDFS available to all nodes
  • Set the MapReduce owner and permissions in the HDFS volume
  • Start the JobTracker
  • Start all other nodes

CDH daemons are defined in /etc/init.d — they can be configured to start along with the operating system or they can be started manually. Execute the command appropriate for each node type using this example:

Listing 17 - Starting a Node Example

# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-”$ver”-*; do \
“$h” stop ; done

Use jobtracker, datanode, tasktracker, etc. corresponding to the node you want to start or stop.

Hot Tip

Refer to the Linux distribution’s documentation for information on how to start the /etc/init.d daemons with the chkconfig tool.
Listing 18 - Set the MapReduce Directory Up

sudo -u hdfs hadoop fs -mkdir \
/mapred/system
sudo -u hdfs hadoop fs -chown mapred \
/mapred/system

Update the Hadoop Configuration Files
Listing 19 - Minimal HDFS Config Update

<!-- hdfs-site.xml -->
<property>
	<name>dfs.name.dir</name>
	<value>/data/1/dfs/nn,/data/2/dfs/nn
	</value>
	<final>true</final>
</property>
<property>
	<name>dfs.data.dir</name>
	<value>
	 /data/1/dfs/dn,/data/2/dfs/dn,
	 /data/3/dfs/dn,/data/4/dfs/dn
	</value>
   <final>true</final>
</property>	

The last step consists of configuring the MapReduce nodes to find their local working and system directories:

Listing 20 - Minimal MapReduce Config Update

<!-- mapred-site.xml -->
<property>
  <name>mapred.local.dir</name>
  <value>
	/data/1/mapred/local,
	/data/2/mapred/local
  </value>
  <final>true</final>
</property>
<property>
	<name>mapred.systemdir</name>
	<value>
	  /mapred/system
	</value>
	<final>true</final>
</property>

Start the JobTracker and all other nodes. You now have a working Hadoop cluster. Use the commands in Listing 11 to validate that it’s operational.

WHAT’S NEXT?

The instructions in this Refcard result in a working development or production Hadoop cluster. Hadoop is a complex framework and requires attention to configure and maintain it. Review the Apache Hadoop and Cloudera CDH documentation. Pay particular attention to the sections on:

  • How to write MapReduce, Pig, or Hive applications
  • Multi-node cluster management with ZooKeeper
  • Hadoop ETL with Sqoop and Flume

Happy Hadoop computing!

STAYING CURRENT

Do you want to know about specific projects and use cases where Hadoop and data scalability are the hot topics? Join the scalability newsletter: http://ciurana.eu/scalablesystems

About The Authors

Eugene Ciurana

Eugene Ciurana (http://eugeneciurana.eu) is the VP of Technology at Badoo.com, the largest dating site worldwide, and cofounder of SOBA Labs, the most sophisticated public and private clouds management software. Eugene is also an open-source evangelist who specializes in the design and implementation of mission-critical, high-availability systems. He recently built scalable computational networks for leading financial, software, insurance, SaaS, government, and healthcare companies in the US, Japan, Mexico, and Europe.

Publications
  • Developing with Google App Engine, Apress
  • DZone Refcard #117: Getting Started with Apache Hadoop
  • DZone Refcard #105: NoSQL and Data Scalability
  • DZone Refcard #43: Scalability and High Availability
  • The Tesla Testament: A Thriller, CIMEntertainment

Thank You!

Thanks to all the technical reviewers, especially to Pavel Dovbush at http://dpp.su

Recommended Book

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open-source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems; programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.


Share this Refcard with
your friends & followers...

DZone greatly appreciates your support.


Your download should begin immediately.
If it doesn't, click here.

Daily Dose - NoSQLs Join Forces

NoSQL backers Membase and CouchOne just announced that they are merging to create Couchbase, Inc.  Combining the caching and clustering technology of Membase, and the document database capabilities of CouchDB, the newly-minted Couchbase company will be...

0 replies - 18213 views - 02/09/11 by Katie Mckinsey in Daily Dose