Babeldoc Userguide

Bruce McDonald

Dejan Krsmanovic


Table of Contents

A Call To Arms!
1. Introduction
Babeldoc
Fundamental terms and expressions
Running the Examples
The Java Runtime Environment
Windows
Unix-like systems
Testing your Babeldoc installation
Running a document through the pipeline
Inspecting the Pipeline
Inspecting the Journal
Modules
Configuration
LightConfig
Setting up a new project
2. Pipelines
Introduction
Pipeline Documents
Body
Attributes
Pipeline types
SimplePipelineStageFactory
Xml Pipeline Stage Factory
Multithreaded Operation
Processors
Feeders
Pipeline stage types
CallStage
Compression
Decompression
DecryptDocument
Domify
Enrich
ExternalApplication
FileWriter
FlatToXml
FtpWriter
HttpClient
JavaXmlDecoder
JournalUpdate
JTidy
Null
Reader
Router
RssChannel
Scripting
Signer
SmtpWriter
SoapWriter
SocketWriter
SqlEnrich
SqlQuery
SqlWriter
SvgTranscode
VelocityTemplatize
XlsToXml
XpathExtract
XpathSplitter
XslFoTransform
XslTransform
ZipArchiveWriter
Handling errors in pipeline stages
Tracking documents
Ignoring Pipelinestages
Pipeline Tool
3. Resources
Introduction
jdbc
jndi
pooled
4. Journal
Introduction
Journal Operations
Journal Implementations
Simple Journal
Jdbc Journal
Ejb Journal Implementation
Journal Tool
5. Scanner
Introduction
Starting scanner
Configuration
DirectoryScanner
null
HttpScanner
MailboxScanner
SqlScanner
FtpScanner
ExternalApplicationScanner
6. Flat file conversions
Introduction
configuration
Header
Paragraphs
conversion XML document
CSV files
Flat lines
Paragraphs
Line Segments
7. HOW TOs
Introduction
HOWTO Set up Eclipse with Babeldoc
HOWTO Read an attribute from external XML file
HOWTO Access the attributes of a pipeline document inside an XSLT
HOWTO Package up your application into a single jar file for easy distribution
A. The Apache Software License, Version 1.1

List of Examples

1.1. Running the demonstration pipeline
1.2. Inspecting the pipeline
1.3. Inspecting the Journal
1.4. Listing the modules
1.5. Listing configuration data
1.6. Tracing a configuration key
2.1. Adding attributes from the command-line
2.2. Declaring a 'Simple' Pipeline
2.3. Defining a 'Simple' Pipeline
2.4. XML Pipeline
2.5. Using another pipeline stage processor

A Call To Arms!

If you are using Babeldoc, please participate in the development of Babeldoc. This can include just sending a hello, posting bug reports to actually contributing code. We want to hear from you. Babeldoc forums. . There is a mailing list that will be very useful for technical users of Babeldoc hosted on sourceforge called babeldoc-devel. Please join for the latest news.

Chapter 1. Introduction

Babeldoc

Babeldoc is a document processing system. Babeldoc is especially suited for Business-to-Business (B2B) environments and similar integration projects. Babeldoc has a flexible and reconfigurable processing pipeline through which documents flow. These pipelines transform the document. Additionally Babeldoc has a sophisticated and extensible journaling system so that documents may be reprocessed and resubmitted as well as tracked through the system. Its runtime environment is flexible so that it can run standalone, in a web container or in a J2EE container (currently tested on Jboss 2.4.x and JBoss 3.0.x). Babeldoc has both a Web console, a number of GUI tools and a command-line console to control the document flow.

Note

The flow based metaphor is an appropriate metaphor that will expanded on in this document.

Babeldoc can be used to process documents which are flowing in and out of a system. There are three basic ways to develop applications to handle these document flows:

  1. Hand-coded standalone applications have the benefit of simplicity. These applications can be slow (java start-up times are significant) and also inflexible, hard to maintain, difficult to track errors and can result in fragile systems. Also such systems could be non-portable in that they make assumptions about their environments.
  2. Hand-coded RMI (or similar protocol based) server applications are better than (1) in terms of performance but can be hard to administer and control (you have to do this yourself).
  3. Server based processes - this is better for manageability and in terms of scalability but could be overkill for simple situations.

How about a system that can be implemented in these ways but still centrally managed and configured. Additionally how about a system that seems like (1) but is actually implemented as a server architecture (3). Babeldoc can be configured to run in any one of these three ways.

Additionally document processing sometimes fails and then the issue is how do you react? Babeldoc has a very sophisticated journaling function that can allow administrators to reintroduce documents into any place in the pipeline.

Fundamental terms and expressions

  1. Document: A document is a string of characters of arbitrary length. Additionally a document can be enriched by each stage in the in Pipeline this is useful to carry expensively obtained information along with the document.
  2. Pipeline: A pipeline is a series of processing stages that will successively apply transformations to the document. There may be one or more pipelines defined but each pipeline must have a unique name. Examples of pipeline functionality are: Convert a flatfile to an XML document, convert one XML document to another XML document, extract variables from an xml document, split a document into sub-documents and write the document to disk. Each pipeline must have an entry stage. See below for stages.
  3. PipelineStage: A pipeline stage is the smallest step in a pipeline. Each pipeline-stage has a unique name and has a type. The name of the pipeline-stage must uniquely identify that pipeline-stage within its pipeline. Each Pipeline-stage informs the pipeline the name of the next of the Pipeline-stage. Examples of pipeline-stages are: FileWriter (write the document to disk), XslTransform (convert one xml document to another xml document).
  4. Journal: The sub-system of babeldoc that is concerned with capturing the transformations that are applied to a document as it moves through the pipelines.
  5. Ticket: The ticket is a unique identifier that is associated with every document in the system. This allows for the journaling system to reload the document associated with that ticket. The ticket is an 64bit integer.
  6. Step: A step is a single operation that can be associated with a ticket and is journaled to the store. The ticket and the step together uniquely identify a document at a particular stage in a pipeline.
  7. TicketStep: The combination of a ticket with a step to uniquely identify an operation upon a document.
  8. Replaying: Replaying the journal takes a ticket and step and reintroduces the document into the pipeline at the specified pipeline stage.
  9. Scanner: A standalone application to monitor resources (disks, FTP sites) and when a file is found, make a document and introduce the document to the specified pipeline. This is a powerful tool and can be made to perform some intricate tasks.
  10. Feeder (1): A standalone application that introduces a document into a pipeline.
  11. Feeder (2): An internal babeldoc processing component that governs how documents are fed to pipelines.

Running the Examples

Babeldoc comes ready to run with demonstration pipelines right out-of-the box. There are two pipelines configured: demo and test. They provide an example of how to construct a pipeline using the simple configuration style (property files) and the more sophisticated XML based configuration. There are additional pipelines configured, including one to recreate the documentation you are currently reading in a number of different formats.

In addition the usage-whitepaper, also found in the /readme directory provides a walk-through creating two usage scenarios - both of which are in the readme folder.

In order for these examples to run correctly, please ensure that the following instructions are followed

The Java Runtime Environment

Download and install the JRE. Babeldoc requires that you install a version 1.4 or greater. It may be possible to run parts of Babeldoc using 1.3 but this is not supported. Ensure that java is in your path. This is essential. To test this, execute the command java from the commandline. If you get a "command not found" exception, you will need to add the jre/bin directory to your path. It is also useful to have JAVA_HOME set or you may receive warning messages.

Windows

The BABELDOC_HOME and the PATH environment must be set. This is done as follows:

  • C:> set BABELDOC_HOME=c:\babeldoc
  • C:> set PATH=c:\babeldoc\bin;%PATH%

It is assumed that you have installed Babeldoc in the c:\babeldoc directory. If it is in another directory, please change accordingly.

Unix-like systems

The BABELDOC_HOME and the PATH environment must be set. This is done as follows:

  • $ export BABELDOC_HOME=/opt/babeldoc
  • $ export PATH=$BABELDOC_HOME/bin:$PATH

It is assumed that you have installed Babeldoc in the /opt/babeldoc directory. If it is in another directory, please change accordingly. It is quite possible that you will have installed it a non-privileged location.

Testing your Babeldoc installation

Open a command window. Ensure that the paths are correct for your platform. Run the command:

  • babeldoc

The output from this command should look similar to this:

*** This is Babeldoc! ***
Usage: babeldoc <command>
where command must be one of:
xls2xml, scanmon, addstagewiz, setupwiz, lightconfig, sqlload, pipeline, setentrywiz, journal, process, journalbrowser, flat2xml, scanner, guiprocess, pipelineb uilder, babelfish, module.
Babeldoc 1.3.0 Copyright (C) 2002,2003,2004 The Babeldoc Team!!
Babeldoc comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it under certain conditions

If your output is not like this or your get an error, please check the paths and the JRE requirements.

Running a document through the pipeline

Jumping right in, you can run the demonstration pipelines:

Example 1.1. Running the demonstration pipeline

  • babeldoc process -p test -f test/quickstart/stats.xml

You will see a number of logging messages scroll over the screen, and in the current directory find a file named: stats.html. Take a look at this file using your favorite browser - to many this is a more pleasant way to look at sports scores. This is the output file from the processing of the input file, stats.xml. It is interesting to note the at the file stats.xml does not actually reside in the filesystem as a file. It is in the Babeldoc core Java archive.

<2002-07-25 23:37:51,279> <root> <INFO> Process stage: entry
<2002-07-25 23:37:52,692> <root> <INFO> Process stage: transform
<2002-07-25 23:37:53,382> <root> <INFO> Process stage: choose
<2002-07-25 23:37:53,400> <root> <INFO> Process stage: writer
Processed. Ticket: 1027654671023 assigned

Note that this is an example of what your output looks like. The ticket number will be different. Use your ticket number in the following examples.

Note some versions may not give the ticket number. The ticket number can be found by running babeldoc journal -L or the Journal Browser tool by running babeldoc journalbrowser

Inspecting the Pipeline

The pipeline can be inspected using the pipeline tool. To see the options, simply type:

Example 1.2. Inspecting the pipeline

  • babeldoc pipeline

The options for the pipeline tool will be printed to the screen. Please experiment with this tool. For interrogating the configurations on, say the entry, stage, issue the command:

  • babeldoc pipeline -C test.entry

Notice the common syntax for accessing pipeline-stages:

Inspecting the Journal

The journal tracks documents moving through the pipelines. It can also track the changes to the documents as well as the status of each stage in the pipeline. The tool to access the journal functionality from the command-line is:

Example 1.3. Inspecting the Journal

  • babeldoc journal

The journal tool is primarily suited to querying the journal data. Please experiment with all the options. For now, though, review all the steps that occurred during the processing in 1.2.1. To do this, you must use the ticket number printed during your session (Not 1027654671023 as below).

  • babeldoc journal -T 1027654671023

This will result in the following output

ticket: step: 0; date: Thu Jul 25 23:53:43 EDT 2002; stage: null; op: newTicket; other: null
ticket: step: 1; date: Thu Jul 25 23:53:43 EDT 2002; stage: test.entry; op: updateDocument; other:
ticket: step: 2; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.extract; op: updateStatus; other: success
ticket: step: 3; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.transform; op: updateStatus; other: success
ticket: step: 4; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateDocument; other:
ticket: step: 5; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateStatus; other: success
ticket: step: 6; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.writer; op: updateStatus; other: success

This listing shows that the document resulted in six recorded steps in the journal. Most of these steps are updateStatus steps. These steps merely record the processing status of the document and the corresponding stage in the pipeline. The updateDocument steps indicate points in the pipeline when the entire document was stored. At these steps it is possible extract the document and display its contents (journal -D) or even reprocess the document (journal -R). Display the document at stage 4 (journal -D 1027654671023.4).

It is possible to toggle the document tracking at any stage in a pipeline. This is done by setting the tracked flag in the pipeline configurations. Please review the pipeline documentation.

Modules

Babeldoc is a modular piece of software. Each of the modules in babeldoc successively add and refine its operation. Although a module participates both in the runtime and build of Babeldoc, this document is concerned with the runtime aspects of modules.

Example 1.4. Listing the modules

There is a commandline tool to list the current set of modules known to Babeldoc:

  • babeldoc module -l

This will list the modules and their dependencies

Module: core is dependent on:
Module: web is dependent on: core
Module: gui is dependent on: core
Module: crypto is dependent on: core
Module: xslfo is dependent on: core
Module: soap is dependent on: web, core
Module: sql is dependent on: core
Module: scanner is dependent on: core, sql
Module: babelfish is dependent on: core
Module: conversion is dependent on: core

It is possible to remove and add modules to Babeldoc. The current set of modules is installed in the directory $BABELDOC_HOME/lib. The standard modules are named: babeldoc_module-name.

Configuration

All configuration data within Babeldoc is handled in a structured fashion. Every configuration key must be contained in a configuration file. The configuration file is hierarchically arranged very similar to a regular filesystem file in directories and subdirectories. The "directory" part of the configuration file name is specified as UNIX style forward slashes separating directory names. The configuration key is a string which must be unique in a configuration file. An example of this is the configuration key: Journal.simple which is defined in the configuration file: service/query.properties.

LightConfig

The configuration implementation of Babeldoc is configurable but the default configuration implementation is the LightConfig. This stores the configuration data in properties files which are then hierarchically arranged into directories. The files may be stored on the local filesystem or in archive (JAR) files

The lightconfiguration implementation also has the very interesting and sometimes perplexing ability to merge configuration files with the same name into a single configuration file. This means that configuration file data does not overwrite data except where the configuration key is identical. In this case, the configuration file that is specified at the end of the configuration search path is dominant. This is logical and is consistent with how the PATH (or CLASSPATH) environment variable is used by the command processor to search for executables except that instead of the first match overriding all else, all of the matches are merged into a single "file".

The configuration searchpath is very important and is used by Babeldoc to determine where to find the configuration data and how to load it. The parts of the search path are given below:

  • module dependency: The module dependencies determine the builtin searchpath. This cannot be changed by the user except by excluding modules from Babeldoc runtime. The module search path moves from the core module to the most dependent module. In other words, a configuration key in a configuration file in the most dependent module will override the same key in the same file in the core module. This is important to how dependent modules "specialize" Babeldoc behavior.
  • BABELDOC_HOME: This environment variable is set by the user to indicate those directories which contain configuration information. This environment variable is structured just as the CLASSPATH variable is. The earlier path elements indicate less dominant paths. It is very important for projects to set this variable to either a directory or a JAR file which contains your configuration settings.
  • current directory: The current directory is automatically added to the configuration search path for convenience reasons.

There are times when the configuration is not working as expected. There is a small commandline tool which makes it easier to inspect the configuration files and how each configuration key is modified in the configuration file. The tool, lightconfig is illustrated below:

Example 1.5. Listing configuration data

  • babeldoc lightconfig -l pipeline/config

The location of the configuration file pipeline/config.properties in each of the parts in the configuration search is then listed. A typical output would be:

Listing urls for the configuration: pipeline/config.properties
1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties
0: file:/C:/work/vap_rpt/./pipeline/config.properties

This output indicates that the file pipeline/config.properties exists in the babeldoc_core.jar file and is then overridden in the directory: C:/work/vap_rpt

Example 1.6. Tracing a configuration key

  • babeldoc lightconfig -l pipeline/config -t documentation.type

This traces how a particular configuration key (documentation.type) found in the configuration file: pipeline/config.properties is modified in all the possible configuration files. A typical output would be:

Listing urls for the configuration: pipeline/config.properties
1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties
documentation.type = simple
0: file:/C:/work/vap_rpt/./pipeline/config.properties
documentation.type: not defined

This output indicates that the configuration key is defined once in the babeldoc_core.jar file and is not subsequently overridden.

Setting up a new project

This section briefly describes the steps necessary to setup a new Babeldoc project. In the interests of brevity, the following assumptions are made:

  • You are on a MS Windows environment.
  • You have installed the Java SDK into the c:\j2sdk1.4.2_04 directory.
  • Babeldoc is installed in the c:\babeldoc directory.
  • Your project is in the c:\project directory.

The simplest method of configuring your environment is to create a setup batch file in the project directory. This file is usually called setup.bat but the name is unimportant. The purpose of the file is to configure the local environment so that Babeldoc can run. The contents of this file for this environment is given below:

@echo off
set JAVA_HOME=c:\j2sdk1.4.2_04
set BABELDOC_HOME=c:\babeldoc
set BABELDOC_USER=c:\project
set PATH=%PATH%;%BABELDOC_HOME%\bin

Prior to using Babeldoc run this script. Now create the configuration files in this directory.

Chapter 2. Pipelines

Introduction

A pipeline is a program whose purpose is to transform a document into one or more resultant documents. An example pipeline could transform a received XML purchase order into a set of SQL statements intended to update a database, produce a printable PDF file for record keeping and a confirmation email sent to the originating party.

All of the pipelines in Babeldoc must have a unique name like test or document. A pipeline is a set of processing steps arranged in a linear fashion. Each processing step is called a "pipeline stage" and each pipeline stage in a pipeline must have a unique name. Two pipelines may have a pipeline stage of the same name. There is a special pipeline stage in a pipeline, the entryStage which indicates which pipeline stage should initially receive the document from the feeder mechanisms. It is also possible to introduce a document into the "middle" of a pipeline. In order to designate a particular stage in a pipeline, the name is given as: pipeline-name.pipeline stage-name.

The pipelines in the babeldoc system are managed by the Pipeline factory which determines how and when a pipeline runs. Each pipeline stage in a pipeline has a type and a set of configuration options. An example of the a pipeline stage is the test.transform pipeline stage whose type is XslTranform. This type of pipeline stage requires either the configuration option tranformationFile which supplies the filename (or URL) of the XSLt file to perform the transformation or transformationScript which is the inline XSL.t document. There is an additional non-mandatory configuration option, bufferSize which can help with larger transformations.

Pipeline Documents

The pipeline stages operate on documents. A useful metaphor is the pipes that constitute the plumbing in your home. Each of the stages in the plumbing pipeline represents bends, faucets and other functional requirements. A pipeline document is the water in the pluming pipeline. A document is successively transformed by the pipeline until it is finally stored, discarded or otherwise disposed of. The transformations are determined by the pipeline and its stages. A document is primarily a number of bytes (characters) of data. The data is characterized by a MIME type. There are a number of ways a document can get fed into a pipeline, namely:

  1. the pipeline feeder programs (including soapfeeder, socketfeeder)
  2. the scanner program
  3. the journal re-player

A document consists of the following components:

  1. Body - the data representation of the document.
  2. Attributes - enrichments of the document body

Body

The document body may be an XML document, a flat-file text document or a binary document. Significant processing can only be applied to XML documents in the standard Babeldoc. In order to convert flat-files to xml documents, there is a conversion pipeline stage which can convert a number of types of flat-file formats to XML. Please see the conversion chapter of this document. The XML functionality is the primary focus of the default Babeldoc distribution. Binary documents are also acceptable documents however, Babeldoc does not have many stages that process binary documents. This does not mean that binary processing is not possible - it is. Examples could be processing of photographic images, sound file processing or even video file processing.

Attributes

Attributes are said to enrich the document because they are basically shortcuts to data found in the document itself or from some other source. The attributes can be applied to a document in the following ways:

  • Externally applied from the processor commandline using the -a switch or similar
  • Internally from the document data. The XpathExtract pipeline stage can apply an xpath expression on the document body and then store the result as an attribute on the document.
  • Internally from other sources. It is possible to apply attributes like the current time, etc on the document.

For instance, the number of purchase orders in a bulk purchase XML document can be extracted from the document (using the XPathExtract pipeline stage) and placed in the attribute named numOrders. This results in significant speed increases in subsequent processing because the attribute (numOrders) can be used instead of multiple expensive XPath operations. The attributes are available through the variable ${document.get("attribute.name")}. This means that it is possible to customize the pipeline processing based on extracted (or enriched) data from the document.

Attributes are not limited to data extracted from the document, but can be options passed into the pipeline along with the the document like the email address to send the document to, file path that the document was read from and more besides. These kinds of attributes allow the internal processing of the pipeline to be influenced by the external environment. The command babeldoc process will accept any number of name=value pairs on the command line. Each of the supplied attributes will be placed on the document and will be available to the pipeline stages in the pipeline. In the test pipeline it is possible to email the processed document by supplying smtpHost, smtpFrom and smtpTo attributes.

Example 2.1. Adding attributes from the command-line

Instead of running the test pipeline as before, we can add a number attributes to the process command-line which will activate "hidden" stages in the pipeline.

  • babeldoc process -p test -f test/quickstart/foo.xml -a "smtpHost=mailer" -a "smtpFrom=you@here.com" -a "smtpTo=some.one@place.com"

Babeldoc has a rather complex data abstraction mechanism. It is just as easy to read a file from your hard-disk as it is to load it from a file in a your classpath, or even from a website (http://...) or a ftp site (ftp://....). This means that your simple pipeline which works on local files, will also work in a networked environment.

Pipeline types

The names of each of the pipelines and the configuration options for each pipeline is provided in the file config/pipeline/config.properties. Since this file (like every other configuration file) participates in the Babeldoc configuration system, you will need to create your own copy of this file in your configuration directory. Please see the configuration handling found in chapter 1. The handling of pipeline stages is performed by a set of PipelineStageFactories. The determination of which PipelineStageFactory will handle the pipeline is the type of the pipeline. Here are the current pipeline factory types

  1. SimplePipelineStageFactory This is the pipeline factory that is driven by regular textual property files. This is the quickest way to get a pipeline up and working in the shortest time.
  2. XmlPipelineStageFactory This is the pipeline factory that is driven by XML configuration files. It is anticipated that these kind of configurations are better suited to larger projects and for automated tools.
  3. ImplEjbPipelineStageFactory This is stub to the EJB pipeline factory. This factory resides in a J2EE server instance. The actual pipeline factory in the server can be either the simple or the XML configuration factory.

SimplePipelineStageFactory

This pipeline factory is the simplest to setup. Its pipeline type is simple. This is indicated in the configuration file pipeline/config.properties. This declares the pipeline and provides its type and the actual configuration file that defines the pipeline. For instance, if the pipeline name is test, to set-up the type of the test pipeline, the entry in the file will be test.type=simple. The pipeline definition (see more later) configuration file for the test pipeline will be given as test.configFile=pipeline/your-config (Note: The .properties is omitted from the file name).

Example 2.2. Declaring a 'Simple' Pipeline

The configuration file: pipeline/config.properties shows the how a simple pipeline called test is declared to Babeldoc. Notice that a subsequent example will show how the pipeline is defined.

test.type=simple
test.configFile=pipeline/simple/test

Notice that the configuration file for the pipeline is pipeline/simple/test - the actual name of the file is pipeline/simple/test.properties.

Notice here that the configuration is based from the config directory. This means that the config directory is in the classpath (by default). So if you define your own configuration file, in the directory, mega-project and place this directory in the CLASSPATH, as:

  • (UNIX) export CLASSPATH=/mega-project
  • (WINDOWS) set CLASSPATH=c:\mega-project

And in this directory, there is a subdirectory called pipeline. Within this, the config.properties (this location is mandated - the PipelineFactory looks for this file. If you do not put your pipeline declarations in this configuration file, they will NOT be found). The definition of your pipelines may done in this file. The pipeline configuration files may be in the same directory as the config.properties or in subdirectories of the pipeline directory - the choice is yours.

The actual definition of the pipeline is provided in the value of the pipeline-name.configFile property which is specified in the pipeline/config.properties file. Each of the pipeline stages within the pipeline are defined here as well as the document flow from one pipeline to the next.

Every simple pipeline definition document must contain the entryStage property. This property informs Babeldoc which pipeline stage is the starting point for the pipeline. If this property is not given in this file, processing of this pipeline results in an error.

Other than the entryStage property, every property in the pipeline definition file is of the form:

pipelinestage-name.option-1...option-n=value

The first part (up to the first period) is the name of the pipeline stage. The subsequent options (period separated up to the '=') are arguments to the pipeline stage. There are a two kinds of options for each pipeline stage:

  • general - these options can be applied to all the pipelinestages
  • specific - these options are only applicable to the specific type of pipelinestage

Additionally there are mandatory and optional pipeline stage options. The pipeline will fail to run if a mandatory option is not provided. The following are general options:

  • stageType (mandatory) - This indicates the type of this pipeline stage
  • nextStage (mandatory) - This indicates the name of the next pipeline stage in the pipeline
  • ignore (optional) - This disables this pipeline stage from processing
  • tracked (optional) - This causes the entire document to be stored in the journal. This would allow this pipeline to be re-executed from this point with identical data

For the complete list of pipelinestage configuration options, please refer later in this chapter to the list of pipelinestages.

Example 2.3. Defining a 'Simple' Pipeline

The pipeline is defined in a properties file which enumerates the pipelinestage configuration.

entryStage=entry
entry.stageType=Null
entry.nextStage=transform
entry.tracked=true
transform.stageType=XslTransform
transform.nextStage=choose
transform.transformationFile=test/quickstart/stats-html.xsl
transform.bufferSize=2048
choose.stageType=Router
choose.nextStage=writer
choose.tracked=true
choose.nextStage.emailer=#if(${document.get("smtpHost")})true#end
emailer.stageType=SmtpWriter
emailer.nextStage=writer
emailer.smtpHost=$document.get("smtpHost")
emailer.smtpFrom=$document.get("smtpFrom")
emailer.smtpTo=$document.get("smtpTo")
emailer.smtpSubject=Document: Ticket: ${ticket.Value}
emailer.smtpMessage=${document.toString()}
writer.stageType=FileWriter
writer.nextStage=null
writer.outputFile=${system.getProperty("user.dir")}/stats.html

The structure of this file is regular except for the entryStage. This property has to be present and its value is the name of the pipelinestage that is the starting point for this pipeline. If this property is not provided, Babeldoc cannot process this pipeline.

The rest of the properties in this pipeline stage definition file configure the 5 pipeline stages:

  • entry - this does nothing but store the document in the journal
  • transform - this stage uses XSL to convert the XML pipeline document into HTML
  • choose - This routes the document to the stage emailer if the attribute smtpHost is set, otherwise the nextStage is fileWriter
  • emailer - This stage emails the document, using the attributes stored on the document
  • writer - this stage writes the document to the disk

Xml Pipeline Stage Factory

This factory builds pipelines from an XML document that completely describes all elements of a pipeline. The schema document for it is found in the directory readme/schema. The two areas of the pipeline definition document are the static area and the dynamic area. The static area is optional and describes each of the types of pipeline stages available. The dynamic areas is mandatory. It describes each of the pipeline stages in the system, their configuration options and the connections between them. The document is illustrated below:

pipelines

static [0..1]

dynamic [1]

stage-instances [1..*]

configuration [0..*]

connections [1]

Example 2.4. XML Pipeline

The demonstration pipeline, demo is defined using a XML pipeline stage factory. This file is given below:

<?xml version="1.0"?>
<pipeline xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.babeldoc.com/xsd/pipeline.xsd">
<documentation>This is a demonstration babel pipeline</documentation>
<pipeline-name>some-name</pipeline-name>
<dynamic>
<entry-stage>entry</entry-stage>
<!- STAGES: Defines the stages ->
<stage-inst>
<stage-name>entry</stage-name>
<stage-desc>This does nothing</stage-desc>
<stage-type>Null</stage-type>
</stage-inst>
<stage-inst>
<stage-name>extract</stage-name>
<stage-desc>this extracts stuff</stage-desc>
<stage-type>XpathExtract</stage-type>
<option>
<option-name>XPath</option-name>
<option-value></option-value>
<sub-option>
<option-name>documentId</option-name>
<option-value>
/AppointmentDocument/DocumentHeader/DocumentId/text()
</option-value>
</sub-option>
<sub-option>
<option-name>senderId</option-name>
<option-value>
/AppointmentDocument/DocumentHeader/SenderId/text()</option-value>
</sub-option>
<sub-option>
<option-name>documentType</option-name>
<option-value>
/AppointmentDocument/DocumentHeader/DocumentType/text()
</option-value>
</sub-option>
<sub-option>
<option-name>documentVersion</option-name>
<option-value>
/AppointmentDocument/DocumentHeader/DocumentVersion/text()
</option-value>
</sub-option>
</option>
</stage-inst>
<stage-inst>
<stage-name>transform</stage-name>
<stage-desc>this transforms stuff</stage-desc>
<stage-type>XslTransform</stage-type>
<option>
<option-name>transformationFile</option-name>
<option-value>
${system.getProperty("user.dir")}/test/quickstart/foo.xsl
</option-value>
</option>
<option>
<option-name>bufferSize</option-name>
<option-value>2048</option-value>
</option>
</stage-inst>
<stage-inst>
<stage-name>choose</stage-name>
<stage-desc>this chooses stuff</stage-desc>
<stage-type>Router</stage-type>
<option>
<option-name>tracked</option-name>
<option-value>true</option-value>
</option>
<option>
<option-name>nextStage</option-name>
<option-value></option-value>
<sub-option>
<option-name>emailer</option-name>
<option-value><![CDATA[
#if(${document.get("smtpHost")})
true
#end
]]></option-value>
</sub-option>
</option>
</stage-inst>
<stage-inst>
<stage-name>emailer</stage-name>
<stage-desc>this emails stuff</stage-desc>
<stage-type>SmtpWriter</stage-type>
<option>
<option-name>smtpHost</option-name>
<option-value>$document.get("smtpHost")</option-value>
</option>
<option>
<option-name>smtpTo</option-name>
<option-value>$document.get("smtpTo")</option-value>
</option>
<option>
<option-name>smtpFrom</option-name>
<option-value>$document.get("smtpFrom")</option-value>
</option>
<option>
<option-name>smtpSubject</option-name>
<option-value>Document: Ticket: ${ticket.getValue()}</option-value>
</option>
<option>
<option-name>smtpMessage</option-name>
<option-value>
<![CDATA[${system.get("os.name")} - ${system.get("os.arch")} - ${system.get("os.version")}
Message:
${document.toString()}
]]></option-value>
</option>
</stage-inst>
<stage-inst>
<stage-name>writer</stage-name>
<stage-desc>this writes stuff</stage-desc>
<stage-type>FileWriter</stage-type>
<option>
<option-name>outputFile</option-name>
<option-value>${system.getProperty("user.dir")}/out1.xml</option-value>
</option>
<option>
<option-name>doneFile</option-name>
<option-value>.done</option-value>
</option>
</stage-inst>
<!-Define the connections between stages ->
<connection>
<source>entry</source>
<sink>extract</sink>
</connection>
<connection>
<source>extract</source>
<sink>transform</sink>
</connection>
<connection>
<source>transform</source>
<sink>choose</sink>
</connection>
<connection>
<source>choose</source>
<sink>writer</sink>
</connection>
<connection>
<source>emailer</source>
<sink>writer</sink>
</connection>
<connection>
<source>writer</source>
<sink>null</sink>
</connection>
</dynamic>
</pipeline>

Multithreaded Operation

Babledoc is capable of spawning multiple threads to process multiple pipelines in parallel and to process documents within each pipeline in parallel. This has important consequences for large scale computing systems. This is an advanced concept. Please skip this section if you feel that it is too advanced for you.

Processors

A processor determines how a pipeline handles documents which are returned by a pipeline stage. There are pipeline stages which produce multiple documents from a single input document. The XPathSplit is such a pipeline stage. The standard way that Babeldoc operates is that each of the resultant documents from the pipeline stage is processed in turn. It is also possible to process the resultant documents in parallel.

The following processors are available:

sync

Synchronously process the pipeline documents. Each document is processed serially - no new threads are created.

threadpool

Asynchronously process the pipeline documents using a threadpool. This is probably the most useful in a multithreaded environment.

Name

Type

number

description

poolSize

integer

0..1

The number of threads in the thread pool. This sets the maximum number of documents to process at one time. Default is 5.

keepAlive

integer

0..1

The number of milliseconds that an idle thread in the threadpool will remain alive before being reclaimed. Default is 15000.

async

Asynchronously process the pipeline documents

Name

Type

number

description

maxThreads

integer

0..1

The maximum number of threads that this processor can spawn. The pipeline stage may override this but can never exceed this value. Default is 5.

The standard processor is the sync processor. This can be overridden if necessary. The processor for each pipeline is given in the pipeline/config.properties file. This is specified by: pipeline-name.processor.type=processor-type.

Example 2.5. Using another pipeline stage processor

This example is also provided in the Babeldoc distribution as 'threads'. The following is a simple pipeline definition found in the directory pipeline/pipeline.properties.

entryStage=ffconvert
ffconvert.stageType=FlatToXml
ffconvert.flatToXmlFile=flatfile.xml
ffconvert.nextStage=splitter
splitter.stageType=XpathSplitter
splitter.XPath=/big-un/row
splitter.nextStage=writer
splitter.threaded=true
splitter.maxThreads=7
writer.stageType=FileWriter
writer.outputFile=out.txt
writer.nextStage=null

This simple pipeline definition accepts a text file, converts it to XML, then splits the XML using the XPath expression: /big-un/row. The resultant documents are all written to the same file, out.txt

There are three declared pipelines, all using the same pipeline definition. This is found in the file pipeline/config.properties below:

pipeline.type=simple
pipeline.configFile=pipeline/pipeline
asyncpipeline.type=simple
asyncpipeline.configFile=pipeline/pipeline
asyncpipeline.processor.type=async
asyncpipeline.processor.maxThreads=4
pooledpipeline.type=simple
pooledpipeline.configFile=pipeline/pipeline
pooledpipeline.processor.type=threadpool
pooledpipeline.processor.poolSize=10

The three pipelines: pipeline, asyncpipeline and pooledpipeline all illustrate the various processor configurations possible.

Feeders

A feeder is a software strategy of getting documents into babeldoc. The following feeders are available:

  • sync - this feeder synchronously feeds each document to the pipeline. The feeder waits until the processing has completed before feeding the next document
  • async - this feeder asynchronously feeds each document to the pipeline. The feeder immediately submits all documents and then returns. The documents are then submitted in parallel to the pipelines. The pipelines are run in parallel
  • async-disk - this feeder is like the async feeder except that the documents are spooled to a directory on the disk so that if the processing is terminated, the feeding may be restarted without any documents being lost

The configuration of each of the feeders is done using the configuration file feeder/config. Babeldoc comes with the following feeders:

# The generic feeders: synchronous
sync.type=synchronous
# The generic feeders: asynchronous - with an in-memory queue
async.type=asynchronous
async.queue=memory
# The "specific" feeders: asynchronous - with disk queue
async-d.type=asynchronous
async-d.queue=disk
async-d.queueDir=/tmp
async-d.queueName=async-d

The async feeders are able to accept an additional parameter, poolSize which limits the thread pool size which limits the maximum number of pipelines that can run in parallel.

Pipeline stage types

There are a limited number of types of pipeline stages. Each of the stages performs a single function. The options available through the configurations change the operation of the stage. In order for your custom pipeline to do any useful work, you have to configure the pipeline stages. You can also create your own custom pipeline stage for specialized processing. See the documentation for each stage type.

CallStage

Allows a pipeline to call another pipelinestage. This pipeline stage is very useful in that it allows for modular pipeline configurations. The result of the called pipeline is either used instead of the current pipeline document or is discarded depending on the setting of the discardResults configuration

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

callStage

string

1..1

Pipeline to call

discardResults

boolean

0..1

Discard the pipeline document from the called stage.

test

boolean

0..1

If this option is set and it evaluates to true, the call is made otherwise

Compression

Compress the document using either zip or gzip compression. **EXPERIMENTAL**

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

compressType

enumeration

0..1

Compression type (zip or gzip)

Decompression

Decompress the document using either zip or gzip compression **EXPERIMENTAL**

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

compressType

enumeration

0..1

Compression type (zip or gzip)

DecryptDocument

Cyptography helper

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

operation

enumeration

0..1

Encryption or decryption

transformation

string

0..1

The encryption transform type

useSessionKey

string

0..1

Use the session key

sessionKeyFile

directory-path

0..1

Use the session key

sessionKeyAlgorithm

string

0..1

Use this session algorithm

sessionKeySize

integer

0..1

size of the session key

Domify

Domify the document contents (assumed to be XML) and save as an attribute on the pipeline document.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

validate

boolean

0..1

Validate the XML. Default is false.

schemaFile

directory-path

0..1

The schema file to validate against

Enrich

Adds attributes to the document. The value of the attribute can be a constant value or a velocity script.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

enrichScript

null

0..n

List of enrichment attributes to add to the document

ExternalApplication

This pipeline stage allows for external applications to be run. Optionally the pipeline document contents is piped to the application as standard input or the output of the application can be read as a new pipeline document.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

application

directory-path

1..1

Full path to the application to run

pipeOutDocument

boolean

0..1

Pipe the current document to the script - the script must fully accept the stardard input otherwise an exception is thrown. Boolean default is false.

pipeInResponse

boolean

0..1

Pipe the response into the document in the attribute ExternalApplicationResponse. Boolean default is false.

FileWriter

Writes the document to a disk file. The contents are written as binary or text data depending on the binary flag on the document. When the pipeline document has been written to disk, this stage can optionally create a 'done' file which could act as a flag file for external processes indicating that the output file is completely written.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

append

boolean

0..1

Append the data to the existing file

outputFile

directory-path

0..1

Output filename

doneFile

directory-path

0..1

Write the "done" file when the document is written. This can act as a flag for other disk scanning processes

encoding

string

0..1

Name of charset used to write file

FlatToXml

Convert this flat document to an XML document

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

flatToXmlFile

directory-path

0..1

Flat file conversion specification XML file

FtpWriter

Write the document using the FTP protocol to an FTP server. This will enable pipelines to distribute documents on the internet using this well supported protocol

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

ftpHost

string

0..1

FTP hostname or ip address

ftpUsername

string

0..1

FTP username to login with

ftpPassword

string

0..1

FTP password to authenticate with

ftpFolder

string

0..1

The name of the folder on the FTP server

ftpFilename

string

0..1

The name of the filename to send the document to the FTP server

HttpClient

Act as http client and get the results as new document

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

method

string

0..1

HTTP method

URL

url

0..1

URL

queryString

null

0..n

Query parameters

followRedirects

boolean

0..1

Follow redirects

http1.1

boolean

0..1

HTTP 1.1

strictMode

boolean

0..1

Strict mode

headers

null

0..n

Headers

parameters

null

0..n

Post parameters

fileParameters

null

0..n

Post file parameters

splitAttributes

boolean

0..1

Add old document's atributes into new document after httpClient call

JavaXmlDecoder

Using the java.beans.XMLDecoder object to unpersist the document contents in Java objects

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

JournalUpdate

This pipeline stage writes a message into the journal that can be viewed with the journal tool (babeldoc journal). Please note that journal entries should be one line long and contain no quotes, commas, or newlines. If these characters are detected, they will be translated into their HTML equivalents to prevent 'bad things' from happening to the journal tool. However, the output from the journal tool will most likely not be what you are expecting.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

message

string

1..1

The message to write to the Journal

JTidy

Format the pipeline document using JTidy. This is used to "clean-up" HTML documents into well-formed documents.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

indent-spaces

integer

0..1

default indentation

wrap

integer

0..1

default wrap margin

wrap-attributes

boolean

0..1

wrap within attribute values

wrap-script-literals

boolean

0..1

wrap within JavaScript string literals

wrap-sections

boolean

0..1

wrap within <![ ... ]>> section tags

wrap-asp

boolean

0..1

wrap within ASP pseudo elements

wrap-jste

boolean

0..1

wrap within JSTE pseudo elements

wrap-php

boolean

0..1

wrap within PHP pseudo elements

literal-attributes

boolean

0..1

if true attributes may use newlines

tab-size

integer

0..1

tabsize = 4;

markup

boolean

0..1

if true normal output is suppressed

quiet

boolean

0..1

no 'Parsing X', guessed DTD or summary

tidy-mark

boolean

0..1

add meta element indicating tidied doc

indent

boolean

0..1

indent content of appropriate tags

ident-attributes

boolean

0..1

newline+indent before each attribute

hide-endtags

boolean

0..1

suppress optional end tags

input-xml

boolean

0..1

treat input as XML

output-xml

boolean

0..1

create output as XML

output-xhtml

boolean

0..1

output extensible HTML

add-xml-pi

boolean

0..1

add <?xml?> for XML docs

add-xml-decl

boolean

0..1

add <?xml?> for XML docs

assume-xml-procins

boolean

0..1

if set to yes PIs must end with ?>

raw

boolean

0..1

avoid mapping values > 127 to entities

uppercase-tags

boolean

0..1

output tags in upper not lower case

uppercase-attributes

boolean

0..1

output attributes in upper not lower case

clean

boolean

0..1

remove presentational clutter

logical-emphasis

boolean

0..1

replace i by em and b by strong

word-2000

boolean

0..1

draconian cleaning for Word2000

drop-empty-paras

boolean

0..1

discard empty p elements

drop-font-tags

boolean

0..1

discard presentation tags

enclose-text

boolean

0..1

if true text at body is wrapped in <p>'s

enclose-block-text

boolean

0..1

if yes text in blocks is wrapped in <p>'s

add-xml-space

boolean

0..1

if set to yes adds xml:space attr as needed

fix-bad-comments

boolean

0..1

fix comments with adjacent hyphens

split

boolean

0..1

create slides on each h2 element

break-before-br

boolean

0..1

o/p newline before <br> or not?

numeric-entities

boolean

0..1

use numeric entities

quote-marks

boolean

0..1

output " marks as &quot;

quote-nbsp

boolean

0..1

output non-breaking space as entity

quote-ampersand

boolean

0..1

output naked ampersand as &amp;

write-back

boolean

0..1

if true then output tidied markup

keep-time

boolean

0..1

if yes last modied time is preserved

show-warnings

boolean

0..1

however errors are always shown

error-file

string

0..1

file name to write errors to

slide-style

string

0..1

style sheet for slides

new-inline-tags

string

0..1

new inline tags

new-blocklevel-tags

string

0..1

new block level tags

new-empty-tags

string

0..1

new empty tags

new-pre-tags

string

0..1

new pre tags

char-encoding

integer

0..1

CharEncoding = ASCII;

doctype

string

0..1

user specified doctype

fix-backslash

boolean

0..1

fix URLs by replacing \ with /

gnu-emacs

boolean

0..1

if true format error output for GNU Emacs

smart-indent

boolean

0..1

does text/block level content effect indentation

alt-text

string

0..1

default text for alt attribute

Null

Null stage. This do-nothing stage is useful in certain situations like a tracking placeholder or just a placeholder for some future pipeline stage.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

Reader

Load the contents of the file, completely overwriting the current document's contents with the file's contents.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

file

directory-path

1..1

The filename or URL to the object to read.

Router

Route this document to a number of specified stages. This stage would be used to specialize processing based on some criterion very much like an if-else statement. Usually the criteria used would be an attribute on the document like time of processing, filename, etc but could be a script. The nextStage complex parameter must evaluate to the literal 'true'. If more than one of the nextStages resolves to true, then the document is routed to each of those stages. If none of the matches are made, the regular nextStage configuration option is used. This provides the 'else' part.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

nextStage

null

0..n

Stage name to route to if the script resolves to 'true'. Each of the matching nextStages will be routed.

RssChannel

Write an item entry to an RSS Channel

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

channelFile

directory-path

1..1

RSS File to process

channelSize

integer

1..1

Maximum umber of items in the RSS Channel

itemDescription

string

1..1

Item Description

itemLink

string

1..1

Item Link

itemTitle

string

1..1

Item Title

Scripting

Execute a user supplied script. This pipeline stage enables pipeline developers to create and manipulate documents in novel and unforeseen ways.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

language

enumeration

1..1

Scripting language - supported as per Apache BSF - Default is javascript

script

multiline

0..1

Script to be executed

scriptFile

directory-path

0..1

Script file to processed

Signer

This stage performs digital signing or verifying the signatures

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

operation

enumeration

1..1

Type of operation that should be performed

keyStoreFile

directory-path

1..1

Absolute or relative file path to the keystore file

keyStoreType

string

0..1

Type of the keystore

keyStorePass

string

1..1

Password of the keystore

signatureFile

directory-path

0..1

File path of the sigature file. This is where signature will be saved if cigning or loaded if verifying signature

signatureAttribute

string

0..1

Document attribute where signature will be stored when signing or loaded from if verifying

verifiedAttribute

string

0..1

Document attribute where result of verify operation will be saved

algorithm

string

1..1

Signature algorithm used for performing operations

keyAlias

string

1..1

Alias of the private key used for signing

keyPassword

string

0..1

Password of the private key used for signing if key is protected with password

certificateAlias

string

1..1

Alias of the certificate (public key) used for verifying signature

SmtpWriter

Email the document using the SMTP protocol. This will allow for documents to be transmitted via email to a number of recipients. The document is normally the body of the email but could also be an attachment.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

smtpHost

string

1..1

The SMTP host to communicate with

smtpFrom

string

1..1

The email address of the sender

smtpTo

string

1..1

The email addess to send the email to

smtpSubject

string

1..1

The subject line of the email

smtpMessage

string

0..1

The body message of the email

filesToAttach

string

0..1

The list of files to attach to this email

attachDocument

boolean

0..1

true if document should be send as attachment. Default is false

documentFileName

string

0..1

The name of the attached document

format

enumeration

0..1

The mail format - text/plain or text/html - Deafult is text/plain

SoapWriter

Send the document to a soap service

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

soapUrl

url

0..1

URL for the SOAP service

soapAction

string

0..1

SOAP action

resultStage

string

1..1

URL for the SOAP service

responseDoc

boolean

0..1

Return SOAP service response as an attribute

authentication

boolean

0..1

Post soap document with authentication

username

string

0..1

User id for authentication

password

string

0..1

Password for authentication

SocketWriter

Send the pipeline document contents to a tcp/ip socket. This is useful for low-level operations.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

hostName

string

1..1

The name of the host

hostIp

string

1..1

The ip address of the host

port

integer

1..1

Neither host ip or host name provided

SqlEnrich

Enrich documents with values based on sql queries

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

resourceName

string

1..1

Name of the resource that contains Database Connection

attributeSql

null

0..n

List of attribute names which contains sql queries that return single value. Attribute will get value returned by that query. Only the first column of the first row will be taken if multicell results returned!

sqlScript

null

0..n

List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere.

SqlQuery

Creates an XML file from a SQL query

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

resourceName

string

1..1

Name of the resource that contains Database Connection

sql

null

0..n

List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere.

SqlWriter

Executes the specified SQL statement

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

resourceName

string

1..1

Name of the resource that contains Database Connection

useBatch

boolean

0..1

Use JDBC SQL batching - depends on the driver support

batchSize

integer

0..1

The batch size if applicable

sql

string

1..1

The SQL statement to execute

failOnFirst

boolean

0..1

Set to true if the pipeline should not attempt subsequent SQL statements if a statement fails

messageTag

string

1..1

The message tag to search for if the statement fails - this is then logged instead of the SQL error message

SvgTranscode

Render the SVG xml document to a binary image

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

transcode

enumeration

1..1

Choose the transcode

width

integer

0..1

Width of the output image

height

integer

0..1

Height of the output image

quality

integer

1..1

Quality of the translation expressed as a percentage

VelocityTemplatize

This stage uses Velocity to templatize the document. The results of the operation will replace the original template.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

XlsToXml

Converts Microsoft Excel files to XML format. This creates a regular XML output document of workbooks, rows and cells. The XML encoding can be configured if necessary.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding string for the output XML. By default it is UTF-8.

attributes

multiline

0..1

Attributes

locale

string

0..1

Locale which should be used for formatting numbers and dates from Excel workbook. If not specified, default Locale will be used.

XpathExtract

Use XPath expressions to extract nodes from the document and store them as attributes on the document. This pipeline stage is widely use when data needs to be extracted from XML documents for router or calculation steps. The extracted attributes can be quickly and easily obtained using velocity $document.get and from the scripting stages. Routing decisions based on the document contents are also possible using this technique.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

XPath

null

0..n

The name of the xpath configuration option is the attribute to assign to the document

XpathSplitter

Split the XML document using xpath expressions. This will result in a number of documents being forwarded to the next stage. This is useful when each of the split nodes represents a document that needs to be actioned. An example would be splitting out each of the orders from an XML document that is a collection of orders.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

xmlOmitDecl

boolean

0..1

Omit the XML PI declaration from the output document

xmlIndent

boolean

0..1

Indent the output document

XPath

string

1..1

The XPath expression to use to split the document

XslFoTransform

Apply XSL:fo transformation on the document

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

outputType

integer

0..1

The buffer size to use

XslTransform

Transform the document (has to be XML) using this XSL script. The script can access all of the babeldoc internals via a number of parameters. The parameters (accessed through the xsl:param element) which are always placed in the transformer are: pipelinestage and document. Other parameters may be placed on the transformer using the param option.

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

transformationFile

directory-path

0..1

The filename or URL to the XSL transformation file. If this is a file, then the XSL will be cached. If the file is modified, then the XSL document will be reloaded.

transformationScript

multiline

0..1

An inline XSL document that could be used instead of the file option above. This will be cached.

param

null

0..n

Complex configuration parameter (of form stage-name.param.param-name-n=param-value) of xsl:params that will be placed in the XSL transformer. This can significantly aid transformation tasks.

ZipArchiveWriter

Crack this document as a zip archive

Name

Type

number

description

stageType

service-name

1..1

Type of pipeline stage

nextStage

string

1..1

Name of the next stage in pipeline or null if this is the last stage.

ignored

boolean

0..1

If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.

tracked

boolean

0..1

If this is set then this stage is tracked - the pipeline document is written to the journal.

encoding

string

0..1

Encoding of resulting document. This is used for text documents. Default is system file.encoding

Handling errors in pipeline stages

Babeldoc has a configurable error handling mechanism. In the case of an exception, the exception will be handled using the default error handler. You can override the default error handler by specifying a custom error handler for a pipeline stage if the default handler is not suitable. The default Babeldoc error handler performs the following steps:

  1. Log the exception to the error log
  2. Set the processing state for the pipeline stage to FAIL
  3. Continue or quit process. This is determined by the flag failOnError. By default this is false but you can set it to true if you want to stop processing the current document if an error occurs. If it is false, Babeldoc will continue processing by proceeding to the next stage.

If you want to have some other error handler you can do it by writing your own error handler class. Your class should implement the interface com.babeldoc.core.pipeline.IPipelineStageErrorHandler. You will also need to provide your new error handler Java class to the pipeline configuration.

Tracking documents

You can store whole document with all its attributes in the journal at a given pipeline stage. This can be done by setting the configuration option tracked to true for a pipeline stage you want to track. The document will be then be stored in the journal. However the attributes are not guaranteed to be stored along with the document. This all depends on the Journal implementation. Also, all attributes are saved as Strings. If you want to use the replay operation in Journal then you should use this option in one of the previous stages so that the replayer will be able to recreate the document.

Ignoring Pipelinestages

There can be situations when you don't want a stage to be processed. This can be done by setting the configuration option ignored to true. For example you can use for unzipping files with zip extension, but you don't want to unzip files that are not zip files. In these situation you can set ignored to true (using Velocity scripting based on the file extension) and processing will not be performed in that stage.

Pipeline Tool

The pipeline can be accessed using the pipeline commandline tool. This allows for the inspection of the pipeline, the stages, the configuration options and connectivity options.

  1. -Q or --query: Lists the pipelines in the system.
  2. -L pipeline-name or --list pipeline-name: Lists the stages in a particular pipeline.
  3. -C pipeline-name.stage-name or --config pipeline-name.stage-name: List the configuration for a pipeline stage.
  4. -L pipeline-name.stage-name -y or -L pipeline-name.stage-name --type: List the type of the pipeline stage.

There are a number of options in this tool. Use the -h option to get the complete list.

Chapter 3. Resources

Introduction

Resources in Babeldoc are a generalized way of accessing data sources. Resources are also considered to be scarce in that they have to be protected from leakage. This is particularly important for database and J2EE resources . They are named using a string name and are thus uniquely identified in the system. The resources are defined in the config/resource/config.xml file. Each resource name maps to a specific class name which governs the policy of the resource and is programmatically specified. The available resources are:

  1. jdbc Unpooled Jdbc access to a database connection
  2. jndi Jndi lookup to a database connection
  3. pooled Pooled jdbc access to a database connection

Each of the named resources are defined in the config/resources as resource-name.properties. Each of the properties files has a required name/value pair called type. This can be one of the types as listed above. The rest of the configuration options are specific to the type of resource. These configuration options are given below:

jdbc

Each simple jdbc resource defines a connection to a database using the following configuration options:

  1. dbUser name of user to log into the database.
  2. dbPassword password of the user.
  3. dbDriver the jdbc driver for the database.
  4. dbUrl the URL for jdbc to resolve the specific database.

For this to work correctly the jdbc jar file for the specific database you need to access needs to be in the classpath. Currently only the mysql jar is built into Babeldoc. Access to other kinds of databases like Oracle, DB2 and Sybase will require that the JDBC driver libraries be placed in the CLASSPATH. The name of the dbDriver parameter and the form of the dbUrl will be highly dependent on the particulars of the vendor database. There are a number of limitations to the simple jdbc resource, the primary limitation being that the resource does not pool connections. Each time a connection is requested, a new connection is created and this can be VERY time consuming.

jndi

This resource is useful when running in a J2EE container. This allows for accessing datasources using JNDI. There is a single configuration option:

  1. datasourceName the jndi name of the datasource to lookup.

pooled

Babeldoc provides a pooled connection using the (Apache Commons DBCP) library . The configuration for the resource is provided by the following configuration options

  1. dbUser The user name with access the database
  2. dbPassword The user's password
  3. dbUrl Url for access to database

Chapter 4. Journal

Introduction

The journal keeps track of documents as they move through the system as well as the status of each operation performed on the document. The primary purpose of the journal is to provide a safe environment for the processing of documents. There are a number of mission critical situations where losing data is not acceptable. It is possible to recreate document processing if an error condition should arise. Errors can be both external and internal. Internal problems could be temporary database errors, disk space, etc. External causes could be erroneous documents, network outages, etc.

Each document is associated with a JournalTicket which is assigned uniquely just as the document enters the pipeline. Each operation upon a document for a JournalTicket (hereafter also referred to as a ticket) is performed at a step. Steps start at zero and increase until the document is finished processing. Each operation (or pipelinestage) on a document can be uniquely identified by a combination of a ticket and a step.

Journal Operations

A journal operation indicates what happened in the journal for the document at that pipelinestage. This is essential for determining problems with document processing. There are a number of journal operations available:

  1. newTicket. This operation is the first operation (step 0) when a document is introduced into a pipeline. This returns a new ticket.
  2. forkTicket. This operation occurs when a document is split into many documents or similar operations. The forked ticket is a new ticket but is associated with it's parent ticket in the Ticket lineage and may thus be traced.
  3. updateStatus. This operation will cause the status of this ticket to be updated and the step updated. The ticket is unchanged, the step is incremented.
  4. updateDocument. This operation writes the document to the journal data store (implementation dependent). The ticket is unchanged and step is incremented.
  5. replay. This operation causes the document associated with the ticket to be replayed from the step specified. This operation can only succeed if the document was updated (see update document operation).

Journal Implementations

The implementation of the journal depends on your specific circumstances. There are currently three implementations that are available. Which specific journal to use is defined in the configuration file: config/journal/config.properties. The journal to be used is set in the single name/value pair: journalType. The options are:

  1. simple
  2. mysql
  3. oracle
  4. postgresql
  5. sqlserver
  6. ejb

Simple Journal

The simple journal implements it's operations as disk files and directories. It is not intended as a robust, enterprise level implementation. It also lacks structured query functions for querying, etc. Its configuration file is config/journal/config.properties. This file has a number of configuration options

  1. simpleJournalDir: The directory to create the log-detail files.
  2. simpleJournalLog: The path to the journal file. See later.
  3. logMaxSize: This will roll-over the log file once the journal log reaches this size.

For each operation logged to the journal, it is logged line by line to the journal log file. The lines are comma-separated values (CSV) and can be parsed by third party applications. The columns are:

  1. ticket number: the ticket number is currently the time in milliseconds at time of creation of the ticket.
  2. step: the step number - starting from 0
  3. operation: The particular operation being executed
  4. timestamp: The time in milliseconds when the operation was logged
  5. status information: The fail / success for updateStatus.
  6. pipeline stage name The stage within the pipeline when this step was logged.
  7. additional status information: The additional status information that indicates further information about this journal log.

For each ticket, there is a directory created with the value of the ticket (this is long string of numbers - its actually the time in milliseconds of when the ticket was created. Inside this directory there are step delta files which represents each step in the log for that ticket. The contents of the delta file may be the status string or the document itself (if the operation is updateDocument). The document is persisted as an object serialization.

Jdbc Journal

It is possible to use a database to store the journal log and the document data. Currently oracle and mysql are supported. The schema creation scripts are in the directory readme/sql. The document data is stored as binary data (BLOBs). Each vendor supports BLOBS slightly differently, hence the specific database support. There are three main tables involved in storing the journal data (the table table_key is for unique key generation), being:

  1. log: Stores tickets and steps for the tickets as well as the operation details for each ticket step. The log_other_data column can either store the status message for updateStatus operations or the parent ticket id for forkTicket operations.
  2. journal: Stores the document as a blob for the ticket step. This is associated with updateDocument operations.
  3. journal_data: Storage for the enriched variables associated with the document. The primary reason that these variables are stored separately is that they can be used as query parameters for console operations. Note that long and binary variables are not stored to the database and that strings can get truncated.

The configuration for the Mysql, Oracle, PostgreSQL, and SQL Server journals are stored in the configuration file: config/journal/sql/config.properties. The only configuration option in this file is resourceName indicates that name of the resource that will manage the database connection. Currently the journal is implemented in a separate schema (instance, whatever) than the other database storage areas (user, and console).

Ejb Journal Implementation

The intent of this journal implementation is to store the operation journal implementation in a J2EE container. Currently Jboss is explicitly supported but not to the exclusion of other containers. This implementation is really a shell around either the simple or sql journal implementations but running in a remote server. By this means, it is possible to move the journal operation to a central location. The configuration for the ejb implementation is stored in the configuration file:

Journal Tool

The journal tool allows access to the journal from the command line. This enables complex queries to be applied against the journal. There are four separate types of queries:

  1. -L or --list: List all the tickets and the steps in the journal. This can produce lots of output. This can be limited by the flag -n (no more than this many lines of output). It is also possible to start from another index other than zero using the -i flag
  2. -T ticket-number or --tickets ticket-number: List all the ticketsteps for the supplied ticket.
  3. -D ticket-number.step or --document ticket-number.step: Displays the contents of the document stored at the ticket/step to the screen
  4. -R ticket-number.step or --replay ticket-number.step: This will reintroduce the document at the the point it was stored or later.

There are a number of options which can change the display of the data from the tool - use the -h command line to get all the options for this tool

Chapter 5. Scanner

Introduction

The scanner is a tool that scans for messages from a variety of sources and when a message is found, it is fed into the pipeline. The scanner is an automation tool, in that a system can be built up using scanners and pipelines. This is an alternative to the process script which feeds a single document into the pipeline when run. The scanner is currently capable of scanning a directory in a filesystem, a mailbox on a mail-server, an FTP servcer, a web server, a database via a SQL query, external application output and a JMS queue. The period of scan and the pipeline to feed, as well as other specific configuration options are all set in the config/scanner/config.propertiesuserinput> file. There may be one or many scanning threads active, each configured differently. For example, one scanner thread could be polling a mailbox once every 60 secs while another is scanning a directory every 10 seconds. The scanner is also capable of scanning based on a schedule specified in the same way that CRON is on UNIX systems.

General attributes available are file_name, scan_path and scan_date.

Starting scanner

The scanner tool is started by running the command babeldoc scanner. This command will use configuration from config/scanner/config.properties. If you want to use configuration from a different file you can use -s another_configuration switch to specify the configuration that should be used instead of default one.

Configuration

There are two kinds of configuration options available:

  1. general: these options are global and apply to all types of scanners.
  2. specific: Options for a certain kind of scanner. For example the configuration: 'host' is only pertinent to the email scanner.

The options for each scanner type are laid out below.

DirectoryScanner

The directory scanner is used for scanning directories on local file system. It can be configured to scan subdirectories of given folder recursively and it can use filter for files that be scanned or for files that should not be scanned (ie exclusion and inclusion) parameters. This is very useful for integrating Babeldoc into larger systems. An example would be reading documents placed in a directory by another application running on the computer or another computer to a shared, networked filesystem.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (DirectoryScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

inDirectory

directory-path

1..n

Directory to be scanned

doneDirectory

directory-path

0..1

Folder that is used for storing scanned files. Note that scanned files will be removed from inDirectory

includeSubfolders

boolean

0..1

Specifies if scanning should be recursive, and include subfolders. If yes, files will be copied to doneDirectory with path relative to inDirectory.

filter

string

0..1

Regular expression filter. Only files that do match will be included. If not specified all files will be included

minimumFileAge

integer

0..1

Minimum age of file in ms (attempts to guard against incomplete reads)

null

Null Scanner feeds null documents everytime the scanner runs. This is useful for scheduling.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (null)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

HttpScanner

The HttpScanner allows the scanner to pull down documents from web servers. Any headers recieved by the HttpScanner are placed on the document as attributes.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (HttpScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

url

url

1..n

URI to getChild the document from

attempts

integer

0..1

Number of times to attempt to get the document

user

string

0..1

Authenticate with this user name

password

string

0..1

Authenticate with this password

realm

string

0..1

Authenticate with this realm

proxyHost

url

0..1

URL of the proxy host

proxyPort

integer

0..1

proxy port

timeout

integer

0..1

retry timeout

MailboxScanner

The MailboxScanner is used for scanning mail servers for e-mail messages. Document can be scanned from e-mail body or from attachments. This is very useful for integration with email enabled clients. An example would be purchase orders emailed to a mailbox scanned by Babeldoc. The From, To and Subject filters are regular expression filters. Enter regular expressions which, if matched, cause the matching email to be processed. For example, if you wanted to match a recipient address of first.last@server.com, you would enter "first\.last@server\.com" in the toFilter. The expressions are effectively OR'd together, because if any one of the filters gets a match, the e-mail message will be processed. The toFilter is tested against all addresses in the TO field. It is NOT tested against the CC or BCC fields. Accessible attributes are subject, from, to and replyTo.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (MailboxScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

host

string

0..1

Mail server host name or address

protocol

string

0..1

Protocol which is used for connecting to mail server (pop3, imap...)

timeOut

integer

0..1

Socket I/O timeout value in milliseconds. Default is infinite timeout.

folder

string

0..1

Name of folder on mail server. (for example INBOX)

username

string

1..n

Username for logging to mail server

password

string

1..n

Password for logging to mail server

getFrom

enumeration

0..1

Should message be created using mail body or attachment. Default is body

fromFilter

string

1..n

Regular expression which, if matched by the From field, causes the message to be processed

toFilter

string

1..n

Regular expression which, if matched by the To field, causes the message to be processed

subjectFilter

string

1..n

Regular expression which, if matched by the Subject field, causes the message to be processed

fromFilterResult

boolean

0..1

Result of regular expresion (true o false)

toFilterResult

boolean

0..1

Result of regular expresion (true o false)

subjectFilterResult

boolean

0..1

Result of regular expresion (true o false)

deleteInvalid

boolean

1..n

Delete messages that are not valid (invalid address etc.) and not processed by Babeldoc

SqlScanner

Sql scanner is used for generating documents by executing sql queries. It can produce XML documents, csv documents or simple documents. In case of XML and CSV only one document can be returned and it will contain all returned rows. In case of simle document each document is formed from the first column of each returned row.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (SqlScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

resourceName

string

1..n

Name of the connection resource

sqlStatement

string

1..n

SQL statement that is executed to get documents

updateStatement

string

0..1

SQL Statement that is executed after selecting rows and creating documents. It is used for marking rows as processed so they don't need to be processed later.

documentType

enumeration

0..1

Type of document that is returned. Choices are simple, xml or csv

cvsFieldSeparator

string

0..1

Character that is used for separating fields in CSV file. Default is comma

csvRowSeparator

string

0..1

Character that is used for separating rows in CSV files. Default is \n

xmlHeadingTag

string

0..1

Tag that is used in XML document for heading. Default is document

xmlRowTag

string

0..1

Tag that is used in XML document for each row. Default is row.

FtpScanner

Scans given folder on remote FTP server. This allows babeldoc to connect to remote FTP servers and then scan folders for documents to process.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (FtpScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

ftpHost

string

1..n

Host name or address of the ftp server

ftpUsername

string

1..n

Username that is used for connecting to host

ftpPassword

string

1..n

Password that is used for connecting to host

ftpFolder

string

1..n

Folder name which is scanned

includeSubfolders

boolean

1..n

Should subfolders be scanned too

ftpOutFolder

string

0..1

Folder on FTP server where scanned documents should be copied

localBackupFolder

directory-path

0..1

Folder on local file system where scanned documents should be copied

filter

string

0..1

null

maxDepth

string

0..1

scanner.FtpScannerInfo.option.maxDepth

ExternalApplicationScanner

The ExternalApplicationScanner runs an external application and pipes the standard output from that application into the pipeline.

Name

Type

number

description

type

service-name

1..n

General: Type of scanner (ExternalApplicationScanner)

period

integer

0..1

General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)

cronSchedule

string

0..1

General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)

pipeline

string

1..n

General: Name of pipeline where scanned document will be processed

contentType

string

0..1

General: Content type of document to be scanned

ignored

boolean

0..1

General: true if scanner should not scan false otherwise. Default is false

journal

boolean

0..1

General: Should the scanner use the journal. Default is true

countDown

integer

0..1

General: The number of times this countdown must run.

binary

boolean

0..1

General: The documents from this stage must be submitted as binary pipeline documents

encoding

string

0..1

General: The encoding used for reading input files

application

string

1..n

The application to run.

Chapter 6. Flat file conversions

Introduction

Flat file ascii data is produced by a number of modern and legacy systems. Examples of flat file data include CSV; COBOL copybooks; positional data (data items placed in a two dimensional with data placed at columns and rows, occupying a number of characters); repeating groups (groups of data which repeats based on found in the document).

configuration

The flat file conversion is governed by an conversion configuration file that conforms to the schema: readme/schema/conversion.xsd. This clearly descibes the varous different configuration options.

Each input document is considered split into two major parts:

  1. header: Details various options that are globally applicable.
  2. Paragraphs: Details the details of the specifics of the conversion.

Header

The header is the first part of the conversion xml document that describes characteristics of the input document (type of conversion, line ending character, the number of lines in a paragraph, lines from the top of the document to the first paragraph, lines between the paragraph, etc) and the output document (the root element and the row element name).

Paragraphs

The paragraphs in the input document represent the lines of data that are of interest to be mapped to the output XML document. Each paragraph may consists of one or more lines, each line consisting of one or more characters up to the end-of-line character. Each paragraph maps to a sub-root element in the output document. Each field in the paragraph is represented either by position and width characters in a position type document or by column number in a CSV type document. These fields are represented by sub-row elements in the output document, ie. In XPath: /root/paragraph/field.

There are two basic kinds of paragraphs, segmented and non-segment lines

Non-segmented lines

Non-segmented lines are lines whose output paragraph xml element does not change based on the presence of data in the input document. There are three types of non-segmented input documents:

CSV documents

This is the simplest document. Each paragraph is a line of comma-separated values. Each field is specified by a column number and a field name. The name is the subrow element to emit for the data found at the column number.

Single-line paragraphs

This a positional document where each line of the input document represents a paragraph. Each field in the line is specified by an offset (starting from zero) into the line, width of the field and the field name of the sub-row element to emit.

Multi-line paragraphs

This is a positional document where each paragraph consists of a number of lines. Each field is specified by a line offset into the paragraph (from the top of the paragraph), an offset from the left margin, character width and a field name. This is useful for screen scraping operations where the screen height and width (usually 80x24) represents the paragraph.

Segmented lines

The premise with segmented lines is that the input file may contain some value which indicates the kind of data on that line as a marker. This is specified as a column/width and a value to match. Once a line has been identified, it is possible to then preform either a single-line paragraph conversion or a CSV on it. There is an optional nesting element to output when a segment is matched - this is situated between the row and field elements.

conversion XML document

The conversion xml document is divided into two sections: header and conversion information. The basic format is:

conversion

header

output-document

root-element

The root element of the output document.

row-element

The row element for each of the input paragraphs

input-document

conversion-type

This can be one of: line, csv, para, segmented-line.

line-ending

The line ending characters - this is currently ignored

field-separator

for files, the separator characters.

inter-skip

The number of lines to skip between paragraphs

top-skip

The number of lines to skip before the first paragraph is encountered

left-margin

Characters to skip from the beginning of the line to the first character of interest in the paragraph

lines-per-para

The lines for each paragraph. This is used for establishing the chunk size.

fields

The fields given above depends on which type flat file is to be converted. There are three types:

CSV files

These files are often used when exporting data from spreadsheet applications. Each column of data is separated by a comma and cells may be enclosed in quotation marks to escape text

field [1..*]

there can be one or more fields

field-name

The name of the output field element

field-number

the number of the CSV column

Flat lines

These files consist of lines of data, each line corresponds to a row of data. The fields of data are positionally arranged in the line of data. For instance, the order reference could exist at column 15 and of width 10.

line-fields

Holds the fields

field [1..*]

there can 1 or more fields

field-name

The name of the output field element

field-column

The character number of the column

field-width

The number of characters of the width

Paragraphs

These files consist of regular groups of lines. Fields exist at a particular row and column in the array of lines. Each field then consists of a number characters, that is, a width. This is very similar to 9.2 except that it is a two dimensional array of data. This is useful for screen-scrapes, etc..

para-fields

element to hold the paragraph fields

field [1..*]

There can be 1 or more fields

field-name

The name of the output field element

field-column

The character number of the column

field-row

The line (from top) of the field.

field-width

The number of characters of the width

Line Segments

The segments is the method of mutating the output based on key fields in this input field.

line-segments

element to hold all of the segments

segment [1..*]

Segments (there can be 1 or more)

segment-name

The name of the output segment element

segment-column

The column of the segment marker

segment-width

The width of the segment marker

segment-value

The value of the segment to match.

begin-group-name

The name of the element to begin the group.

csv-fields | line-fields

Chapter 7. HOW TOs

Introduction

This chapter is intended to collect together some of the collected knowledge of those using Babeldoc. The intention is to save you time if you are trying to perform some of these tasks or similar. Please contribute your nuggets of information - these can help others.

HOWTO Set up Eclipse with Babeldoc

Pre-requisites: Eclipse 3.0 build M4 or later. Anything earlier will not work.

  1. Open eclipse
  2. Make sure that you can see the CVS Repositories View. If you can't, click Window | Show View | Other ... and select CVS Repositories.
  3. The Repositories View will (probably) come up empty. Right click on the white space, and click New | Repository Location ... Enter all the repository details (extssh for anybody with a developer account, otherwise pserver), and click OK.
  4. A new entry for the repository will appear. It's the root node in a tree. Open the tree. Below you should see entries HEAD, Branches and Versions. If you want to develop on the HEAD, as most core developers would probably want, open the HEAD node.
  5. Right-click on the babeldoc node that appears under the HEAD, and select Check Out As... A dialog will appear. You can use "Check out as a project configured using the New Project Wizard". or try the "Check out as a project in the workspace".
  6. Complete the New Project Wizard details.
  7. Once the New Project Wizard has finished processing, you should have a project open in the Java perspective. If not, click on the Add Perspective button, and add a Java Perspective.
  8. Right-click on the project node, and select Properties. Select the Java Build Path option, and select the Source tab.
  9. You should see your project appear with a single (empty) exclusion filter. Edit the filter, and set it to **, i.e.: exclude all files (trust me, I'm a programmer), and click OK.
  10. Select Add Folder... and add each of the src/ folders. You can multiselect on the folder selection dialog. For example, you should open the root node, and then open modules/, and then open babelfish/, and then select the src/ directory. Then open the conversion/ directory, and select src/ (with the Ctrl button down, this time), and so on. In the j2ee/ folder, don't forget to add both src/ and gensrc/. All src/ directories inside modules/ should be added.
  11. Now in the Properties dialog, still in the Java Build Path option, select the Libraries tab. Click Add JARs... and select all the jar files in build/lib, except for any library beginning with "babeldoc_". Also add support/ant/lib/ant.jar and support/ant/lib/junit.jar.
  12. Now click OK in the Properties dialog. Eclipse will probably spend a few seconds rebuilding its project information.

You should now have a happy eclipse system showing all the source modules, and the libraries. Eclipse should not show any errors detected by the background compiler. However, there will be a stack of warnings. They can, for the moment, be ignored.

Now to get ant working.

  1. In the Java perspective, right-click on build.xml, and select Run Ant...
  2. The Ant dialog will now appear. Click on the Main tab, and ensure that the Base Directory is set to the project root. It will probably look something like this: ${workspace_loc:/Babeldoc}
  3. Click on the Classpath tab. Uncheck the "Use global classpath as specified in the Ant runtime preferences". Click Add JARs... and add support/ant/lib/babeldoc_bootstrap.jar, support/ant/lib/xercesImpl.jar, and support/jalopy/lib/jalopy-ant-0.6.1.jar, or whatever the current version is.
  4. Click on the JRE tab. Click Alternate JRE, and select one of your JREs. You should probably set it to something fairly recent. Now, this is critical. You have to set the Working directory. Uncheck the "Use default working directory", and select "Workspace". Click Browse... and select the root node of the project. Click OK. If you don't have a "Working directory" section on the JRE tab, running ant is not going to work. If you do not have the working directory section, you need to upgrade your eclipse to at least version 3.0 build M4.
  5. Click Apply.
  6. Click on the Targets tab. Select the "build" target, and click Run.

You should now get a Console View appear, and the ant output will be spooled into the Console View.

HOWTO Read an attribute from external XML file

As there is no equivalent to SQLEnrich for XML it is not obvious how to get an attribute from an external file and then revert to the original document. One way to do this is to store the current document as an attribute, process the second file and then revert the document to value of the attribute

doc2attrib.stageType=Scripting
doc2attrib.nextStage={Stages that load other document etc.}
doc2attrib.script=document.put("originalContent", document.getBytes());
attrib2doc.stageType=Scripting
attrib2doc.nextSyage={Continue with processing}
attrib2doc.script=document.setBytes(document.get("originalContent"));

HOWTO Access the attributes of a pipeline document inside an XSLT

Essentially you can use document.get("myprop")

<xsl:param name="doc" select="$document"/>
<xsl:param name="myprop" select="java:get($doc, 'myprop')"/>

For the syntax, see: http://xml.apache.org/xalan-j/extensions.html See the java section.

Additionally you can get the pipeline stage object from the XSL and then you can manipulate the java code directly.

The snippet below is an example of how to get the current time and format it nicely:

<xsl:variable name="date" select="java:java.util.Date.new()"/>
<xsl:variable name="seconds" select="java:getTime($date)"/>
<xsl:variable name="velocity"
select="java:com.babeldoc.core.VelocityUtilityContext.new()"/>
<xsl:variable name="datestr" select="java:getFormattedDate($velocity, 'd MMM yyyy
HH:mm:ss', $seconds)"/>

HOWTO Package up your application into a single jar file for easy distribution

The idea of this HOWTO is to avoid distributing all the directories that make up your configuration by packaging them all up into a single jar file and using this to run your pipelines.

Lets assume that your BABELDOC_USER points to the c:\project directory. This directory has all the required configuration directories like pipeline, resource, etc.

  • Jar up your configuration files: jar cf myproject.jar pipeline resource journal producing a myproject.jar file.
  • Change your BABELDOC_USER to: set BABELDOC_USER=c:\project\myproject.jar
  • Verify that your pipelines still work, but change directory away from c:\project directory to make sure that the configuration files there don't interfere with new BABELDOC_USER variable!

Appendix A. The Apache Software License, Version 1.1

Copyright (c) 2000 The Apache Software Foundation. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. The end-user documentation included with the redistribution, if any, must include the following acknowledgment:

"This product includes software developed by the Apache Software Foundation (http://www.apache.org/).

Alternately, this acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear.

4. The names "Apache" and "Apache Software Foundation" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact apache@apache.org.

5. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.

THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

====================================================================

This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation. For more information on the Apache Software Foundation, please see http://www.apache.org.

Portions of this software are based upon public domain software originally written at the National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign.

====================================================================