Babeldoc Userguide

If you are using Babeldoc, please participate in the development of Babeldoc. This can include just sending a hello, posting bug reports to actually contributing code. We want to hear from you. Babeldoc forums. . There is a mailing list that will be very useful for technical users of Babeldoc hosted on sourceforge called babeldoc-devel. Please join for the latest news.

Chapter 1. Introduction

Table of Contents

Babeldoc

Fundamental terms and expressions

Running the Examples

The Java Runtime Environment
Windows
Unix-like systems
Testing your Babeldoc installation
Running a document through the pipeline
Inspecting the Pipeline
Inspecting the Journal

Modules

Configuration

LightConfig

Setting up a new project

Babeldoc

Babeldoc is a document processing system. Babeldoc is especially suited for Business-to-Business (B2B) environments and similar integration projects. Babeldoc has a flexible and reconfigurable processing pipeline through which documents flow. These pipelines transform the document. Additionally Babeldoc has a sophisticated and extensible journaling system so that documents may be reprocessed and resubmitted as well as tracked through the system. Its runtime environment is flexible so that it can run standalone, in a web container or in a J2EE container (currently tested on Jboss 2.4.x and JBoss 3.0.x). Babeldoc has both a Web console, a number of GUI tools and a command-line console to control the document flow.

Note

The flow based metaphor is an appropriate metaphor that will expanded on in this document.

Babeldoc can be used to process documents which are flowing in and out of a system. There are three basic ways to develop applications to handle these document flows:

Hand-coded standalone applications have the benefit of simplicity. These applications can be slow (java start-up times are significant) and also inflexible, hard to maintain, difficult to track errors and can result in fragile systems. Also such systems could be non-portable in that they make assumptions about their environments.
Hand-coded RMI (or similar protocol based) server applications are better than (1) in terms of performance but can be hard to administer and control (you have to do this yourself).
Server based processes - this is better for manageability and in terms of scalability but could be overkill for simple situations.

How about a system that can be implemented in these ways but still centrally managed and configured. Additionally how about a system that seems like (1) but is actually implemented as a server architecture (3). Babeldoc can be configured to run in any one of these three ways.

Additionally document processing sometimes fails and then the issue is how do you react? Babeldoc has a very sophisticated journaling function that can allow administrators to reintroduce documents into any place in the pipeline.

Fundamental terms and expressions

Document: A document is a string of characters of arbitrary length. Additionally a document can be enriched by each stage in the in Pipeline this is useful to carry expensively obtained information along with the document.
Pipeline: A pipeline is a series of processing stages that will successively apply transformations to the document. There may be one or more pipelines defined but each pipeline must have a unique name. Examples of pipeline functionality are: Convert a flatfile to an XML document, convert one XML document to another XML document, extract variables from an xml document, split a document into sub-documents and write the document to disk. Each pipeline must have an entry stage. See below for stages.
PipelineStage: A pipeline stage is the smallest step in a pipeline. Each pipeline-stage has a unique name and has a type. The name of the pipeline-stage must uniquely identify that pipeline-stage within its pipeline. Each Pipeline-stage informs the pipeline the name of the next of the Pipeline-stage. Examples of pipeline-stages are: FileWriter (write the document to disk), XslTransform (convert one xml document to another xml document).
Journal: The sub-system of babeldoc that is concerned with capturing the transformations that are applied to a document as it moves through the pipelines.
Ticket: The ticket is a unique identifier that is associated with every document in the system. This allows for the journaling system to reload the document associated with that ticket. The ticket is an 64bit integer.
Step: A step is a single operation that can be associated with a ticket and is journaled to the store. The ticket and the step together uniquely identify a document at a particular stage in a pipeline.
TicketStep: The combination of a ticket with a step to uniquely identify an operation upon a document.
Replaying: Replaying the journal takes a ticket and step and reintroduces the document into the pipeline at the specified pipeline stage.
Scanner: A standalone application to monitor resources (disks, FTP sites) and when a file is found, make a document and introduce the document to the specified pipeline. This is a powerful tool and can be made to perform some intricate tasks.
Feeder (1): A standalone application that introduces a document into a pipeline.
Feeder (2): An internal babeldoc processing component that governs how documents are fed to pipelines.

Running the Examples

Babeldoc comes ready to run with demonstration pipelines right out-of-the box. There are two pipelines configured: demo and test. They provide an example of how to construct a pipeline using the simple configuration style (property files) and the more sophisticated XML based configuration. There are additional pipelines configured, including one to recreate the documentation you are currently reading in a number of different formats.

In addition the usage-whitepaper, also found in the /readme directory provides a walk-through creating two usage scenarios - both of which are in the readme folder.

In order for these examples to run correctly, please ensure that the following instructions are followed

The Java Runtime Environment

Download and install the JRE. Babeldoc requires that you install a version 1.4 or greater. It may be possible to run parts of Babeldoc using 1.3 but this is not supported. Ensure that java is in your path. This is essential. To test this, execute the command java from the commandline. If you get a "command not found" exception, you will need to add the jre/bin directory to your path. It is also useful to have JAVA_HOME set or you may receive warning messages.

Windows

The BABELDOC_HOME and the PATH environment must be set. This is done as follows:

C:> set BABELDOC_HOME=c:\babeldoc
C:> set PATH=c:\babeldoc\bin;%PATH%

It is assumed that you have installed Babeldoc in the c:\babeldoc directory. If it is in another directory, please change accordingly.

Unix-like systems

The BABELDOC_HOME and the PATH environment must be set. This is done as follows:

$ export BABELDOC_HOME=/opt/babeldoc
$ export PATH=$BABELDOC_HOME/bin:$PATH

It is assumed that you have installed Babeldoc in the /opt/babeldoc directory. If it is in another directory, please change accordingly. It is quite possible that you will have installed it a non-privileged location.

Testing your Babeldoc installation

Open a command window. Ensure that the paths are correct for your platform. Run the command:

babeldoc

The output from this command should look similar to this:

*** This is Babeldoc! *** Usage: babeldoc <command> where command must be one of: xls2xml, scanmon, addstagewiz, setupwiz, lightconfig, sqlload, pipeline, setentrywiz, journal, process, journalbrowser, flat2xml, scanner, guiprocess, pipelineb uilder, babelfish, module. Babeldoc 1.3.0 Copyright (C) 2002,2003,2004 The Babeldoc Team!! Babeldoc comes with ABSOLUTELY NO WARRANTY; This is free software, and you are welcome to redistribute it under certain conditions

If your output is not like this or your get an error, please check the paths and the JRE requirements.

Running a document through the pipeline

Jumping right in, you can run the demonstration pipelines:

Example 1.1. Running the demonstration pipeline

babeldoc process -p test -f test/quickstart/stats.xml

You will see a number of logging messages scroll over the screen, and in the current directory find a file named: stats.html. Take a look at this file using your favorite browser - to many this is a more pleasant way to look at sports scores. This is the output file from the processing of the input file, stats.xml. It is interesting to note the at the file stats.xml does not actually reside in the filesystem as a file. It is in the Babeldoc core Java archive.

<2002-07-25 23:37:51,279> <root> <INFO> Process stage: entry <2002-07-25 23:37:52,692> <root> <INFO> Process stage: transform <2002-07-25 23:37:53,382> <root> <INFO> Process stage: choose <2002-07-25 23:37:53,400> <root> <INFO> Process stage: writer Processed. Ticket: 1027654671023 assigned

Note that this is an example of what your output looks like. The ticket number will be different. Use your ticket number in the following examples.

Note some versions may not give the ticket number. The ticket number can be found by running babeldoc journal -L or the Journal Browser tool by running babeldoc journalbrowser

Inspecting the Pipeline

The pipeline can be inspected using the pipeline tool. To see the options, simply type:

Example 1.2. Inspecting the pipeline

babeldoc pipeline

The options for the pipeline tool will be printed to the screen. Please experiment with this tool. For interrogating the configurations on, say the entry, stage, issue the command:

babeldoc pipeline -C test.entry

Notice the common syntax for accessing pipeline-stages:

Inspecting the Journal

The journal tracks documents moving through the pipelines. It can also track the changes to the documents as well as the status of each stage in the pipeline. The tool to access the journal functionality from the command-line is:

Example 1.3. Inspecting the Journal

babeldoc journal

The journal tool is primarily suited to querying the journal data. Please experiment with all the options. For now, though, review all the steps that occurred during the processing in 1.2.1. To do this, you must use the ticket number printed during your session (Not 1027654671023 as below).

babeldoc journal -T 1027654671023

This will result in the following output

ticket: step: 0; date: Thu Jul 25 23:53:43 EDT 2002; stage: null; op: newTicket; other: null ticket: step: 1; date: Thu Jul 25 23:53:43 EDT 2002; stage: test.entry; op: updateDocument; other: ticket: step: 2; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.extract; op: updateStatus; other: success ticket: step: 3; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.transform; op: updateStatus; other: success ticket: step: 4; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateDocument; other: ticket: step: 5; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateStatus; other: success ticket: step: 6; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.writer; op: updateStatus; other: success

This listing shows that the document resulted in six recorded steps in the journal. Most of these steps are updateStatus steps. These steps merely record the processing status of the document and the corresponding stage in the pipeline. The updateDocument steps indicate points in the pipeline when the entire document was stored. At these steps it is possible extract the document and display its contents (journal -D) or even reprocess the document (journal -R). Display the document at stage 4 (journal -D 1027654671023.4).

It is possible to toggle the document tracking at any stage in a pipeline. This is done by setting the tracked flag in the pipeline configurations. Please review the pipeline documentation.

Modules

Babeldoc is a modular piece of software. Each of the modules in babeldoc successively add and refine its operation. Although a module participates both in the runtime and build of Babeldoc, this document is concerned with the runtime aspects of modules.

Example 1.4. Listing the modules

There is a commandline tool to list the current set of modules known to Babeldoc:

babeldoc module -l

This will list the modules and their dependencies

Module: core is dependent on: Module: web is dependent on: core Module: gui is dependent on: core Module: crypto is dependent on: core Module: xslfo is dependent on: core Module: soap is dependent on: web, core Module: sql is dependent on: core Module: scanner is dependent on: core, sql Module: babelfish is dependent on: core Module: conversion is dependent on: core

It is possible to remove and add modules to Babeldoc. The current set of modules is installed in the directory $BABELDOC_HOME/lib. The standard modules are named: babeldoc_module-name.

Configuration

All configuration data within Babeldoc is handled in a structured fashion. Every configuration key must be contained in a configuration file. The configuration file is hierarchically arranged very similar to a regular filesystem file in directories and subdirectories. The "directory" part of the configuration file name is specified as UNIX style forward slashes separating directory names. The configuration key is a string which must be unique in a configuration file. An example of this is the configuration key: Journal.simple which is defined in the configuration file: service/query.properties.

LightConfig

The configuration implementation of Babeldoc is configurable but the default configuration implementation is the LightConfig. This stores the configuration data in properties files which are then hierarchically arranged into directories. The files may be stored on the local filesystem or in archive (JAR) files

The lightconfiguration implementation also has the very interesting and sometimes perplexing ability to merge configuration files with the same name into a single configuration file. This means that configuration file data does not overwrite data except where the configuration key is identical. In this case, the configuration file that is specified at the end of the configuration search path is dominant. This is logical and is consistent with how the PATH (or CLASSPATH) environment variable is used by the command processor to search for executables except that instead of the first match overriding all else, all of the matches are merged into a single "file".

The configuration searchpath is very important and is used by Babeldoc to determine where to find the configuration data and how to load it. The parts of the search path are given below:

module dependency: The module dependencies determine the builtin searchpath. This cannot be changed by the user except by excluding modules from Babeldoc runtime. The module search path moves from the core module to the most dependent module. In other words, a configuration key in a configuration file in the most dependent module will override the same key in the same file in the core module. This is important to how dependent modules "specialize" Babeldoc behavior.
BABELDOC_HOME: This environment variable is set by the user to indicate those directories which contain configuration information. This environment variable is structured just as the CLASSPATH variable is. The earlier path elements indicate less dominant paths. It is very important for projects to set this variable to either a directory or a JAR file which contains your configuration settings.
current directory: The current directory is automatically added to the configuration search path for convenience reasons.

There are times when the configuration is not working as expected. There is a small commandline tool which makes it easier to inspect the configuration files and how each configuration key is modified in the configuration file. The tool, lightconfig is illustrated below:

Example 1.5. Listing configuration data

babeldoc lightconfig -l pipeline/config

The location of the configuration file pipeline/config.properties in each of the parts in the configuration search is then listed. A typical output would be:

Listing urls for the configuration: pipeline/config.properties 1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties 0: file:/C:/work/vap_rpt/./pipeline/config.properties

This output indicates that the file pipeline/config.properties exists in the babeldoc_core.jar file and is then overridden in the directory: C:/work/vap_rpt

Example 1.6. Tracing a configuration key

babeldoc lightconfig -l pipeline/config -t documentation.type

This traces how a particular configuration key (documentation.type) found in the configuration file: pipeline/config.properties is modified in all the possible configuration files. A typical output would be:

Listing urls for the configuration: pipeline/config.properties 1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties documentation.type = simple 0: file:/C:/work/vap_rpt/./pipeline/config.properties documentation.type: not defined

This output indicates that the configuration key is defined once in the babeldoc_core.jar file and is not subsequently overridden.

Setting up a new project

This section briefly describes the steps necessary to setup a new Babeldoc project. In the interests of brevity, the following assumptions are made:

You are on a MS Windows environment.
You have installed the Java SDK into the c:\j2sdk1.4.2_04 directory.
Babeldoc is installed in the c:\babeldoc directory.
Your project is in the c:\project directory.

The simplest method of configuring your environment is to create a setup batch file in the project directory. This file is usually called setup.bat but the name is unimportant. The purpose of the file is to configure the local environment so that Babeldoc can run. The contents of this file for this environment is given below:

@echo off

set JAVA_HOME=c:\j2sdk1.4.2_04

set BABELDOC_HOME=c:\babeldoc

set BABELDOC_USER=c:\project

set PATH=%PATH%;%BABELDOC_HOME%\bin

Prior to using Babeldoc run this script. Now create the configuration files in this directory.

Chapter 2. Pipelines

Introduction

A pipeline is a program whose purpose is to transform a document into one or more resultant documents. An example pipeline could transform a received XML purchase order into a set of SQL statements intended to update a database, produce a printable PDF file for record keeping and a confirmation email sent to the originating party.

All of the pipelines in Babeldoc must have a unique name like test or document. A pipeline is a set of processing steps arranged in a linear fashion. Each processing step is called a "pipeline stage" and each pipeline stage in a pipeline must have a unique name. Two pipelines may have a pipeline stage of the same name. There is a special pipeline stage in a pipeline, the entryStage which indicates which pipeline stage should initially receive the document from the feeder mechanisms. It is also possible to introduce a document into the "middle" of a pipeline. In order to designate a particular stage in a pipeline, the name is given as: pipeline-name.pipeline stage-name.

The pipelines in the babeldoc system are managed by the Pipeline factory which determines how and when a pipeline runs. Each pipeline stage in a pipeline has a type and a set of configuration options. An example of the a pipeline stage is the test.transform pipeline stage whose type is XslTranform. This type of pipeline stage requires either the configuration option tranformationFile which supplies the filename (or URL) of the XSLt file to perform the transformation or transformationScript which is the inline XSL.t document. There is an additional non-mandatory configuration option, bufferSize which can help with larger transformations.

Pipeline Documents

The pipeline stages operate on documents. A useful metaphor is the pipes that constitute the plumbing in your home. Each of the stages in the plumbing pipeline represents bends, faucets and other functional requirements. A pipeline document is the water in the pluming pipeline. A document is successively transformed by the pipeline until it is finally stored, discarded or otherwise disposed of. The transformations are determined by the pipeline and its stages. A document is primarily a number of bytes (characters) of data. The data is characterized by a MIME type. There are a number of ways a document can get fed into a pipeline, namely:

the pipeline feeder programs (including soapfeeder, socketfeeder)
the scanner program
the journal re-player

A document consists of the following components:

Body - the data representation of the document.
Attributes - enrichments of the document body

Body

The document body may be an XML document, a flat-file text document or a binary document. Significant processing can only be applied to XML documents in the standard Babeldoc. In order to convert flat-files to xml documents, there is a conversion pipeline stage which can convert a number of types of flat-file formats to XML. Please see the conversion chapter of this document. The XML functionality is the primary focus of the default Babeldoc distribution. Binary documents are also acceptable documents however, Babeldoc does not have many stages that process binary documents. This does not mean that binary processing is not possible - it is. Examples could be processing of photographic images, sound file processing or even video file processing.

Attributes

Attributes are said to enrich the document because they are basically shortcuts to data found in the document itself or from some other source. The attributes can be applied to a document in the following ways:

Externally applied from the processor commandline using the -a switch or similar
Internally from the document data. The XpathExtract pipeline stage can apply an xpath expression on the document body and then store the result as an attribute on the document.
Internally from other sources. It is possible to apply attributes like the current time, etc on the document.

For instance, the number of purchase orders in a bulk purchase XML document can be extracted from the document (using the XPathExtract pipeline stage) and placed in the attribute named numOrders. This results in significant speed increases in subsequent processing because the attribute (numOrders) can be used instead of multiple expensive XPath operations. The attributes are available through the variable ${document.get("attribute.name")}. This means that it is possible to customize the pipeline processing based on extracted (or enriched) data from the document.

Attributes are not limited to data extracted from the document, but can be options passed into the pipeline along with the the document like the email address to send the document to, file path that the document was read from and more besides. These kinds of attributes allow the internal processing of the pipeline to be influenced by the external environment. The command babeldoc process will accept any number of name=value pairs on the command line. Each of the supplied attributes will be placed on the document and will be available to the pipeline stages in the pipeline. In the test pipeline it is possible to email the processed document by supplying smtpHost, smtpFrom and smtpTo attributes.

Example 2.1. Adding attributes from the command-line

Instead of running the test pipeline as before, we can add a number attributes to the process command-line which will activate "hidden" stages in the pipeline.

babeldoc process -p test -f test/quickstart/foo.xml -a "smtpHost=mailer" -a "smtpFrom=you@here.com" -a "smtpTo=some.one@place.com"

Babeldoc has a rather complex data abstraction mechanism. It is just as easy to read a file from your hard-disk as it is to load it from a file in a your classpath, or even from a website (http://...) or a ftp site (ftp://....). This means that your simple pipeline which works on local files, will also work in a networked environment.

Pipeline types

The names of each of the pipelines and the configuration options for each pipeline is provided in the file config/pipeline/config.properties. Since this file (like every other configuration file) participates in the Babeldoc configuration system, you will need to create your own copy of this file in your configuration directory. Please see the configuration handling found in chapter 1. The handling of pipeline stages is performed by a set of PipelineStageFactories. The determination of which PipelineStageFactory will handle the pipeline is the type of the pipeline. Here are the current pipeline factory types

SimplePipelineStageFactory This is the pipeline factory that is driven by regular textual property files. This is the quickest way to get a pipeline up and working in the shortest time.
XmlPipelineStageFactory This is the pipeline factory that is driven by XML configuration files. It is anticipated that these kind of configurations are better suited to larger projects and for automated tools.
ImplEjbPipelineStageFactory This is stub to the EJB pipeline factory. This factory resides in a J2EE server instance. The actual pipeline factory in the server can be either the simple or the XML configuration factory.

SimplePipelineStageFactory

This pipeline factory is the simplest to setup. Its pipeline type is simple. This is indicated in the configuration file pipeline/config.properties. This declares the pipeline and provides its type and the actual configuration file that defines the pipeline. For instance, if the pipeline name is test, to set-up the type of the test pipeline, the entry in the file will be test.type=simple. The pipeline definition (see more later) configuration file for the test pipeline will be given as test.configFile=pipeline/your-config (Note: The .properties is omitted from the file name).

Example 2.2. Declaring a 'Simple' Pipeline

The configuration file: pipeline/config.properties shows the how a simple pipeline called test is declared to Babeldoc. Notice that a subsequent example will show how the pipeline is defined.

test.type=simple test.configFile=pipeline/simple/test

Notice that the configuration file for the pipeline is pipeline/simple/test - the actual name of the file is pipeline/simple/test.properties.

Notice here that the configuration is based from the config directory. This means that the config directory is in the classpath (by default). So if you define your own configuration file, in the directory, mega-project and place this directory in the CLASSPATH, as:

(UNIX) export CLASSPATH=/mega-project
(WINDOWS) set CLASSPATH=c:\mega-project

And in this directory, there is a subdirectory called pipeline. Within this, the config.properties (this location is mandated - the PipelineFactory looks for this file. If you do not put your pipeline declarations in this configuration file, they will NOT be found). The definition of your pipelines may done in this file. The pipeline configuration files may be in the same directory as the config.properties or in subdirectories of the pipeline directory - the choice is yours.

The actual definition of the pipeline is provided in the value of the pipeline-name.configFile property which is specified in the pipeline/config.properties file. Each of the pipeline stages within the pipeline are defined here as well as the document flow from one pipeline to the next.

Every simple pipeline definition document must contain the entryStage property. This property informs Babeldoc which pipeline stage is the starting point for the pipeline. If this property is not given in this file, processing of this pipeline results in an error.

Other than the entryStage property, every property in the pipeline definition file is of the form:

pipelinestage-name.option-1...option-n=value

The first part (up to the first period) is the name of the pipeline stage. The subsequent options (period separated up to the '=') are arguments to the pipeline stage. There are a two kinds of options for each pipeline stage:

general - these options can be applied to all the pipelinestages
specific - these options are only applicable to the specific type of pipelinestage

Additionally there are mandatory and optional pipeline stage options. The pipeline will fail to run if a mandatory option is not provided. The following are general options:

stageType (mandatory) - This indicates the type of this pipeline stage
nextStage (mandatory) - This indicates the name of the next pipeline stage in the pipeline
ignore (optional) - This disables this pipeline stage from processing
tracked (optional) - This causes the entire document to be stored in the journal. This would allow this pipeline to be re-executed from this point with identical data

For the complete list of pipelinestage configuration options, please refer later in this chapter to the list of pipelinestages.

Example 2.3. Defining a 'Simple' Pipeline

The pipeline is defined in a properties file which enumerates the pipelinestage configuration.

entryStage=entry entry.stageType=Null entry.nextStage=transform entry.tracked=true transform.stageType=XslTransform transform.nextStage=choose transform.transformationFile=test/quickstart/stats-html.xsl transform.bufferSize=2048 choose.stageType=Router choose.nextStage=writer choose.tracked=true choose.nextStage.emailer=#if(${document.get("smtpHost")})true#end emailer.stageType=SmtpWriter emailer.nextStage=writer emailer.smtpHost=$document.get("smtpHost") emailer.smtpFrom=$document.get("smtpFrom") emailer.smtpTo=$document.get("smtpTo") emailer.smtpSubject=Document: Ticket: ${ticket.Value} emailer.smtpMessage=${document.toString()} writer.stageType=FileWriter writer.nextStage=null writer.outputFile=${system.getProperty("user.dir")}/stats.html

The structure of this file is regular except for the entryStage. This property has to be present and its value is the name of the pipelinestage that is the starting point for this pipeline. If this property is not provided, Babeldoc cannot process this pipeline.

The rest of the properties in this pipeline stage definition file configure the 5 pipeline stages:

entry - this does nothing but store the document in the journal
transform - this stage uses XSL to convert the XML pipeline document into HTML
choose - This routes the document to the stage emailer if the attribute smtpHost is set, otherwise the nextStage is fileWriter
emailer - This stage emails the document, using the attributes stored on the document
writer - this stage writes the document to the disk

Xml Pipeline Stage Factory

This factory builds pipelines from an XML document that completely describes all elements of a pipeline. The schema document for it is found in the directory readme/schema. The two areas of the pipeline definition document are the static area and the dynamic area. The static area is optional and describes each of the types of pipeline stages available. The dynamic areas is mandatory. It describes each of the pipeline stages in the system, their configuration options and the connections between them. The document is illustrated below:


static [0..1]
dynamic [1]
	stage-instances [1..*]
		configuration [0..*]
	connections [1]

Example 2.4. XML Pipeline

The demonstration pipeline, demo is defined using a XML pipeline stage factory. This file is given below:

<?xml version="1.0"?> <pipeline xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.babeldoc.com/xsd/pipeline.xsd"> <documentation>This is a demonstration babel pipeline</documentation> <pipeline-name>some-name</pipeline-name> <dynamic> <entry-stage>entry</entry-stage> <!- STAGES: Defines the stages -> <stage-inst> <stage-name>entry</stage-name> <stage-desc>This does nothing</stage-desc> <stage-type>Null</stage-type> </stage-inst> <stage-inst> <stage-name>extract</stage-name> <stage-desc>this extracts stuff</stage-desc> <stage-type>XpathExtract</stage-type> <option> <option-name>XPath</option-name> <option-value></option-value> <sub-option> <option-name>documentId</option-name> <option-value> /AppointmentDocument/DocumentHeader/DocumentId/text() </option-value> </sub-option> <sub-option> <option-name>senderId</option-name> <option-value> /AppointmentDocument/DocumentHeader/SenderId/text()</option-value> </sub-option> <sub-option> <option-name>documentType</option-name> <option-value> /AppointmentDocument/DocumentHeader/DocumentType/text() </option-value> </sub-option> <sub-option> <option-name>documentVersion</option-name> <option-value> /AppointmentDocument/DocumentHeader/DocumentVersion/text() </option-value> </sub-option> </option> </stage-inst> <stage-inst> <stage-name>transform</stage-name> <stage-desc>this transforms stuff</stage-desc> <stage-type>XslTransform</stage-type> <option> <option-name>transformationFile</option-name> <option-value> ${system.getProperty("user.dir")}/test/quickstart/foo.xsl </option-value> </option> <option> <option-name>bufferSize</option-name> <option-value>2048</option-value> </option> </stage-inst> <stage-inst> <stage-name>choose</stage-name> <stage-desc>this chooses stuff</stage-desc> <stage-type>Router</stage-type> <option> <option-name>tracked</option-name> <option-value>true</option-value> </option> <option> <option-name>nextStage</option-name> <option-value></option-value> <sub-option> <option-name>emailer</option-name> <option-value><![CDATA[ #if(${document.get("smtpHost")}) true #end ]]></option-value> </sub-option> </option> </stage-inst> <stage-inst> <stage-name>emailer</stage-name> <stage-desc>this emails stuff</stage-desc> <stage-type>SmtpWriter</stage-type> <option> <option-name>smtpHost</option-name> <option-value>$document.get("smtpHost")</option-value> </option> <option> <option-name>smtpTo</option-name> <option-value>$document.get("smtpTo")</option-value> </option> <option> <option-name>smtpFrom</option-name> <option-value>$document.get("smtpFrom")</option-value> </option> <option> <option-name>smtpSubject</option-name> <option-value>Document: Ticket: ${ticket.getValue()}</option-value> </option> <option> <option-name>smtpMessage</option-name> <option-value> <![CDATA[${system.get("os.name")} - ${system.get("os.arch")} - ${system.get("os.version")} Message: ${document.toString()} ]]></option-value> </option> </stage-inst> <stage-inst> <stage-name>writer</stage-name> <stage-desc>this writes stuff</stage-desc> <stage-type>FileWriter</stage-type> <option> <option-name>outputFile</option-name> <option-value>${system.getProperty("user.dir")}/out1.xml</option-value> </option> <option> <option-name>doneFile</option-name> <option-value>.done</option-value> </option> </stage-inst> <!-Define the connections between stages -> <connection> <source>entry</source> <sink>extract</sink> </connection> <connection> <source>extract</source> <sink>transform</sink> </connection> <connection> <source>transform</source> <sink>choose</sink> </connection> <connection> <source>choose</source> <sink>writer</sink> </connection> <connection> <source>emailer</source> <sink>writer</sink> </connection> <connection> <source>writer</source> <sink>null</sink> </connection> </dynamic> </pipeline>

Multithreaded Operation

Babledoc is capable of spawning multiple threads to process multiple pipelines in parallel and to process documents within each pipeline in parallel. This has important consequences for large scale computing systems. This is an advanced concept. Please skip this section if you feel that it is too advanced for you.

Processors

A processor determines how a pipeline handles documents which are returned by a pipeline stage. There are pipeline stages which produce multiple documents from a single input document. The XPathSplit is such a pipeline stage. The standard way that Babeldoc operates is that each of the resultant documents from the pipeline stage is processed in turn. It is also possible to process the resultant documents in parallel.

The following processors are available:

sync

Synchronously process the pipeline documents. Each document is processed serially - no new threads are created.

threadpool

Asynchronously process the pipeline documents using a threadpool. This is probably the most useful in a multithreaded environment.

Name	Type	number	description
poolSize	integer	0..1	The number of threads in the thread pool. This sets the maximum number of documents to process at one time. Default is 5.
keepAlive	integer	0..1	The number of milliseconds that an idle thread in the threadpool will remain alive before being reclaimed. Default is 15000.

async

Asynchronously process the pipeline documents

Name	Type	number	description
maxThreads	integer	0..1	The maximum number of threads that this processor can spawn. The pipeline stage may override this but can never exceed this value. Default is 5.

The standard processor is the sync processor. This can be overridden if necessary. The processor for each pipeline is given in the pipeline/config.properties file. This is specified by: pipeline-name.processor.type=processor-type.

Example 2.5. Using another pipeline stage processor

This example is also provided in the Babeldoc distribution as 'threads'. The following is a simple pipeline definition found in the directory pipeline/pipeline.properties.

entryStage=ffconvert ffconvert.stageType=FlatToXml ffconvert.flatToXmlFile=flatfile.xml ffconvert.nextStage=splitter splitter.stageType=XpathSplitter splitter.XPath=/big-un/row splitter.nextStage=writer splitter.threaded=true splitter.maxThreads=7 writer.stageType=FileWriter writer.outputFile=out.txt writer.nextStage=null

This simple pipeline definition accepts a text file, converts it to XML, then splits the XML using the XPath expression: /big-un/row. The resultant documents are all written to the same file, out.txt

There are three declared pipelines, all using the same pipeline definition. This is found in the file pipeline/config.properties below:

pipeline.type=simple pipeline.configFile=pipeline/pipeline asyncpipeline.type=simple asyncpipeline.configFile=pipeline/pipeline asyncpipeline.processor.type=async asyncpipeline.processor.maxThreads=4 pooledpipeline.type=simple pooledpipeline.configFile=pipeline/pipeline pooledpipeline.processor.type=threadpool pooledpipeline.processor.poolSize=10

The three pipelines: pipeline, asyncpipeline and pooledpipeline all illustrate the various processor configurations possible.

Feeders

A feeder is a software strategy of getting documents into babeldoc. The following feeders are available:

sync - this feeder synchronously feeds each document to the pipeline. The feeder waits until the processing has completed before feeding the next document
async - this feeder asynchronously feeds each document to the pipeline. The feeder immediately submits all documents and then returns. The documents are then submitted in parallel to the pipelines. The pipelines are run in parallel
async-disk - this feeder is like the async feeder except that the documents are spooled to a directory on the disk so that if the processing is terminated, the feeding may be restarted without any documents being lost

The configuration of each of the feeders is done using the configuration file feeder/config. Babeldoc comes with the following feeders:

# The generic feeders: synchronous sync.type=synchronous # The generic feeders: asynchronous - with an in-memory queue async.type=asynchronous async.queue=memory # The "specific" feeders: asynchronous - with disk queue async-d.type=asynchronous async-d.queue=disk async-d.queueDir=/tmp async-d.queueName=async-d

The async feeders are able to accept an additional parameter, poolSize which limits the thread pool size which limits the maximum number of pipelines that can run in parallel.

Pipeline stage types

There are a limited number of types of pipeline stages. Each of the stages performs a single function. The options available through the configurations change the operation of the stage. In order for your custom pipeline to do any useful work, you have to configure the pipeline stages. You can also create your own custom pipeline stage for specialized processing. See the documentation for each stage type.

CallStage

Allows a pipeline to call another pipelinestage. This pipeline stage is very useful in that it allows for modular pipeline configurations. The result of the called pipeline is either used instead of the current pipeline document or is discarded depending on the setting of the discardResults configuration

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
callStage	string	1..1	Pipeline to call
discardResults	boolean	0..1	Discard the pipeline document from the called stage.
test	boolean	0..1	If this option is set and it evaluates to true, the call is made otherwise

Compression

Compress the document using either zip or gzip compression. **EXPERIMENTAL**

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
compressType	enumeration	0..1	Compression type (zip or gzip)

Decompression

Decompress the document using either zip or gzip compression **EXPERIMENTAL**

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
compressType	enumeration	0..1	Compression type (zip or gzip)

DecryptDocument

Cyptography helper

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
operation	enumeration	0..1	Encryption or decryption
transformation	string	0..1	The encryption transform type
useSessionKey	string	0..1	Use the session key
sessionKeyFile	directory-path	0..1	Use the session key
sessionKeyAlgorithm	string	0..1	Use this session algorithm
sessionKeySize	integer	0..1	size of the session key

Domify

Domify the document contents (assumed to be XML) and save as an attribute on the pipeline document.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
validate	boolean	0..1	Validate the XML. Default is false.
schemaFile	directory-path	0..1	The schema file to validate against

Enrich

Adds attributes to the document. The value of the attribute can be a constant value or a velocity script.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
enrichScript	null	0..n	List of enrichment attributes to add to the document

ExternalApplication

This pipeline stage allows for external applications to be run. Optionally the pipeline document contents is piped to the application as standard input or the output of the application can be read as a new pipeline document.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
application	directory-path	1..1	Full path to the application to run
pipeOutDocument	boolean	0..1	Pipe the current document to the script - the script must fully accept the stardard input otherwise an exception is thrown. Boolean default is false.
pipeInResponse	boolean	0..1	Pipe the response into the document in the attribute ExternalApplicationResponse. Boolean default is false.

FileWriter

Writes the document to a disk file. The contents are written as binary or text data depending on the binary flag on the document. When the pipeline document has been written to disk, this stage can optionally create a 'done' file which could act as a flag file for external processes indicating that the output file is completely written.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
append	boolean	0..1	Append the data to the existing file
outputFile	directory-path	0..1	Output filename
doneFile	directory-path	0..1	Write the "done" file when the document is written. This can act as a flag for other disk scanning processes
encoding	string	0..1	Name of charset used to write file

FlatToXml

Convert this flat document to an XML document

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
flatToXmlFile	directory-path	0..1	Flat file conversion specification XML file

FtpWriter

Write the document using the FTP protocol to an FTP server. This will enable pipelines to distribute documents on the internet using this well supported protocol

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
ftpHost	string	0..1	FTP hostname or ip address
ftpUsername	string	0..1	FTP username to login with
ftpPassword	string	0..1	FTP password to authenticate with
ftpFolder	string	0..1	The name of the folder on the FTP server
ftpFilename	string	0..1	The name of the filename to send the document to the FTP server

HttpClient

Act as http client and get the results as new document

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
method	string	0..1	HTTP method
URL	url	0..1	URL
queryString	null	0..n	Query parameters
followRedirects	boolean	0..1	Follow redirects
http1.1	boolean	0..1	HTTP 1.1
strictMode	boolean	0..1	Strict mode
headers	null	0..n	Headers
parameters	null	0..n	Post parameters
fileParameters	null	0..n	Post file parameters
splitAttributes	boolean	0..1	Add old document's atributes into new document after httpClient call

JavaXmlDecoder

Using the java.beans.XMLDecoder object to unpersist the document contents in Java objects

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding

JournalUpdate

This pipeline stage writes a message into the journal that can be viewed with the journal tool (babeldoc journal). Please note that journal entries should be one line long and contain no quotes, commas, or newlines. If these characters are detected, they will be translated into their HTML equivalents to prevent 'bad things' from happening to the journal tool. However, the output from the journal tool will most likely not be what you are expecting.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
message	string	1..1	The message to write to the Journal

JTidy

Format the pipeline document using JTidy. This is used to "clean-up" HTML documents into well-formed documents.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
indent-spaces	integer	0..1	default indentation
wrap	integer	0..1	default wrap margin
wrap-attributes	boolean	0..1	wrap within attribute values
wrap-script-literals	boolean	0..1	wrap within JavaScript string literals
wrap-sections	boolean	0..1	wrap within <![ ... ]>> section tags
wrap-asp	boolean	0..1	wrap within ASP pseudo elements
wrap-jste	boolean	0..1	wrap within JSTE pseudo elements
wrap-php	boolean	0..1	wrap within PHP pseudo elements
literal-attributes	boolean	0..1	if true attributes may use newlines
tab-size	integer	0..1	tabsize = 4;
markup	boolean	0..1	if true normal output is suppressed
quiet	boolean	0..1	no 'Parsing X', guessed DTD or summary
tidy-mark	boolean	0..1	add meta element indicating tidied doc
indent	boolean	0..1	indent content of appropriate tags
ident-attributes	boolean	0..1	newline+indent before each attribute
hide-endtags	boolean	0..1	suppress optional end tags
input-xml	boolean	0..1	treat input as XML
output-xml	boolean	0..1	create output as XML
output-xhtml	boolean	0..1	output extensible HTML
add-xml-pi	boolean	0..1	add <?xml?> for XML docs
add-xml-decl	boolean	0..1	add <?xml?> for XML docs
assume-xml-procins	boolean	0..1	if set to yes PIs must end with ?>
raw	boolean	0..1	avoid mapping values > 127 to entities
uppercase-tags	boolean	0..1	output tags in upper not lower case
uppercase-attributes	boolean	0..1	output attributes in upper not lower case
clean	boolean	0..1	remove presentational clutter
logical-emphasis	boolean	0..1	replace i by em and b by strong
word-2000	boolean	0..1	draconian cleaning for Word2000
drop-empty-paras	boolean	0..1	discard empty p elements
drop-font-tags	boolean	0..1	discard presentation tags
enclose-text	boolean	0..1	if true text at body is wrapped in <p>'s
enclose-block-text	boolean	0..1	if yes text in blocks is wrapped in <p>'s
add-xml-space	boolean	0..1	if set to yes adds xml:space attr as needed
fix-bad-comments	boolean	0..1	fix comments with adjacent hyphens
split	boolean	0..1	create slides on each h2 element
break-before-br	boolean	0..1	o/p newline before <br> or not?
numeric-entities	boolean	0..1	use numeric entities
quote-marks	boolean	0..1	output " marks as "
quote-nbsp	boolean	0..1	output non-breaking space as entity
quote-ampersand	boolean	0..1	output naked ampersand as &
write-back	boolean	0..1	if true then output tidied markup
keep-time	boolean	0..1	if yes last modied time is preserved
show-warnings	boolean	0..1	however errors are always shown
error-file	string	0..1	file name to write errors to
slide-style	string	0..1	style sheet for slides
new-inline-tags	string	0..1	new inline tags
new-blocklevel-tags	string	0..1	new block level tags
new-empty-tags	string	0..1	new empty tags
new-pre-tags	string	0..1	new pre tags
char-encoding	integer	0..1	CharEncoding = ASCII;
doctype	string	0..1	user specified doctype
fix-backslash	boolean	0..1	fix URLs by replacing \ with /
gnu-emacs	boolean	0..1	if true format error output for GNU Emacs
smart-indent	boolean	0..1	does text/block level content effect indentation
alt-text	string	0..1	default text for alt attribute

Null

Null stage. This do-nothing stage is useful in certain situations like a tracking placeholder or just a placeholder for some future pipeline stage.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding

Reader

Load the contents of the file, completely overwriting the current document's contents with the file's contents.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
file	directory-path	1..1	The filename or URL to the object to read.

Router

Route this document to a number of specified stages. This stage would be used to specialize processing based on some criterion very much like an if-else statement. Usually the criteria used would be an attribute on the document like time of processing, filename, etc but could be a script. The nextStage complex parameter must evaluate to the literal 'true'. If more than one of the nextStages resolves to true, then the document is routed to each of those stages. If none of the matches are made, the regular nextStage configuration option is used. This provides the 'else' part.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
nextStage	null	0..n	Stage name to route to if the script resolves to 'true'. Each of the matching nextStages will be routed.

RssChannel

Write an item entry to an RSS Channel

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
channelFile	directory-path	1..1	RSS File to process
channelSize	integer	1..1	Maximum umber of items in the RSS Channel
itemDescription	string	1..1	Item Description
itemLink	string	1..1	Item Link
itemTitle	string	1..1	Item Title

Scripting

Execute a user supplied script. This pipeline stage enables pipeline developers to create and manipulate documents in novel and unforeseen ways.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
language	enumeration	1..1	Scripting language - supported as per Apache BSF - Default is javascript
script	multiline	0..1	Script to be executed
scriptFile	directory-path	0..1	Script file to processed

Signer

This stage performs digital signing or verifying the signatures

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
operation	enumeration	1..1	Type of operation that should be performed
keyStoreFile	directory-path	1..1	Absolute or relative file path to the keystore file
keyStoreType	string	0..1	Type of the keystore
keyStorePass	string	1..1	Password of the keystore
signatureFile	directory-path	0..1	File path of the sigature file. This is where signature will be saved if cigning or loaded if verifying signature
signatureAttribute	string	0..1	Document attribute where signature will be stored when signing or loaded from if verifying
verifiedAttribute	string	0..1	Document attribute where result of verify operation will be saved
algorithm	string	1..1	Signature algorithm used for performing operations
keyAlias	string	1..1	Alias of the private key used for signing
keyPassword	string	0..1	Password of the private key used for signing if key is protected with password
certificateAlias	string	1..1	Alias of the certificate (public key) used for verifying signature

SmtpWriter

Email the document using the SMTP protocol. This will allow for documents to be transmitted via email to a number of recipients. The document is normally the body of the email but could also be an attachment.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
smtpHost	string	1..1	The SMTP host to communicate with
smtpFrom	string	1..1	The email address of the sender
smtpTo	string	1..1	The email addess to send the email to
smtpSubject	string	1..1	The subject line of the email
smtpMessage	string	0..1	The body message of the email
filesToAttach	string	0..1	The list of files to attach to this email
attachDocument	boolean	0..1	true if document should be send as attachment. Default is false
documentFileName	string	0..1	The name of the attached document
format	enumeration	0..1	The mail format - text/plain or text/html - Deafult is text/plain

SoapWriter

Send the document to a soap service

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
soapUrl	url	0..1	URL for the SOAP service
soapAction	string	0..1	SOAP action
resultStage	string	1..1	URL for the SOAP service
responseDoc	boolean	0..1	Return SOAP service response as an attribute
authentication	boolean	0..1	Post soap document with authentication
username	string	0..1	User id for authentication
password	string	0..1	Password for authentication

SocketWriter

Send the pipeline document contents to a tcp/ip socket. This is useful for low-level operations.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
hostName	string	1..1	The name of the host
hostIp	string	1..1	The ip address of the host
port	integer	1..1	Neither host ip or host name provided

SqlEnrich

Enrich documents with values based on sql queries

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
resourceName	string	1..1	Name of the resource that contains Database Connection
attributeSql	null	0..n	List of attribute names which contains sql queries that return single value. Attribute will get value returned by that query. Only the first column of the first row will be taken if multicell results returned!
sqlScript	null	0..n	List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere.

SqlQuery

Creates an XML file from a SQL query

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
resourceName	string	1..1	Name of the resource that contains Database Connection
sql	null	0..n	List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere.

SqlWriter

Executes the specified SQL statement

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
resourceName	string	1..1	Name of the resource that contains Database Connection
useBatch	boolean	0..1	Use JDBC SQL batching - depends on the driver support
batchSize	integer	0..1	The batch size if applicable
sql	string	1..1	The SQL statement to execute
failOnFirst	boolean	0..1	Set to true if the pipeline should not attempt subsequent SQL statements if a statement fails
messageTag	string	1..1	The message tag to search for if the statement fails - this is then logged instead of the SQL error message

SvgTranscode

Render the SVG xml document to a binary image

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
transcode	enumeration	1..1	Choose the transcode
width	integer	0..1	Width of the output image
height	integer	0..1	Height of the output image
quality	integer	1..1	Quality of the translation expressed as a percentage

VelocityTemplatize

This stage uses Velocity to templatize the document. The results of the operation will replace the original template.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding

XlsToXml

Converts Microsoft Excel files to XML format. This creates a regular XML output document of workbooks, rows and cells. The XML encoding can be configured if necessary.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding string for the output XML. By default it is UTF-8.
attributes	multiline	0..1	Attributes
locale	string	0..1	Locale which should be used for formatting numbers and dates from Excel workbook. If not specified, default Locale will be used.

XpathExtract

Use XPath expressions to extract nodes from the document and store them as attributes on the document. This pipeline stage is widely use when data needs to be extracted from XML documents for router or calculation steps. The extracted attributes can be quickly and easily obtained using velocity $document.get and from the scripting stages. Routing decisions based on the document contents are also possible using this technique.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
XPath	null	0..n	The name of the xpath configuration option is the attribute to assign to the document

XpathSplitter

Split the XML document using xpath expressions. This will result in a number of documents being forwarded to the next stage. This is useful when each of the split nodes represents a document that needs to be actioned. An example would be splitting out each of the orders from an XML document that is a collection of orders.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
xmlOmitDecl	boolean	0..1	Omit the XML PI declaration from the output document
xmlIndent	boolean	0..1	Indent the output document
XPath	string	1..1	The XPath expression to use to split the document

XslFoTransform

Apply XSL:fo transformation on the document

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
outputType	integer	0..1	The buffer size to use

XslTransform

Transform the document (has to be XML) using this XSL script. The script can access all of the babeldoc internals via a number of parameters. The parameters (accessed through the xsl:param element) which are always placed in the transformer are: pipelinestage and document. Other parameters may be placed on the transformer using the param option.

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding
transformationFile	directory-path	0..1	The filename or URL to the XSL transformation file. If this is a file, then the XSL will be cached. If the file is modified, then the XSL document will be reloaded.
transformationScript	multiline	0..1	An inline XSL document that could be used instead of the file option above. This will be cached.
param	null	0..n	Complex configuration parameter (of form stage-name.param.param-name-n=param-value) of xsl:params that will be placed in the XSL transformer. This can significantly aid transformation tasks.

ZipArchiveWriter

Crack this document as a zip archive

Name	Type	number	description
stageType	service-name	1..1	Type of pipeline stage
nextStage	string	1..1	Name of the next stage in pipeline or null if this is the last stage.
ignored	boolean	0..1	If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline.
tracked	boolean	0..1	If this is set then this stage is tracked - the pipeline document is written to the journal.
encoding	string	0..1	Encoding of resulting document. This is used for text documents. Default is system file.encoding

Handling errors in pipeline stages

Babeldoc has a configurable error handling mechanism. In the case of an exception, the exception will be handled using the default error handler. You can override the default error handler by specifying a custom error handler for a pipeline stage if the default handler is not suitable. The default Babeldoc error handler performs the following steps:

Log the exception to the error log
Set the processing state for the pipeline stage to FAIL
Continue or quit process. This is determined by the flag failOnError. By default this is false but you can set it to true if you want to stop processing the current document if an error occurs. If it is false, Babeldoc will continue processing by proceeding to the next stage.

If you want to have some other error handler you can do it by writing your own error handler class. Your class should implement the interface com.babeldoc.core.pipeline.IPipelineStageErrorHandler. You will also need to provide your new error handler Java class to the pipeline configuration.

Tracking documents

You can store whole document with all its attributes in the journal at a given pipeline stage. This can be done by setting the configuration option tracked to true for a pipeline stage you want to track. The document will be then be stored in the journal. However the attributes are not guaranteed to be stored along with the document. This all depends on the Journal implementation. Also, all attributes are saved as Strings. If you want to use the replay operation in Journal then you should use this option in one of the previous stages so that the replayer will be able to recreate the document.

Ignoring Pipelinestages

There can be situations when you don't want a stage to be processed. This can be done by setting the configuration option ignored to true. For example you can use for unzipping files with zip extension, but you don't want to unzip files that are not zip files. In these situation you can set ignored to true (using Velocity scripting based on the file extension) and processing will not be performed in that stage.

Pipeline Tool

The pipeline can be accessed using the pipeline commandline tool. This allows for the inspection of the pipeline, the stages, the configuration options and connectivity options.

-Q or --query: Lists the pipelines in the system.
-L pipeline-name or --list pipeline-name: Lists the stages in a particular pipeline.
-C pipeline-name.stage-name or --config pipeline-name.stage-name: List the configuration for a pipeline stage.
-L pipeline-name.stage-name -y or -L pipeline-name.stage-name --type: List the type of the pipeline stage.

There are a number of options in this tool. Use the -h option to get the complete list.

Chapter 3. Resources

Table of Contents

Introduction
jdbc
jndi
pooled

Introduction

Resources in Babeldoc are a generalized way of accessing data sources. Resources are also considered to be scarce in that they have to be protected from leakage. This is particularly important for database and J2EE resources . They are named using a string name and are thus uniquely identified in the system. The resources are defined in the config/resource/config.xml file. Each resource name maps to a specific class name which governs the policy of the resource and is programmatically specified. The available resources are:

jdbc Unpooled Jdbc access to a database connection
jndi Jndi lookup to a database connection
pooled Pooled jdbc access to a database connection

Each of the named resources are defined in the config/resources as resource-name.properties. Each of the properties files has a required name/value pair called type. This can be one of the types as listed above. The rest of the configuration options are specific to the type of resource. These configuration options are given below:

jdbc

Each simple jdbc resource defines a connection to a database using the following configuration options:

dbUser name of user to log into the database.
dbPassword password of the user.
dbDriver the jdbc driver for the database.
dbUrl the URL for jdbc to resolve the specific database.

For this to work correctly the jdbc jar file for the specific database you need to access needs to be in the classpath. Currently only the mysql jar is built into Babeldoc. Access to other kinds of databases like Oracle, DB2 and Sybase will require that the JDBC driver libraries be placed in the CLASSPATH. The name of the dbDriver parameter and the form of the dbUrl will be highly dependent on the particulars of the vendor database. There are a number of limitations to the simple jdbc resource, the primary limitation being that the resource does not pool connections. Each time a connection is requested, a new connection is created and this can be VERY time consuming.

jndi

This resource is useful when running in a J2EE container. This allows for accessing datasources using JNDI. There is a single configuration option:

datasourceName the jndi name of the datasource to lookup.

pooled

Babeldoc provides a pooled connection using the (Apache Commons DBCP) library . The configuration for the resource is provided by the following configuration options

dbUser The user name with access the database
dbPassword The user's password
dbUrl Url for access to database

Chapter 4. Journal

Table of Contents

Introduction

Journal Operations

Journal Implementations

Simple Journal
Jdbc Journal
Ejb Journal Implementation

Journal Tool

Introduction

The journal keeps track of documents as they move through the system as well as the status of each operation performed on the document. The primary purpose of the journal is to provide a safe environment for the processing of documents. There are a number of mission critical situations where losing data is not acceptable. It is possible to recreate document processing if an error condition should arise. Errors can be both external and internal. Internal problems could be temporary database errors, disk space, etc. External causes could be erroneous documents, network outages, etc.

Each document is associated with a JournalTicket which is assigned uniquely just as the document enters the pipeline. Each operation upon a document for a JournalTicket (hereafter also referred to as a ticket) is performed at a step. Steps start at zero and increase until the document is finished processing. Each operation (or pipelinestage) on a document can be uniquely identified by a combination of a ticket and a step.

Journal Operations

A journal operation indicates what happened in the journal for the document at that pipelinestage. This is essential for determining problems with document processing. There are a number of journal operations available:

newTicket. This operation is the first operation (step 0) when a document is introduced into a pipeline. This returns a new ticket.
forkTicket. This operation occurs when a document is split into many documents or similar operations. The forked ticket is a new ticket but is associated with it's parent ticket in the Ticket lineage and may thus be traced.
updateStatus. This operation will cause the status of this ticket to be updated and the step updated. The ticket is unchanged, the step is incremented.
updateDocument. This operation writes the document to the journal data store (implementation dependent). The ticket is unchanged and step is incremented.
replay. This operation causes the document associated with the ticket to be replayed from the step specified. This operation can only succeed if the document was updated (see update document operation).

Journal Implementations

The implementation of the journal depends on your specific circumstances. There are currently three implementations that are available. Which specific journal to use is defined in the configuration file: config/journal/config.properties. The journal to be used is set in the single name/value pair: journalType. The options are:

simple
mysql
oracle
postgresql
sqlserver
ejb

Simple Journal

The simple journal implements it's operations as disk files and directories. It is not intended as a robust, enterprise level implementation. It also lacks structured query functions for querying, etc. Its configuration file is config/journal/config.properties. This file has a number of configuration options

simpleJournalDir: The directory to create the log-detail files.
simpleJournalLog: The path to the journal file. See later.
logMaxSize: This will roll-over the log file once the journal log reaches this size.

For each operation logged to the journal, it is logged line by line to the journal log file. The lines are comma-separated values (CSV) and can be parsed by third party applications. The columns are:

ticket number: the ticket number is currently the time in milliseconds at time of creation of the ticket.
step: the step number - starting from 0
operation: The particular operation being executed
timestamp: The time in milliseconds when the operation was logged
status information: The fail / success for updateStatus.
pipeline stage name The stage within the pipeline when this step was logged.
additional status information: The additional status information that indicates further information about this journal log.

For each ticket, there is a directory created with the value of the ticket (this is long string of numbers - its actually the time in milliseconds of when the ticket was created. Inside this directory there are step delta files which represents each step in the log for that ticket. The contents of the delta file may be the status string or the document itself (if the operation is updateDocument). The document is persisted as an object serialization.

Jdbc Journal

It is possible to use a database to store the journal log and the document data. Currently oracle and mysql are supported. The schema creation scripts are in the directory readme/sql. The document data is stored as binary data (BLOBs). Each vendor supports BLOBS slightly differently, hence the specific database support. There are three main tables involved in storing the journal data (the table table_key is for unique key generation), being:

log: Stores tickets and steps for the tickets as well as the operation details for each ticket step. The log_other_data column can either store the status message for updateStatus operations or the parent ticket id for forkTicket operations.
journal: Stores the document as a blob for the ticket step. This is associated with updateDocument operations.
journal_data: Storage for the enriched variables associated with the document. The primary reason that these variables are stored separately is that they can be used as query parameters for console operations. Note that long and binary variables are not stored to the database and that strings can get truncated.

The configuration for the Mysql, Oracle, PostgreSQL, and SQL Server journals are stored in the configuration file: config/journal/sql/config.properties. The only configuration option in this file is resourceName indicates that name of the resource that will manage the database connection. Currently the journal is implemented in a separate schema (instance, whatever) than the other database storage areas (user, and console).

Ejb Journal Implementation

The intent of this journal implementation is to store the operation journal implementation in a J2EE container. Currently Jboss is explicitly supported but not to the exclusion of other containers. This implementation is really a shell around either the simple or sql journal implementations but running in a remote server. By this means, it is possible to move the journal operation to a central location. The configuration for the ejb implementation is stored in the configuration file:

Journal Tool

The journal tool allows access to the journal from the command line. This enables complex queries to be applied against the journal. There are four separate types of queries:

-L or --list: List all the tickets and the steps in the journal. This can produce lots of output. This can be limited by the flag -n (no more than this many lines of output). It is also possible to start from another index other than zero using the -i flag
-T ticket-number or --tickets ticket-number: List all the ticketsteps for the supplied ticket.
-D ticket-number.step or --document ticket-number.step: Displays the contents of the document stored at the ticket/step to the screen
-R ticket-number.step or --replay ticket-number.step: This will reintroduce the document at the the point it was stored or later.

There are a number of options which can change the display of the data from the tool - use the -h command line to get all the options for this tool

Chapter 5. Scanner

Table of Contents

Introduction

Starting scanner

Configuration

DirectoryScanner
null
HttpScanner
MailboxScanner
SqlScanner
FtpScanner
ExternalApplicationScanner

Introduction

The scanner is a tool that scans for messages from a variety of sources and when a message is found, it is fed into the pipeline. The scanner is an automation tool, in that a system can be built up using scanners and pipelines. This is an alternative to the process script which feeds a single document into the pipeline when run. The scanner is currently capable of scanning a directory in a filesystem, a mailbox on a mail-server, an FTP servcer, a web server, a database via a SQL query, external application output and a JMS queue. The period of scan and the pipeline to feed, as well as other specific configuration options are all set in the config/scanner/config.propertiesuserinput> file. There may be one or many scanning threads active, each configured differently. For example, one scanner thread could be polling a mailbox once every 60 secs while another is scanning a directory every 10 seconds. The scanner is also capable of scanning based on a schedule specified in the same way that CRON is on UNIX systems.

General attributes available are file_name, scan_path and scan_date.

Starting scanner

The scanner tool is started by running the command babeldoc scanner. This command will use configuration from config/scanner/config.properties. If you want to use configuration from a different file you can use -s another_configuration switch to specify the configuration that should be used instead of default one.

Configuration

There are two kinds of configuration options available:

general: these options are global and apply to all types of scanners.
specific: Options for a certain kind of scanner. For example the configuration: 'host' is only pertinent to the email scanner.

The options for each scanner type are laid out below.

DirectoryScanner

The directory scanner is used for scanning directories on local file system. It can be configured to scan subdirectories of given folder recursively and it can use filter for files that be scanned or for files that should not be scanned (ie exclusion and inclusion) parameters. This is very useful for integrating Babeldoc into larger systems. An example would be reading documents placed in a directory by another application running on the computer or another computer to a shared, networked filesystem.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (DirectoryScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
inDirectory	directory-path	1..n	Directory to be scanned
doneDirectory	directory-path	0..1	Folder that is used for storing scanned files. Note that scanned files will be removed from inDirectory
includeSubfolders	boolean	0..1	Specifies if scanning should be recursive, and include subfolders. If yes, files will be copied to doneDirectory with path relative to inDirectory.
filter	string	0..1	Regular expression filter. Only files that do match will be included. If not specified all files will be included
minimumFileAge	integer	0..1	Minimum age of file in ms (attempts to guard against incomplete reads)

null

Null Scanner feeds null documents everytime the scanner runs. This is useful for scheduling.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (null)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files

HttpScanner

The HttpScanner allows the scanner to pull down documents from web servers. Any headers recieved by the HttpScanner are placed on the document as attributes.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (HttpScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
url	url	1..n	URI to getChild the document from
attempts	integer	0..1	Number of times to attempt to get the document
user	string	0..1	Authenticate with this user name
password	string	0..1	Authenticate with this password
realm	string	0..1	Authenticate with this realm
proxyHost	url	0..1	URL of the proxy host
proxyPort	integer	0..1	proxy port
timeout	integer	0..1	retry timeout

MailboxScanner

The MailboxScanner is used for scanning mail servers for e-mail messages. Document can be scanned from e-mail body or from attachments. This is very useful for integration with email enabled clients. An example would be purchase orders emailed to a mailbox scanned by Babeldoc. The From, To and Subject filters are regular expression filters. Enter regular expressions which, if matched, cause the matching email to be processed. For example, if you wanted to match a recipient address of first.last@server.com, you would enter "first\.last@server\.com" in the toFilter. The expressions are effectively OR'd together, because if any one of the filters gets a match, the e-mail message will be processed. The toFilter is tested against all addresses in the TO field. It is NOT tested against the CC or BCC fields. Accessible attributes are subject, from, to and replyTo.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (MailboxScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
host	string	0..1	Mail server host name or address
protocol	string	0..1	Protocol which is used for connecting to mail server (pop3, imap...)
timeOut	integer	0..1	Socket I/O timeout value in milliseconds. Default is infinite timeout.
folder	string	0..1	Name of folder on mail server. (for example INBOX)
username	string	1..n	Username for logging to mail server
password	string	1..n	Password for logging to mail server
getFrom	enumeration	0..1	Should message be created using mail body or attachment. Default is body
fromFilter	string	1..n	Regular expression which, if matched by the From field, causes the message to be processed
toFilter	string	1..n	Regular expression which, if matched by the To field, causes the message to be processed
subjectFilter	string	1..n	Regular expression which, if matched by the Subject field, causes the message to be processed
fromFilterResult	boolean	0..1	Result of regular expresion (true o false)
toFilterResult	boolean	0..1	Result of regular expresion (true o false)
subjectFilterResult	boolean	0..1	Result of regular expresion (true o false)
deleteInvalid	boolean	1..n	Delete messages that are not valid (invalid address etc.) and not processed by Babeldoc

SqlScanner

Sql scanner is used for generating documents by executing sql queries. It can produce XML documents, csv documents or simple documents. In case of XML and CSV only one document can be returned and it will contain all returned rows. In case of simle document each document is formed from the first column of each returned row.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (SqlScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
resourceName	string	1..n	Name of the connection resource
sqlStatement	string	1..n	SQL statement that is executed to get documents
updateStatement	string	0..1	SQL Statement that is executed after selecting rows and creating documents. It is used for marking rows as processed so they don't need to be processed later.
documentType	enumeration	0..1	Type of document that is returned. Choices are simple, xml or csv
cvsFieldSeparator	string	0..1	Character that is used for separating fields in CSV file. Default is comma
csvRowSeparator	string	0..1	Character that is used for separating rows in CSV files. Default is \n
xmlHeadingTag	string	0..1	Tag that is used in XML document for heading. Default is document
xmlRowTag	string	0..1	Tag that is used in XML document for each row. Default is row.

FtpScanner

Scans given folder on remote FTP server. This allows babeldoc to connect to remote FTP servers and then scan folders for documents to process.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (FtpScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
ftpHost	string	1..n	Host name or address of the ftp server
ftpUsername	string	1..n	Username that is used for connecting to host
ftpPassword	string	1..n	Password that is used for connecting to host
ftpFolder	string	1..n	Folder name which is scanned
includeSubfolders	boolean	1..n	Should subfolders be scanned too
ftpOutFolder	string	0..1	Folder on FTP server where scanned documents should be copied
localBackupFolder	directory-path	0..1	Folder on local file system where scanned documents should be copied
filter	string	0..1	null
maxDepth	string	0..1	scanner.FtpScannerInfo.option.maxDepth

ExternalApplicationScanner

The ExternalApplicationScanner runs an external application and pipes the standard output from that application into the pipeline.

Name	Type	number	description
type	service-name	1..n	General: Type of scanner (ExternalApplicationScanner)
period	integer	0..1	General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used)
cronSchedule	string	0..1	General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used)
pipeline	string	1..n	General: Name of pipeline where scanned document will be processed
contentType	string	0..1	General: Content type of document to be scanned
ignored	boolean	0..1	General: true if scanner should not scan false otherwise. Default is false
journal	boolean	0..1	General: Should the scanner use the journal. Default is true
countDown	integer	0..1	General: The number of times this countdown must run.
binary	boolean	0..1	General: The documents from this stage must be submitted as binary pipeline documents
encoding	string	0..1	General: The encoding used for reading input files
application	string	1..n	The application to run.

Chapter 6. Flat file conversions

Table of Contents

Introduction

configuration

Header
Paragraphs

conversion XML document

CSV files
Flat lines
Paragraphs
Line Segments

Introduction

Flat file ascii data is produced by a number of modern and legacy systems. Examples of flat file data include CSV; COBOL copybooks; positional data (data items placed in a two dimensional with data placed at columns and rows, occupying a number of characters); repeating groups (groups of data which repeats based on found in the document).

configuration

The flat file conversion is governed by an conversion configuration file that conforms to the schema: readme/schema/conversion.xsd. This clearly descibes the varous different configuration options.

Each input document is considered split into two major parts:

header: Details various options that are globally applicable.
Paragraphs: Details the details of the specifics of the conversion.

Header

The header is the first part of the conversion xml document that describes characteristics of the input document (type of conversion, line ending character, the number of lines in a paragraph, lines from the top of the document to the first paragraph, lines between the paragraph, etc) and the output document (the root element and the row element name).

Paragraphs

The paragraphs in the input document represent the lines of data that are of interest to be mapped to the output XML document. Each paragraph may consists of one or more lines, each line consisting of one or more characters up to the end-of-line character. Each paragraph maps to a sub-root element in the output document. Each field in the paragraph is represented either by position and width characters in a position type document or by column number in a CSV type document. These fields are represented by sub-row elements in the output document, ie. In XPath: /root/paragraph/field.

There are two basic kinds of paragraphs, segmented and non-segment lines

Non-segmented lines

Non-segmented lines are lines whose output paragraph xml element does not change based on the presence of data in the input document. There are three types of non-segmented input documents:

CSV documents

This is the simplest document. Each paragraph is a line of comma-separated values. Each field is specified by a column number and a field name. The name is the subrow element to emit for the data found at the column number.

Single-line paragraphs

This a positional document where each line of the input document represents a paragraph. Each field in the line is specified by an offset (starting from zero) into the line, width of the field and the field name of the sub-row element to emit.

Multi-line paragraphs

This is a positional document where each paragraph consists of a number of lines. Each field is specified by a line offset into the paragraph (from the top of the paragraph), an offset from the left margin, character width and a field name. This is useful for screen scraping operations where the screen height and width (usually 80x24) represents the paragraph.

Segmented lines

The premise with segmented lines is that the input file may contain some value which indicates the kind of data on that line as a marker. This is specified as a column/width and a value to match. Once a line has been identified, it is possible to then preform either a single-line paragraph conversion or a CSV on it. There is an optional nesting element to output when a segment is matched - this is situated between the row and field elements.

conversion XML document

The conversion xml document is divided into two sections: header and conversion information. The basic format is:


header
	output-document
		root-element	The root element of the output document.
		row-element	The row element for each of the input paragraphs
	input-document
		conversion-type	This can be one of: line, csv, para, segmented-line.
		line-ending	The line ending characters - this is currently ignored
		field-separator	for files, the separator characters.
		inter-skip	The number of lines to skip between paragraphs
		top-skip	The number of lines to skip before the first paragraph is encountered
		left-margin	Characters to skip from the beginning of the line to the first character of interest in the paragraph
		lines-per-para	The lines for each paragraph. This is used for establishing the chunk size.
`fields`

The fields given above depends on which type flat file is to be converted. There are three types:

CSV files

These files are often used when exporting data from spreadsheet applications. Each column of data is separated by a comma and cells may be enclosed in quotation marks to escape text

field [1..*]		there can be one or more fields
	field-name	The name of the output field element
	field-number	the number of the CSV column

Flat lines

These files consist of lines of data, each line corresponds to a row of data. The fields of data are positionally arranged in the line of data. For instance, the order reference could exist at column 15 and of width 10.

		Holds the fields
field [1..*]		there can 1 or more fields
	field-name	The name of the output field element
	field-column	The character number of the column
	field-width	The number of characters of the width

Paragraphs

These files consist of regular groups of lines. Fields exist at a particular row and column in the array of lines. Each field then consists of a number characters, that is, a width. This is very similar to 9.2 except that it is a two dimensional array of data. This is useful for screen-scrapes, etc..

		element to hold the paragraph fields
field [1..*]		There can be 1 or more fields
	field-name	The name of the output field element
	field-column	The character number of the column
	field-row	The line (from top) of the field.
	field-width	The number of characters of the width

Line Segments

The segments is the method of mutating the output based on key fields in this input field.

			element to hold all of the segments
segment [1..*]			Segments (there can be 1 or more)
	segment-name		The name of the output segment element
	segment-column		The column of the segment marker
	segment-width		The width of the segment marker
	segment-value		The value of the segment to match.
	begin-group-name		The name of the element to begin the group.
		csv-fields \| line-fields

Chapter 7. HOW TOs

Table of Contents

Introduction
HOWTO Set up Eclipse with Babeldoc
HOWTO Read an attribute from external XML file
HOWTO Access the attributes of a pipeline document inside an XSLT
HOWTO Package up your application into a single jar file for easy distribution

Introduction

This chapter is intended to collect together some of the collected knowledge of those using Babeldoc. The intention is to save you time if you are trying to perform some of these tasks or similar. Please contribute your nuggets of information - these can help others.

HOWTO Set up Eclipse with Babeldoc

Pre-requisites: Eclipse 3.0 build M4 or later. Anything earlier will not work.

Open eclipse
Make sure that you can see the CVS Repositories View. If you can't, click Window | Show View | Other ... and select CVS Repositories.
The Repositories View will (probably) come up empty. Right click on the white space, and click New | Repository Location ... Enter all the repository details (extssh for anybody with a developer account, otherwise pserver), and click OK.
A new entry for the repository will appear. It's the root node in a tree. Open the tree. Below you should see entries HEAD, Branches and Versions. If you want to develop on the HEAD, as most core developers would probably want, open the HEAD node.
Right-click on the babeldoc node that appears under the HEAD, and select Check Out As... A dialog will appear. You can use "Check out as a project configured using the New Project Wizard". or try the "Check out as a project in the workspace".
Complete the New Project Wizard details.
Once the New Project Wizard has finished processing, you should have a project open in the Java perspective. If not, click on the Add Perspective button, and add a Java Perspective.
Right-click on the project node, and select Properties. Select the Java Build Path option, and select the Source tab.
You should see your project appear with a single (empty) exclusion filter. Edit the filter, and set it to **, i.e.: exclude all files (trust me, I'm a programmer), and click OK.
Select Add Folder... and add each of the src/ folders. You can multiselect on the folder selection dialog. For example, you should open the root node, and then open modules/, and then open babelfish/, and then select the src/ directory. Then open the conversion/ directory, and select src/ (with the Ctrl button down, this time), and so on. In the j2ee/ folder, don't forget to add both src/ and gensrc/. All src/ directories inside modules/ should be added.
Now in the Properties dialog, still in the Java Build Path option, select the Libraries tab. Click Add JARs... and select all the jar files in build/lib, except for any library beginning with "babeldoc_". Also add support/ant/lib/ant.jar and support/ant/lib/junit.jar.
Now click OK in the Properties dialog. Eclipse will probably spend a few seconds rebuilding its project information.

You should now have a happy eclipse system showing all the source modules, and the libraries. Eclipse should not show any errors detected by the background compiler. However, there will be a stack of warnings. They can, for the moment, be ignored.

Now to get ant working.

In the Java perspective, right-click on build.xml, and select Run Ant...
The Ant dialog will now appear. Click on the Main tab, and ensure that the Base Directory is set to the project root. It will probably look something like this: ${workspace_loc:/Babeldoc}
Click on the Classpath tab. Uncheck the "Use global classpath as specified in the Ant runtime preferences". Click Add JARs... and add support/ant/lib/babeldoc_bootstrap.jar, support/ant/lib/xercesImpl.jar, and support/jalopy/lib/jalopy-ant-0.6.1.jar, or whatever the current version is.
Click on the JRE tab. Click Alternate JRE, and select one of your JREs. You should probably set it to something fairly recent. Now, this is critical. You have to set the Working directory. Uncheck the "Use default working directory", and select "Workspace". Click Browse... and select the root node of the project. Click OK. If you don't have a "Working directory" section on the JRE tab, running ant is not going to work. If you do not have the working directory section, you need to upgrade your eclipse to at least version 3.0 build M4.
Click Apply.
Click on the Targets tab. Select the "build" target, and click Run.

You should now get a Console View appear, and the ant output will be spooled into the Console View.

HOWTO Read an attribute from external XML file

As there is no equivalent to SQLEnrich for XML it is not obvious how to get an attribute from an external file and then revert to the original document. One way to do this is to store the current document as an attribute, process the second file and then revert the document to value of the attribute

doc2attrib.stageType=Scripting
doc2attrib.nextStage={Stages that load other document etc.}
doc2attrib.script=document.put("originalContent", document.getBytes());

attrib2doc.stageType=Scripting
attrib2doc.nextSyage={Continue with processing}
attrib2doc.script=document.setBytes(document.get("originalContent"));

HOWTO Access the attributes of a pipeline document inside an XSLT

Essentially you can use document.get("myprop")

<xsl:param name="doc" select="$document"/>

<xsl:param name="myprop" select="java:get($doc, 'myprop')"/>

For the syntax, see: http://xml.apache.org/xalan-j/extensions.html See the java section.

Additionally you can get the pipeline stage object from the XSL and then you can manipulate the java code directly.

The snippet below is an example of how to get the current time and format it nicely:

<xsl:variable name="date" select="java:java.util.Date.new()"/>

<xsl:variable name="seconds" select="java:getTime($date)"/>

<xsl:variable name="velocity"

select="java:com.babeldoc.core.VelocityUtilityContext.new()"/>

<xsl:variable name="datestr" select="java:getFormattedDate($velocity, 'd MMM yyyy

HH:mm:ss', $seconds)"/>

HOWTO Package up your application into a single jar file for easy distribution

The idea of this HOWTO is to avoid distributing all the directories that make up your configuration by packaging them all up into a single jar file and using this to run your pipelines.

Lets assume that your BABELDOC_USER points to the c:\project directory. This directory has all the required configuration directories like pipeline, resource, etc.

Jar up your configuration files: jar cf myproject.jar pipeline resource journal producing a myproject.jar file.
Change your BABELDOC_USER to: set BABELDOC_USER=c:\project\myproject.jar
Verify that your pipelines still work, but change directory away from c:\project directory to make sure that the configuration files there don't interfere with new BABELDOC_USER variable!

Appendix A. The Apache Software License, Version 1.1

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. The end-user documentation included with the redistribution, if any, must include the following acknowledgment:

"This product includes software developed by the Apache Software Foundation (http://www.apache.org/).

Alternately, this acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear.

4. The names "Apache" and "Apache Software Foundation" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact apache@apache.org.

5. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.

THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

====================================================================

This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation. For more information on the Apache Software Foundation, please see http://www.apache.org.

Portions of this software are based upon public domain software originally written at the National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign.

====================================================================