Table of Contents
List of Examples
If you are using Babeldoc, please participate in the development of Babeldoc. This can include just sending a hello, posting bug reports to actually contributing code. We want to hear from you. Babeldoc forums. . There is a mailing list that will be very useful for technical users of Babeldoc hosted on sourceforge called babeldoc-devel. Please join for the latest news.
Table of Contents
Babeldoc is a document processing system. Babeldoc is especially suited for Business-to-Business (B2B) environments and similar integration projects. Babeldoc has a flexible and reconfigurable processing pipeline through which documents flow. These pipelines transform the document. Additionally Babeldoc has a sophisticated and extensible journaling system so that documents may be reprocessed and resubmitted as well as tracked through the system. Its runtime environment is flexible so that it can run standalone, in a web container or in a J2EE container (currently tested on Jboss 2.4.x and JBoss 3.0.x). Babeldoc has both a Web console, a number of GUI tools and a command-line console to control the document flow.
The flow based metaphor is an appropriate metaphor that will expanded on in this document.
Babeldoc can be used to process documents which are flowing in and out of a system. There are three basic ways to develop applications to handle these document flows:
How about a system that can be implemented in these ways but still centrally managed and configured. Additionally how about a system that seems like (1) but is actually implemented as a server architecture (3). Babeldoc can be configured to run in any one of these three ways.
Additionally document processing sometimes fails and then the issue is how do you react? Babeldoc has a very sophisticated journaling function that can allow administrators to reintroduce documents into any place in the pipeline.
Babeldoc comes ready to run with demonstration pipelines right out-of-the box. There are two pipelines configured: demo and test. They provide an example of how to construct a pipeline using the simple configuration style (property files) and the more sophisticated XML based configuration. There are additional pipelines configured, including one to recreate the documentation you are currently reading in a number of different formats.
In addition the usage-whitepaper, also found in the /readme directory provides a walk-through creating two usage scenarios - both of which are in the readme folder.
In order for these examples to run correctly, please ensure that the following instructions are followed
Download and install the JRE. Babeldoc requires that you install a version 1.4 or greater. It may be possible to run parts of Babeldoc using 1.3 but this is not supported. Ensure that java is in your path. This is essential. To test this, execute the command java from the commandline. If you get a "command not found" exception, you will need to add the jre/bin directory to your path. It is also useful to have JAVA_HOME set or you may receive warning messages.
The BABELDOC_HOME and the PATH environment must be set. This is done as follows:
It is assumed that you have installed Babeldoc in the c:\babeldoc directory. If it is in another directory, please change accordingly.
The BABELDOC_HOME and the PATH environment must be set. This is done as follows:
It is assumed that you have installed Babeldoc in the /opt/babeldoc directory. If it is in another directory, please change accordingly. It is quite possible that you will have installed it a non-privileged location.
Open a command window. Ensure that the paths are correct for your platform. Run the command:
The output from this command should look similar to this:
*** This is Babeldoc! *** |
Usage: babeldoc <command> |
where command must be one of: |
xls2xml, scanmon, addstagewiz, setupwiz, lightconfig, sqlload, pipeline, setentrywiz, journal, process, journalbrowser, flat2xml, scanner, guiprocess, pipelineb uilder, babelfish, module. |
Babeldoc 1.3.0 Copyright (C) 2002,2003,2004 The Babeldoc Team!! |
Babeldoc comes with ABSOLUTELY NO WARRANTY; |
This is free software, and you are welcome to redistribute it under certain conditions |
If your output is not like this or your get an error, please check the paths and the JRE requirements.
Jumping right in, you can run the demonstration pipelines:
Example 1.1. Running the demonstration pipeline
You will see a number of logging messages scroll over the screen, and in the current directory find a file named: stats.html. Take a look at this file using your favorite browser - to many this is a more pleasant way to look at sports scores. This is the output file from the processing of the input file, stats.xml. It is interesting to note the at the file stats.xml does not actually reside in the filesystem as a file. It is in the Babeldoc core Java archive.
<2002-07-25 23:37:51,279> <root> <INFO> Process stage: entry |
<2002-07-25 23:37:52,692> <root> <INFO> Process stage: transform |
<2002-07-25 23:37:53,382> <root> <INFO> Process stage: choose |
<2002-07-25 23:37:53,400> <root> <INFO> Process stage: writer |
Processed. Ticket: 1027654671023 assigned |
Note that this is an example of what your output looks like. The ticket number will be different. Use your ticket number in the following examples.
Note some versions may not give the ticket number. The ticket number can be found by running babeldoc journal -L or the Journal Browser tool by running babeldoc journalbrowser
The pipeline can be inspected using the pipeline tool. To see the options, simply type:
Example 1.2. Inspecting the pipeline
The options for the pipeline tool will be printed to the screen. Please experiment with this tool. For interrogating the configurations on, say the entry, stage, issue the command:
Notice the common syntax for accessing pipeline-stages:
The journal tracks documents moving through the pipelines. It can also track the changes to the documents as well as the status of each stage in the pipeline. The tool to access the journal functionality from the command-line is:
Example 1.3. Inspecting the Journal
The journal tool is primarily suited to querying the journal data. Please experiment with all the options. For now, though, review all the steps that occurred during the processing in 1.2.1. To do this, you must use the ticket number printed during your session (Not 1027654671023 as below).
This will result in the following output
ticket: step: 0; date: Thu Jul 25 23:53:43 EDT 2002; stage: null; op: newTicket; other: null |
ticket: step: 1; date: Thu Jul 25 23:53:43 EDT 2002; stage: test.entry; op: updateDocument; other: |
ticket: step: 2; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.extract; op: updateStatus; other: success |
ticket: step: 3; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.transform; op: updateStatus; other: success |
ticket: step: 4; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateDocument; other: |
ticket: step: 5; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.choose; op: updateStatus; other: success |
ticket: step: 6; date: Thu Jul 25 23:53:45 EDT 2002; stage: test.writer; op: updateStatus; other: success |
This listing shows that the document resulted in six recorded steps in the journal. Most of these steps are updateStatus steps. These steps merely record the processing status of the document and the corresponding stage in the pipeline. The updateDocument steps indicate points in the pipeline when the entire document was stored. At these steps it is possible extract the document and display its contents (journal -D) or even reprocess the document (journal -R). Display the document at stage 4 (journal -D 1027654671023.4).
It is possible to toggle the document tracking at any stage in a pipeline. This is done by setting the tracked flag in the pipeline configurations. Please review the pipeline documentation.
Babeldoc is a modular piece of software. Each of the modules in babeldoc successively add and refine its operation. Although a module participates both in the runtime and build of Babeldoc, this document is concerned with the runtime aspects of modules.
Example 1.4. Listing the modules
There is a commandline tool to list the current set of modules known to Babeldoc:
This will list the modules and their dependencies
Module: core is dependent on: |
Module: web is dependent on: core |
Module: gui is dependent on: core |
Module: crypto is dependent on: core |
Module: xslfo is dependent on: core |
Module: soap is dependent on: web, core |
Module: sql is dependent on: core |
Module: scanner is dependent on: core, sql |
Module: babelfish is dependent on: core |
Module: conversion is dependent on: core |
It is possible to remove and add modules to Babeldoc. The current set of modules is installed in the directory $BABELDOC_HOME/lib. The standard modules are named: babeldoc_module-name.
All configuration data within Babeldoc is handled in a structured fashion. Every configuration key must be contained in a configuration file. The configuration file is hierarchically arranged very similar to a regular filesystem file in directories and subdirectories. The "directory" part of the configuration file name is specified as UNIX style forward slashes separating directory names. The configuration key is a string which must be unique in a configuration file. An example of this is the configuration key: Journal.simple which is defined in the configuration file: service/query.properties.
The configuration implementation of Babeldoc is configurable but the default configuration implementation is the LightConfig. This stores the configuration data in properties files which are then hierarchically arranged into directories. The files may be stored on the local filesystem or in archive (JAR) files
The lightconfiguration implementation also has the very interesting and sometimes perplexing ability to merge configuration files with the same name into a single configuration file. This means that configuration file data does not overwrite data except where the configuration key is identical. In this case, the configuration file that is specified at the end of the configuration search path is dominant. This is logical and is consistent with how the PATH (or CLASSPATH) environment variable is used by the command processor to search for executables except that instead of the first match overriding all else, all of the matches are merged into a single "file".
The configuration searchpath is very important and is used by Babeldoc to determine where to find the configuration data and how to load it. The parts of the search path are given below:
There are times when the configuration is not working as expected. There is a small commandline tool which makes it easier to inspect the configuration files and how each configuration key is modified in the configuration file. The tool, lightconfig is illustrated below:
Example 1.5. Listing configuration data
The location of the configuration file pipeline/config.properties in each of the parts in the configuration search is then listed. A typical output would be:
Listing urls for the configuration: pipeline/config.properties |
1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties |
0: file:/C:/work/vap_rpt/./pipeline/config.properties |
This output indicates that the file pipeline/config.properties exists in the babeldoc_core.jar file and is then overridden in the directory: C:/work/vap_rpt
Example 1.6. Tracing a configuration key
This traces how a particular configuration key (documentation.type) found in the configuration file: pipeline/config.properties is modified in all the possible configuration files. A typical output would be:
Listing urls for the configuration: pipeline/config.properties |
1: jar:file:/c:/download/babeldoc/build/lib/babeldoc_core.jar!/core/pipeline/config.properties |
documentation.type = simple |
0: file:/C:/work/vap_rpt/./pipeline/config.properties |
documentation.type: not defined |
This output indicates that the configuration key is defined once in the babeldoc_core.jar file and is not subsequently overridden.
This section briefly describes the steps necessary to setup a new Babeldoc project. In the interests of brevity, the following assumptions are made:
The simplest method of configuring your environment is to create a setup batch file in the project directory. This file is usually called setup.bat but the name is unimportant. The purpose of the file is to configure the local environment so that Babeldoc can run. The contents of this file for this environment is given below:
@echo off |
set JAVA_HOME=c:\j2sdk1.4.2_04 |
set BABELDOC_HOME=c:\babeldoc |
set BABELDOC_USER=c:\project |
set PATH=%PATH%;%BABELDOC_HOME%\bin |
Prior to using Babeldoc run this script. Now create the configuration files in this directory.
Table of Contents
A pipeline is a program whose purpose is to transform a document into one or more resultant documents. An example pipeline could transform a received XML purchase order into a set of SQL statements intended to update a database, produce a printable PDF file for record keeping and a confirmation email sent to the originating party.
All of the pipelines in Babeldoc must have a unique name like test or document. A pipeline is a set of processing steps arranged in a linear fashion. Each processing step is called a "pipeline stage" and each pipeline stage in a pipeline must have a unique name. Two pipelines may have a pipeline stage of the same name. There is a special pipeline stage in a pipeline, the entryStage which indicates which pipeline stage should initially receive the document from the feeder mechanisms. It is also possible to introduce a document into the "middle" of a pipeline. In order to designate a particular stage in a pipeline, the name is given as: pipeline-name.pipeline stage-name.
The pipelines in the babeldoc system are managed by the Pipeline factory which determines how and when a pipeline runs. Each pipeline stage in a pipeline has a type and a set of configuration options. An example of the a pipeline stage is the test.transform pipeline stage whose type is XslTranform. This type of pipeline stage requires either the configuration option tranformationFile which supplies the filename (or URL) of the XSLt file to perform the transformation or transformationScript which is the inline XSL.t document. There is an additional non-mandatory configuration option, bufferSize which can help with larger transformations.
The pipeline stages operate on documents. A useful metaphor is the pipes that constitute the plumbing in your home. Each of the stages in the plumbing pipeline represents bends, faucets and other functional requirements. A pipeline document is the water in the pluming pipeline. A document is successively transformed by the pipeline until it is finally stored, discarded or otherwise disposed of. The transformations are determined by the pipeline and its stages. A document is primarily a number of bytes (characters) of data. The data is characterized by a MIME type. There are a number of ways a document can get fed into a pipeline, namely:
A document consists of the following components:
The document body may be an XML document, a flat-file text document or a binary document. Significant processing can only be applied to XML documents in the standard Babeldoc. In order to convert flat-files to xml documents, there is a conversion pipeline stage which can convert a number of types of flat-file formats to XML. Please see the conversion chapter of this document. The XML functionality is the primary focus of the default Babeldoc distribution. Binary documents are also acceptable documents however, Babeldoc does not have many stages that process binary documents. This does not mean that binary processing is not possible - it is. Examples could be processing of photographic images, sound file processing or even video file processing.
Attributes are said to enrich the document because they are basically shortcuts to data found in the document itself or from some other source. The attributes can be applied to a document in the following ways:
For instance, the number of purchase orders in a bulk purchase XML document can be extracted from the document (using the XPathExtract pipeline stage) and placed in the attribute named numOrders. This results in significant speed increases in subsequent processing because the attribute (numOrders) can be used instead of multiple expensive XPath operations. The attributes are available through the variable ${document.get("attribute.name")}. This means that it is possible to customize the pipeline processing based on extracted (or enriched) data from the document.
Attributes are not limited to data extracted from the document, but can be options passed into the pipeline along with the the document like the email address to send the document to, file path that the document was read from and more besides. These kinds of attributes allow the internal processing of the pipeline to be influenced by the external environment. The command babeldoc process will accept any number of name=value pairs on the command line. Each of the supplied attributes will be placed on the document and will be available to the pipeline stages in the pipeline. In the test pipeline it is possible to email the processed document by supplying smtpHost, smtpFrom and smtpTo attributes.
Example 2.1. Adding attributes from the command-line
Instead of running the test pipeline as before, we can add a number attributes to the process command-line which will activate "hidden" stages in the pipeline.
Babeldoc has a rather complex data abstraction mechanism. It is just as easy to read a file from your hard-disk as it is to load it from a file in a your classpath, or even from a website (http://...) or a ftp site (ftp://....). This means that your simple pipeline which works on local files, will also work in a networked environment.
The names of each of the pipelines and the configuration options for each pipeline is provided in the file config/pipeline/config.properties. Since this file (like every other configuration file) participates in the Babeldoc configuration system, you will need to create your own copy of this file in your configuration directory. Please see the configuration handling found in chapter 1. The handling of pipeline stages is performed by a set of PipelineStageFactories. The determination of which PipelineStageFactory will handle the pipeline is the type of the pipeline. Here are the current pipeline factory types
This pipeline factory is the simplest to setup. Its pipeline type is simple. This is indicated in the configuration file pipeline/config.properties. This declares the pipeline and provides its type and the actual configuration file that defines the pipeline. For instance, if the pipeline name is test, to set-up the type of the test pipeline, the entry in the file will be test.type=simple. The pipeline definition (see more later) configuration file for the test pipeline will be given as test.configFile=pipeline/your-config (Note: The .properties is omitted from the file name).
Example 2.2. Declaring a 'Simple' Pipeline
The configuration file: pipeline/config.properties shows the how a simple pipeline called test is declared to Babeldoc. Notice that a subsequent example will show how the pipeline is defined.
test.type=simple |
test.configFile=pipeline/simple/test |
Notice that the configuration file for the pipeline is pipeline/simple/test - the actual name of the file is pipeline/simple/test.properties.
And in this directory, there is a subdirectory called pipeline. Within this, the config.properties (this location is mandated - the PipelineFactory looks for this file. If you do not put your pipeline declarations in this configuration file, they will NOT be found). The definition of your pipelines may done in this file. The pipeline configuration files may be in the same directory as the config.properties or in subdirectories of the pipeline directory - the choice is yours.
The actual definition of the pipeline is provided in the value of the pipeline-name.configFile property which is specified in the pipeline/config.properties file. Each of the pipeline stages within the pipeline are defined here as well as the document flow from one pipeline to the next.
Every simple pipeline definition document must contain the entryStage property. This property informs Babeldoc which pipeline stage is the starting point for the pipeline. If this property is not given in this file, processing of this pipeline results in an error.
Other than the entryStage property, every property in the pipeline definition file is of the form:
pipelinestage-name.option-1...option-n=value |
The first part (up to the first period) is the name of the pipeline stage. The subsequent options (period separated up to the '=') are arguments to the pipeline stage. There are a two kinds of options for each pipeline stage:
Additionally there are mandatory and optional pipeline stage options. The pipeline will fail to run if a mandatory option is not provided. The following are general options:
For the complete list of pipelinestage configuration options, please refer later in this chapter to the list of pipelinestages.
Example 2.3. Defining a 'Simple' Pipeline
The pipeline is defined in a properties file which enumerates the pipelinestage configuration.
entryStage=entry |
entry.stageType=Null |
entry.nextStage=transform |
entry.tracked=true |
transform.stageType=XslTransform |
transform.nextStage=choose |
transform.transformationFile=test/quickstart/stats-html.xsl |
transform.bufferSize=2048 |
choose.stageType=Router |
choose.nextStage=writer |
choose.tracked=true |
choose.nextStage.emailer=#if(${document.get("smtpHost")})true#end |
emailer.stageType=SmtpWriter |
emailer.nextStage=writer |
emailer.smtpHost=$document.get("smtpHost") |
emailer.smtpFrom=$document.get("smtpFrom") |
emailer.smtpTo=$document.get("smtpTo") |
emailer.smtpSubject=Document: Ticket: ${ticket.Value} |
emailer.smtpMessage=${document.toString()} |
writer.stageType=FileWriter |
writer.nextStage=null |
writer.outputFile=${system.getProperty("user.dir")}/stats.html |
The structure of this file is regular except for the entryStage. This property has to be present and its value is the name of the pipelinestage that is the starting point for this pipeline. If this property is not provided, Babeldoc cannot process this pipeline.
The rest of the properties in this pipeline stage definition file configure the 5 pipeline stages:
This factory builds pipelines from an XML document that completely describes all elements of a pipeline. The schema document for it is found in the directory readme/schema. The two areas of the pipeline definition document are the static area and the dynamic area. The static area is optional and describes each of the types of pipeline stages available. The dynamic areas is mandatory. It describes each of the pipeline stages in the system, their configuration options and the connections between them. The document is illustrated below:
pipelines | |||
---|---|---|---|
static [0..1] | |||
dynamic [1] | |||
stage-instances [1..*] | |||
configuration [0..*] | |||
connections [1] |
Example 2.4. XML Pipeline
The demonstration pipeline, demo is defined using a XML pipeline stage factory. This file is given below:
<?xml version="1.0"?> |
<pipeline xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.babeldoc.com/xsd/pipeline.xsd"> |
<documentation>This is a demonstration babel pipeline</documentation> |
<pipeline-name>some-name</pipeline-name> |
<dynamic> |
<entry-stage>entry</entry-stage> |
<!- STAGES: Defines the stages -> |
<stage-inst> |
<stage-name>entry</stage-name> |
<stage-desc>This does nothing</stage-desc> |
<stage-type>Null</stage-type> |
</stage-inst> |
<stage-inst> |
<stage-name>extract</stage-name> |
<stage-desc>this extracts stuff</stage-desc> |
<stage-type>XpathExtract</stage-type> |
<option> |
<option-name>XPath</option-name> |
<option-value></option-value> |
<sub-option> |
<option-name>documentId</option-name> |
<option-value> |
/AppointmentDocument/DocumentHeader/DocumentId/text() |
</option-value> |
</sub-option> |
<sub-option> |
<option-name>senderId</option-name> |
<option-value> |
/AppointmentDocument/DocumentHeader/SenderId/text()</option-value> |
</sub-option> |
<sub-option> |
<option-name>documentType</option-name> |
<option-value> |
/AppointmentDocument/DocumentHeader/DocumentType/text() |
</option-value> |
</sub-option> |
<sub-option> |
<option-name>documentVersion</option-name> |
<option-value> |
/AppointmentDocument/DocumentHeader/DocumentVersion/text() |
</option-value> |
</sub-option> |
</option> |
</stage-inst> |
<stage-inst> |
<stage-name>transform</stage-name> |
<stage-desc>this transforms stuff</stage-desc> |
<stage-type>XslTransform</stage-type> |
<option> |
<option-name>transformationFile</option-name> |
<option-value> |
${system.getProperty("user.dir")}/test/quickstart/foo.xsl |
</option-value> |
</option> |
<option> |
<option-name>bufferSize</option-name> |
<option-value>2048</option-value> |
</option> |
</stage-inst> |
<stage-inst> |
<stage-name>choose</stage-name> |
<stage-desc>this chooses stuff</stage-desc> |
<stage-type>Router</stage-type> |
<option> |
<option-name>tracked</option-name> |
<option-value>true</option-value> |
</option> |
<option> |
<option-name>nextStage</option-name> |
<option-value></option-value> |
<sub-option> |
<option-name>emailer</option-name> |
<option-value><![CDATA[ |
#if(${document.get("smtpHost")}) |
true |
#end |
]]></option-value> |
</sub-option> |
</option> |
</stage-inst> |
<stage-inst> |
<stage-name>emailer</stage-name> |
<stage-desc>this emails stuff</stage-desc> |
<stage-type>SmtpWriter</stage-type> |
<option> |
<option-name>smtpHost</option-name> |
<option-value>$document.get("smtpHost")</option-value> |
</option> |
<option> |
<option-name>smtpTo</option-name> |
<option-value>$document.get("smtpTo")</option-value> |
</option> |
<option> |
<option-name>smtpFrom</option-name> |
<option-value>$document.get("smtpFrom")</option-value> |
</option> |
<option> |
<option-name>smtpSubject</option-name> |
<option-value>Document: Ticket: ${ticket.getValue()}</option-value> |
</option> |
<option> |
<option-name>smtpMessage</option-name> |
<option-value> |
<![CDATA[${system.get("os.name")} - ${system.get("os.arch")} - ${system.get("os.version")} |
Message: |
${document.toString()} |
]]></option-value> |
</option> |
</stage-inst> |
<stage-inst> |
<stage-name>writer</stage-name> |
<stage-desc>this writes stuff</stage-desc> |
<stage-type>FileWriter</stage-type> |
<option> |
<option-name>outputFile</option-name> |
<option-value>${system.getProperty("user.dir")}/out1.xml</option-value> |
</option> |
<option> |
<option-name>doneFile</option-name> |
<option-value>.done</option-value> |
</option> |
</stage-inst> |
<!-Define the connections between stages -> |
<connection> |
<source>entry</source> |
<sink>extract</sink> |
</connection> |
<connection> |
<source>extract</source> |
<sink>transform</sink> |
</connection> |
<connection> |
<source>transform</source> |
<sink>choose</sink> |
</connection> |
<connection> |
<source>choose</source> |
<sink>writer</sink> |
</connection> |
<connection> |
<source>emailer</source> |
<sink>writer</sink> |
</connection> |
<connection> |
<source>writer</source> |
<sink>null</sink> |
</connection> |
</dynamic> |
</pipeline> |
Babledoc is capable of spawning multiple threads to process multiple pipelines in parallel and to process documents within each pipeline in parallel. This has important consequences for large scale computing systems. This is an advanced concept. Please skip this section if you feel that it is too advanced for you.
A processor determines how a pipeline handles documents which are returned by a pipeline stage. There are pipeline stages which produce multiple documents from a single input document. The XPathSplit is such a pipeline stage. The standard way that Babeldoc operates is that each of the resultant documents from the pipeline stage is processed in turn. It is also possible to process the resultant documents in parallel.
The following processors are available:
Synchronously process the pipeline documents. Each document is processed serially - no new threads are created.
Asynchronously process the pipeline documents using a threadpool. This is probably the most useful in a multithreaded environment.
Name | Type | number | description |
---|---|---|---|
poolSize | integer | 0..1 | The number of threads in the thread pool. This sets the maximum number of documents to process at one time. Default is 5. |
keepAlive | integer | 0..1 | The number of milliseconds that an idle thread in the threadpool will remain alive before being reclaimed. Default is 15000. |
The standard processor is the sync processor. This can be overridden if necessary. The processor for each pipeline is given in the pipeline/config.properties file. This is specified by: pipeline-name.processor.type=processor-type.
Example 2.5. Using another pipeline stage processor
This example is also provided in the Babeldoc distribution as 'threads'. The following is a simple pipeline definition found in the directory pipeline/pipeline.properties.
entryStage=ffconvert |
ffconvert.stageType=FlatToXml |
ffconvert.flatToXmlFile=flatfile.xml |
ffconvert.nextStage=splitter |
splitter.stageType=XpathSplitter |
splitter.XPath=/big-un/row |
splitter.nextStage=writer |
splitter.threaded=true |
splitter.maxThreads=7 |
writer.stageType=FileWriter |
writer.outputFile=out.txt |
writer.nextStage=null |
This simple pipeline definition accepts a text file, converts it to XML, then splits the XML using the XPath expression: /big-un/row. The resultant documents are all written to the same file, out.txt
There are three declared pipelines, all using the same pipeline definition. This is found in the file pipeline/config.properties below:
pipeline.type=simple |
pipeline.configFile=pipeline/pipeline |
asyncpipeline.type=simple |
asyncpipeline.configFile=pipeline/pipeline |
asyncpipeline.processor.type=async |
asyncpipeline.processor.maxThreads=4 |
pooledpipeline.type=simple |
pooledpipeline.configFile=pipeline/pipeline |
pooledpipeline.processor.type=threadpool |
pooledpipeline.processor.poolSize=10 |
The three pipelines: pipeline, asyncpipeline and pooledpipeline all illustrate the various processor configurations possible.
A feeder is a software strategy of getting documents into babeldoc. The following feeders are available:
The configuration of each of the feeders is done using the configuration file feeder/config. Babeldoc comes with the following feeders:
# The generic feeders: synchronous |
sync.type=synchronous |
# The generic feeders: asynchronous - with an in-memory queue |
async.type=asynchronous |
async.queue=memory |
# The "specific" feeders: asynchronous - with disk queue |
async-d.type=asynchronous |
async-d.queue=disk |
async-d.queueDir=/tmp |
async-d.queueName=async-d |
The async feeders are able to accept an additional parameter, poolSize which limits the thread pool size which limits the maximum number of pipelines that can run in parallel.
There are a limited number of types of pipeline stages. Each of the stages performs a single function. The options available through the configurations change the operation of the stage. In order for your custom pipeline to do any useful work, you have to configure the pipeline stages. You can also create your own custom pipeline stage for specialized processing. See the documentation for each stage type.
Allows a pipeline to call another pipelinestage. This pipeline stage is very useful in that it allows for modular pipeline configurations. The result of the called pipeline is either used instead of the current pipeline document or is discarded depending on the setting of the discardResults configuration
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
callStage | string | 1..1 | Pipeline to call |
discardResults | boolean | 0..1 | Discard the pipeline document from the called stage. |
test | boolean | 0..1 | If this option is set and it evaluates to true, the call is made otherwise |
Compress the document using either zip or gzip compression. **EXPERIMENTAL**
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
compressType | enumeration | 0..1 | Compression type (zip or gzip) |
Decompress the document using either zip or gzip compression **EXPERIMENTAL**
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
compressType | enumeration | 0..1 | Compression type (zip or gzip) |
Cyptography helper
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
operation | enumeration | 0..1 | Encryption or decryption |
transformation | string | 0..1 | The encryption transform type |
useSessionKey | string | 0..1 | Use the session key |
sessionKeyFile | directory-path | 0..1 | Use the session key |
sessionKeyAlgorithm | string | 0..1 | Use this session algorithm |
sessionKeySize | integer | 0..1 | size of the session key |
Domify the document contents (assumed to be XML) and save as an attribute on the pipeline document.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
validate | boolean | 0..1 | Validate the XML. Default is false. |
schemaFile | directory-path | 0..1 | The schema file to validate against |
Adds attributes to the document. The value of the attribute can be a constant value or a velocity script.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
enrichScript | null | 0..n | List of enrichment attributes to add to the document |
This pipeline stage allows for external applications to be run. Optionally the pipeline document contents is piped to the application as standard input or the output of the application can be read as a new pipeline document.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
application | directory-path | 1..1 | Full path to the application to run |
pipeOutDocument | boolean | 0..1 | Pipe the current document to the script - the script must fully accept the stardard input otherwise an exception is thrown. Boolean default is false. |
pipeInResponse | boolean | 0..1 | Pipe the response into the document in the attribute ExternalApplicationResponse. Boolean default is false. |
Writes the document to a disk file. The contents are written as binary or text data depending on the binary flag on the document. When the pipeline document has been written to disk, this stage can optionally create a 'done' file which could act as a flag file for external processes indicating that the output file is completely written.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
append | boolean | 0..1 | Append the data to the existing file |
outputFile | directory-path | 0..1 | Output filename |
doneFile | directory-path | 0..1 | Write the "done" file when the document is written. This can act as a flag for other disk scanning processes |
encoding | string | 0..1 | Name of charset used to write file |
Convert this flat document to an XML document
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
flatToXmlFile | directory-path | 0..1 | Flat file conversion specification XML file |
Write the document using the FTP protocol to an FTP server. This will enable pipelines to distribute documents on the internet using this well supported protocol
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
ftpHost | string | 0..1 | FTP hostname or ip address |
ftpUsername | string | 0..1 | FTP username to login with |
ftpPassword | string | 0..1 | FTP password to authenticate with |
ftpFolder | string | 0..1 | The name of the folder on the FTP server |
ftpFilename | string | 0..1 | The name of the filename to send the document to the FTP server |
Act as http client and get the results as new document
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
method | string | 0..1 | HTTP method |
URL | url | 0..1 | URL |
queryString | null | 0..n | Query parameters |
followRedirects | boolean | 0..1 | Follow redirects |
http1.1 | boolean | 0..1 | HTTP 1.1 |
strictMode | boolean | 0..1 | Strict mode |
headers | null | 0..n | Headers |
parameters | null | 0..n | Post parameters |
fileParameters | null | 0..n | Post file parameters |
splitAttributes | boolean | 0..1 | Add old document's atributes into new document after httpClient call |
Using the java.beans.XMLDecoder object to unpersist the document contents in Java objects
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
This pipeline stage writes a message into the journal that can be viewed with the journal tool (babeldoc journal). Please note that journal entries should be one line long and contain no quotes, commas, or newlines. If these characters are detected, they will be translated into their HTML equivalents to prevent 'bad things' from happening to the journal tool. However, the output from the journal tool will most likely not be what you are expecting.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
message | string | 1..1 | The message to write to the Journal |
Format the pipeline document using JTidy. This is used to "clean-up" HTML documents into well-formed documents.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
indent-spaces | integer | 0..1 | default indentation |
wrap | integer | 0..1 | default wrap margin |
wrap-attributes | boolean | 0..1 | wrap within attribute values |
wrap-script-literals | boolean | 0..1 | wrap within JavaScript string literals |
wrap-sections | boolean | 0..1 | wrap within <![ ... ]>> section tags |
wrap-asp | boolean | 0..1 | wrap within ASP pseudo elements |
wrap-jste | boolean | 0..1 | wrap within JSTE pseudo elements |
wrap-php | boolean | 0..1 | wrap within PHP pseudo elements |
literal-attributes | boolean | 0..1 | if true attributes may use newlines |
tab-size | integer | 0..1 | tabsize = 4; |
markup | boolean | 0..1 | if true normal output is suppressed |
quiet | boolean | 0..1 | no 'Parsing X', guessed DTD or summary |
tidy-mark | boolean | 0..1 | add meta element indicating tidied doc |
indent | boolean | 0..1 | indent content of appropriate tags |
ident-attributes | boolean | 0..1 | newline+indent before each attribute |
hide-endtags | boolean | 0..1 | suppress optional end tags |
input-xml | boolean | 0..1 | treat input as XML |
output-xml | boolean | 0..1 | create output as XML |
output-xhtml | boolean | 0..1 | output extensible HTML |
add-xml-pi | boolean | 0..1 | add <?xml?> for XML docs |
add-xml-decl | boolean | 0..1 | add <?xml?> for XML docs |
assume-xml-procins | boolean | 0..1 | if set to yes PIs must end with ?> |
raw | boolean | 0..1 | avoid mapping values > 127 to entities |
uppercase-tags | boolean | 0..1 | output tags in upper not lower case |
uppercase-attributes | boolean | 0..1 | output attributes in upper not lower case |
clean | boolean | 0..1 | remove presentational clutter |
logical-emphasis | boolean | 0..1 | replace i by em and b by strong |
word-2000 | boolean | 0..1 | draconian cleaning for Word2000 |
drop-empty-paras | boolean | 0..1 | discard empty p elements |
drop-font-tags | boolean | 0..1 | discard presentation tags |
enclose-text | boolean | 0..1 | if true text at body is wrapped in <p>'s |
enclose-block-text | boolean | 0..1 | if yes text in blocks is wrapped in <p>'s |
add-xml-space | boolean | 0..1 | if set to yes adds xml:space attr as needed |
fix-bad-comments | boolean | 0..1 | fix comments with adjacent hyphens |
split | boolean | 0..1 | create slides on each h2 element |
break-before-br | boolean | 0..1 | o/p newline before <br> or not? |
numeric-entities | boolean | 0..1 | use numeric entities |
quote-marks | boolean | 0..1 | output " marks as " |
quote-nbsp | boolean | 0..1 | output non-breaking space as entity |
quote-ampersand | boolean | 0..1 | output naked ampersand as & |
write-back | boolean | 0..1 | if true then output tidied markup |
keep-time | boolean | 0..1 | if yes last modied time is preserved |
show-warnings | boolean | 0..1 | however errors are always shown |
error-file | string | 0..1 | file name to write errors to |
slide-style | string | 0..1 | style sheet for slides |
new-inline-tags | string | 0..1 | new inline tags |
new-blocklevel-tags | string | 0..1 | new block level tags |
new-empty-tags | string | 0..1 | new empty tags |
new-pre-tags | string | 0..1 | new pre tags |
char-encoding | integer | 0..1 | CharEncoding = ASCII; |
doctype | string | 0..1 | user specified doctype |
fix-backslash | boolean | 0..1 | fix URLs by replacing \ with / |
gnu-emacs | boolean | 0..1 | if true format error output for GNU Emacs |
smart-indent | boolean | 0..1 | does text/block level content effect indentation |
alt-text | string | 0..1 | default text for alt attribute |
Null stage. This do-nothing stage is useful in certain situations like a tracking placeholder or just a placeholder for some future pipeline stage.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
Load the contents of the file, completely overwriting the current document's contents with the file's contents.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
file | directory-path | 1..1 | The filename or URL to the object to read. |
Route this document to a number of specified stages. This stage would be used to specialize processing based on some criterion very much like an if-else statement. Usually the criteria used would be an attribute on the document like time of processing, filename, etc but could be a script. The nextStage complex parameter must evaluate to the literal 'true'. If more than one of the nextStages resolves to true, then the document is routed to each of those stages. If none of the matches are made, the regular nextStage configuration option is used. This provides the 'else' part.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
nextStage | null | 0..n | Stage name to route to if the script resolves to 'true'. Each of the matching nextStages will be routed. |
Write an item entry to an RSS Channel
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
channelFile | directory-path | 1..1 | RSS File to process |
channelSize | integer | 1..1 | Maximum umber of items in the RSS Channel |
itemDescription | string | 1..1 | Item Description |
itemLink | string | 1..1 | Item Link |
itemTitle | string | 1..1 | Item Title |
Execute a user supplied script. This pipeline stage enables pipeline developers to create and manipulate documents in novel and unforeseen ways.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
language | enumeration | 1..1 | Scripting language - supported as per Apache BSF - Default is javascript |
script | multiline | 0..1 | Script to be executed |
scriptFile | directory-path | 0..1 | Script file to processed |
This stage performs digital signing or verifying the signatures
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
operation | enumeration | 1..1 | Type of operation that should be performed |
keyStoreFile | directory-path | 1..1 | Absolute or relative file path to the keystore file |
keyStoreType | string | 0..1 | Type of the keystore |
keyStorePass | string | 1..1 | Password of the keystore |
signatureFile | directory-path | 0..1 | File path of the sigature file. This is where signature will be saved if cigning or loaded if verifying signature |
signatureAttribute | string | 0..1 | Document attribute where signature will be stored when signing or loaded from if verifying |
verifiedAttribute | string | 0..1 | Document attribute where result of verify operation will be saved |
algorithm | string | 1..1 | Signature algorithm used for performing operations |
keyAlias | string | 1..1 | Alias of the private key used for signing |
keyPassword | string | 0..1 | Password of the private key used for signing if key is protected with password |
certificateAlias | string | 1..1 | Alias of the certificate (public key) used for verifying signature |
Email the document using the SMTP protocol. This will allow for documents to be transmitted via email to a number of recipients. The document is normally the body of the email but could also be an attachment.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
smtpHost | string | 1..1 | The SMTP host to communicate with |
smtpFrom | string | 1..1 | The email address of the sender |
smtpTo | string | 1..1 | The email addess to send the email to |
smtpSubject | string | 1..1 | The subject line of the email |
smtpMessage | string | 0..1 | The body message of the email |
filesToAttach | string | 0..1 | The list of files to attach to this email |
attachDocument | boolean | 0..1 | true if document should be send as attachment. Default is false |
documentFileName | string | 0..1 | The name of the attached document |
format | enumeration | 0..1 | The mail format - text/plain or text/html - Deafult is text/plain |
Send the document to a soap service
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
soapUrl | url | 0..1 | URL for the SOAP service |
soapAction | string | 0..1 | SOAP action |
resultStage | string | 1..1 | URL for the SOAP service |
responseDoc | boolean | 0..1 | Return SOAP service response as an attribute |
authentication | boolean | 0..1 | Post soap document with authentication |
username | string | 0..1 | User id for authentication |
password | string | 0..1 | Password for authentication |
Send the pipeline document contents to a tcp/ip socket. This is useful for low-level operations.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
hostName | string | 1..1 | The name of the host |
hostIp | string | 1..1 | The ip address of the host |
port | integer | 1..1 | Neither host ip or host name provided |
Enrich documents with values based on sql queries
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
resourceName | string | 1..1 | Name of the resource that contains Database Connection |
attributeSql | null | 0..n | List of attribute names which contains sql queries that return single value. Attribute will get value returned by that query. Only the first column of the first row will be taken if multicell results returned! |
sqlScript | null | 0..n | List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere. |
Creates an XML file from a SQL query
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
resourceName | string | 1..1 | Name of the resource that contains Database Connection |
sql | null | 0..n | List of scripts that can return multiple columns (but single row). An atttribute will get created for each column. The name of the attribute will be the same as the column name and the value will be the column value. script name should be unique but it is not important elsewhere. |
Executes the specified SQL statement
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
resourceName | string | 1..1 | Name of the resource that contains Database Connection |
useBatch | boolean | 0..1 | Use JDBC SQL batching - depends on the driver support |
batchSize | integer | 0..1 | The batch size if applicable |
sql | string | 1..1 | The SQL statement to execute |
failOnFirst | boolean | 0..1 | Set to true if the pipeline should not attempt subsequent SQL statements if a statement fails |
messageTag | string | 1..1 | The message tag to search for if the statement fails - this is then logged instead of the SQL error message |
Render the SVG xml document to a binary image
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
transcode | enumeration | 1..1 | Choose the transcode |
width | integer | 0..1 | Width of the output image |
height | integer | 0..1 | Height of the output image |
quality | integer | 1..1 | Quality of the translation expressed as a percentage |
This stage uses Velocity to templatize the document. The results of the operation will replace the original template.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
Converts Microsoft Excel files to XML format. This creates a regular XML output document of workbooks, rows and cells. The XML encoding can be configured if necessary.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding string for the output XML. By default it is UTF-8. |
attributes | multiline | 0..1 | Attributes |
locale | string | 0..1 | Locale which should be used for formatting numbers and dates from Excel workbook. If not specified, default Locale will be used. |
Use XPath expressions to extract nodes from the document and store them as attributes on the document. This pipeline stage is widely use when data needs to be extracted from XML documents for router or calculation steps. The extracted attributes can be quickly and easily obtained using velocity $document.get and from the scripting stages. Routing decisions based on the document contents are also possible using this technique.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
XPath | null | 0..n | The name of the xpath configuration option is the attribute to assign to the document |
Split the XML document using xpath expressions. This will result in a number of documents being forwarded to the next stage. This is useful when each of the split nodes represents a document that needs to be actioned. An example would be splitting out each of the orders from an XML document that is a collection of orders.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
xmlOmitDecl | boolean | 0..1 | Omit the XML PI declaration from the output document |
xmlIndent | boolean | 0..1 | Indent the output document |
XPath | string | 1..1 | The XPath expression to use to split the document |
Apply XSL:fo transformation on the document
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
outputType | integer | 0..1 | The buffer size to use |
Transform the document (has to be XML) using this XSL script. The script can access all of the babeldoc internals via a number of parameters. The parameters (accessed through the xsl:param element) which are always placed in the transformer are: pipelinestage and document. Other parameters may be placed on the transformer using the param option.
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
transformationFile | directory-path | 0..1 | The filename or URL to the XSL transformation file. If this is a file, then the XSL will be cached. If the file is modified, then the XSL document will be reloaded. |
transformationScript | multiline | 0..1 | An inline XSL document that could be used instead of the file option above. This will be cached. |
param | null | 0..n | Complex configuration parameter (of form stage-name.param.param-name-n=param-value) of xsl:params that will be placed in the XSL transformer. This can significantly aid transformation tasks. |
Crack this document as a zip archive
Name | Type | number | description |
---|---|---|---|
stageType | service-name | 1..1 | Type of pipeline stage |
nextStage | string | 1..1 | Name of the next stage in pipeline or null if this is the last stage. |
ignored | boolean | 0..1 | If this is set then this stage is ignored - the pipeline document is simply passed, unprocessed, to the next stage. Useful to quickly disable parts of a pipeline. |
tracked | boolean | 0..1 | If this is set then this stage is tracked - the pipeline document is written to the journal. |
encoding | string | 0..1 | Encoding of resulting document. This is used for text documents. Default is system file.encoding |
Babeldoc has a configurable error handling mechanism. In the case of an exception, the exception will be handled using the default error handler. You can override the default error handler by specifying a custom error handler for a pipeline stage if the default handler is not suitable. The default Babeldoc error handler performs the following steps:
If you want to have some other error handler you can do it by writing your own error handler class. Your class should implement the interface com.babeldoc.core.pipeline.IPipelineStageErrorHandler. You will also need to provide your new error handler Java class to the pipeline configuration.
You can store whole document with all its attributes in the journal at a given pipeline stage. This can be done by setting the configuration option tracked to true for a pipeline stage you want to track. The document will be then be stored in the journal. However the attributes are not guaranteed to be stored along with the document. This all depends on the Journal implementation. Also, all attributes are saved as Strings. If you want to use the replay operation in Journal then you should use this option in one of the previous stages so that the replayer will be able to recreate the document.
There can be situations when you don't want a stage to be processed. This can be done by setting the configuration option ignored to true. For example you can use for unzipping files with zip extension, but you don't want to unzip files that are not zip files. In these situation you can set ignored to true (using Velocity scripting based on the file extension) and processing will not be performed in that stage.
The pipeline can be accessed using the pipeline commandline tool. This allows for the inspection of the pipeline, the stages, the configuration options and connectivity options.
There are a number of options in this tool. Use the -h option to get the complete list.
Table of Contents
Resources in Babeldoc are a generalized way of accessing data sources. Resources are also considered to be scarce in that they have to be protected from leakage. This is particularly important for database and J2EE resources . They are named using a string name and are thus uniquely identified in the system. The resources are defined in the config/resource/config.xml file. Each resource name maps to a specific class name which governs the policy of the resource and is programmatically specified. The available resources are:
Each of the named resources are defined in the config/resources as resource-name.properties. Each of the properties files has a required name/value pair called type. This can be one of the types as listed above. The rest of the configuration options are specific to the type of resource. These configuration options are given below:
Each simple jdbc resource defines a connection to a database using the following configuration options:
For this to work correctly the jdbc jar file for the specific database you need to access needs to be in the classpath. Currently only the mysql jar is built into Babeldoc. Access to other kinds of databases like Oracle, DB2 and Sybase will require that the JDBC driver libraries be placed in the CLASSPATH. The name of the dbDriver parameter and the form of the dbUrl will be highly dependent on the particulars of the vendor database. There are a number of limitations to the simple jdbc resource, the primary limitation being that the resource does not pool connections. Each time a connection is requested, a new connection is created and this can be VERY time consuming.
This resource is useful when running in a J2EE container. This allows for accessing datasources using JNDI. There is a single configuration option:
Babeldoc provides a pooled connection using the (Apache Commons DBCP) library . The configuration for the resource is provided by the following configuration options
Table of Contents
The journal keeps track of documents as they move through the system as well as the status of each operation performed on the document. The primary purpose of the journal is to provide a safe environment for the processing of documents. There are a number of mission critical situations where losing data is not acceptable. It is possible to recreate document processing if an error condition should arise. Errors can be both external and internal. Internal problems could be temporary database errors, disk space, etc. External causes could be erroneous documents, network outages, etc.
Each document is associated with a JournalTicket which is assigned uniquely just as the document enters the pipeline. Each operation upon a document for a JournalTicket (hereafter also referred to as a ticket) is performed at a step. Steps start at zero and increase until the document is finished processing. Each operation (or pipelinestage) on a document can be uniquely identified by a combination of a ticket and a step.
A journal operation indicates what happened in the journal for the document at that pipelinestage. This is essential for determining problems with document processing. There are a number of journal operations available:
The implementation of the journal depends on your specific circumstances. There are currently three implementations that are available. Which specific journal to use is defined in the configuration file: config/journal/config.properties. The journal to be used is set in the single name/value pair: journalType. The options are:
The simple journal implements it's operations as disk files and directories. It is not intended as a robust, enterprise level implementation. It also lacks structured query functions for querying, etc. Its configuration file is config/journal/config.properties. This file has a number of configuration options
For each operation logged to the journal, it is logged line by line to the journal log file. The lines are comma-separated values (CSV) and can be parsed by third party applications. The columns are:
For each ticket, there is a directory created with the value of the ticket (this is long string of numbers - its actually the time in milliseconds of when the ticket was created. Inside this directory there are step delta files which represents each step in the log for that ticket. The contents of the delta file may be the status string or the document itself (if the operation is updateDocument). The document is persisted as an object serialization.
It is possible to use a database to store the journal log and the document data. Currently oracle and mysql are supported. The schema creation scripts are in the directory readme/sql. The document data is stored as binary data (BLOBs). Each vendor supports BLOBS slightly differently, hence the specific database support. There are three main tables involved in storing the journal data (the table table_key is for unique key generation), being:
The configuration for the Mysql, Oracle, PostgreSQL, and SQL Server journals are stored in the configuration file: config/journal/sql/config.properties. The only configuration option in this file is resourceName indicates that name of the resource that will manage the database connection. Currently the journal is implemented in a separate schema (instance, whatever) than the other database storage areas (user, and console).
The intent of this journal implementation is to store the operation journal implementation in a J2EE container. Currently Jboss is explicitly supported but not to the exclusion of other containers. This implementation is really a shell around either the simple or sql journal implementations but running in a remote server. By this means, it is possible to move the journal operation to a central location. The configuration for the ejb implementation is stored in the configuration file:
The journal tool allows access to the journal from the command line. This enables complex queries to be applied against the journal. There are four separate types of queries:
There are a number of options which can change the display of the data from the tool - use the -h command line to get all the options for this tool
Table of Contents
The scanner is a tool that scans for messages from a variety of sources and when a message is found, it is fed into the pipeline. The scanner is an automation tool, in that a system can be built up using scanners and pipelines. This is an alternative to the process script which feeds a single document into the pipeline when run. The scanner is currently capable of scanning a directory in a filesystem, a mailbox on a mail-server, an FTP servcer, a web server, a database via a SQL query, external application output and a JMS queue. The period of scan and the pipeline to feed, as well as other specific configuration options are all set in the config/scanner/config.propertiesuserinput> file. There may be one or many scanning threads active, each configured differently. For example, one scanner thread could be polling a mailbox once every 60 secs while another is scanning a directory every 10 seconds. The scanner is also capable of scanning based on a schedule specified in the same way that CRON is on UNIX systems.
General attributes available are file_name, scan_path and scan_date.
The scanner tool is started by running the command babeldoc scanner. This command will use configuration from config/scanner/config.properties. If you want to use configuration from a different file you can use -s another_configuration switch to specify the configuration that should be used instead of default one.
There are two kinds of configuration options available:
The options for each scanner type are laid out below.
The directory scanner is used for scanning directories on local file system. It can be configured to scan subdirectories of given folder recursively and it can use filter for files that be scanned or for files that should not be scanned (ie exclusion and inclusion) parameters. This is very useful for integrating Babeldoc into larger systems. An example would be reading documents placed in a directory by another application running on the computer or another computer to a shared, networked filesystem.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (DirectoryScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
inDirectory | directory-path | 1..n | Directory to be scanned |
doneDirectory | directory-path | 0..1 | Folder that is used for storing scanned files. Note that scanned files will be removed from inDirectory |
includeSubfolders | boolean | 0..1 | Specifies if scanning should be recursive, and include subfolders. If yes, files will be copied to doneDirectory with path relative to inDirectory. |
filter | string | 0..1 | Regular expression filter. Only files that do match will be included. If not specified all files will be included |
minimumFileAge | integer | 0..1 | Minimum age of file in ms (attempts to guard against incomplete reads) |
Null Scanner feeds null documents everytime the scanner runs. This is useful for scheduling.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (null) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
The HttpScanner allows the scanner to pull down documents from web servers. Any headers recieved by the HttpScanner are placed on the document as attributes.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (HttpScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
url | url | 1..n | URI to getChild the document from |
attempts | integer | 0..1 | Number of times to attempt to get the document |
user | string | 0..1 | Authenticate with this user name |
password | string | 0..1 | Authenticate with this password |
realm | string | 0..1 | Authenticate with this realm |
proxyHost | url | 0..1 | URL of the proxy host |
proxyPort | integer | 0..1 | proxy port |
timeout | integer | 0..1 | retry timeout |
The MailboxScanner is used for scanning mail servers for e-mail messages. Document can be scanned from e-mail body or from attachments. This is very useful for integration with email enabled clients. An example would be purchase orders emailed to a mailbox scanned by Babeldoc. The From, To and Subject filters are regular expression filters. Enter regular expressions which, if matched, cause the matching email to be processed. For example, if you wanted to match a recipient address of first.last@server.com, you would enter "first\.last@server\.com" in the toFilter. The expressions are effectively OR'd together, because if any one of the filters gets a match, the e-mail message will be processed. The toFilter is tested against all addresses in the TO field. It is NOT tested against the CC or BCC fields. Accessible attributes are subject, from, to and replyTo.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (MailboxScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
host | string | 0..1 | Mail server host name or address |
protocol | string | 0..1 | Protocol which is used for connecting to mail server (pop3, imap...) |
timeOut | integer | 0..1 | Socket I/O timeout value in milliseconds. Default is infinite timeout. |
folder | string | 0..1 | Name of folder on mail server. (for example INBOX) |
username | string | 1..n | Username for logging to mail server |
password | string | 1..n | Password for logging to mail server |
getFrom | enumeration | 0..1 | Should message be created using mail body or attachment. Default is body |
fromFilter | string | 1..n | Regular expression which, if matched by the From field, causes the message to be processed |
toFilter | string | 1..n | Regular expression which, if matched by the To field, causes the message to be processed |
subjectFilter | string | 1..n | Regular expression which, if matched by the Subject field, causes the message to be processed |
fromFilterResult | boolean | 0..1 | Result of regular expresion (true o false) |
toFilterResult | boolean | 0..1 | Result of regular expresion (true o false) |
subjectFilterResult | boolean | 0..1 | Result of regular expresion (true o false) |
deleteInvalid | boolean | 1..n | Delete messages that are not valid (invalid address etc.) and not processed by Babeldoc |
Sql scanner is used for generating documents by executing sql queries. It can produce XML documents, csv documents or simple documents. In case of XML and CSV only one document can be returned and it will contain all returned rows. In case of simle document each document is formed from the first column of each returned row.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (SqlScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
resourceName | string | 1..n | Name of the connection resource |
sqlStatement | string | 1..n | SQL statement that is executed to get documents |
updateStatement | string | 0..1 | SQL Statement that is executed after selecting rows and creating documents. It is used for marking rows as processed so they don't need to be processed later. |
documentType | enumeration | 0..1 | Type of document that is returned. Choices are simple, xml or csv |
cvsFieldSeparator | string | 0..1 | Character that is used for separating fields in CSV file. Default is comma |
csvRowSeparator | string | 0..1 | Character that is used for separating rows in CSV files. Default is \n |
xmlHeadingTag | string | 0..1 | Tag that is used in XML document for heading. Default is document |
xmlRowTag | string | 0..1 | Tag that is used in XML document for each row. Default is row. |
Scans given folder on remote FTP server. This allows babeldoc to connect to remote FTP servers and then scan folders for documents to process.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (FtpScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
ftpHost | string | 1..n | Host name or address of the ftp server |
ftpUsername | string | 1..n | Username that is used for connecting to host |
ftpPassword | string | 1..n | Password that is used for connecting to host |
ftpFolder | string | 1..n | Folder name which is scanned |
includeSubfolders | boolean | 1..n | Should subfolders be scanned too |
ftpOutFolder | string | 0..1 | Folder on FTP server where scanned documents should be copied |
localBackupFolder | directory-path | 0..1 | Folder on local file system where scanned documents should be copied |
filter | string | 0..1 | null |
maxDepth | string | 0..1 | scanner.FtpScannerInfo.option.maxDepth |
The ExternalApplicationScanner runs an external application and pipes the standard output from that application into the pipeline.
Name | Type | number | description |
---|---|---|---|
type | service-name | 1..n | General: Type of scanner (ExternalApplicationScanner) |
period | integer | 0..1 | General: Interval between two scanning operation in milliseconds (Only one of cronSchedule or period can be used) |
cronSchedule | string | 0..1 | General: Cron-like entry for speciying scanner schedule (Only one of cronSchedule or period can be used) |
pipeline | string | 1..n | General: Name of pipeline where scanned document will be processed |
contentType | string | 0..1 | General: Content type of document to be scanned |
ignored | boolean | 0..1 | General: true if scanner should not scan false otherwise. Default is false |
journal | boolean | 0..1 | General: Should the scanner use the journal. Default is true |
countDown | integer | 0..1 | General: The number of times this countdown must run. |
binary | boolean | 0..1 | General: The documents from this stage must be submitted as binary pipeline documents |
encoding | string | 0..1 | General: The encoding used for reading input files |
application | string | 1..n | The application to run. |
Table of Contents
Flat file ascii data is produced by a number of modern and legacy systems. Examples of flat file data include CSV; COBOL copybooks; positional data (data items placed in a two dimensional with data placed at columns and rows, occupying a number of characters); repeating groups (groups of data which repeats based on found in the document).
The flat file conversion is governed by an conversion configuration file that conforms to the schema: readme/schema/conversion.xsd. This clearly descibes the varous different configuration options.
Each input document is considered split into two major parts:
The header is the first part of the conversion xml document that describes characteristics of the input document (type of conversion, line ending character, the number of lines in a paragraph, lines from the top of the document to the first paragraph, lines between the paragraph, etc) and the output document (the root element and the row element name).
The paragraphs in the input document represent the lines of data that are of interest to be mapped to the output XML document. Each paragraph may consists of one or more lines, each line consisting of one or more characters up to the end-of-line character. Each paragraph maps to a sub-root element in the output document. Each field in the paragraph is represented either by position and width characters in a position type document or by column number in a CSV type document. These fields are represented by sub-row elements in the output document, ie. In XPath: /root/paragraph/field.
There are two basic kinds of paragraphs, segmented and non-segment lines
Non-segmented lines are lines whose output paragraph xml element does not change based on the presence of data in the input document. There are three types of non-segmented input documents:
This is the simplest document. Each paragraph is a line of comma-separated values. Each field is specified by a column number and a field name. The name is the subrow element to emit for the data found at the column number.
This a positional document where each line of the input document represents a paragraph. Each field in the line is specified by an offset (starting from zero) into the line, width of the field and the field name of the sub-row element to emit.
This is a positional document where each paragraph consists of a number of lines. Each field is specified by a line offset into the paragraph (from the top of the paragraph), an offset from the left margin, character width and a field name. This is useful for screen scraping operations where the screen height and width (usually 80x24) represents the paragraph.
The premise with segmented lines is that the input file may contain some value which indicates the kind of data on that line as a marker. This is specified as a column/width and a value to match. Once a line has been identified, it is possible to then preform either a single-line paragraph conversion or a CSV on it. There is an optional nesting element to output when a segment is matched - this is situated between the row and field elements.
The conversion xml document is divided into two sections: header and conversion information. The basic format is:
conversion | ||||
---|---|---|---|---|
header | ||||
output-document | ||||
root-element | The root element of the output document. | |||
row-element | The row element for each of the input paragraphs | |||
input-document | ||||
conversion-type | This can be one of: line, csv, para, segmented-line. | |||
line-ending | The line ending characters - this is currently ignored | |||
field-separator | for files, the separator characters. | |||
inter-skip | The number of lines to skip between paragraphs | |||
top-skip | The number of lines to skip before the first paragraph is encountered | |||
left-margin | Characters to skip from the beginning of the line to the first character of interest in the paragraph | |||
lines-per-para | The lines for each paragraph. This is used for establishing the chunk size. | |||
fields |
The fields given above depends on which type flat file is to be converted. There are three types:
These files are often used when exporting data from spreadsheet applications. Each column of data is separated by a comma and cells may be enclosed in quotation marks to escape text
field [1..*] | there can be one or more fields | |||
field-name | The name of the output field element | |||
field-number | the number of the CSV column |
These files consist of lines of data, each line corresponds to a row of data. The fields of data are positionally arranged in the line of data. For instance, the order reference could exist at column 15 and of width 10.
line-fields | Holds the fields | |||
---|---|---|---|---|
field [1..*] | there can 1 or more fields | |||
field-name | The name of the output field element | |||
field-column | The character number of the column | |||
field-width | The number of characters of the width |
These files consist of regular groups of lines. Fields exist at a particular row and column in the array of lines. Each field then consists of a number characters, that is, a width. This is very similar to 9.2 except that it is a two dimensional array of data. This is useful for screen-scrapes, etc..
para-fields | element to hold the paragraph fields | |||
---|---|---|---|---|
field [1..*] | There can be 1 or more fields | |||
field-name | The name of the output field element | |||
field-column | The character number of the column | |||
field-row | The line (from top) of the field. | |||
field-width | The number of characters of the width |
The segments is the method of mutating the output based on key fields in this input field.
line-segments | element to hold all of the segments | ||||
---|---|---|---|---|---|
segment [1..*] | Segments (there can be 1 or more) | ||||
segment-name | The name of the output segment element | ||||
segment-column | The column of the segment marker | ||||
segment-width | The width of the segment marker | ||||
segment-value | The value of the segment to match. | ||||
begin-group-name | The name of the element to begin the group. | ||||
csv-fields | line-fields |
Table of Contents
This chapter is intended to collect together some of the collected knowledge of those using Babeldoc. The intention is to save you time if you are trying to perform some of these tasks or similar. Please contribute your nuggets of information - these can help others.
Pre-requisites: Eclipse 3.0 build M4 or later. Anything earlier will not work.
You should now have a happy eclipse system showing all the source modules, and the libraries. Eclipse should not show any errors detected by the background compiler. However, there will be a stack of warnings. They can, for the moment, be ignored.
Now to get ant working.
You should now get a Console View appear, and the ant output will be spooled into the Console View.
As there is no equivalent to SQLEnrich for XML it is not obvious how to get an attribute from an external file and then revert to the original document. One way to do this is to store the current document as an attribute, process the second file and then revert the document to value of the attribute
doc2attrib.stageType=Scripting |
doc2attrib.nextStage={Stages that load other document etc.} |
doc2attrib.script=document.put("originalContent", document.getBytes()); |
attrib2doc.stageType=Scripting |
attrib2doc.nextSyage={Continue with processing} |
attrib2doc.script=document.setBytes(document.get("originalContent")); |
Essentially you can use document.get("myprop")
<xsl:param name="doc" select="$document"/> |
<xsl:param name="myprop" select="java:get($doc, 'myprop')"/> |
For the syntax, see: http://xml.apache.org/xalan-j/extensions.html See the java section.
Additionally you can get the pipeline stage object from the XSL and then you can manipulate the java code directly.
The snippet below is an example of how to get the current time and format it nicely:
<xsl:variable name="date" select="java:java.util.Date.new()"/> |
<xsl:variable name="seconds" select="java:getTime($date)"/> |
<xsl:variable name="velocity" |
select="java:com.babeldoc.core.VelocityUtilityContext.new()"/> |
<xsl:variable name="datestr" select="java:getFormattedDate($velocity, 'd MMM yyyy |
HH:mm:ss', $seconds)"/> |
The idea of this HOWTO is to avoid distributing all the directories that make up your configuration by packaging them all up into a single jar file and using this to run your pipelines.
Lets assume that your BABELDOC_USER points to the c:\project directory. This directory has all the required configuration directories like pipeline, resource, etc.
Copyright (c) 2000 The Apache Software Foundation. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. The end-user documentation included with the redistribution, if any, must include the following acknowledgment:
"This product includes software developed by the Apache Software Foundation (http://www.apache.org/).
Alternately, this acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear.
4. The names "Apache" and "Apache Software Foundation" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact apache@apache.org.
5. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
====================================================================
This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation. For more information on the Apache Software Foundation, please see http://www.apache.org.
Portions of this software are based upon public domain software originally written at the National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign.
====================================================================