New job
Next week (1st December) I’m switching to a new job for Connecta, which is a consultancy company in central Stockholm. My role will be to lead and build up the Java knowledge of a team in the area of Enterprise Java.
A first look at Spring-Batch, part 2
In my first post about Spring-Batch, I described in detail a Hello-World application using Spring-Batch and discussing the necessary plumbing wiring needed in the spring-beans configuration file. In this second post I will take it one step further, by introducing the concept of tokenizer and field-set mapper.
I will copy-reuse as much as possible from the previous project. Because the reuse is by copy, you can download and study them independent, without any compile/runtime dependencies between them. (In constrast to the Spring-Batch samples, wich is one big heap of code)
The Application
The application reads a set of person data from a file, creates a person object and prints it out. Simple as that and still a toy application, however it allows you to concentrate on the key concepts of tokenizer and mapper.
The Input File
Let’s start with the input data. It is in CSV (Character Separated Values) format and located at the top of the class path.
Name;Street;PostCode;City Anna Conda;Hacker street 17;12345;Javaville Sham Poo;Reboot lane 5;67890;Perlvillage Sandy Shoes;Desert town street 11;98765;Cobolburgh
The Domain Class
The overall objective is to transform each record in the CSV file into a Person object. Here is the Person class for easy reference. It uses the ToStringBuilder from the Jakarta Commons Lang project.
package com.ribomation.tutorial;
import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.commons.lang.builder.ToStringStyle;
public class Person {
private String name, street, postCode, city;
public String toString() {
return ToStringBuilder.reflectionToString(this, ToStringStyle.SHORT_PREFIX_STYLE);
}
//. . .getters and setters. . .
}
Tokenizers and Mappers
The file above is a so called flat-file, each row is a record and each record is subdivied into fields, where the fields are separated by a semi-colon character. We will be using a FlatFileItemReader to read from our (class path) resource. The reader will read one line at a time, need a means to break the line into fields. This is the task for tokenizer, in our case we will use a DelimitedLineTokenizer and tell it to split fields around semi-colon.
A tokenizer gets a string and returns a FieldSet. The next step is to convert the field-set into an business object. A FieldSetMapper gets a field-set and returns a fresh new object. This mapper class is something you often have to implement yourself, although in many cases you can get way with a BeanWrapperFieldSetMapper. I don’t not want to use too many magics at once, so here is the very straight-forward mapper class.
package com.ribomation.tutorial;
import org.springframework.batch.item.file.mapping.FieldSetMapper;
import org.springframework.batch.item.file.mapping.FieldSet;
public class PersonMapper implements FieldSetMapper {
public Object mapLine(FieldSet fs) {
Person p = new Person();
int idx = 0;
p.setName ( fs.readString(idx++) );
p.setStreet ( fs.readString(idx++) );
p.setPostCode( fs.readString(idx++) );
p.setCity ( fs.readString(idx++) );
return p;
}
}
The POM
Now we know sufficient to configure the reader in the Maven POM. As I said initially, I reuse (by copy) as much as possible from the first hello spring-batch application, so I will only show you the new/changed XML snippet.
<bean id="inputFile" class="org.springframework.core.io.ClassPathResource">
<constructor-arg value="/names.csv"/>
</bean>
<bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="resource" ref="inputFile"/>
<property name="firstLineIsHeader" value="true"/>
<property name="lineTokenizer">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="delimiter" value=";"/>
</bean>
</property>
<property name="fieldSetMapper">
<bean class="com.ribomation.tutorial.PersonMapper"/>
</property>
</bean>
Running the Application
Now we are ready to compile/build and execute. The new POM aslmost the same as the previous, except the artifact name is ‘HelloSpringBatch2′ (plus a new dependency for commons-lang). Build the application with
mvn package
Finally run the application with the command below. We are reusing the LogWriter from the first application, which just prints out the supplied object using its toString() method.
HelloSpringBatch-2> java -jar target\HelloSpringBatch2-1.0.jar hello-spring-batch.xml helloJob
[SimpleJobLauncher] No TaskExecutor has been set, defaulting to synchronous executor.
[SimpleStepFactoryBean] Setting commit interval to default value (1)
[SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] launched with the following parameters: [{}{}{}{}]
[LogWriter] Person[name=Anna Conda,street=Hacker street 17,postCode=12345,city=Javaville]
[LogWriter] Person[name=Sham Poo,street=Reboot lane 5,postCode=67890,city=Perlvillage]
[LogWriter] Person[name=Sandy Shoes,street=Desert town street 11,postCode=98765,city=Cobolburgh]
[SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] completed successfully with the following parameters: [{}{}{}{}]
A Small Variation
I mentioned above we can get away with implementing a mapper class, if it’s easy to map fields to bean properties. So let’s do just that.
A BeanWrapperFieldSetMapper need to know the destination class (or use a protype bean) and know the names of each field. My own mapper class above intentionally used indexing instead of field names. Look at the input field again, the first line contain the field names
Name;Street;PostCode;City Anna Conda;Hacker street 17;12345;Javaville . . .
and the reader configuration contained an instruction to interpret the first line as the field name defintion
<bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="firstLineIsHeader" value="true"/>
<property name="fieldSetMapper" ref="mapper"/>
. . .
That’s all we need, to provide BeanWrapperFieldSetMapper with sufficient information for it to perform its duties.
<bean id="mapper" class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper"> <property name="targetType" value="com.ribomation.tutorial.Person"/> </bean>
Don’t forget to recompile/package (mvn package). The execution will produce the exact same output as before.
HelloSpringBatch-2>java -jar target\HelloSpringBatch2-1.0.jar hello-spring-batch.xml helloJob
[SimpleJobLauncher] No TaskExecutor has been set, defaulting to synchronous executor.
[SimpleStepFactoryBean] Setting commit interval to default value (1)
[SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] launched with the following parameters: [{}{}{}{}]
[LogWriter] Person[name=Anna Conda,street=Hacker street 17,postCode=12345,city=Javaville]
[LogWriter] Person[name=Sham Poo,street=Reboot lane 5,postCode=67890,city=Perlvillage]
[LogWriter] Person[name=Sandy Shoes,street=Desert town street 11,postCode=98765,city=Cobolburgh]
[SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] completed successfully with the following parameters: [{}{}{}{}]
This new version only uses two written classes; the domain class Person and the writer class LogWriter. The rest is pure configuration.
A Second Variation
Let’s twist this application a second time by changing the output format. Instead of using our LogWriter, let Spring-Batch convert the output to XML.
We will use another bean called xmlWriter, which is of type StaxEventItemWriter. This bean needs to have a XML serializer, an output resource and the name of the XML root tag. The serializer uses a marshaller, which is an abstraction around different XML O/X mapping tools. We will use XStream, a light-weight XML tool.
Here are all additions to the beans configuration file. The xmlWriter reference should replace the LogWriter reference in the helloStep configuration.
<bean id="xmlWriter" class="org.springframework.batch.item.xml.StaxEventItemWriter">
<property name="rootTagName" value="persons"/>
<property name="serializer" ref="xmlSerializer"/>
<property name="overwriteOutput" value="true"/>
<property name="resource" ref="xmlOutputFile"/>
</bean>
<bean id="xmlSerializer" class="org.springframework.batch.item.xml.oxm.MarshallingEventWriterSerializer">
<constructor-arg>
<bean class="org.springframework.oxm.xstream.XStreamMarshaller"/>
</constructor-arg>
</bean>
<bean id="xmlOutputFile" class="org.springframework.core.io.FileSystemResource">
<constructor-arg value="persons.xml"/>
</bean>
The output now goes to a file, named ‘persons.xml’ in the current directory. Don’t forget to recompile/rebuild and run it as before. The contents of the produced file is
<?xml version="1.0" encoding="UTF-8" ?>
<persons>
<com.ribomation.tutorial.Person>
<name>Anna Conda</name>
<street>Hacker street 17</street>
<postCode>12345</postCode>
<city>Javaville</city>
</com.ribomation.tutorial.Person>
<com.ribomation.tutorial.Person>
<name>Sham Poo</name>
<street>Reboot lane 5</street>
<postCode>67890</postCode>
<city>Perlvillage</city>
</com.ribomation.tutorial.Person>
<com.ribomation.tutorial.Person>
<name>Sandy Shoes</name>
<street>Desert town street 11</street>
<postCode>98765</postCode>
<city>Cobolburgh</city>
</com.ribomation.tutorial.Person>
</persons>
This concludes my second post of Spring-Batch.
Source Code
A first look at Spring Batch
Spring-Batch is a rather new project within the Spring portfolio. It addresses a large field within computing, although not main stream in Java. A lot of corporate computing is managed by batch processing, many business transactions based on file input picked up from FTP drop zones etc.
Back in 2003, I built a batch-oriented system, that could deal with FLV (Cobol) files, assemble transaction data from a database, generate reports in various formats and push them away over FTP, HTTPS or mail. One offspring of that project is my library for reading and writing FLV files.
It is therefore a dejavu to reconnect to the ideas and principles behind Spring-Batch. As common for Spring projects, it solves more than one design problem and provides a smorgasboard of solutions. The only drawback is the lack of introductory reading material, which makes the introduction steeper than it needs to be. So, let’s fill that gap.
A little bit of theory
Spring-Batch can be subdivided into two areas which you can use separately; the first is item handling and the second batch executions.
Item Handling
Let’s start discuss item handling. This means reading and interpreting file or database contents and write it to file or database. In this area SpringBatch really shines. During interpretation you typically want to create business objects, operate and transform them. SpringBatch comes with support for both flat files, structured file and database access.
A flat file is either a CSV (Character Separated Values) of FLV (Fixed Length Values) file. A structured file is for example XML. Typically you assemble a tokenizer with a mapper that produces business objects and the other way as well. A tokenizer understands the file format and shields the rest of the application, making it easy to swap file formats. An activity more common than expected, because transaction data suppliers often delivers in various obscure file formats.
Item handling is the easy to understand part of SpringBatch. And, as I said above, you can use it as is without touching the other part; batch executions.
Batch Execution
Batch executions in general and within SpringBatch in particular, you design a solution around the concept job, which is a named sequence of steps. A step is a chunk of work, typically processing an input file.
One important execution condition for batch processing is operation performance monitoring and management, which in practice means the ability to track individual step instances and in case of need, restart a step (or job) and continue from from where it left. In order to fulfill this requirement, traditional logging is not sufficient and therefore is batch processing surrounded by lots of tracking logic and the execution progress is tracked and persisted to a database.
With this said as background information, it is easier to understand and approach SpringBatch. You can also easier take a decision do you need both components or is it sufficient with the item handling part. For every non-trivial batch application, you will end up with a loop over the input data anyway, so way not give the batch execution part a chance as well? “Nuf talking, show me the code”
The Job Model
You organize a SpringApplication in one or more jobs. If you have more than one job, they typically reuse/share lots of functionalities when ti comes to step and item processing logic. The required infrastructure is a job repository, a job launcher (runner) and a transaction manager. The latter is needed for committing chunks of work.
This is the intended way of working - the model. On the other hand, at least during investigation and early development (and maybe later as well), you have no need for transaction management and persisted execution tracking. The good news is you can fake it, which is exactly what we will do below.
Hello SpringBatch
The (first) SpringBatch application will be a minimal ‘hello world’ application, just to demonstrate what is required to get something up and running. I will use Maven, because it hides all the tedious tasks of managing all 3rd party libraries SpringBatch depends on.
Writer
The initial version contains only one single very small Java class. It prints out its input using Log4j. Let’s start with this one, so we can move on to the interesting parts.
package com.ribomation.tutorial;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.FlushFailedException;
import org.springframework.batch.item.ClearFailedException;
import org.apache.log4j.Logger;
public class LogWriter implements ItemWriter {
private Logger log = Logger.getLogger(this.getClass());
public void write(Object item) throws Exception {
log.info( item );
}
public void flush() throws FlushFailedException { }
public void clear() throws ClearFailedException { }
}
Reader
This class serves as the end-point of a chain of components, that reads items from a list. Here is the list definition as given in a spring-beans configuration file.
<bean id="reader" class="org.springframework.batch.item.support.ListItemReader">
<constructor-arg>
<list>
<value>Hello</value>
<value>Spring</value>
<value>Batch</value>
</list>
</constructor-arg>
</bean>
<bean id="writer" class="com.ribomation.tutorial.LogWriter"/>
You can see the input side is a Reader and the output side is a Writer. For every non-toy application these abstractions are tied to files and/or databases. But leave that out for the moment. So what happens in between?
Step
A step is a chunk of work, for example reading the items from the list one at a time and sending them to the writer. As I said above, SpringBatch supports a heavy-weight execution model intended to track step instance executions ans support restarts. For this reason, the configuration of a trivial step is more complex than expected. You will need a transaction manager and job repository, in addition to the two more obvious reader and writer. Here is our spring snippet
<bean id="helloStep" class="org.springframework.batch.core.step.item.SimpleStepFactoryBean">
<property name="transactionManager" ref="tm"/>
<property name="jobRepository" ref="jobRepository"/>
<property name="itemReader" ref="reader"/>
<property name="itemWriter" ref="writer"/>
</bean>
You can see it uses a factory bean to create the actual step behind the scenes. There are several intricate ways to create a step, but this will do for the moment.
Job
A job is a sequence of steps, which means each step is run to completion before the next is started. (Support for concurrent execution of steps are around the corner). The easiest way to create a job is to (re-)use a SimpleJob. In our case, it has just one single step.
<bean id="helloJob" class="org.springframework.batch.core.job.SimpleJob">
<property name="jobRepository" ref="jobRepository"/>
<property name="steps">
<list>
<ref bean="helloStep"/>
</list>
</property>
</bean>
The listings above captures all our application logic. What remains is batch and build execution infrastructure.
Repo, Launcher and TM
In this toy application we do not need persistent execution tracking support and will use fake components. The job repository will store its job and steps in a hash map and the transaction manager used will be just empty (view its source code).
<bean id="tm" class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>
<bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
<property name="transactionManager" ref="tm"/>
</bean>
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository"/>
</bean>
That’s it. This is all spring wiring needed. What remains is the Maven POM.
Maven POM
I will not digress into Maven here, just leave it as is. The POM, lists the required dependencies and adds a few nice to have plugins. For example, the dependency-plugin that assembles all 3rd party JAR files into a sub-directory and the jar-plugin that sets the class-path to this lib directory and points out the main-entry point, so we can run the artifact from the command line. The main class is CommandLineJobRunner, which is a small boot-strapper, that loads a spring beans configuration and kicks the job launcher.
Without more addo, here it is.
<?xml version="1.0" encoding="iso-8859-1"?>
<project
xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<name>HelloSpringBatch</name>
<groupId>com.ribomation.tutorial</groupId>
<artifactId>${project.name}</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<javaVersion>1.5</javaVersion>
<springBatchVersion>1.1.1.RELEASE</springBatchVersion>
<springDaoVersion>2.0.8</springDaoVersion>
<springVersion>2.5.5</springVersion>
<log4jVersion>1.2.14</log4jVersion>
<appClass>org.springframework.batch.core.launch.support.CommandLineJobRunner</appClass>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>${springVersion}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-beans</artifactId>
<version>${springVersion}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-dao</artifactId>
<version>${springDaoVersion}</version>
</dependency>
<dependency>
<groupId>org.springframework.batch</groupId>
<artifactId>spring-batch-core</artifactId>
<version>${springBatchVersion}</version>
</dependency>
<dependency>
<groupId>org.springframework.batch</groupId>
<artifactId>spring-batch-infrastructure</artifactId>
<version>${springBatchVersion}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4jVersion}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-idea-plugin</artifactId>
<configuration>
<jdkLevel>${javaVersion}</jdkLevel>
<downloadSources>true</downloadSources>
<downloadJavadocs>true</downloadJavadocs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${javaVersion}</source>
<target>${javaVersion}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<index>true</index>
<manifest>
<mainClass>${appClass}</mainClass>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Log4j
We will also need a minimal log4j configuration file (log4j-properties), which resides together with the hello-spring-batch.xml configuration file, in the src/main/resources directory of our maven project.
log4j.rootLogger=info, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdoutTarget=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{HH:mm:ss} %5p [%c{1}] %m%n
log4j.logger.org.springframework=warn
log4j.logger.org.springframework.batch=info
log4j.logger.com.ribomation.tutorial=debug
Compilation and Execution
Build the maven application using
mvn package
Run it using the command below
java -jar target\HelloSpringBatch-1.0.jar hello-spring-batch.xml helloJob
As you can see, we run the executable JAR file, with two required command line parameters. The first points to the spring beans file in the class path, and the second parameter is the name of the job to run. The output of the execution looks like this
13:17:26 INFO [SimpleJobLauncher] No TaskExecutor has been set, defaulting to synchronous executor.
13:17:26 INFO [SimpleStepFactoryBean] Setting commit interval to default value (1)
13:17:26 INFO [SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] launched with the following parameters: [{}{}{}{}]
13:17:26 INFO [LogWriter] Hello
13:17:26 INFO [LogWriter] Spring
13:17:26 INFO [LogWriter] Batch
13:17:26 INFO [SimpleJobLauncher] Job: [SimpleJob: [name=helloJob]] completed successfully with the following parameters: [{}{}{}{}]
Let’s take one step back and review our (toy) application. SpringBatch has taken care of all plumbing code to iterate over an input source and invoke various components leaving us to concentrate on the core business - in our case just to print it out to the console.
A minor variation
Before I close this posting, let’s add one small variation. In the initial version of our hello application, we just let the data (items) pass on to the output writer. This is clearly not realistic - if we for the moment ignore the list input and stuff. Typically, the items need to be processed and/or transformed in some way. One possibility is to attach an item transformer to the writer.
Item Transformer
The class below takes care of transforming the input item (a string) into a another item (upper case string).
package com.ribomation.tutorial;
import org.springframework.batch.item.transform.ItemTransformer;
public class UpperCaseTransformer implements ItemTransformer {
public Object transform(Object item) throws Exception {
return item.toString().toUpperCase();
}
}
The next step is to attach the transformer to the Writer, using an ItemTransformerItemWriter, which is a delegating writer combined with a transformer invoker.
<bean id="transformingWriter" class="org.springframework.batch.item.transform.ItemTransformerItemWriter">
<property name="itemTransformer">
<bean class="com.ribomation.tutorial.UpperCaseTransformer"/>
</property>
<property name="delegate" ref="writer"/>
</bean>
Complicated? No, not really. It first invokes the transformer object, and then the writer sending it the transformed item. The only remaining task is to update the step definition, to use the transformingWriter. If you allow me, I leave that as an exercise for you. The only difference in the output is the item strings are now in upper case.
16:16:36 INFO [LogWriter] HELLO 16:16:36 INFO [LogWriter] SPRING 16:16:36 INFO [LogWriter] BATCH
Source Code
The Camel Distribution
Long time ago, before the internet era, I was working on my PhD in the field of distributed event-driven simulation. One of the key questions was how to manage the event time line. The pitch in this case is that the time line is distributed.
To be more precise; a simulation event time line is a priority queue, where event notices are inserted during the processing of one event. By definition, events are posted to a future point in simulated time. Distributed event-driven simulation, means that several processors are processing events, where each event emits new future events. Whenever an event is processed the simulated time is incremented monotonically. If several events are processed in parallel, it means the simulated time is blurred between the earliest and the latest time points of the events being processed. The trick is to ensure no new emitted event falls within this blurred time interval. Becuase, time cannot move backwards.
The time line was implemented using varius priority-queue implementations, suitable for execution on a multi-processor. I was especially interested in the performance of my implementation of a distributioned version of the Calendar queue .
During my tests I realized the need to emulate bursty traffic, i.e., burts of insertions of events into the time line priority queue. I couldn’t find a suitable stochastic distribution, that easily could module bursts. Like, “give me five bursts, over the distribution interval”.
I played around with several different stochastic distributions for random number generation, without finding a simple to use distribution with the property stated above. This inspired me to design a distribution myself, which was named the Camel distribution, because of the humps. A Camel distribution is a composition of one or more Dromedary distributions. A Dromedary distribution has one single hump.
The Camel distribution paper was published in 1991 and have been in use in various computer simulation projects ever since.
My first implementation for random number generation based on the Camel distribution was in C, because that was the language I used for my research. The implementation here is in Java. However, it is very easy to implement the algorithm in some other language. Based on the Java code or straight from the source. More information can be found here.
Speaker at JavaForum
I will give a (swedish) speach at JavaForum in Stockholm, 30 September 2008. The title is Rena kartor - GoogleMaps med AJAX and I will talk about how functional programming now is commodity on both the server and the client side. The basis of my talk is a small AJAX application using Google Maps. More info can be found here.

