Archive for August, 2007

There is a Spoon: Integrating ETL into your application

A not-too-uncommon occurrence when building a data-driven application is the realization that you need an honest-to-goodness data warehouse. Having one-off statistical tables scattered throughout your schema just doesn’t cut it after a while. Thankfully, there’s plenty of information on the web on how to build a data-warehouse. Once you create your denormalized, summarized, and dimensioned schema, you face the challenge of getting the data out of your transaction processing database and into the warehouse. That’s the domain of ETL (Extract, Transform, and Load) tools. This post is about how to get Pentaho Data Integration (formerly known as Kettle) embedded within your application. Since I like ‘Kettle’ much better than PDI, I’m going to use Kettle throughout the post.

Step 0: Create your transformation

The transformation will be very specific to the tables you are trying to transform. However, there are a few general guidelines that will make the process easier. Unless your production environment is your workstation, it’s a good practice to use Kettle variables for all the database fields. To make the spoon gui work, you will need to add a kettle.properties file to your userhome/.kettle/ directory. I’ll show you how to set the properties within your application a little further down.

Step 1: Getting your project environment configured

The first step is to identify which jars you’ll need to add to your project. Kettle depends on a lot of the usual java libraries, so the new stuff you need to add is probably minimal. From the lib directory of the Kettle distribution, you’ll need kettle.jar. Given that I work on a server application, I was somewhat chagrined that I had to import common.jar from the libswt directory. You’ll probably need to grab a few other things from libext before the kettle stuff will compile. What specifically you need will depend on what you already have. You won’t need everything, since you are only going to embed a small portion of the total Kettle toolkit.

Step 2: Create the Skeleton

There are three steps to executing a Kettle transformation: initializing the environment, loading the transformation, and execution. Therefore, I created a method that currently calls stubs to handle each of these steps as well as repackage any exceptions that are thrown.

    private void executeKettleTransform(String transformLocation) throws WarehouseDataEtlException {
        try {
            initializeEnvironment();
            TransMeta meta = loadTransform(transformLocation);
            executeTransform(meta);
        } catch(Throwable t) {
            throw new WarehouseDataEtlException("Unable to run transform at " + transformLocation, t);
        }
    }

Step 3: Implement the initialization functions

The initialization method needs to do a few things. First, it has to initialize the transformation’s environment, followed by loading the steps, then setting all the variables for the jdbc properties.

    private void initializeEnvironment() {
        EnvUtil.environmentInit();
        StepLoader steploader = StepLoader.getInstance();
        if (!steploader.read()) {
            throw new IllegalStateException("Spoon broke for some reason");
        }

        KettleVariables kettleVariables =  KettleVariables.getInstance();
        kettleVariables.setVariable("jdbc.server.name", jdbcServerName);
        kettleVariables.setVariable("jdbc.olap.server.name", jdbcOlapServerName);
        kettleVariables.setVariable("jdbc.dbname", jdbcDBName);
        kettleVariables.setVariable("jdbc.olap.dbname", jdbcOlapDBName);
        kettleVariables.setVariable("jdbc.username", jdbcUsername);
        kettleVariables.setVariable("jdbc.password", jdbcPassword);
    }

One of the more annoying things about working with the Kettle API is it’s use of returning false for failures instead of exceptions. Hence, the boolean check around steploader.read(). Also note that this is where we set the variables for our application runtime. This approach lets the developer run the transformation in the application as well as from the tools provided with the Kettle project, which can be extremely helpful while debugging.

Step 4: Load the transformation

A Kettle transformation is just an XML file, which provides some flexibility in how you load it within your server app. I chose to put the transformation on the classpath, then use the class loader to get access to it. Since the 2.5.1 release of Kettle does not support loading a transformation directly from a stream, you have to parse the XML then feed it to the transformation metadata object.

    private TransMeta loadTransform(String transformName) throws Exception {
        String transformLocation = getTransformPath(transformName);
        TransMeta transMeta = null;
        InputStream transformStream = this.getClass().getResourceAsStream(transformLocation);

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document doc = builder.parse(transformStream);

        transMeta = new TransMeta(doc.getFirstChild());
        return transMeta;
    }

A crucial realization is that you need to pass in the first child of your xml document, and not the root of the document itself. Otherwise, Kettle will throw exceptions about being unable to find anything to load. The call to getTransformPath is a call to an internal method that maps a string to a classpath location.

Step 5: Execute the Spoon

Executing the transformation is just getting an instance of the transformation from the transformation metadata object, then executing it. If you are logging transformation executions to a table, you need to wait for it to finish, then manually call end processing. Failure to do these steps will mean you will never see a row with the status end in the log table. This is especially bad if you are using transformation times in your where clauses for data input steps.

    private void executeTransform(TransMeta transMeta) throws Exception {
        Trans trans = new Trans(LogWriter.getInstance(), transMeta);
        if(!trans.execute(transMeta.getArguments())) {
            throw new RuntimeException("Transformation failed");
        }

        trans.waitUntilFinished();
        trans.endProcessing("end");
    }

That’s all there is, and you can easily test it by wrapping a junit around the execute method call. In the next few weeks, I’ll post something about doing near-real-time warehousing using quartz, spring, and the code that I’ve shown here.

  • Share/Bookmark

Eclipse and Subversion on Mac OS X

I don’t think there’s anything more frustrating for a developer than having your development tools slow you down instead of help you get your work done. I’m still a fan of Eclipse, mostly because I’ve used it so much, everything else feels foreign. Trying to develop Java code without it is a lot like trying to find your way out of the middle of a dark forest with neither map nor flashlight.

Eclipse, however, is far from perfect, and one of it’s features that’s really been getting on my nerves lately is it’s Subversion support. I had been using Subclipse for over a year and a half. Unfortunately, Subclipse’s rough edges seemed to grow proportionally with the size of the code base. Eclipse has always been the best at version control support, but Subclipse always felt lacking and a little ugly.

Switching to Subversive alleviated some of the problems. Subversive doesn’t give me as many bizarre errors related to malformed network data. There’s partial support for visual merge’s, similar to the built in CVS support, and it’s icon set is better. However, one major problem remained: performance.

The problem seems to be the pure-java SVNKit. It’s dog slow. Checking out my repository in Eclipse took over 20 minutes. Checking it out with the SVN command-line client, took around 2 minutes. This is simply unacceptable. Windows users probably have not experienced this problem, since Subversive bundles the native javaHL Subversion implementation.

On the mac, the only alternative was to roll your own Subversion client from source. The desire to get away from a platform where I had to manually compile software was part of the draw for using OS X in the first place. Thankfully, the folks at OpenCollab.net just released a binary of Subversion 1.4.4 which includes the javaHL module and an easy to use installer.

Once the binary is installed, all that’s left is to open the Eclipse preference menu (which as of Eclipse 3.3 is in the correct place), then select the javaHL client under the SVN Client tab under Team -> SVN. Finally, the version control feature of Eclipse, with close to the command line speed!

  • Share/Bookmark
Return top

About

This is my blog about programming. For random stuff, checkout my Twitter or Tumblr