LZO vs Snappy vs LZF vs ZLIB, A comparison of compression algorithms for fat cells in HBase

Now and then, i talk about our usage of HBase and MapReduce. Although i am not able to discuss details further than what writes on my linkedin profile, i try to talk about general findings which may help others trying to achive similar goals. This post is about a recent research which tries to increase IO performance for our MapReduce jobs.

Why Compression?

HBase documentation and several posts in hbase-user mailing list tell that using some form of compression for storing data may lead to an increase in IO performance. Considering hadoop clusters almost always work on commodity machines, the reason for that is simple to explain: disks are slow. Hadoop workloads i know about are generally data-intensive, thus making the data reads a bottleneck in overall application performance. By using some sort of compression we reduce the size of our data achieving faster reads. On the other hand we now need to uncompress that data so we use some CPU cycles. It is simply trading IO load for CPU load.

If the infrastructure starves on disk capacity but has no performance problems it may be logical to use an algorithm that give huge compression ratios, losing some time on CPU but that’s usually not the case. Large capacity disks are far cheaper than fast storage solutions (think SSDs) so it is better for a compression algorithm being faster than being able to give higher compression ratios. Because of that hadoop applications prefer LZO, a real-time fast compression library, to ZLIB variants. Of course these are general talks and to see real performance changes and compression ratios, one have to try those algorithms with his/her own data.

Which algorithm?

Our data is like 700kB per row and for testing purposes we have 100k rows. Each cell contains an image, more specifically a subset of an image so it is binary and supposedly not as compressable as some log file. Using no compression, our test data of 1000 items takes up 670MB and our MapReduce tasks are able to read a cell in 8.41ms.

First algorithm we tried was ZLIB, or java.util.Deflater/Inflater following this post by @jdcryans. It simply involves using Deflater just before “Put”ting data into HBase, and using Inflater just after reading data from “Result”s. The total size of our 1000 items decreased to 346MB meaning a compression ratio of 48%. But our reading performance suffered 16%, increasing the time per row to 9.73ms.

Second one was the famous LZO. Although we are unable to re-distribute it because of licensing issues, we still felt the urge test and see what we are missing. It is somewhat harder to use in hadoop (at least the recommended way), but i’ve managed. You can check here and here for instructions on how to set it up. On the other hand this complexity is sure to have a benefit. All other methods i talk about here compress data per item basis. LZO on the other hand will compress the whole file in HDFS, so in a regular setup it is expected to have better compression ratios since there will be similarities among the rows and it will exploit those. Anyways, our 1000 item set resulted in 398MB meaning a 41% compression ratio and we’ve seen a 5% increase in reading performance too: we read one item in 8.1ms compared to 8.41ms uncompressed. So it is starting to become a win-win.

Third test was a LZF implementation, ning-compress following Ferdy Galema’s response to previous Deflater tip. It works the same way as it does too, like using LZFEncoder.encode just before writing to HBase and using LZFDecoder.decode just after reading. At this test our data size was 400MB meaning a compression ratio of 40%. Reading performance increased 21% with 6.63ms passed for one item.

Last one was recently announced snappy of Google. The same compress-each-item-seperately mechanism applies here with Snappy.compress and Snappy.uncompress. Data size was 403MB which mean around 40% compression ratio and we read our data at 6.37ms per item which indicate 25% increase in IO performance.

Conclusion

Algorithm	Compression Ratio	IO performance increase
Snappy		40%			25%
LZF		40%			21%
LZO		41%			5%
ZLIB		48%			-16%

I am suspicious about something in LZO scores since I was expecting much better performance. But it doesn’t matter because of our inability to redistribute it. Snappy-java with its Apache license is a clear winner. It is way easy to use too.

I have to remind again YMMV. These are the scores for a data which consist of 700kB rows, each containing a binary image data. They probably won’t apply to things like numeric or text data.

And the winner is …

It was harder than i thought to convince my sysadmin to do anything or to let me do anything, so i continued with my own machine, with some simpler tests. Since i am thinking about using a number of languages/frameworks, i tried to write the easiest possible example for each of them and measure performance.

The easiest example was a web page writing 3 bytes, “123”, as the output. One performance measure is number of requests processed per second. The other one is response time for 90% of the requests for 100 concurrency. I started with a little confusion, not knowing which ones i should test for, but as the number of frameworks i tested grew, i managed to classify them into two categories.

Networking Frameworks: These are the pieces that handle user requests first and passes it to our code for further processing. In my setup i simply returned that 3 bytes along with some HTTP headers.

Web Frameworks: These are the things that make our lives easier by handling all generic stuff and letting us focus on our logic. They also tend to have simpler programming models.

And there are some other tools which i couldn’t put into these categories but still had to be tested for this post to become more interesting. Actually, i will start with these. But before all, i should say that none of the tools had more than 15 minutes of my time [with the exception of Ruby tools]. Almost everything i did was a straight steal from their respective docs running in default configurations. That means two things: your mileage may and will vary, numbers here may indicate how much that tool is performance oriented.

First uncategorized tool is couchdb. It is not actually a web framework, definitely not a networking framework. But it acts as a web server and is able to process some sort of user request so it might fit the purpose of these tests. Besides i love it and don’t want to do anything that doesn’t involve a piece.
RPS: 1150, 90% RT: 120
I cannot compare the numbers to any other tool out there because there is no other one like it. It still is a winner.

Next uncategorized tool is php. Considering the popularity, i cannot leave it behind. Considering the options, it generates another category in itself. I am sure google knows about many blogs whose sole purpose is to compare php frameworks, which i left out. What i did was to get some measurements for a number different environments:

					RPS	90%
Apache prefork + mod_php 		7410	15
Apache prefork + mod_fcgi + php-cgi	5770	18
Apache prefork + php-fpm		5170	24
Apache worker + mod_fcgi + php-cgi	5440	22
Apache worker + php-fpm			5740	19
lighttpd + php-fpm			5000	24
nginx + php-fpm				7430	18

Seems like the old way with mod_php and prefork is pretty fast but make no mistake. I did not measure the memory performance of these environments, which i read in various places, prefork sucks. All other configurations with worker seem close to each other which again memory usage may make the difference. I expected lighttpd and nginx to generate similar results but that wasn’t the case.
My winner is : nginx + php-fpm

Networking Frameworks

					RPS	90%
twisted					2750	31
node.js					8150	18
gunicorn				11000	12
jboss netty				13000	6

I would love to have unicorn added here too but i wasn’t able to find a way to run it without sinatra. At least not in 15 minutes. I am specifically impressed by gunicorn performance. It was the fastest to run too. Their website just gave what i want straight from the landing site and it didn’t take me 2 minutes to hit ab. node.js was fast too but i had to compile it first so i finished in like 5 mins. Twisted was a little … well twisted. Programming model is somewhat different from others. So it took a little more while. No need to comment on netty. Java development cannot be faster than anything dynamic.
My winner is : gunicorn

Web frameworks

					RPS	90%
sinatra(unicorn)			734	140
express(node.js)			5660	30
django(gunicorn)			5270	20
play(netty)				7000	20
??					14220	6

I do not want to say anything on sinatra and unicorn here because i am almost sure that i did something wrong. I could not investigate further because of my 15 min rule. I believe all unicorn lovers have an explanation for this. I just don’t. On the other hand this comparison is somewhat unfair. play and django are full web frameworks with ORMs and caching and all. sinatra and express are lightweight and they aim performance. Oh, yet the numbers indicate otherwise.
My winner is : play

There actually is another web framework that totally blew my mind when i saw the output of the benchmark. But because of the unfairness i mentioned above, i did not want to include it in the list [but then i did]. It actually beat even netty and replied 14220 requests per second. 90% of the requests were replied in under 6 miliseconds. And ironically it was inspired by the worst performing tool of all:
My grand prize winner is scalatra. I will definitely use more of it in the future.

Announcing results with/of couchdb

I have already mentioned about the scenario in the previous post. Now its time for some business, with couchdb as it is my favorite nowadays. The application is actually dead-simple. One screen asks you your ID and password, next one lists the results you had. In order to look like some real-life counterpart (not that one exists), i added some styling to the page, but still kept the size minimum. You can see the screenshots in story-telling post.

These all are composed of a single show-function in order to keep the design-doc minimal. However in order to make it like a real-life application, with all maintainability issues and such, i used a mustache template to render that screen, and that means i used mustache.js too. User passwords should actually be on some other authentication system, but since we aim for high-performance, we cannot rely on some alien auth system. So every user/result doc also contains a md5’ed password. And that means we need to be able to generate md5 hashes out of user inputs so i used md5.js too. Whole design-doc (a special doc which you define your application) is like that:

{
   "_id": "_design/app",
   "_rev": "6-c80ec7e20e6ded17bf0e048fff596665",
   "templates": {
       "result": "[see below]"
   },
   "lib": {
       "mustache": "[mustache.js]",
       "md5": "[md5.js]"
   },
   "language": "javascript",
   "shows": {
       "result": "[see result.js below]"
   }
}

Show function result.js is :

function(doc, req) {
  var Mustache = require("lib/mustache");
  var md5 = require("lib/md5").hex;
  var ctx = {
    form: true
  };
  if (req.query.password) {
    if (!doc || doc.passwd != md5(req.query.password)) {
      ctx.formErr = "Wrong ID number and/or password";
    } else {
      ctx.form = false;
      ctx.doc = doc;
    }
  }
  return Mustache.to_html(this.templates.result, ctx);
}

For couchdb outsiders, this is a show-function which runs on couchdb when a request is received on some URL. If you add a doc-id to that URL, we receive the corresponding doc (a json object) in function’s doc parameter. req parameter contains details about http request. The result of the function is sent back to the user. Simple.

As you can see, both the login and the result screen spawns from the same function. It is simply a matter of query string of the request. More specifically, if user tries to log-in we check for auth and provide the doc to the template, if not we pass form:true to template so the login form gets rendered. The relevant parts of the template goes like that:

{{#form}}
<p>Please enter your identification number and password</p>
<form onsubmit="sb()">
  <table border="0" cellpadding="3" cellspacing="0">
    <tr><th><label for="idn">ID Number</label></th>
      <td><input type="text" id="idn"/></tr>
    <tr><th><label for="pass">Password</label></th>
      <td><input type="password" name="password" id="pass"/></tr>
  </table>
  <p><input type="submit"/></p>
</form>
{{/form}}

<p class="err">{{formErr}}</p>

{{#doc}}
...Result screen...
{{/doc}}

<script type="text/javascript">
  function sb() {
    document.forms[0].setAttribute(
      "action", "/"+document.getElementById("idn").value);
  }
</script>

That last script changes the form action to include the user’s id number in request path which means the show function will be run on the doc with that id. Short version: Document ids are user ids, and in order to ‘show’ that document we request the doc with the user’s id. An example user/result doc is like that:

{
   "_id": "1000002",
   "_rev": "1-72b2b17d3c46a69464c55c80373abc01",
   "name": "John Knife",
   "passwd": "877466ffd21fe26dd1b3366330b7b560"
   // result data
}

So if we request something like “_show/result/1000002” our result show function will be run with that doc.

I am omitting the result data and template because they may change for every exam. Also it is extremely easy to render json with mustache.

Application is done, but system isn’t. I need 3 million docs in that db, so i wrote a simple script to generate them all. It took me 15 minutes to prepate and around 20 minutes to run. In just over half an hour i had 3 million records ready to be read by the application. Add that duration to the 2.5 hour mostly passed in styling, i had completed single node system in about 3 hours. Isn’t that relaxing!

Speaking of single node deployment (my pc, a quad core machine with 4G ram), here are some numbers. Using ab on login screen: i got around 500 RPS. With 100 concurrent users, 90% of the requests served under 290ms. But that is hardly a simulation for the running system. Normally, a user lands on login screen, then asks for his/her results using credentials. Using jmeter to fit this scenario: i got 390 RPS and 360ms for 90%. In either case, single machine will take hours to finish 3 million results, so we need to scale.

Since the application is totally stateless, I should be able to run a number of couches behind some kind of a load balancer. Just put the data on all of them and assign an usher to show people where to sit. And since there are no writes i shouldn’t even need replication among them. Simple enough. The hard part is to convince some sysadmins to install couchdb on some servers. I need some time. I’ll update here when i’m done.

Examining the examination results - Warm up

Consider this: You took an examination and your result will place you in a university of your choosing, or not. Although the date of the result announcement has already been made public, on the evening of the day before it, reporter on TV tells us: “2021 university acceptance exam results will be announced tomorrow morning at 10am. [Enters a video showing students taking exams and zoom-ined test answer forms] 3 million candidates have to wait just another night to learn if they have made it to the school of their dreams, or those dreams are to be hold onto for another year. Head of the “student picking and planting institute” spoke today to inform that the candidates would be able learn their scores online at http://localhost:5984/ [Enters a monitor showing the app(below)] and official statements will be mailed till late July… [Continues with some interviews on the street]”

You got the picture. Only two strokes out of this picture really concerns me, which are 3 million and 10am. That means that poor server will start to get hammered at around 10am for results. We need a high-performance web-application to process all that. This’ll be my scenario for trying out web frameworks and datastores till i find a better one. Criteria will be requests-per-second on a single machine, on a four-machine cluster with some load-balancing and, how much did it take me to complete the running system, able to serve 3 million results.

Although not-final I have these tools in mind to experiment on

And besides couchdb, i’ll need a datastore for them to work such as

I’ve already finished the app with couchdb (screenshots) but, you know, the post needs to be cooked before serving. Just hold-on a little.

Compiling gcc-4.5 on debian unstable

As of today, i have found no way to install gcc-4.5 on debian unstable without manually compiling it. It seems there is a deb package in ‘experimental’ but it does not install because of a bunch of dependency issues. So i was back in old days where we need to compile several versions of gcc in chain in order to run apache on AIX.

I had forgotten how hard to these things right so i dived straight into ./configure && make cycle without RingTFM. My bad. On the other hand TFM is so Fed up it was not possible for me to R and understand anything. Anyways… this post is just for future reference.

First of all just as anywhere else mentions, you need GMP, MPFR and MPC to compile gcc. But these are not enough, you will need PPL, CLOOG and libelf too. You might try these to get from debian repositories but you may not be able to find them. Even if you do, you may not be able to install because of the same dependency conflicts. I compiled all of them manually and here are the versions used:

  • gmp-5.0.1
  • mpc-0.9
  • mpfr-3.0.0
  • ppl-0.11.1
  • cloog-ppl-0.15.10
  • libelf-0.8.13
  • and of course gcc-4.5.2

Any of those require any other library than the ones in the list, i installed using apt. After ./configure && make && make installing all dependencies i did those to get a build.

mkdir gcc-build
cd gcc-build
../gcc-4.5.2/./configure --disable-ppl-version-check --enable-languages=c,c++
make -j3
make install

Creating a dir like gcc-build and building there is the preferred way of doing things. I did not come across any, but docs say doing a build in the same directory as sources may yield unexpected results.

Configure script looks to /usr/local for manually installed libraries by default. But if you have changed it for dependencies you should show them with parameters —with-gmp, —with-mpfr, etc. I suggest you leave them default. No one needs any more complexity.

Here is the catch: gcc documentation says that it requires ppl-0.11 but the ./configure script somehow requires 0.10. So, if you provide 0.10 (say with apt-get install libppl0.10-dev) the config.log will not complain but ‘make’ will fail with:

configure: error: cannot compute suffix of object files: cannot compile

Per documentation, this tells you absolutely nothing. And if you give 0.11 (say by compiling manually) config.log complains with the stupidest error possible:

conftest.c:16: error: 'choke' undeclared (first use in this function)

It seems that “choke” is a non-keyword, non-variable statement, something like “foobar” whose sole purpose is to fail the compiler. Another problem which tells you absolutely nothing about the problem and worse, leads you to a completely irrelevant direction. I mean, if you need to fail the compiler can’t you just write something like “_PPL_version_is_not_0.10” so that i get an error like

conftest.c:16: error: '_PPL_version_is_not_0.10' undeclared (first use in this function)

Is it really that hard?! Anyways, i guess everyone but me use —disable-ppl-version-check by default in their build scripts so no one ever mentions it.

Another weird behaviour that cost me about half an hour is this line in config.log

conftest.c:10:28: error: ac_nonexistent.h: No such file or directory

Google this and get nothing again. Because it is nothing. It is a way to test the compiler if it loads nonexistent libraries where it should not. So this error is not an error but a success. But if you just look for error lines in config.log you are as silly as i am. How can you even think this is a configuration error!

I probably compiled dozens of different packages but i never seen successful ./configure output but a failed config.log. If there is a library missing ./configure output always told me whats missing. And if it did not complain about anything i went straight to make. Never needed to check config.log beyond some exceptional cases. This whole behaviour is default in gcc. You always need to check config.log for errors before make.

And remember, errors may not be errors! Even if they are, few words that is written right next to them are not error descriptions. Now i understand why there are so many C-flavored languages.

UPDATE: I received my official gcc-4.5 package with todays morning update. Guess i should’ve waited just a few more days. 

Monitoring nVidia GPU metrics with Ganglia

Editor’s Note: As the first time in sleepcoding’s short history, i welcome our sysadmin H. Çağlar Bilir’s post on our recent experiences with graphic cards and ganglia. I mentioned about our new cluster but omitted the detail that the machines also have some GPUs on them. New toys mean new details to look for. So here it goes.

After a brief period of googling, we could not find any out-of-box solution for monitoring nVidia GPU’s with Ganglia, so we created our quick and dirty solution. This is the explanation, code and configuration. First some warm-up:

  • We have only gtx 480 cards, we have only one gpu per node. We only monitored temperature, fan speed and gpu/memory utilization.
  • We have linux as underlying OS. We used ganglia 3.1.7 compiled with python support.
  • nVidia has a command line tool called nvidia-smi (NVIDIA System Management Interface program) and it is in nvidia kernel module packages. (man nvidia-smi)
  • nvidia-smi tool may not be able to query your gpu cards so you may need to pass —gpu=0 parameter to get some sensible output. Or it may be just us since we don’t run X.

Actually Ganglia has a tool called gmetric, and if you have a command-line utility, you can send its output to Ganglia with the help of cron. To keep our OS installation clean, we did not want to use cron, we picked the harder way: Ganglia has a feature called metric modules which simply enables you to run your C or python code as a third party module (Ganglia README -> Extending Ganglia through metric modules). We used python ‘cos it is fast to develop.

Python pluggable module code (named as gpuwatch.py):

That ’ | tail -23’ is the result of trying to be quick-and-dirty. Standard output of the command has two lines starting with GPU. We needed the one in the last 23 lines.

To link gpuwatch.py with ganglia, a configuration file should be created and placed in appropriate folder. In our case the folder is /etc/ganglia/conf.d and the conf file is modpython.conf. The file is as follows:

For gmond to read the modpython.conf configuration file, there should be an include statement in gmond.conf file. Our gmond.conf is in /etc/ganglia and it contains the following configuration line:

include ('/etc/ganglia/conf.d/*.conf')

Ganglia must be compiled with python support and as a result, there should be modpython.so in the same place with the other metric module so’s.

Ex: In our system, modpython.so and others are in /usr/lib64/ganglia/.

Note: ldconfig must know and add these so’s in its cache.

After all of these, gmond must be restarted.

Here are some notes we took down during the time:

  • MultiGPU nodes may be processed in code.
  • Code could be cleaner/wiser.
  • Other parameters such as ECC errors etc. can be monitored.

Refs:

To wrap things up, we have written that python piece in about half an hour with “pair programming” style, which we do not in our normal workdays, since, you know, he is a sysadmin. It was an enlightening experience to “pair program” with a man who writes shell scripts on-the-fly and avoids programming for a living. Pushed me to my edge of task-focusedness. Thanks man, for that experience and letting me publish this.

UPDATE: After a short while of running this, we ran into issues with nvidia driver itself, causing kernel panics and crashing machines. Although we were unable to pinpoint the problem, it went away after disabling this. In short, use it at your own risk.

Notes from past week

Not focused on anything specific, last week was one of those i had to multitask, so there are more than one subject that deserves a few words. I also am in a state where i both want and don’t want to talk about them, which usually means the post will be more like my weekly reports rather than an article.

Our new shiny cluster of 12 machines arrived at last, just in time to put my hands on HBase 0.90. As usual, I ran in pseudo-distributed mode first, then re-packaged it with our configuration details and sent it over for deploying. Almost nothing has changed in terms of configuration details or API, so everything went as smooth as it could get. Our sysadmin made me laugh when he warned me not to go guns and blazin on machines because they were not placed on rack and i might burn some CPUs. It would be fun though.

I regret to say that we had to go seam for that activiti integrated human-resources application (seam is an application framework that i do not want to link to. I guess that is my way of protesting it. Not that anyone cares). It was for certain even while i was drafting my last post, but i needed to ignore those facts and live in the peaceful world of play! just a little bit more. Now that dream has ended and i went back to JBoss workshop. Although it was a bit painful, i was able to pull of a maven and maven-jetty-plugin trick on seam 2. That saved me from the 40 sec startup of JBoss AS but i think i’m gonna have to live with the lump of unresolved-and-we-don’t-even-care-to-resolve-just-use-sth-else bugs of seam. I am not sure if i should write about the experience. Anyone in need of something like seam/maven/jetty integration may google it to find things that don’t work out of the box, but help to build a working pom.xml. In my opinion, anyone in need of seam should just use something else.

BTW, i know that it’s highly probable its just my allergies on this seam matter. @mozcelebi is using it with whole JBoss stack and is pretty happy with it. Moving on.

I am not an expert but i think we people prefer talking about the things we don’t like over talking about the things we like. I am trying to avoid doing that, so i will write a post about how i did integrate activiti with seam in order to raise my karma over the last paragraph. And this time i would be talking more about activiti, i need good things happening nowadays.

As a dessert, i tried to polish some details with plevsy. I added a few more CRUD screens and tweaked some UI to make use of those new data. There was nothing worthy to note on UI/Evently end past week, but i tried to discover couches on local network by sending JSONP requests all over it using web workers. It was fun. I can’t even begin to tell how delightful it is to be working with couchdb. They couldn’t have found a better motto than “Relax”. Note-to-self: should write about that web worker discovery experiment soon. On the other hand, plevsy is not without some factors that suck this fun and relaxed parts out of it. Note-to-self: should not write anything about these in order to, you know, keep my karma.

Activiti ‘Hello world’ on play!

Previously, i told about my experiences with play module/plugin development. I tried to point out how it was easy to extend the framework to your needs. A primitive activiti plugin was the product of that effort, and i said there will be a simple application using this. Here goes.

I picked the “Financial Report” example from 10 minute tutorial part of the activiti documentation because it seems the easiest way of demonstrating that it works. The BPMN file -which is used to define the business process- can be obtained from the site.

The use case is straightfoward: we have a company, let’s call it BPMCorp. In BPMCorp, a financial report needs to be written every month for the company shareholders. This is the responsibility of the accountancy department. When the report is finished, one of the members of the upper management needs to approve the document before it is sent to all the shareholders.

Process simply composes of two user actions (emphasized). It also mentions user groups (accountancy department and upper management) but i did not include them in my example in order to keep things simple. However, i used Secure module for user identification. Interface will be composed of three lists; one for processes deployed (registered) in activiti, one for tasks to start and another one for tasks of currently logged in user.

Since there is only one process definition deployed, first list will show only one item. A user will start the process by clicking [New] link next to it, and activiti will create a task per our process definition which will be listed in the second list. This list may contain arbitrary number of tasks from different process instances, which means there may be more than one instance of the financial report process at any given time. The user will start working on the task by clicking [Start] next to task. Now the task will move into the next list. It will wait there until the task is finished. User may logout and log back in to see his/her tasks waiting to be [Finish]ed. All of that information will be stored in activiti.

Other users may work on whatever tasks they should be working too.

Enough crappy screenshots. The code is simply one template and one controller to show lists and back actions.

#{extends 'main.html' /}
#{set title:'Home' /}

Welcome ${user}! <a href="@{Secure.logout()}">Logout</a>
<h3>Processes</h3>
#{list items:pdl, as:'pd'}
    [<a href="@{Application.start(pd.getId())}">New</a>]
    ${pd.getName()}<br/>
#{/list}
<h3>Tasks</h3>
#{list items:utl, as:'ut'}
    [<a href="@{Application.claim(ut.getId())}">Start</a>]
    ${ut.getName()}<br/>
#{/list}
<h3>My Tasks</h3>
#{list items:atl, as:'at'}
    [<a href="@{Application.complete(at.getId())}">Finish</a>]
    ${at.getName()}<br/>
#{/list}
@With(Secure.class)
public class Application extends Controller {
	
    @Inject static ProcessEngine pe;

    public static void index() {
        String user = Security.connected();
        List pdl = pe.getRepositoryService().
                createProcessDefinitionQuery().list();

        List utl = pe.getTaskService().createTaskQuery().
                taskUnnassigned().list();

    	List atl = pe.getTaskService().createTaskQuery().
                taskAssignee(Security.connected()).list();
        render(user, pdl, utl, atl);
    }
    
    public static void start(String pid){
    	String pdid = pe.getRepositoryService().
                createProcessDefinitionQuery().
                processDefinitionId(pid).
                singleResult().
                getId();
        pe.getRuntimeService().startProcessInstanceById(pdid);
        index();
    }
    
    public static void claim(String tid){
        pe.getTaskService().claim(tid, Security.connected());
        index();
    }

    public static void complete(String tid){
        pe.getTaskService().complete(tid);
        index();
    }
}

There is only one piece missing in this simple example: how did i deploy my process definition in the first place? Well, it is the dirtiest hack possible. I put it into the plugin where i initialized ProcessEngine. There are many ways to deploy a process definition into activiti and i didn’t want to get into the specifics. I used a mem database for the same reason, so none of these were persistent. I tried to see if the plugin would actually work and it did! On the other hand, if i wanted something in production i know it is a matter of configuration.

I said all i can about play’s plugin system in the previous post, it is great. But since it was about initializing activiti i didn’t say anything about it. AFAICT, it is one of the few libraries/frameworks that delivers what it advertises. It is light-weight, fast and simple. They say a BPM engine should be working in every Java environment, and this ‘helloworld’ is one example. API is clean, well documented and easy to work with. As a developer who correlates BPM engines to application servers that cannot start in under a minute, i am really pleased with what i was able to achieve. Overall experience with activiti is simply great.

A prototype play! plugin for activiti integration

Nowadays, i am back on play! trying to do things with somewhat less-documented feature of plugins. There is a project on the horizon which needs tight integration with a BPM engine, namely activiti, so i thought it’s a good time to try and deal with it as a play plugin or play module.

I said “plugin or module” deliberately, because i wasn’t aware of a plugin system other than the module system, which is basically described as just another play application. Actually i was wondering how it was possible to inject guice or spring dependencies into my application using just-another-application. Then i found out about the file play.plugins and saw that it was not magic at all (ability to code scala is still magic though).

I am also new to activiti and BPM, but i’ll try to explain what i made of it, at least the part that interested me. It is an application that uses a database to store some process definitions and ongoing processes, while executing everything according to the definition. Processes may be defined with a language called BPMN 2.0 which supposedly can be written by business-people so we shouldn’t care about the complexity or validity of them! We register these definitions into activiti which in turn may be instantiated into process instances using a Java API. Then, we may monitor and/or act on those instances using the same API. Activiti also provides a REST API which is marked as experimental so i didn’t touch it. Simple enough.

I still am not sure how will i use it in my future project but there is a point which will be needed anyway so i tried to cover that in this primitive plugin: initializing activiti. This means getting the hold of a ProcessEngine instance.

A play module may put any classpath dependency in its lib/ directory but its not created with $ play new-module. I mkdired it first and copied activiti related jars here. Activiti itself comes in a single jar named activiti-engine and this was the first jar to be copied. Because it has been developed using spring, it requires some spring jars to be on the classpath. It is possible to get them using spring module but i wasn’t sure if it works with 1.1 or is compatible with what activiti requires. I simply copied spring-core, spring-beans and spring-asm. It uses MyBatis to access the required database so its jar is to be included too. The last state of lib is like

  • activiti-engine-5.0.jar
  • mybatis-3.0.1.jar
  • spring-asm-3.0.3.RELEASE.jar
  • spring-beans-3.0.3.RELEASE.jar
  • spring-core-3.0.3.RELEASE.jar

$ play new-module created some unnecessary directories [for this example] like app/ and conf/ so i removed them before starting anything.

Activiti documentation states that ProcessEngine is thread-safe and it will be OK to initialize and destroy it upon container boot and shutdown. It gives an example for plain Servlet environments which seems easy to translate to a play plugin.

A play plugin is a class that extends PlayPlugin which is able to modify just about everything framework does, by overriding the methods. In order to function, it needs to have a default constructor and has to be enabled by a play.plugins file which should be somewhere in the root of the classpath. $ play new-module creates such a file in src/ so i edited it to put my plugin classname.

1100:play.modules.activiti.ActivitiPlugin

The number before the colon defines the priority of the plugin which simply defines the order that plugins are run. I’ll explain why i picked 1100 shortly. The class is as simple as it gets.

package play.modules.activiti;

import org.activiti.engine.ProcessEngines;
import play.PlayPlugin;

public class ActivitiPlugin extends PlayPlugin {
  @Override
  public void onApplicationStart() {
    ProcessEngines.init();
  }     
  @Override
  public void onApplicationStop() {
    ProcessEngines.destroy();
  }     
}

We have successfully initialized our ProcessEngines per activiti documentation. But it is obvious that this is less than useful. We should find a way to get a ProcessEngine instance in our controllers. I don’t know if it’s OK here but i am a huge fan of dependency injection so i would love to be able to just @Inject it. I knew that there is a @javax.inject.Inject support in play but i had to peek at the source code of guice module to figure out how to make it work. It turned out to be the simplest modification ever.

public class ActivitiPlugin extends PlayPlugin implements BeanSource {
  @Override
  public void onApplicationStart() {
    ProcessEngines.init();
    Injector.inject(this);
  }     
  @Override
  public  T getBeanOfType(Class clazz) {
    if (clazz.equals(ProcessEngine.class))
      return (T) ProcessEngines.getDefaultProcessEngine();
    return null;
  }     
  //...   
}

The Injector.inject() in second line crawles our classes for @Inject annotated static fields and applies getBeanOfType() method to its type when it finds one. And my methods gives the ProcessEngine instance if the type of the field is actually a ProcessEngine. Now it is possible to get it in any controller, mailer or job

public class Application extends Controller {
  @Inject
  static ProcessEngine pe;

  public static void index() {
    List pdl = pe.getRepositoryService()
      .createProcessDefinitionQuery().list();
    render(pdl);
  }     
}

That example gets the list of registered (deployed) process definitions. After getting that ProcessEngine, rest is details about how you want to use it which i will figure out after getting into the project.

On that priority 1100; since i am utilizing @Inject here, there is a huge possibility that guice or spring plugin will interfere. Both of the plugins’ priorities are 1000 so activiti plugin will be loaded after them and overwrite anything that has been injected by them. I am still not sure about this whole mechanism and i feel i’ll probably be forced to depend on spring plugin sooner or later.

For the completeness of the example, just in case anyone wants to work with it, there is one more step. Activiti offers you the easy way of defining your activiti database in a configuration file named activiti.cfg.xml. You should obtain one from the documentation and modify it under your conf/ directory. But be careful about it as you would probably need h2 or jdbc-driver jar in your lib/.

About that module/plugin confusion in the beginning i think it is more clear now. From what i understood, a module is just another play application, more precisely a web application, that you may want to include in several projects. Like a google-maps widget for a yabe. Or an authentication system that sits on top your application, saying hello to logged-in users and showing login controls for others. Plugins on the other hand, extend the framework for developer needs. Like my primitive activiti integration, or a hypothetical drools integration which may fire rules upon a valid request using a simple annotation.

Plugins also reminded me of a recent chat with @mozcelebi and @burakdalgic, about whether it is possible in play to do things like django middlewares are able to do. It seems they are the way to go for request pre/post processing and more. I am thinking about a middlewares plugin for example.

Next post will be is about a simple activiti application on this.

Scala for a newbie

Finally had the chance to get my hands dirty with scala. I’ve been trying to learn the stuff for almost a year now but you know… it’s not easy without something to build. Reading and trying the examples in the book doesn’t get one far unless he has an objective.

Anyways, what i was trying to do was to run some statistics over the performance data in some csv files. The files were the product of a python script which crawles our cluster for hadoop job logs and combines them with the timing data in job history. We ran 64 different test so there were 64 csv files. Each file contains a number of data sets which represents repeated tests.

I thought it will be a simple spreadsheet job but i didn’t find a way to import those files into one workbook (i need to learn one of these programs soon). I started to copy each file manually and got bored on the fifth. I decided it will be easier to run the statistics in python then pipe the data to google charts. On the third line of that python script, i saw the opportunity to get it done in scala.

Being an experienced scala newbie i knew that IDE plugins for scala were still not up to the rhythm so i skipped plugging anything to eclipse and went straight to vim. I also thought that it will be hard to go between write-code and compile cycle since this would be my first piece of scala code. After fixing some configuration problems with sbt and codefellow i had my first code running.

object Test { 
    def main(args: Array[String]) {
        println("TEST")
    }
}

Starting small, i had a fileReader first:

def lines(file: java.io.File) =
    io.Source.fromFile(file).getLines.toList

Little more verbose than pythonic file(filename).readlines() but way too concise for java counterpart. The most interesting part may be the lack of a return. Don’t expect any because last line actually produces a value which gets returned automatically.

Hadoop Job and TaskAttempt stats were in the same file so i had to filter them accordingly in order to generate different statistics for job times and task times. I have processed jobs with three simple functions (one of which being lines).

def jobTimes(file: java.io.File) = 
    for {
        l <- lines(file)
        if l.startsWith("job")
        jt = l.split(",")(1)
    } yield jt

def jobs(files: List[java.io.File]) = 
    for {
        f <- files
        jt = jobTimes(f)
        av = average(jt)
        sd = math.sqrt(variance(jt, av))
    } yield (f.getName, av, sd)

Doesn’t look scary right. Tell me about it. I have read the for part of the book at least twice but still didn’t remember the difference between {} and () for that for statements. yield is familiar from python but the placement is important. It can’t be in the code block. I am still trying to get the concept but this feels more correct than python. In the end yield is not related to the code so it should be outside.

Average function is actually as concise as it gets

def average(lst: List[Double]) =
    lst.sum / lst.length

But i didn’t like the looks of that length so i tried a pure approach. I was trying to learn after all

def average(lst: List[Double]) = {
    val mc = ((0.0, 0.0) /: lst.map(x => (x, 1.0))) {
        (a,b) => 
        (a._1 + b._1, a._2 + b._2)
    }
    mc._1 / mc._2
}

Now that is scarier, huh. You gotta trust me it’s not that complicated once you get to recognize it. It is a scala counterpart of a slightly modified version of a MR job that calcuate average. You emit a 1 with your values, and your reducers add each value in itself. In the end you have collected your values and those 1s which equals to the number of values. So you simply divide em in post-processing. In this code piece the first 0.0, 0.0 defines your initial values /: operator means “i am starting a fold (reduce) operation with this initial value over this list”. Since list contains only the values to collect, i had to map it to (value, 1.0) tuples in order to comply with the mechanic just described. x => (x, 1.0) is lambda x: (x, 1.0) in python, and i guess => is better than a keyword BTW. After our map phase, reduce phase in {} block simply collects these values. Weird looking _1 and _2 accessors are for tuples. 

As a conclusion, after some gotchas in points just told, formatting an output was the easy part. I am aware of the fact that i probably used only 1 percent of what’s available in scala and there probably are better ways to do what i just did. But AFAIK scala is not a all-or-nothing language, if it was, no-one would be able to use it because it’s huge for an average developer to swallow as a whole. One should find similarities with other languages, do things in-effectively at first then slowly learn the scala way.