LZO vs Snappy vs LZF vs ZLIB, A comparison of compression algorithms for fat cells in HBase
Now and then, i talk about our usage of HBase and MapReduce. Although i am not able to discuss details further than what writes on my linkedin profile, i try to talk about general findings which may help others trying to achive similar goals. This post is about a recent research which tries to increase IO performance for our MapReduce jobs.
HBase documentation and several posts in hbase-user mailing list tell that using some form of compression for storing data may lead to an increase in IO performance. Considering hadoop clusters almost always work on commodity machines, the reason for that is simple to explain: disks are slow. Hadoop workloads i know about are generally data-intensive, thus making the data reads a bottleneck in overall application performance. By using some sort of compression we reduce the size of our data achieving faster reads. On the other hand we now need to uncompress that data so we use some CPU cycles. It is simply trading IO load for CPU load.
If the infrastructure starves on disk capacity but has no performance problems it may be logical to use an algorithm that give huge compression ratios, losing some time on CPU but that’s usually not the case. Large capacity disks are far cheaper than fast storage solutions (think SSDs) so it is better for a compression algorithm being faster than being able to give higher compression ratios. Because of that hadoop applications prefer LZO, a real-time fast compression library, to ZLIB variants. Of course these are general talks and to see real performance changes and compression ratios, one have to try those algorithms with his/her own data.
Our data is like 700kB per row and for testing purposes we have 100k rows. Each cell contains an image, more specifically a subset of an image so it is binary and supposedly not as compressable as some log file. Using no compression, our test data of 1000 items takes up 670MB and our MapReduce tasks are able to read a cell in 8.41ms.
First algorithm we tried was ZLIB, or java.util.Deflater/Inflater following this post by @jdcryans. It simply involves using Deflater just before “Put”ting data into HBase, and using Inflater just after reading data from “Result”s. The total size of our 1000 items decreased to 346MB meaning a compression ratio of 48%. But our reading performance suffered 16%, increasing the time per row to 9.73ms.
Second one was the famous LZO. Although we are unable to re-distribute it because of licensing issues, we still felt the urge test and see what we are missing. It is somewhat harder to use in hadoop (at least the recommended way), but i’ve managed. You can check here and here for instructions on how to set it up. On the other hand this complexity is sure to have a benefit. All other methods i talk about here compress data per item basis. LZO on the other hand will compress the whole file in HDFS, so in a regular setup it is expected to have better compression ratios since there will be similarities among the rows and it will exploit those. Anyways, our 1000 item set resulted in 398MB meaning a 41% compression ratio and we’ve seen a 5% increase in reading performance too: we read one item in 8.1ms compared to 8.41ms uncompressed. So it is starting to become a win-win.
Third test was a LZF implementation, ning-compress following Ferdy Galema’s response to previous Deflater tip. It works the same way as it does too, like using LZFEncoder.encode just before writing to HBase and using LZFDecoder.decode just after reading. At this test our data size was 400MB meaning a compression ratio of 40%. Reading performance increased 21% with 6.63ms passed for one item.
Last one was recently announced snappy of Google. The same compress-each-item-seperately mechanism applies here with Snappy.compress and Snappy.uncompress. Data size was 403MB which mean around 40% compression ratio and we read our data at 6.37ms per item which indicate 25% increase in IO performance.
Algorithm Compression Ratio IO performance increase Snappy 40% 25% LZF 40% 21% LZO 41% 5% ZLIB 48% -16%
I am suspicious about something in LZO scores since I was expecting much better performance. But it doesn’t matter because of our inability to redistribute it. Snappy-java with its Apache license is a clear winner. It is way easy to use too.
I have to remind again YMMV. These are the scores for a data which consist of 700kB rows, each containing a binary image data. They probably won’t apply to things like numeric or text data.