Unit-testing HBase Applications

For the past six months i’ve been on the road with hadoop and related technologies. Although this wasn’t a long journey i finally got to realize some aspects of my fellow traveller. But each one of them deserves another blog post, and this one’s only about an integral part which is unit-testing.

First of all, here is the situation: We have some tables that we usually write/read/update stuff and a MapReduce job that reads data from one of these tables, makes some operations, and writes results to another table. Our code is structured in modules, one of them being the database module which mostly performs Scans, Gets and Puts on HTables for reading and writing data. 

And the problem: Classes in other modules are dependent on interfaces of the db module which we can easily mock and have testable instances. But the classes in db module are mostly dependent on HTable’s which is not that easy to mock. Let me rephrase that: They are everything but easy in terms of testing. Trying to stub some methods usually leads writing the whole infrastructure so don’t try to even go there. 

And the research: From what i saw, people just don’t test their database interacting classes. I can’t cite any sources but i read some articles saying that they are not easy to test and should not contain anything that needs testing anyway, so it’s unnecessary. Just by writing this sentence, i felt i didn’t need to cite anywhere, because it’s bullshit. Try to find a spelling error that leads writing to “cf:colum” and reading from “cf:column” then we’ll have a conversation. Those database interacting classes almost always work with byte arrays which is perfectly compilable so you won’t know this kind of error until your application crashed all over your customers. 

More sensible ones, proposed solutions like starting mini-clusters prior to any testing. And even HBase itself bolsters this approach and provides org.apache.hadoop.hbase.HBaseTestingUtility for this [look in src/test not src/main]. But that seemed more usable if you’re unit-testing HBase itself, not your applications that run on HBase. Cluster is an HBase dependency, not mine. So i don’t think i should need any mini, micro or nano cluster to run any of my tests. If it was SQL i was testing, i really wouldn’t have much choice but to run a test database server, load my data into it and run. But for a NoSQL database, unit-testing shouldn’t be that hard.

And my solution: If my dependency is only to an HTable i should be able to extend and override it’s methods like put/get/delete/getScanner creating a MockHTable and call it a day. But i knew it wasn’t that easy to override that methods. Even if i didn’t use, i still would need a cluster just to connect to (that sucks by the way, i really started to feel like whole HBase API is not fascinating at all). A little more googling lead me to HBASE-1758 and just as i found it, i started switching my dependencies from HTable to HTableInterface. Then, i scratched my first MockHTable implementation, and ported my horrific tests to work with it.

And the results: Now my tests’ LoC are reduced in half, readability is increased 10-fold, and even felt an increase in test performance. Besides all this, i found some places that i was not able to test and covered them too. I’ll post my MockHTable implementation to somewhere and update here with the details soon. UPDATE: Code is here along with the usage instructions in its javadoc. Tried hard to keep it simple so no external dependencies. Just put it anywhere in your project.

Notes

  1. agaoglu posted this