Resource synchronization on Hadoop clusters with ZooKeeper - Part I
“We need zookeeper to run HBase”. Until past week that was basically my view of zookeeper. It is a distributed configuration and coordination service, HBase requires it so we have to put it in cluster. For a size of our cluster 1 instance seems to be OK but we are running 3 instances. These four sentences pretty much summed up what i knew about it. Fortunately i had looked to its main page previously and remember its somewhat abstract purpose : “Distributed coordination”.
First, some clarification: This post is generally about coordinating processes that requires access to some limited system resource. In order to keep things simple, i used an example resource throughout the post, which is GPU. Another example may be distributed CD publishing using Hadoop on machines with a number of CD writers. Or, controlling an array of arduino devices to simulate LHC. In short, post has no direct relationship with GPGPU programming nor GPU kernel thread synchronization. If you have arrived here googling that, i’m sorry.
So, the problem: is if you put a number of CPU cores in a single computer and start running processes, operating-system will place them to the cores accordingly. Say you have a 4-core machine and try to multiply 4 matrices simultenously, each multiplication will be done on another core. Well, i don’t know if there are any developments about it but that’s not the case for multi GPU systems. Say you have 4 GPUs on a machine and you spawned 4 processes wishing to multiply your matrices in each one of them, you need to explicitly tell those processes not to overlap with one another. If you leave it to OS, one or more of your GPUs may sit idle while others starve for resources. I heard Mac OS can manage this but they are not suitable to our environment.
In theory: there is no way of letting anything other than yourself decide which process (read: map task in a MapReduce job) should occupy which resource. Simplest solution is to supply the process the resource identifier it should operate on. Process may be executed with appropriate parameters. But this would mean manual control of all the processes which is not possible with MapReduce. Another solution is having processes ask to some daemon process, which resource to allocate. Daemon process may hold which process uses which resource, so answer other client processes accordingly. This is actually how we were working for a while now.
In practice: this daemon process would bring some maintenance issues as any other software components. It is just another service one needs to deploy to machines in the cluster and ensure it works properly. Because of this, we were hesitant to go production with this setup, looking for another solution. I am not sure how it happened but zookeeper seemed like it can do such a thing. Let me rephrase that, we thought we can do GPGPU process synchronization with zookeeper, without actually knowing what zookeeper does. It is a distributed coordination service right, how hard could it be?
After reading the “Getting Started” part of zookeeper documentation, i saw that my hand was blackjack. I can create some data points (called znodes), load some small data in them, and access those from any other process. An altogether solution to what keeps us scratching our heads. All i needed to do is, modify my mappers a little to talk with zookeeper before and after the process. Since GPUs should be coordinated per computer basis, mappers should know which computer they run on, and the GPUs on that node. I applied occam’s razor and got this.
private String hostname() throws IOException {
return InetAddress.getLocalHost().getHostName();
}
private static String[] discoverGpus() {
File[] gpus = new File("/dev").listFiles(new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.startsWith("nvidia") && ! name.endsWith("ctl");
}
});
String[] ret = new String[gpus.length];
for (int i = 0; i < gpus.length; i++) {
ret[i] = gpus[i].getName();
}
return ret;
}
On our nodes, GPUs are added as devices under /dev with sequential names nvidia0, nvidia1, … And there is one another device named nvidiactl which is not a GPU. Other additions to my mappers are those.
// in setup()
rs = new ResourceSynchronizer(
new ZooKeeper(QUORUM_ADDRESS, TIMEOUT, null),
"/gpusync/"+hostname(),
discoverGpus());
// in map()
String gpu = rs.request();
process(); // whatever
rs.release();
// in cleanup()
rs.close();
Now i cheated a little bit here and didn’t include the core piece that is ResourceSynchronizer. That’s because in addition to holding GPU names and supplying names to process requiring them, it does one additional and somewhat more sophisticated task concerning GPGPU operations. Seasoned GPU developers may guess what it is but i left it for Part II of this posting.