Hadoop MapReduce job statistics (a fraction of them)
Well, this has been on my backlog for a while. The problem is extremely simple actually: when did a MapReduce job started processing? I need this info to report to my clients using my API, meaning redirecting them to the JobTracker’s web interface is not an option.
Everyone using hadoop for some time knows 0.20 is the version to use, and everyone developed something other than a WordCount knows it’s a PITA. API is hard to use at best, misleading and incomplete most of the time. You might wonder how hard can it get to extract a basic (and easily accessible over the web interface) piece of information such as a start time of a job, all i can say is very.
Without further ado, while i expect something like Job.instance("JOB_ID").getStartTime() here is the piece of crappy code i found to be working:
long startTime(String jobID) {
Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN
JobID jobID = JobID.forName(jobID); // deprecation WARN
RunningJob runningJob = jobClient.getJob(jobID);
Field field = runningJob.getClass()
.getDeclaredField("status"); // reflection !!!
field.setAccessible(true);
JobStatus jobStatus = (JobStatus) field.get(runningJob);
return jobStatus.getStartTime(); // finally
}
As noted above, JobConf and JobID are deprecated. But since there is no way of working with anything non-deprecated, we reluctantly accept that. What we may not accept is working with reflection, but well… I couldn’t find any other way (please point me if you know). It is actually funny to have that information in the status field of runningJob but not able to access with because of a lack of getStartTime() method which reads from it. (BTW v0.21 is closer to what i expect but it is largely unusable for various reasons.)
On the other hand, my requirement wasn’t that, exactly. There may be a delay between the time i have submitted a job and it started processing, highly because the cluster was busy. What i needed was when the job started actually processing, meaning the time the first task is fired on a task tracker. Now i expect something like Job.instance("JOB_ID").getTasksOrderedByStartDate().get(0).getStartTime() but i know i won’t get what i expect, instead:
long actualStartTime(String jobID) {
Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN
JobID jobID = JobID.forName("job_201107011451_0001"); // deprecation WARN
RunningJob runningJob = jobClient.getJob(jobID);
TaskID firstCompletedTaskID = // deprecation WARN
runningJob.getTaskCompletionEvents(0)[1].getTaskAttemptId().getTaskID();
for (TaskReport tr : jobClient.getMapTaskReports(jobID)) {
if (tr.getTaskID().equals(firstCompletedTaskID)) {
return tr.getStartTime(); // search !!!
}
}
}
First task completion event belongs to the SETUP task which runs on the time of job submitting no matter what the cluster is busy with. That’s because i’m getting the second one in the array using [1].
One small problem is that i’m using task completion events, not task starting events, so i am assuming the first task to get finished is also the first task to get started. This is usually correct in my case but i know it will not apply to others.
I haven’t been able to find a way to get a job’s finish date yet, i’m using job.end.notification.url for that. Hadoop sends a GET to a servlet on finished jobs so i simply get the time the service was called. It may not be accurate but again works for me.
In the light of these difficulties, i am thinking about a simple application that serves easily parseable job information. It would probably be rendered obsolete when 0.22 is out but it might still be useful to be able to consume such info with other languages than Java.