Friday, January 27, 2006

 

I finally tried to do try..finally

I had the great opportunity of tracking down a bug in a Java system we're running in production at the moment. This was one of those peculiar little bastards where you've got the following course of events:
  1. See the problem happening on one of the production servers (master node in a cluster)
  2. Don't have enough debug logging in your code to track down the problem, so...
  3. Introduce more debug logging and try to recreate the problem on the test environment... and guess what - the problem does not occur on the test environment at all.
  4. Roll out a new version to production with the debug logging enabled to find the problem (as a side-effect of the rollout, the master node changed)... and guess what - now the problem does not occur anymore..
  5. Change the production environment to force the original master node (with the observed problems) to become the master node again, but this time with debug information enabled.
  6. After sifting through a few hundred lines of debug logging and tracking down what each thread does, I finally found the monster!
  7. Fix it (write your test case... yeah, yeah), roll out again and write this blog about how good it feels to squash these unwanted guests... hopefully for good!


Without going into too much detail here, all I want to say is never underestimate the power of the try {...} finally {...} statement in any language!
Our particular problem was that we had a thread pool consisting of a number of worker threads in an I/O environment.. Only one worker thread is waiting for work to be done at any point in time. As soon as it found work to do, it signals another thread in the pool to start waiting for more work and it goes off to complete its own work. This is a type of Leader-Follower pattern (http://www.kircher-schwanninger.de/michael/publications/lf.pdf).

In my case, the one thread got an unchecked exception while waiting for work, which caused it to never wake-up one of the sleeping threads... resulting in the whole threadpool to sleep (not very productive if you ask me!). On a high level, the fix:


try {
//blocking call to wait for work
wait_for_work();
} finally {
//Done waiting..
//not sure if waiting failed, or not,
//but we always want to wakeup one of
//the sleeping threads
wakeup_sleeping_dude();
}
//I can now go ahead and to my work (if any)..

I know that afterwards, this really looks simple! All I'm saying is that you should always look at code which should always be executed (regardless of exceptions or return values) and don't be afraid to put those in a finally block! One way to think about the problem is to go through each line in your method and ask yourself the question: What will happen if this line throws an unchecked exception?

Unchecked exceptions also have the ability to kill the thread you wanted to run forever (no matter what!)..
If your "running-forever" thread looks like this:


public void run() {
while (true) {
//do your work
}
}


Rather change it to something like this:

public void run() {
while (true) {
try {
//do your work
} catch (Exception e) {
// Log the fact that an exception
//propagated to this high level and just //continue…
}
}
}

Just note that I'm catching Exception, and not Throwable, as Throwable also includes all the Error subclasses, which you normally can't do anything about (unless you know what you're doing). Consider this decision for every case though, as your specific case might want to cater for this?

Happy bug hunting!

["Programming today is a race between software engineers striving to build bigger and better idiot proof programs, and the universe trying to produce bigger and better idiots.
So far the universe is winning."
-Rich Cook]


Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?