Wednesday, January 5, 2011

Prove your assumptions, but remember Murphy was an optimist

Continuing on my unremarkable coding task, having gotten it to work, it was now time to clean up the code.

Given the large load expected, I couldn't allocate a new direct ByteBuffer for every piece of data I wanted to handle. But code that accepts a ByteBuffer is equivalent to code that accepts a byte array, a starting offset and a length (or a limit), right? So I just set the position and the limit on the boundaries of the data and I'm good to go.

Now I ran into some of my own previous assumptions. Luckily, the failure resulted in a log message that made it clear where something was going wrong. It was one of those places where a deviation from the happy path either cannot happen, should not happen or we didn't really care too much. A place where I used to code an empty block "{}", or when I was feeling diligent, with a comment "{/*cannot happen*/}". Now I'll code it as "{throw new AssertionError("Cannot happen.");}" or at least "{LOG.fine("Don't care.");}".

Anyway, a quick analysis showed a likely cause. I say likely, because it's an assumption until proven. I wrote a quick unit test that failed, fixed the code and the test passed. But the program still failed, with the same log message as before. Imagine what I might have been doing if I hadn't proven the failure and the fix with the unit test, I'd still be staring at the same place and possibly ripping my code to shreds in desperation. Now I could immediately move on and fix the second error.

OK, on a roll now, all assumptions proven by assertions, I'm just about done. Except that the code still doesn't work and no sign of why. Murphy's law in full action.

Finally I find it. There's a trap in ExecutorService. What's the difference between calling "execute(Runnable)" versus "submit(Runnable)"? Nothing much, when the code works. But "submit(Runnable)" should have a big red warning sticker. It returns a Future, with no result. You don't bother to "get()" nothing. The devastating side-effect is that all exceptions get preserved until "get()" is called, so this is a hidden equivalent of "catch(Exception e){}". Next task: change this everywhere and add a rule to FindBugs.

Saturday, January 1, 2011

Your assumptions are dangerous, you know too much.

I have just completed a rather unremarkable piece of code. The system was designed to allow this type of addition, so it just took a couple of hours or three to write the code and touch up the parts to selectively enable the functionality by user.

Then a colleague and I spent two days debugging. We hacked around issue after issue, just to prove basic workability before trying to solve the issues.

The final obstacle was that our socket channel was not being selected for reading on the selector. So I hacked around that by creating a new thread to loop infinitely around reading the channel. Which made things work fine up to the point where actual data was coming in and the server blew up with a segmentation fault.

After reflecting while travelling on the bus home, I remembered that perhaps our JNI code needed a direct byte buffer, with data starting at position 0 and the whole backing array available for use. Inconvenient in this case, but another little hack and everything worked like a charm.

Back to the previous question: why no reads? Perhaps my server socket in non-blocking mode actually returned a socket configured for blocking mode instead of for non-blocking mode? It did, and explicitly setting it to non-blocking fixed it.

As I backtracked through the issues, it struck me that every single problem we'd run into had to do with assumptions.

I could of course have checked my assumption about the blocking mode. But I also have another assumption, which is more valid: incorrect usage of an API should not fail silently. This turns out to be correct, because a SelectableChannel throws an IllegalBlockingModeException.

Unfortunately, the "helper" framework that we have in place inadvertently masked that by running the register call in a FutureTask that had a boolean "success" return value that nobody had found the need to check, because "true" was the only possibility. Well, that is, assuming no exceptions are thrown.

Perhaps there is also a flaw in the assumption that a helper framework that obscures the standard API is actually helpful.

Certainly, the direct byte buffer assumption mentioned above should probably have been asserted somewhere, it's easy enough to throw an IllegalArgumentException if buffer.isDirect() returns false. Obviously, the programmer who created the JNI call was not assuming we had a direct byte buffer, he knew we had one. But that's the trick of maintainable and re-usable object oriented code: you cannot rely on any knowledge outside the class you are currently in. From the point of view of the class, such knowledge is an assumption.

Another issue I had hit on the way concerned the UserIdentifier class. It is really just a wrapper around a string, but because it has a specific semantic meaning it was correctly exposed as a separate value class. To limit the new functionality by groups of users, I found it convenient to construct the user identifier slightly differently. The code did not work as expected.

At another point in the code, a programmer had used his knowledge of how the user identifier was constructed, which introduced a hidden assumption about the structure of the user identifier. The correct code would have been to obtain the user identifier from a single authoritative place.

The root cause of the user identifier hack was a design assumption that two things were independent. When they were not, a co-dependency web was created. In a sense, this is also a case of classes having too much knowledge: they knew about too many different other classes that they needed to collaborate with. To keep your code clean and honest, such things need to be refactored (re-designed).

Consider the small amount of care and time it would have taken to avoid the assumptions in the first place and compare it to the four man-days of lost productivity that was caused. We are always under time pressure, but that will be the case next week, month or year as well. If you don't pay the price now, you pay it with interest later.

The more things your classes know about your system, the harder it is to change or re-use the code. Make sure that the knowledge you put into a class is appropriate and doesn't create a dangerous web of assumptions.