Exploring the error handling concepts for programming languages

 This is a follow-up to my previous post about concepts in programming languages, expanding the analysis to explore concepts around errors: raising, detecting and handling them.

It turns out that error handling isn't necessarily a separate concept, there is only one concept necessary for error handling:

Selection - Purpose: Allows a choice of which units to process and which to ignore. Operational Principle: If you specify the selection criteria, the matching units will be processed.

The failing code selects whether to error or not and the calling code selects on the result how to proceed.

When execution of some code can result in an error instead of a result of the expected kind, it is extremely useful for programmers to know about it, so another important, though not strictly necessary, principle is:

Documentation - Purpose: Allows communicating aspects of the design and purpose to a future reader of the program. Operational Principle: If you specify the relevant information, a future reader will receive it when needed.

There are many different reasons for reaching an error condition, as will be explored below. It is probably desirable to distinguish between them, in particular it is probably desirable to distinguish the types of errors that indicate the program itself is incorrect, so perhaps we can use a verification mechanism.

Verification - Purpose: Allows checking correctness criteria of the program. Operational Principle: If you specify a criterion and run the verification procedure (which may be built in), a warning/failure is issued if the program does not fulfil the criterion.

The implementations of verification that I looked at then were tests, contracts, types and proofs. But I think verification is more general than that and also obviously extends to assertions. Following that train of thought, raising or throwing an error is also a verification, by always issuing a warning/failure when that code path is followed.

Obviously, verification can and is also used to ensure that errors are handled correctly. 

While the concepts for raising and handling errors are not new, the implementations of them for errors may be different than the implementations previously explored.

In this article, a new highly useful concept  will be introduced:

Checkpoint - Purpose: Saves a known program state Operational principle: When a line of computation starting from the checkpoint is abandoned and returned to the checkpoint, all changes made in that line of computation will be forgotten and not impact any continued computation.

With complete or partial rollback to a known checkpoint, reasoning about state becomes a lot simpler.

Existing error implementations

Error codes, result values and exceptions are all just different implementations of ways to enable selection on error versus normal return.

Error codes returned on the side can easily be forgotten. Sum type result values are better at documentation and must usually be handled or explicitly ignored (by type verification).

Exceptions allow selection to happen further up the calling scope, checked exceptions force explicit handling by verification.

While the erroring computation is in effect abandoned with no result, it is usually left to the programmer to restore the state, completely or partially, to a known checkpoint.

Checkpoint

The first thing that comes to mind is that checkpoints are a part of transactions, but on further examination, it turns out that checkpoints are everywhere.

You make backups to allow rollback to a previous checkpoint. The code versioning system allows rollback to a previous checkpoint.

Immutable persistent data structures are checkpoints. No matter what computations and derivations are done, you will always rollback to that known state when you get back up the execution stack.

 I want to argue that idempotency is essentially a checkpoint mechanism. Not that you actually rollback, but you can reason as if you had, as long as you retry the idempotent operation until it is confirmed to be successful.

The error handling philosophy of Erlang to let the process crash and restart is a form of rollback to a checkpoint. This is necessary, because when an error happens, nothing can be assumed to be known.

An interesting vision of checkpoints is the "worlds" of Alan Kay 

The Midori error model

Joe Duffy has written about the experience from the Midori project, including the error model they arrived at and why. The goal was to have an error model that was usable, reliable, performant, concurrent, diagnosable and composable. The feeling was that they largely succeeded and that programmers were very happy with it.

They chose exceptions as the basic implementation because they do not cause overhead on the normal "hot" path.

Abandonment

One key observation was that bugs aren't recoverable errors, so any condition that arose from a possible bug should lead to abandonment of the process, without any ability to catch the error or attempt any form of recovery.

When a bug occurs, the program is in an unknown state that the code has not been equipped to handle, so it would actually be impossible to be sure about any state after attempted recovery. This corresponds with the Erlang philosophy and Google coding practices.

Abandonment is annoying enough that the code will eventually get fixed, while error logs can be ignored for months.

This practice led to very stable code and a sense of safety for the programmers, whose trust in the code increased.

Note that this complete abandonment is often considered too disruptive to the system as a whole, because most computations could still function properly. The mitigation is to have microprocesses that are automatically and swiftly restarted when abandoned, which then effectively creates a rollback mechanism to the starting state checkpoint.

Recoverable errors

Once the things that are probably bugs are out of the way, there is a small amount of recoverable errors left. (Abandonments outnumbered recoverable errors by 10 to 1)

The decision was to implement them as checked exceptions, but to simplify to basically only one exception type (with an option to create others if deemed absolutely necessary). This is interesting because it reduces the detail level of documentation. The success of this idea implies that there is such a thing as too much documentation.

Also, a thing that was very helpful was to force the use of a try keyword on every call to a function that could fail. This made it much easier to reason about the code (increased documentation).

While result values that are either a result or an error could have worked equally well, exceptions only have an execution cost on the exceptional path, while result values have an execution cost on the successful path as well.

There was a way to convert between exceptions and result values when a more dataflow type of syntax was desired.

Other interesting patterns

Having undeniable exceptions that could only be suppressed by a catch block holding the right token turned out to be a useful pattern when you needed cancellation of a computation (or aborts as they called it). The token prevented any intervening code in the callstack from catching the exception. Note that a checkpoint would probably be desirable here.

Opt-in try APIs: Some cases where abandonment was used could in rare cases be a case where you want to attempt the calculation and fallback to do something else if it fails. For those they identified, a separate opt-in API for trying the call was provided. (This is the non-error validation case mentioned above)

"Keepers" were a pattern where an object was set up to be able to "fix" certain errors on the spot, for example providing a fallback file if the desired file could not be found. Instead of unwinding the whole stack to catch the exception, they were made available to be called at the point of throw.

Different types of conditional failures 

It might be worth exploring this from the angle of what different types of conditional failures exist and what they mean, in order to determine suitable implementations. The best analysis of conditional failure types that I have seen so far is in the Google Guava documentation.

"The code I'm testing messed up"

This type of failure would normally not be raised at runtime, but rather in testing or analysis stages before the code itself runs. Typical implementations here would be tests and type checks. Also, typically, the failures will not be handled in code, just reported to the programmer/user.

In cases where tests are integrated in the code, they could be automatically run before starting the program, inhibiting running a failed program. Pyret does this.

"You messed up (caller)"

This can sometimes be determined through static analysis of types, but for more powerful checks this would generally need to be checked at runtime. It corresponds to the precondition part of a contract specification.

If a failure is issued for this reason, it would tend to indicate a programming error, so it might not be useful to be able to detect and handle these errors in code.

On the other hand, if you want to call a procedure only when the precondition is satisfied, otherwise do something else, should you, the caller, be forced to duplicate the precondition check or could you just "catch" the failure as a selection signal?

 "I messed up"

Quite obviously programming errors, so abandonment would probably be the only strategy.

Postcondition checks of a contract are the first thing that comes to mind. Also invariant checks and other assertions about what the programmer believes is true at that point in the code.

Dereferencing a null pointer is another typical example. 

"Someone I depend on messed up"

Very similar to  "I messed up", but instead of verifying that your code worked, you check that someone else's code, a dependency, (still) works in the way you expect and require.

Abandonment is probably the only reasonable action, but it may be useful for debugging purposes to distinguish the two.

"What the? the world is messed up!"

Very similar to "I messed up" and "Someone I depend on messed up", but this distinguishes impossible things that should not be able to happen at all, according to our model of how things work.

Abandonment is probably the only reasonable action because you cannot really reason about anything at this point. Again it might be useful to distinguish for debugging purposes.

Background: at Google, second-rate hardware is used successfully with the understanding that you may need to check this kind of thing and just bail out and try again (usually somewhere else) if they happen.

"No one messed up, exactly (at least in this VM)"

Finally we reach the only type of error that you may need to be aware of and handle with a backup strategy. This is the case where things did not work out the way you expected, but the conditions are too complex to control for, or completely out of your control. Examples:

  • A file did not exist where it was supposed to
  • A web service did not respond as expected
  • The input data was incorrectly formatted and could not be interpreted.

My learnings from this

Performance 

You don't want error handling to affect the performance of normal processing.

However, that does not necessarily affect syntax and semantics, as long as the compiler can distinguish the cases. Also, dynamic branch prediction will reduce a lot of the overhead.

Errors

By error, I mean a condition has been detected that indicates that the code itself is flawed (or the setup/infrastructure in which it runs, such as memory allocation). 

Whenever an error happens such that the logic of the code is no longer certain because the state of things is unexpected, the program or process needs to return to a known checkpoint.

Preferably enough information is gathered to understand how to fix the problem.

From experience, there is nothing more damaging to a codebase than ignoring an error signal/log, because it demoralizes the engineering team. Likewise, don't keep a backlog of things you probably will never do. If something is serious enough to fix, the problem will be rediscovered. So make sure the error log is clean, either fix an error or stop reporting it as an error.

Since it is too easy to ignore error logs, especially in development, I prefer that the program just crash, which makes a loud enough noise that the error will be fixed. The Midori experience seems to confirm that, as do Erlang programmers.

Note that crashing the whole program is often considered too severe for huge monolithic programs, but we probably shouldn't build those anyway. There is after all a reason why people keep coming up with independently deployable subsystems.

Non-error failures

There is a greyzone where a result is not quite a success, but not really an error either, or at least it is not unexpected that a computation will fail.

There is a often a need to define fallback strategies in the face of failing. I'm still unsure of how valuable it might be to configure this in an outer context instead of just having a local fallback strategy.

In existing programming languages, these cases often end up at least partially attached to the error system, which they probably shouldn't be.

Fruitless search or unknown value

This is where null and friends come into play. Even though Tony Hoare calls it his billion-dollar mistake, we can't get away from having to handle the absence of a value and null is easy to grab for.

It was very easy for programmers to miss handling the null case, which is a program flaw, so modern type checking will force its declaration as an optional value and force handling of it.

Most of the time we don't want to do anything with a missing value and I have previously written about the usefulness of simply not emitting any value at all in those cases.

Validation or test and fallback

There are times when the programmer cannot know beforehand whether data is valid or not and needs to check it first.

For optional values, there have arisen a number of convenient ways to reduce the boilerplate of checking, such as the elvis operator and the null-coalescing operator.

When parsing a string to a number, for example, it is reasonable to expect that some strings may not be numeric. Should the programmer be forced to code up a test when the desired test already exists in the parser?

Previously it was common to just piggyback on the error system and the programmer would add error handling for it.

What about the cases when the programmer knows the string is numeric? Should testing or error handling be forced anyway, with an "impossible case" declaration (I've done this a fair amount). Or should the call just proceed and abort the program on failure to signify a bug?

There should be a more lightweight way to utilize existing precondition checks when the uncertainty is known.

Aborts and cancellations

An infamous case is Java's InterruptedException, which looks and smells like an error, but definitely isn't and should be handled completely differently. It is just a way to propag that an interrupt flag has been set on the current thread, that is, a cancellation has been requested.

In the case of cancellation from the outside, there is a need for a way to pass in a signal that the executing code can check.

Whether by cancellation or detection of another condition, the thread of execution should be abandoned and returned to a safe checkpoint.

Usability 

There seems to be a tension between too much and too little of a property, which reminds of the cognitive dimensions of notation.

Optional handling

Allowing error codes to be optionally handled or ignored leads to low viscosity, but also generally low visibility. Handling is often forgotten and errors can be very hard to debug.

The experience with C does not encourage this. 

Explicit handling 

Explicit local handling, such as sum result types or checked exceptions (or effects), increases visibility, making the logic easier to analyze, but also increases viscosity, making the code harder to change, and can even act as an abstraction barrier.

Java's checked exceptions are disliked for this reason.

Option types are a little less tedious because the None type can be monadically bound to None results. They don't contain any information about the error, though, so not quite error handling, just allowing a "no result" result. In Tailspin that can be done by simply not returning a result at all.

Maybe Midori's use of one exception type strikes the right balance. The same could be done with a standard result type.

Local handling

Having to handle errors locally after each call give good visibility but also high viscosity with lots of repeated boilerplate up the call stack.

Go's error handling is disliked for this. 

Remote handling 

A catch statement (or effect handler) is essentially a COMEFROM, but even lower visibility because you can't even tell where the execution thread came from. To mitigate this, an exception will usually contain a stack trace to show where it came from.

Locally marking all possible sources, such as the obligatory try keyword on calls that could fail in Midori, helps increase visibility.

The ability to ignore an error at lower levels gives low viscosity for changing the code at those levels.

Summing it up

While I'm not entirely sure how to get all this into Tailspin, I have a pretty clear picture of how I want to handle errors and failures (which aligns well with how it already works):

  • Anything that smacks of a programming or configuration error should be a hard, uncatchable abandonment of the process. Probably an assert or abandon statement is desired to augment built-in checks.
  • There needs to be ways to signal intent to broaden the built-in checks, such as the elvis operator for missing values, or the "type bounds" that Tailspin already has on comparisons. I think I want a try operator for proceeding only with a function call if the precondition checks are successful and a fallback otherwise. Note the restriction to precondition checks, or errors of the "You messed up (caller)." kind. I don't want a catch-all for any error. Probably could introduce a reject statement for this case.
  • I think function entry points would be good checkpoints, and I want a way to rollback to them and at the point of call define an on rollback strategy. There is probably a need to be able to commit parts of an operation, like calling a web service. I'll have to think further on how the resulting partial rollbacks should be visualized.

Comments

Popular Posts