From fb7839da20a479e82955099426d638d031f75f49 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 17 Sep 2025 16:54:32 -0700 Subject: [PATCH 01/41] Begin Errors. --- better-code/src/SUMMARY.md | 1 + better-code/src/chapter-3-errors.md | 768 ++++++++++++++++++++++++++++ 2 files changed, 769 insertions(+) create mode 100644 better-code/src/chapter-3-errors.md diff --git a/better-code/src/SUMMARY.md b/better-code/src/SUMMARY.md index 076ef79..bbfdac1 100644 --- a/better-code/src/SUMMARY.md +++ b/better-code/src/SUMMARY.md @@ -2,3 +2,4 @@ - [Introduction](./chapter-1-introduction.md) - [Contracts](./chapter-2-contracts.md) +- [Errors](./chapter-3-errors.md) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md new file mode 100644 index 0000000..a730ab8 --- /dev/null +++ b/better-code/src/chapter-3-errors.md @@ -0,0 +1,768 @@ +# Errors + +In the *Contracts* chapter you may have noticed we made this reference +to the concept of *errors*: + +> If the preconditions are met, but the postconditions are not, and +> the function does not report an error, we'd say the method has a +> bug. + +In the interest of progressive disclosure, we didn't look closely at +the idea, because behind that simple word lies a chapter's worth of +discussion. Welcome to the *Errors* chapter! + +## Definitions + +To understand any topic, it's important to define it crisply, and +unfortunately “error” and associated words have been used rather +loosely, and previous attempts to define these words have relied on +other words, like “expected,” which themselves lack clear definitions, +at least when it comes to programming. + +Unless we want to invent new terms, we will have to impose a little of +our own structure on the usual terminology. We hope these definitions +are at least consistent with your understanding: + +> **Error**: a condition in conflict with the primary intention of the +> code. + +When we write the word “error” in normal type, we mean the idea above, +distinct from the related Swift `Error` protocol, which we'll always +spell in code font. + +We'll divide errors into two categories: + +> - **Bug**: code contains an avoidable[^avoidable] mistake. For +> example, an `if` statement might test the logical inverse of the +> correct condition. +> +> - **Failure**: a function could not fulfill its postconditions even +> though its preconditions were satisfied. For example, writing a +> file might fail because the filesystem is full. + +[^avoidable]: Although “bugs” are inevitable, every *specific* bug is + avoidable. + +## Recovery + +The idea of recovery from errors may have started in the domain of compilers. + +OK, So what do we mean by recovery? +So when I asked the web which I E do a lot, most of the hits define error recovery in terms of what a parser does when it hits a syntax error in your code and that kind of surprised me because it's it's kind of an esoteric but thing. +But, but yeah, it's a well established, uh idea in in compiler engineering. +So let's say that you left out a semicolon. +Umm, so this is just some C code, right? +Uh, the parts are could just stop right there, right here. +And if she one diagnostic about the missing symbol? +Uh, if that's the only possibility in that syntactic position, otherwise it might, it might have a less useful diagnostic, but most programming languages, the they're they don't do that, even though I often I wish they would. +They wanna give me all of the potentially useful diagnostics about errors and the rest of my code, and so you know, if the parser just starts, our starts over as though as though this is the beginning of the document, you know this is the whole document and discards its state. +Umm, you know, I'm going to get a lot of bogus error messages. +That's a pretty poor recovery because although the program continues, it's doing something that almost certainly doesn't make any sense. +So you know, it thinks F is a type. +Name it thinks X is a type name. +It's complaining about a type specifier. +It's and then there's this extra closing brace that doesn't match anything right where, whereas it was there. +So instead of doing that, parsers typically try to recover by pretending I had written something correct. +In this case, it just injects a phantom semicolon and continues so as a first cut and here that's why you end up with this with this second error that that makes a little bit more sense, right? +That call to, to F would be at least syntactically legal. +If I'd put the the close paren. +Earlier, OK. +So. +Umm OK so so as a first cut, let's say the covery is continuing to execute doing sensible work, right? +And but I really like a quote I found in a stack overflow answer. +I mean, well, we're still not, we're still not getting down to it. +Very technical definition. +I think this really captures the spirit, they said. +It's to Sally forth entirely unscathed, as those such an inconvenient event had never occurred in the 1st place. +And So what do they mean by unscathed? +Well, they mean that the program state is intact. +Not only are the invariants) upheld, umm, but the state makes sense given the inputs the program has received. +So in like that parsing case, if you start over, you know from the beginning after the error, then the state doesn't really make sense given the inputs. +It doesn't correspond to what you've already seen. +Umm, so here's another example. +If we have an error while we're applying a blur to some image, it's not enough that the users document is still a well formed file, right? +It also can't have some random or half finished changes that they didn't request. +So that's that would be that would be very scathed, OK. +OK, so let's talk about recovering from a bug. +So what would that mean? +Well, first it is sumes that you had some way to detect the bug, right? +And not all bugs are detectable, but let's assume that this one is. +So an example of a nondetectable bug is you are trying to sort something, but you're but you're comparison function returns random results. +So, so that doesn't satisfy the requirements for the the sorting function. +It's a precondition that that there's no way to actually check for. +OK. +Uh, so, uh. +So let's assume that we have a detectable bug, and usually that means some somebody's checking a precondition and that precondition check fails. +And that means there's a bug in the collar that caused them to pass an invalid argument. +So when that happens though, you're not really detecting the bug itself. +You're detecting one of its symptoms like some kind of a cosmic echo. +The bug itself occurred some indefinite point before that. +Right then, there's a series of logical conclusions that the the code may have made about what it had that are incorrect. +That led it to produce this input that you you see doesn't satisfy (preconditions,. +OK. +So can you Sally forth on scathed? +Well, the problem is you don't know, right? +Because of the bug, your program state could be very, very scathed indeed. +Umm. + + +Umm OK? +OK, so your program state if it's scathed selling 4th at this point is a terrible idea for lots of reasons. +So there's sort of two categories. +First, there are effects in the outside world. +I don't know. + +Uh, so the users data might be corrupted, right? +And they might say that that way and they'll lose the last good state they had. +Right. +So that's that's pretty serious. +The other thing is, if you've done in a security evaluation, the assumptions that underlie that evaluation might be violated. +So by continuing, you may be opening a security hole and so it like sort of to sum up, you don't have enough information about the state of your system to do a recovery to to Sally forth reliably. +And you can't. +You also can't detect whether you've. +Recovered correctly, right? +There's there's nothing to look at and the penalties that we just talked about for failure to do it correctly are really, really high. +OK. +So that's one category, but then there's also the impact on the development process. +So if you Sally forth the bug is gonna be masked and we'll never get fixed until at some point, you know, somebody will observe the effects of this. +It's gonna affect your, your customers and your and if it you know when it affects the really important customer, your management may insist that you do something about it. +Right now all you'll have is evidence you don't remember. +You didn't detect the bug. +You don't have a detection of the bug. +You have some very distant echo in the users document that's corrupted and now now it's a long process to, you know, try to figure out where that corruption came from. +Right. +You're you've you've gotten the information very, very late. +Last of all, most code is correct, so you're “bug,” recovery code will never run. +Probably it certainly isn't gonna get tested. +I if it got tested, you're not gonna ship the tested one because you're gonna fix the. +You're gonna fix the problem right? +All of this recovery code bloats your program and every single line is a liability with no offsetting benefits. +So. +Yeah, I think this is. +This is an interesting insight. +I mean, they're do exist robust systems, right? +So they they can recover from bugs. +How do they do that? +Well, it's all. +It's almost always basically always. +It's outside the process, right? +Maybe the robustness of the system comes from redundancy. +You have you have three different processes and they all vote on the result. +The like this is the kind of thing you might see in like the F22 Joint Strike Fighter, right? +So yeah, there could be a bug. +First of all, they you know they check the code a lot more carefully than we do, but but they also put in safeguards in place so that so that if you know you have three systems voting on the result and one disagrees, you can kill that process and start it up again. +Umm. +So yeah, sometimes it's possible to design a system to recover from books, but don't expect to do it in in your process. +To sum up, uh in general you can't recover from bugs and it's a bad idea to try. +So what can you do? +Well, the way to handle bugs is to stop the program before any more damage is done and generate a crash report for debuggable image that captures as much information as you possibly can about the state of the program. +So there's a chance of fixing the bug. +Umm, be there might be some small emergency shutdown procedure. +You might need to perform like saving information about the failing command so your application can offer to retry it for you when you restart it. +Were you? +You know, maybe you can say something to the user about the reason that you're exiting. +So this is bad, right? +This is really bad if if you don't do something, really go out of your way to do something about it, it's gonna be experienced as a crash by the users, but it's the only way to prevent much worse consequences of a botched recovery attempt. +Remember the chances of battery are really high because you don't have enough information to do it reliably. +There is an upside, though, right? +It's also gonna be experienced as a crash by developers, QE teams and beta testers, and that gives you a chance to fix the bug, right? +It's not going to slip by those people unnoticed and then hit your customers in a really damaging way. +So you can though, mitigate this experience of of crashing right? +For example, you could say something to the user about the reasons that you're exiting, and you can actually make it sound pretty responsible. So. +So this is important. +You know, a lot of people have a hard time accepting the idea of voluntarily crashing or exiting right? +Exiting early is really what that should say, but you know we should face it. +You're bug detection isn't the only reason that the program might exit early, right? +You can crash from an undetected bug were a person can trip over the power cord, and really you should design your software so that when these bad things happen, they're not catastrophic. +In fact, you know, if we stop, you know, pushing, pushing bugs away and and early exit away. +As though as though it's an intolerable thing, we could actually embrace it and try to make it really seamless, right? +But you have to to do that. +You have to accept that early exits are sometimes gonna be a part of the whole package of user experience that you're trying to to deliver. +Umm. +Maybe you could arrange for the program to restart itself, for example. +Umm so. +In fact, there are platforms that actually force you to live under constraint of, you know, no early exit, right. +So on an iPhone or iPad, for example, to save battery and keep your foreground apps responsive, the OS might kill your process anytime it's in the background. +But it's going to make it look to the user like the the app is still running and when the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. +I can tell you that as a user, it's really jarring when you encounter an app that doesn't do that, right? +So the point is resilience to early termination is something that you can and should design into the system. +So Photoshop uses a variety of strategies for this, so we already we always save documents into a new file and then atomically swap that file into place only after the save succeeds. +So we never crash, leaving some half written corrupted document on disk, right? +We also periodically save backups so you only had Most lose the last few minutes of work, but we could be more ambitious about this, right? +We, if we needed to tighten that up, we could maybe save a record of changes since the last fall back backup. +OK. +Umm, so the usual mechanism that we have for terminating a program when a bug is detected is called an assertion, and traditionally it's spelled, you know something like this and this spelling comes from C and C++. +If you're programming in in some other language, you probably have something similar and the the facility from C is pretty straightforward. +Either it's disabled, in which case it generates no code at all, even the. +Check is skipped. +Umm. +Or it does the check and exits immediately with a predefined error code if the check fails, usually printing a a message containing the text of the failed check and its location in source. +Good debuggers commonly stop at that assertion rather than just exiting, and even if you're not running in the debugger on many on Major OS's you'll get a crash report with the entire program state that could be loaded into a debugger. +So this is great for catching bugs early before they get shipped and and actually diagnosing them provided people use it. +And uh, so another important dynamic is the project's commonly disable assertions in release builds. +So this has the nice side effects of making programmers comfortable adding a lots of assertions because they know they're not gonna slow down the release build, and that means more bugs get caught early. +But unless you really believe you're shipping bug free software, you might wanna leave most assertions on and release builds. +So in in fact the security of your software might depend on it. +So if you're programming in an unsafe language like C, opportunities to cause undefined behavior are all around you, and when you can assert that the conditions for avoiding that you be are met before executing the dangerous operation, the program will come to a controlled stop instead of instead of opening an arbitrarily bad security hole. +Yeah, I should have. +I meant to to make this distinction earlier, right? +Exiting because of an assertion is not a crash, right? +This is a controlled stop for calculated reasons. +Umm so but the problem with leaving assertions on and release is that some checks are too expensive to ship. +And let's be honest, a lot of programmers are gonna go with their gut about what's too expensive instead of measuring. +So we really need a second expensive assert, right? +That is only on in debug builds, so we can continue to cache those bugs early. +And there's another problem with having just one assertion. +It doesn't Express sufficient intent. +There are lots of different reasons you might wanna be doing this kind of a check, so it might be a precondition check, right? +Or you're asserting functions author might just be double checking their own reasoning, and when these two different assertions fire, the meaning is really different. +The first one indicates a bug in the color and the other one is a bug in the callee, so I really wanna separate precondition and self check functions. +I want both of those. +Now, if I'm writing in a safe by default language like rust or swift, the checks that prevent undefined behavior like array bounds checks or special I can afford to turn off all of the other checks in shipping code. +But these checks are the ones that uphold the safety properties of my system. +Right. +And if I turn those off, the that's compromised. +So I wanna separate assertion for those for those checks that prevent undefined behavior even if I don't ever anticipate turning off the other ones in a shipped product, because these are the ones we can't delete from the code, right? +So you want to make that obvious by their spelling? +And furthermore, I'm I might wanna turn the others off locally so I can measure how much overhead they're incurring. +Alright, so I hope you get the idea. +I'm not trying to prescribe the exact set of assertion facilities your project needs, but at carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. +If you haven't gotten one of these, go design it. +OK. +So one last point about the C++ is Sir. +Umm, it's better than nothing, right? +But because it calls abort, there's no place to put any emergency shutdown measures. +So you can't even display a message to the user, so to the user if you use C's Cert, it's always gonna feel like a hard, unceremonious crash. +You probably won't fail to certains to call terminate instead of abort, right? +Because there are terminate handlers and those would run. +That gives you a chance to do some origin. +See shut down measures. +So that's another reason to engineer your own assertions, even if you're only engineering one. +OK, so at this point somebody always asks, but you know I I'm not allowed to terminate. +My manager says that that we have to keep running no matter what. +Right. +Umm, So what do you do? +Well, first, you've gotta fight for the right to park. +I need to terminate. +Right. +If you've got a critical system you wanna advocate creating some recovery system that's outside of the process because there is no reliable recovery inside the process. +And if you lose that fight today, right. +You wanna keep fighting, but in the meantime, fail as noisily as possible, preferably by when at least when you're not shipping the code, get it to terminate right and also set yourself up to deal with the day that that you win the fight because at some point the cost of of following this possible this policy are gonna become obvious. +And so that means use a suite of assertions that, well, today they don't terminate, but you can change their behavior when you do win the fight, OK. + +I I don't know if we're going to get to the end because of the scope expansion anyway, so as much as we all love talking about bugs, it's time to leave bugs behind and talk about failures. + +So let's say you identify a condition where your function is unable to fulfill its primary purpose, so that can occur in one of two ways. +Either something you're function calls has a precondition that you can't be sure you're prepared to satisfy, or something you're function calls. Itself. +Reports the failure to you so usually have two choices at this point. +So one is you can say that your inability to make progress reflects a bug in the caller, right? +You can make not XD be a precondition of your function or you can make X failure right, which means that all of the code in the system is correct. +Umm, that's counterintuitive, but you should actually always prefer to classify that situation as a bug in the caller, as long as it satisfies the criteria for acceptable (preconditions,. +So there there are a few things you need to satisfy, right? +It needs to be possible for the caller to ensure the condition, right? +There's no way for the caller to ensure there's enough disk space to save a file, because other processes can come and use up any space that might have been free before the call. +So you can't make there's enough disk to save a precondition. +The the other way in which something might not be a suitable precondition is if it takes as much work for the caller to ensure it as the work you're gonna do in in performing the operation in the end anyway. +So for example, if if they're deserializing a document, umm and you find that it's corrupted, you can't make it a precondition that the file is well formed, because determining whether it's well formed or not is the same work that as doing the deserialization so. +OK, so prefer to make it a precondition, but. +If you can't satisfy a post condition and you're incorrect code, you're in correct code. +That's a failure. +So why am I tying this definition to postconditions, other than to bind our understanding of Error Handling to under to the way we understand correctness? +That's a valuable thing, but there's there are more reasons. +So first of all, it's simplifies and improves understandability of contracts. +So this is really easiest to see if you have a dedicated mechanism in the language for Error Handling, so I just. +I'm using fictional programming language here. +There should be easy to understand what's happening though. +Here's here's a couple of examples. +So in the first case we have we have the the error cases treated as though it's part of the post condition, right? +We have to say. +This thing returns X sorted or it throws an exception in case something fails, right? +You're going to end up saying that a lot if it's not part of the post condition. +You can say this now if you you know you know it's throwing an exception, you know that means the operation failed. +There's nothing else you need to say, even if you do feel you need to say something about possible failures, that becomes a secondary note. +That's not essential to the contract, right? +You get something like this? +In in both of these cases, a programmer can know everything essential from that summary fragment at the top and the signature of the function. +So another way this separation plays nicely with exceptions is that you can say that the post condition of a function describes what you get when it returns, and a throwing function never returns. +OK. +But if you don't use exceptions., you still get simplified contracts from this, as long as you have a dedicated type to represent the possibility of failure. +So here's here's an example. +You can say that this returns X in sorted order because you know that result or failure means or. +You know, there's the possibility that the operation failed and I'm reporting that, umm. +Separating this the primary intention from the reasons for failure makes sense because the reasons for failure actually matter less. +And if that's not obvious to you yet, some justification is coming. +So finally another reason to exclude this failure case from the post condition is that you want postconditions, to be solid and fully described. +But when a mutating operation fails, it often leaves behind a state. +That's very nebulous, and as I said in the contracts talk, you usually don't want to describe it because it's detailed that nobody cares about. +But if it's part of the post condition you you end up, you need to say something about it and that further complicates the contracts. +So you end up with something like this. +OK, this sorts of Xia cording to order or throws an exception to Forder fails, leaving X modified in unspecified ways and you end up saying something like that over and over again for mutating operations instead of just being able to say. +Swartz X according to order. +OK. +Now if you spend some time writing code that that handles errors carefully and correctly, especially in a language like C where all of the error propagation is explicit, failures start to sort of sort themselves into two categories. +There's local failures and non local failures based on where the recovery is likely to happen. +Local recovery. +It occurs very close to the source of the failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. +So in many cases, also in, it tends to be more in performance critical code. +So for example, you might have an ultra fast memory allocation memory allocator that draws from a local pool. +That's much smaller than system memory, and on top of it you build a general purpose allocator and first tries your fast allocator, and only if that allocation fails, it recovers by trying the system allocator. +Right, that's very local Handling. +You're gonna try the fast allocator and try your alternative method and the error doesn't propagate any further than that. +Umm, another common example is the lowest level function that tries to send a network packet can fill for a whole slew of reasons, and you can look these up in the in the POSIX documentation. +Some of these indicated temporary condition like packet collision and 99% of the time the immediate caller of this low level function is a higher level function that checks for these conditions and if it finds one of these temporary conditions, it initiates a retry protocol with exponential backoff and only itself fails after about, you know some number of failed retries that lowest level failure is local and the failure after and retries is very likely to be nonlocal, so nonlocal recovery nonlocal recovery is far far more common, umm, and it usually it occurs far from the source usually. +In a way that doesn't depend on the details of the reason for failure. +For example, when you're serializing a complex document, serializing any part means serializing all of that all of the subparts and parts are ultimately nested many layers deep, right? +And because you can run out of space in the serialization medium, every step of the process can fail. +So if you write out the error propagation explicitly, it usually looks something like this. +Right. +You have it error code and then this pattern gets repeated over and over again. +Each part you serialize it and check to see if there was a failure and if there was a failure you you have an early return. Umm. +So after every operation that can fail, you're you're logically adding and if there was a failure, return it OK and so there are many layers of this propagation, and none of it depends on the details of the reasons for failure, whether the disk is full, or the OS detects directory corruption or the serialization is going to an in memory archive and you run out of memory, you're going to do the same thing. +Finally, we're the propagation stops and the failure is ultimately handled. +Like let's say this is a desktop app. +Again, the recovery is usually the same no matter what the reasons are for the failure you report the problem to the user and you wait for the next command. +OK, so let's talk about exceptions for a minute. +Way back in 1996, I sort of developed a personal personal mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. +So yeah, I'm old. +Ohh and while I've seen some real progress on that over the years, I know that some of you out there are still not all that comfortable with the idea of exceptions, and if you'll let me, I think I can help the the first point to know is that exceptions are just control flow and you can see the motivation for for this really easily with cases like this one, because using an exception eliminates the boilerplate and lets you see the codes primary intent right there. +There is no magic here. +Exceptions. +Just like a Switch statement exceptions. +Capture this commonly needed control flow pattern and eliminate unneeded syntax so to to grok the meaning of this code in its full detail, you mentally add and if there was a failure, return it just that same thing that we said we were gonna repeat over and over again in the code with the explicit error handling everywhere. +But if you push failures out of your mind for a moment, you can see how the function also is much more easy to see how it fulfills its primary purpose, right? +That that primary purpose was a lot was obscured by all of the failure handling in the earlier version, and this effect of of clarifying the primary purpose is even stronger when there's some control flow that isn't related to error handling, because the the pattern is less. +You know the pattern of stuff that you can ignore is less obvious, OK? +Umm OK, so I said exceptions. +Are just control flow. +I I lied a little bit. +OK, there's one other big difference between the exception version and the explicit version. +The exception version of erases the types of the failure data and catch blocks are just big type switches with dynamic down casts that recover that information. +So a lot of us are static typing partisans, so at first this, you know, erasing this type information might sound like a bad thing. +But remember, as I said, none of the code propagating this failure or even recovering from it, usually cares about the details of the reason for the failure. +They don't care about the the data in the fail failure report. +What do you gain by threading all that failure information type information through your code when the reasons for failure change, you end up creating lots of churn in your code base. +Updating this types. +In fact, if you look carefully at the explicit signature. +You'll see something that typically shows up in systems where failure type information is included. +People find a way to bypass that the development friction induced by static types right here we have this unknown case, and that's basically a type of raised box for any failure type. +This is also a reason that systems with statically checked exception types are a bad idea, but it doesn't matter whether you're doing exception handling or reporting errors another way. +The same dynamic occurs. +Java has a feature called checked exceptions, which is a famously failed design. +Because of this dynamic people. +Having to bypass it. +Swift recently added statically typed Error Handling. +In spite of this lesson, that should be well understood to the language designers, I I don't understand why there was a lot of fanfare from the Community, because I suppose everybody thinks they want more static type safety. +But I'm not optimistic that this time it's gonna work out any better than I did for Java. +So the moral of the story here is sometimes dynamic polymorphism is the right answer, and nonlocal error handling is a great example of that, and the design of most exception systems optimized for that. +OK, unfortunately we are getting right to the limit on time, so. +Yeah, like there's a we're not gonna get to the end today. +So I think we're gonna. +We're gonna need to have a part too. + +Todd Baumeister 50:51 +Alright, I'll be in the brave idiot who goes first. + +Nick DeMarco 50:55 +Thank you, Todd. + +Todd Baumeister 50:56 +Ah, awesome presentation. +Thank you. +Umm. +As a former C developer, although I don't if I can say former, can you ever forget how to writing a bike? +It was really good and a lot of really good points about Error Handling, but. +I have to ask so the conditions you're looking for are like the worst case. +Like we we can't recover from them. +And you, you mentioned the uh inheritance or the I can't remember the exact words you use, but the idea that you can have a a chain of handlers when an exception comes out and filter through that and then hit an all at the end. +So can I summarize your talk to? + + +Todd Baumeister 51:44 +As a developer, I have expected errors. +Yes, things that I expect to potentially go wrong. +I like for example, my network times out. +I need to handle that, but we always need a catch all at the end for. +Well, no, we should not have a catch all at the end for uh, the unexpected errors is that my main takeaway from here or. + +Dave Abrahams 52:08 +Umm. +If I understand, yeah, if I understand your question right. +Uh, no, I'm not saying that. +Umm, let me ooh. +Let me be let me try to sort some of this out though. +There's a lot of good stuff in your question. +Umm. +So umm, remember from the beginning unexpected errors that you almost always means bugs. +OK. +And and part of my advice. +Well, which we we didn't get to, but. +It's right here. + +Don't use exceptions for bugs when a bug is detected, you should exit the program, not throw. +Don't worry about catching. +Certainly don't use exceptions to exit the program, right? +I know that's the default behavior if you don't catch it, but the problem with that is anybody up the chain from you? +I'm going to just read what I've got here. +The default behavior of exceptions. +Has stopped the program but throwing when you find a bug defers the choice about to whether whether to actually stop to every function above you in the call stack, and that is not the service that is a burden. +Right, giving giving your clients bad choices to make does not help anybody. +So if you made your function harder to use by giving your client just more decisions to make. +If you do that. +OK. +So then so last, OK, what about what about catchall case at the top anyway, right? +Maybe. +Maybe it's not for bugs, maybe it's for some exception type that you don't you you want to wear of at the top level. + +Todd Baumeister 54:09 +Yourerunning.net on the C++ platform and .net through exception and you gotta catch it someplace. + +Dave Abrahams 54:15 +Yeah. + +Todd Baumeister 54:16 +That's where I've experienced this, yeah. + +Dave Abrahams 54:18 +Yeah. +OK. +So I mean, let's let's talk about the the desktop app. +Uh, OK, because that's something I I can I can address easily. +We can look at other examples. +So you got an unidentifiable error, but it it prevented your operation from succeeding. +So what's the problem here? +If you don't know anything about the exception type at all, you can't really give a meaningful error report to the user. +That's the worst. +That's the worst part of it. +That's that's really so you have to say sorry and unknown error occurred. +That's embarrassing, but it's not. +It's not catastrophic, right? +You can from there you can proceed just as though any other thing like ran out of disk space occur. + +Todd Baumeister 55:15 +OK. +Thank you. +Yes, helpful. + +Nick DeMarco 55:20 +Dustin, you wanna go ahead? + +Dustin Passofaro 55:22 +And the last 30 seconds I'm. +I'm sitting at the top of this bell curve of my relationship with exceptions. +You start out with ohh. +Exceptions are cool. +Ohh my gosh, never use exceptions. +They are the bane of my existence. +Why oh why? +And I see you over here and you're starting to push me over the edge to wait a minute. +Maybe there is something and I'm not one over yet, because I still see and please I want to be one over. +Please help me continue to see how this doesn't just lead to the most mind boggling spaghetti. + +Oliver Unter Ecker 55:51 +Can you talk with the office? +I could. + +Dustin Passofaro 55:54 +Or and and. +Sorry, somebody else is talking. + +Dave Abrahams 55:56 +OK. + +Dustin Passofaro 55:57 +So to kind of double down, I still see even in the cases where it's like, oh, this is a known error, you're still deferring and now you have your entire call stack above you. +Oh, is it my responsibility? +How about yours? +How about yours? +That's the first problem I see. + +Dave Abrahams 56:13 +OK. +Well, let me address that to start with, if you if you do Error Handling carefully, no matter what mechanism you use, that pattern comes up. + +Dustin Passofaro 56:26 +Good observation. + +Dave Abrahams 56:27 +OK. + +Dustin Passofaro 56:27 +OK. + +Dave Abrahams 56:27 +So. + +Dustin Passofaro 56:27 +Yeah, I see that. + +Dave Abrahams 56:28 +So it's just, it's just a mechanism, it doesn't change the nature of failure handling, which is the same no matter what mechanism you use. + +Dustin Passofaro 56:38 +That's the light bulb, OK? + +Dave Abrahams 56:40 +OK. + +Todd Baumeister 56:40 +We can. +I just add the big difference here is known versus unknown, right? +You're talking about known exception handling versus you're talking about unknown, which is a bug, and it's out of no. + +Dave Abrahams 56:52 +Uh, yeah, yeah, let me, I, I, I wanna be really precise here. +OK so so I prefer not to use the word unknown because a library could throw could give throw you a failure, right? +That doesn't represent a bug, but they didn't tell you about the type. +They didn't tell you they were going to throw that type. +So an unknown unknown exception is not a is not a bug. + +Todd Baumeister 57:25 +I a structured exception is that we mean like umm. +An application exception. +Let me put it that way versus like a null pointer someplace, yeah. + +Dave Abrahams 57:37 +OK. +So yeah, there there's this unfortunate. +So. +So yes, what I'm talking about are language feature exceptions. +OK, there are. +Ohh. +Unfortunately, uh. +So lots of systems and and processors call things like divide by zero and exception, but they don't. +They don't act like exceptions. +In languages that propagate up the call stack and and take care of things like destructors. + +Todd Baumeister 58:08 +What? +What's so system interrupts first application errors? + +Dave Abrahams 58:13 +Sorry. What? + +Todd Baumeister 58:13 +Maybe then system enter ups for application errors like divide by zeros to the system interrupt. + +Dave Abrahams 58:20 +OK, the problem again. +Let's let's be precise. +In our language you say application error. +Given what we saw about errors, that could just mean a bug in the logic of the application. +So don't handle that with an exception if the if something you're using throws an exception at you for those cases, which is a common Miss Design, C has it even in places there, right? +Stop it. +Turn it into an assertion failure. +You don't want your your unrecoverable code paths mixed with your recoverable code paths, right? + +Todd Baumeister 59:06 +Thank you. + +Nick DeMarco 59:09 +I have a question from Kevin Hopps as I sense a cadence here. + +Dave Abrahams 59:12 +OK. + +Nick DeMarco 59:14 +Incidentally, as I've seen a clapping emoji, if you have to drop off, feel free to. +But we like to go with the Q&A section for as long as folks are interested or until someone gets exhausted. +So we're gonna go for a little bit longer. +Also, Please note that I've just dropped a survey link in the chat, so if you do drop off, please take a like the five minutes that it takes to answer the questions we recently made all of the questions not required, so if you just wanna tell us one thing about the talk, you can do that now. +So don't feel obligated to fill out every single question, but Kevin asks how should I write my utility function? +Take open file for example. +It might be fatal for one caller and error for another caller. +We're not even an error for yet another caller. + +Dave Abrahams 1:00:02 +OK. +So does it? +That's a great question. +Umm, so if it's not necessarily fatal, right? +If if you want to make this thing useful to everybody, obviously you can't make the decision to make it fatal, right? +So so. +You have to. +You have to report the condition to to your caller. +Now remember what I said about about contracts and. +Contracts and and whether you decide something as an error or not. +Well, it's actually. +All right, listen, let's let's back up first and and classify this. +This is this is a not a bug error right? +No, you you can't. +Nobody's in a position to control whether open a file is gonna succeed, right? +So so clearly it's not a bug, it's a failure of some kind, right? + +Kevin Hopps 1:01:13 +Might be a bug if the call if the call to open the file is from a piece of code that knows that file should be there, then it's a bug. + +Dave Abrahams 1:01:25 +What do you mean should be there? + +Kevin Hopps 1:01:28 +If you say always create a log file when you start the program and later you try to open that log file and it's not there, that could be a bug. + +Dave Abrahams 1:01:40 +Well, not a bugging your code. + +Kevin Hopps 1:01:46 +Well, maybe I've come up with a poor example. +The point is, only the caller knows whether the result is a bug or not. +I mean it, it might be a bug in some context and not a bug in another context. + +Dave Abrahams 1:01:59 +Fine. +Fine. +If if to here in general, if only the caller knows you can't make a, you can't make a catastrophic decision, right? +It's the same as not knowing whether it's fatal, right? + +Kevin Hopps 1:02:14 +Right. + +Dave Abrahams 1:02:15 +Actually, bugs are fatal. +We've just, we've said that right, bugs are gonna be fatal. +So so if you don't know whether it's going to be fatal for the client, you can't make the decision that is fatal. +So you can treat it as a post condition failure, right? +So what's the? +Was the other possibility. +It's not an error for another caller. +I'm not sure what that means, like whether it's an error in your, you're the open file function, whether it's an error in the open file function is not up to the caller, right? +The the caller may decide. +Ohh, I'm gonna deal with the failure to open the file some other way, but it's still a failure of the open file function, right? +Maybe it's maybe the caller is not gonna propagate that failure to its caller. +Maybe it's got some alternative way to achieve success, but it would still an error in the scope of the open file function. +I hope that answers the question, Kevin. + +Kevin Hopps 1:03:24 +Yes, thank you. + +Dave Abrahams 1:03:26 +Sure. + +Nick DeMarco 1:03:34 +Not seeing other questions in the chat, although I do wanna call out that Florin shared something very interesting about Erlang that I invite folks to read if they're curious. +Describes it as a nice complement to the concepts that Dave presented today, and I'm inclined to agree. +Having skimmed it twice now, but we are now a little bit over time. +I saw someone just come on camera Learn. +Maybe you wanna comment a little bit about Erlang and what you shared. + +Florin Trofin 1:04:03 +Umm. +Yes. +Umm, I highly recommend for engineers to read that paper because I think I I mean you can skip over the airline itself. +The language is not my favorite language, but the runtime system. +It's remarkable and some of the properties that it has, for example, you know it, it had like a an astounding, I don't remember like 7 nines of like it's been running for years and decades without stopping it and being able to patch the system at runtime without shutting it down. +I mean only those two properties. +If I tell you like that, that should raise an eyebrow and say, OK, well, that's something. +And it was done in the 70s, right? +So it's like ohh way back, but the principles I think they're very sound and the principles have been validated by, you know, these remarkable traits that these systems have these switches, these telephone switches that they've been running for decades and the supervision concept that it's introduced an airline, it's very powerful. +So the idea is that basically when you want to do something that's not trivial. +So if it's any complexity, then you delegate to a child subprocess, and in Neverland the processes are not like the OS processes. +There's something very much cheaper than that. +So you can spawn millions of processes and a nice property of the system. +Is that also establishes a bidirectional link between the parent and the the children that you're spawning? +So if a child so you, you you you need to do something. +And so you you delegate to a child or multiple children for example, you can spawn a bunch of children. +And let's say one of them fails. +Then you immediately get notified that that particular thing fails, and you as a parent can have different policies. +For example, you can say I wanna respond that node, you know a retry it or I can respond all the nodes that were even though it was only one child. +You know, because they kind of like altogether and it doesn't make sense to respond just that node and need to retry the whole thing. +And then if I fail doing that, then I report back to my parents. +So then simpler and simpler things get done. +You know, until like the whole system restarts, right? +It like restarts automatically, so that's a very interesting thought. +And you know, I've. +I've been thinking about that for a long time and I think his specially for distributed systems and services that makes has a lot of appeal, but I don't think it should be limited just to distributed systems. +I think it's also powerful when you think about normal software like desktop software. + +Nick DeMarco 1:06:37 +Hmm. +Dave, your thoughts? + +Dave Abrahams 1:06:42 +Umm yeah, I guess about that. +I I was wondering Florian, when you described those failures. +Umm, are those are those indeed failures the way the way I've described them in this presentation i.e. +Not bugs or or. +If they're or are these sometimes bugs? + +Florin Trofin 1:07:05 +So the plan, the paper actually does make a distinction, and it talks about like, what's the difference between an exception error or a failure and a bug. +And it's, I think a lot of your talk it uh, it overlaps with some of the concepts that are in there. +So that's why I said it's kind of like a nice complement to what you already discussed there. +I would be curious to hear your thoughts. +Maybe next time, you know, after you read the paper, you know there. + +Dave Abrahams 1:07:31 +Yeah, I'll read it before Part 2. + +Florin Trofin 1:07:32 +And because I also find it that it's it's it's it's really well written and well organized. + +Dave Abrahams 1:07:34 +That's a great idea. + +Florin Trofin 1:07:39 +As you know, the author organized their thought, his thoughts in the in the very well manner, so. + +Dave Abrahams 1:07:44 +So. +So it's better than the Erlang movies. +Have you seen their long movies? + +Florin Trofin 1:07:50 +No. + +Dave Abrahams 1:07:51 +Ohh there you can look them up on YouTube. +They're they're kind of hilarious. + +Florin Trofin 1:07:56 +OK. + +Dave Abrahams 1:07:58 +Yeah. +What else? + +Nick DeMarco 1:08:09 +All right. + +Dave Abrahams 1:08:10 +Anything else? + +Nick DeMarco 1:08:11 +What else? +I was just giving folks a moment. +But I think I think we might be reaching a cadence here. +So thank you to everyone that came. +Thank you to the 28 of you that that are sticking around for the discussion. +That's always fun, and I'm going to share a link in the chat one more time just to please take a survey. +If you've got a couple of minutes, let us know what you think. +Let us know in particular what you thought of the reading text on a screen narrative presentation style. +That's a first for us and I'm curious to see how it was in terms of communicating ideas and and understandability and things like that. +So please share your thoughts, but for now I guess let's wrap this up. +I'll see you next month for type design with Sean Parent. +That should be fun, but for now, enjoy your long weekend and we'll see you the next one. +Thanks everyone I. + +Florin Trofin 1:08:58 +OK. +Thank you for organizing these and thank you, Dave, for for your presentation. + +Dave Abrahams 1:08:59 +Thank you everybody. + +Nick DeMarco 1:09:03 +I agree. + +Speaker 1 1:09:03 +Thank you. + +Nick DeMarco 1:09:03 +Big kudos today if this was a lot of fun. + +Dave Abrahams 1:09:05 +Thanks, bye. + +Nick DeMarco 1:09:06 +File. + +Nick DeMarco stopped transcription From 252e110740c65e777dcd411ae11a591eab9db8a5 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 23 Sep 2025 13:25:59 -0700 Subject: [PATCH 02/41] Errors WIP --- better-code/src/chapter-3-errors.md | 726 ++++++++++++++++++++++++++-- 1 file changed, 687 insertions(+), 39 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index a730ab8..d8a6f3d 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -30,8 +30,12 @@ When we write the word “error” in normal type, we mean the idea above, distinct from the related Swift `Error` protocol, which we'll always spell in code font. -We'll divide errors into two categories: +We'll divide errors into three categories: +> - **Input error**: the program's external inputs are malformed. For +> example, a `{` without a matching `}` is discovered in a JSON +> file. +> > - **Bug**: code contains an avoidable[^avoidable] mistake. For > example, an `if` statement might test the logical inverse of the > correct condition. @@ -43,45 +47,59 @@ We'll divide errors into two categories: [^avoidable]: Although “bugs” are inevitable, every *specific* bug is avoidable. -## Recovery - -The idea of recovery from errors may have started in the domain of compilers. - -OK, So what do we mean by recovery? -So when I asked the web which I E do a lot, most of the hits define error recovery in terms of what a parser does when it hits a syntax error in your code and that kind of surprised me because it's it's kind of an esoteric but thing. -But, but yeah, it's a well established, uh idea in in compiler engineering. -So let's say that you left out a semicolon. -Umm, so this is just some C code, right? -Uh, the parts are could just stop right there, right here. -And if she one diagnostic about the missing symbol? -Uh, if that's the only possibility in that syntactic position, otherwise it might, it might have a less useful diagnostic, but most programming languages, the they're they don't do that, even though I often I wish they would. -They wanna give me all of the potentially useful diagnostics about errors and the rest of my code, and so you know, if the parser just starts, our starts over as though as though this is the beginning of the document, you know this is the whole document and discards its state. -Umm, you know, I'm going to get a lot of bogus error messages. -That's a pretty poor recovery because although the program continues, it's doing something that almost certainly doesn't make any sense. -So you know, it thinks F is a type. -Name it thinks X is a type name. -It's complaining about a type specifier. -It's and then there's this extra closing brace that doesn't match anything right where, whereas it was there. -So instead of doing that, parsers typically try to recover by pretending I had written something correct. -In this case, it just injects a phantom semicolon and continues so as a first cut and here that's why you end up with this with this second error that that makes a little bit more sense, right? -That call to, to F would be at least syntactically legal. -If I'd put the the close paren. -Earlier, OK. -So. -Umm OK so so as a first cut, let's say the covery is continuing to execute doing sensible work, right? -And but I really like a quote I found in a stack overflow answer. -I mean, well, we're still not, we're still not getting down to it. -Very technical definition. -I think this really captures the spirit, they said. -It's to Sally forth entirely unscathed, as those such an inconvenient event had never occurred in the 1st place. -And So what do they mean by unscathed? -Well, they mean that the program state is intact. -Not only are the invariants) upheld, umm, but the state makes sense given the inputs the program has received. -So in like that parsing case, if you start over, you know from the beginning after the error, then the state doesn't really make sense given the inputs. -It doesn't correspond to what you've already seen. -Umm, so here's another example. -If we have an error while we're applying a blur to some image, it's not enough that the users document is still a well formed file, right? +## Error Recovery + +Let's begin by talking about what it means to “recover from an error.” +Perhaps the [earliest use of the +term](https://dl.acm.org/doi/10.1145/800028.808489) was in the domain +of compilers, where the challenge, after detecting a flaw in the +input, is to continue to process the rest of the input meaningfully. +Consider a simple syntax error: the simplest possiblities are that the +next or previous symbol is extra, missing, or misspelled. Guessing +correctly affects not only the quality of the error message, but also +whether further diagnostics will be useful. For example, in this code, +the `while` keyword is misspelled: + +```swift +func f(x: inout Int) { + whilee x < 10 { + x += 1 + } +} +``` + +As of this writing, the Swift compiler treats `whilee` as an +identifier and issues five unhelpful errors, four of which point to +the remaining otherwise-valid code. That's not an indictment of +Swift; doing this job correctly is nontrivial. + + + +More generally, [it has been +said](https://stackoverflow.com/a/38387506) that recovering from an +error means that the program can “sally forth entirely unscathed,” +i.e. that the program state is intact—its invariants are upheld. + +Also, the state must make sense given the correct inputs received so +far. “Making sense” is necessarily a subjective judgement, so examples +are called for. + +- The initial state of a compiler, before it has seen any input, + certainly meets the compiler's invariants. But when an error is encountered, + resuming with that state would ignore the context seen so far that + can help inform further diagnostics. If the following text did not + match what is expected at the beginning of a source file, it would + be flagged as an error. + +- If we have an error while we're applying a blur to some image, it's not enough that the users document is still a well formed file, right? It also can't have some random or half finished changes that they didn't request. + So that's that would be that would be very scathed, OK. OK, so let's talk about recovering from a bug. So what would that mean? @@ -766,3 +784,633 @@ Nick DeMarco 1:09:06 File. Nick DeMarco stopped transcription + +## PART 2 ## + +Alright, welcome back everybody. +Umm. +So just to refresh where we where well, first of all, for those of you who who don't remember part one or weren't here, umm this is a very slick presentation where I show you no slides and just this document that I wrote up with my notes is is in the background cuz it contains some examples which I'll no want you to look at. +Umm, so where we were at, we were talking about exceptions and I just wanna review a few things just for background. +You know, I tried to. +I tried to demystify exceptions. +A little bit there. +They're just a control flow mechanism. +I and and they don't introduce any new. +Problems to error handling, but if you're Handling errors right, you have basically all the same issues to to think about whether you're using return types or not, but they they do optimize for. +For things a little differently, they optimize for nonlocal error handling right where, where it's very likely that your immediate caller doesn't have anything to do with the error, and they're just gonna need to propagate it up. +And they also optimized for the they tend to to erase the types of error information in, which tends to prevent, uh, code churn has has different kinds of errors end up propagated through the code and just turns out to be a good thing for for most code, which mostly doesn't care about, uh, about what types you've actually got in the. +In the air, this cause most of the code is just propagating OK. +So and we were about to talk about Wendy's exceptions and when not to. +OK. +And I wanna start by by piercing some of the the aphorisms you may have heard about this because there's a lot of really nice sounding advice about when to use exceptions. +That's either meaningless or really vague, so like, use exceptions for exceptional conditions, right? +Well, how do I measure what's an exceptional condition? +I don't know. +Don't use exceptions for control flow. +That one specifically. +I know that's really popular around Adobe and even appears in our one of our coding guidelines documents, but come on, if you're using exceptions, you're using them for control flow because that's what it is, right? +Umm exceptions. +Change which code executes next. +So I hope I can improve on that advice a little bit. +So umm, first of all, you can use exceptions for things that aren't obviously failures. +Umm so for example, when the user cancels a command, an exception is appropriate here because the control flow pattern is identical to the one where the command runs out of disk space. +For example, the condition ends up propagated to the top level. +OK. +Umm, uh. +And in this case, the recovery is just very slightly different, right? +There's nothing to report to the user when they cancel, but all the intermediate levels between the point where the failure is initiated and the point at the top of your event loop are the same. +So it would be silly to explicitly propagate cancellation using some other mechanism in parallel with the implicit propagation of failures that you get from exceptions. +So uh, but if you make the choice to to use exceptions to deal with user cancellation, I would strongly urge you to in your in your thinking and in your terminology classify this case as a failure, right? +I said it's OK to use exceptions for things that aren't obviously failures, but you can call this a failure. +Otherwise, if you don't do that, you're gonna undo all of the benefits you've got by separating failures from postconditions,. +Right. +And you'll have to include unless the user cancels, in which case you know an exception is thrown in the description of all of the functions that it could be cancelled, right? +So in the end, my broad advice is only use exceptions for failures, but be open minded about what you call a failure. +Actually, even if you're not using exceptions, any condition whose control flow follows the same path as nonlocal failures should probably be classified as a failure. +OK. +Umm, another prime example of a non obvious place to use exceptions is the discovery of a syntax error and some input right in the general case you're parsing this input out of a file and IO failures can occur, and Wilf and what's gonna happen, right? +If you have some nested call stack where you're, you've got a recursive descent for service. +Say umm uh. +When you hit this IO error, the control flow is going to be the same as the control flow. +When you hit a syntax error. +So if you call the syntax error, the failure of the parsing routine and use the same error reporting mechanism you you have a win for your code. +OK. +So those are some places where you can use exceptions next when not to use them right? +Don't use exceptions for bugs when a bug is detected, the program can't proceed reliably, right? +And what happens when you throw? +Well, there is a whole set of unwinding actions that happen. +It destroys things on the stack. +It changes where the stack pointer is and all of that happens before your debugger were your crash report. +If if you even get one occurs, right? +So you're destroying valuable information that you might need to find a bug. +Furthermore, anything that you do that's extra once a budget is reported is that much more likely to cause a problem. +Maybe maybe corrupt your document. +Open security hold and finally it can hide the bug from developers because after all, when you throw an exception that delegates responsibility for how to deal with it to your callers and your callers, maybe you know don't feel like stopping the application, right? +Maybe they wanna swallow the exception and continue that. +As I said before, it is not a service to delegate that choice to your to your callers. +It's a burden, right? +Don't don't give your your clients extra decisions to make the specially not don't open the door to bad decisions like continuing after a bug is detected, so you've just make your function that much harder to use. +OK. +So another thing that, yeah. + +David Sankel 9:20 +It looks like there's a a question in the chat Dinesh, you wanna go ahead? + +Dinesh Agarwal 9:24 +Yeah. +OK. +Thank you. +So I just had a quick question. +So Dave, you mentioned that if it detect a bug, please don't pass it as exception. +But while the code is in production, ideally the bug would translate as an exception. +I really don't understand that. +What exactly it means to like detect a bug if a developer is detecting a bug, they will try to fix it, right? + +Dave Abrahams 9:53 +Uh, OK, so you said a few different things that I guess need to be responded to if the developer has. + +Dinesh Agarwal 10:00 +Yeah. OK. + +Dave Abrahams 10:01 +So let's let me respond to the last thing first. +Cause that one easy if the developer detects a bug, ideally they would try to fix it. +Yes, I agree. +OK. +So then about production, so we were you were you present for part one of the talk? + +Dinesh Agarwal 10:19 +I joined 5 minutes late. + +Dave Abrahams 10:21 +OK, so so I pretty sure that we covered this in part one. +So once a bug is detected, if you continue to run, you increase the chance that you silently corrupt the users data in an unrecoverable way, right? +So for example, let's you can take Photoshop, which periodically saves document to recovery files you have. +You know, even there are more sophisticated systems that will also dribble out the commands that have executed successfully so far, so that you so that the application can replay them regardless. +You know that that's a solid state before the before the bug is detected, or at least there's a very high likelihood once the bug is detected. +Now you're proceeding based on incorrect assumptions and what can very easily happen is that those incorrect assumptions lead to corruption and the document that the user doesn't see, so they they proceed and then they save their document and it's all over, right? +You can never get that. + +Dinesh Agarwal 11:36 +Got it. +I see. +I see. +Cool. +Thank you. +Got it. + +Dave Abrahams 11:43 +OK. + +Dustin Passofaro 11:44 +I'm can I can I step in there too? +Because I think I'm also not understanding I I I shared his question actually and I was here for part one, so maybe I missed, maybe I wasn't understanding something there. + +Dave Abrahams 11:47 +Sure. +Yeah, you were. +I remember you. + +Dustin Passofaro 11:56 +Ah, good. +That's good. + +Dave Abrahams 11:58 +Yeah, you had good questions, so. + +Dustin Passofaro 11:59 +Umm well or yeah I I hope. +I hope this question is is also memorable, but we'll see. +I was also thinking sometimes a bug will come up and we'll we'll present itself as an exception. +Umm and awesome that I'll let you keep going. + +Dave Abrahams 12:14 +That's my next point. +Yes. +So as it says Ohh my my finger doesn't quite reach it, I can scroll it down right right there. +If you use components that misguidedly throw things like logic errors or domain errors or invalid arguments at you, those things all represent bugs. +Don't let those exceptions propagate. +Catch them and terminate the application. +Otherwise, you're just doing. +You're just essentially indirectly doing what we've just said is a bad idea. +OK umm now. +Uh, there are some systems like Python. + +Dinesh Agarwal 13:05 +It's so you didn't run, but it's a very interesting statement. +Like we understand that there is some misguided code in the uh code base. +Is there any guidance or is there any guidelines how how do we decide it's misguided a function or maybe code piece of code? + +Dave Abrahams 13:27 +OK, well this this is a very simple criterion if the code. +Response to bugs, in other words, misuse of the code right precondition violations by throwing exception. +That's misguided. + +Dinesh Agarwal 13:49 +But that we would know once we basically run it multiple times and basically let's say if there is a library that is getting loaded, we have not run it multiple times. +It's dynamic library. +In that case, is there any guidance would you like to share some? +Maybe sanity. +Shall we do some sanity for that code before relying on that? Ohh. + +Dave Abrahams 14:13 +Known OK so. +So I you know, I probably shouldn't assume this but but my basic my basic assumption is that the components that you use have documented APIs APIs. +OK, so that means they, you know, they tell you what they're going to do. Umm. +Although you know when there's a preconditioned failure, there are obligations to do things. +There are no obligations to do anything, so I guess discovering what you know, discovering these misguided uh things. +Uh probably ends up having to be a product of auditing or or, you know, when you observe these, these kinds of misguided exceptions during runtime. + +Dinesh Agarwal 15:09 +Got it. + +Dave Abrahams 15:09 +Umm so yeah. + +Sean Parent 15:14 +Or comment that your list there is is pretty good and you know a lot of people will just inherit from student logic error. +Uh. +For for anything that's a bug, and so. +So that's a good case if you're. +If you catch a Sood logic error. +You might want to treat it you know, as as fatal all the places in Photoshop where it puts up a dialog box that says a program error has occurred. +Umm, it's fine to tell the user that, but the next thing should be save out or recovery document and exit so. + +Dave Abrahams 15:50 +Umm yeah. +That said, I mean you shouldn't. +You shouldn't make the assumption that every component you're you use is going to be misguided, right? +Then you'll then you'll have try catch blocks all over the place and basically try catch blocks if you. +If you write your code right, should be extremely rare, mostly only at the top level. +Some you know the exception is. +Sometimes you have an otherwise an unmanaged resource and you need to clean it up. +So usually you can deal with that by managing it in a destructor, but something like you're initializing on initialized memory and you know an exception is thrown while you're doing that. +You might need a cache block to go and denialist all of the elements you've initialized, right? +So very rare. +OK. +Umm. +And all that said, I want to also acknowledge that there are some. +There are some systems like Python where using exceptions to report bugs is just part of the fabric of the system, right? +In fact, in Python they use exceptions to exit loops, which is. +As a little while arming to some people, but in Python you just you can't use this rule that that if you see one of these things you you stop the program. +You have to let it propagate. +OK. +Umm. +And there's a hand. + +Josep Valls 17:33 +Hi, yes, I think I was for so I mostly programming Python so that that's me but even in Java when network is in place there are lots of things that are very commonly reported exception but by means train libraries, anything from time out to to access like temporary resource access. + +Dave Abrahams 17:34 +Shouldn't. +Are those bugs? + +Josep Valls 17:58 +So, but those are the bugs. +But these these exceptions seems to be handled in a trackage, so we have lots. + +Dave Abrahams 18:05 +No. +Well, yeah, not not immediately, right. +So generally the pattern is there's some place, so remember, exceptions are for nonlocal error handling, right? +That they are generally for things that can't be responded to by the immediate caller. +So generally in the general case, the pattern is there's one try catch block, sort of at the top level of the application that that catches all of the things that propagate out of operations that fit. + +Josep Valls 18:51 +But simple things like a retry. +Will there be a good excuse for an exception to where the library throws a? + +Dave Abrahams 18:57 +Yeah. +Right. +Yeah. +So so if you have to retry a network operation now, right. +So that's what I'm what I've been calling a local failure and you might have a component that misguidedly uses exceptions to report local failures, in which case, yes, you do need a local Tri catch. +But as I also said in the previous section, local failures are are far and away much more rare than nonlocal failures. +There's there's just a few low level functions that that need to report local failures, so if you have, if you get a a component that reports a local failure with an exception, what you can do is put a little wrapper around it and use that wrapper everywhere. +Make that wrapper report the error differently. +So which is going to be my next piece of advice is don't use exceptions for local failures. +They're not optimized for that. + +Josep Valls 20:13 +Yeah. + +Dave Abrahams 20:13 +Does that help? + +Josep Valls 20:15 +Yes, I guess I'll get more context. +When after thought process things. + +Dave Abrahams 20:22 +OK, we can come back to that. +There will be time for questions at the end. +Are there other hands that we should deal with? + +David Sankel 20:30 +We've got one more hand in the queue, is he? + +Izzy Muerte 20:32 +Yeah, so this isn't actually a question, just a small note that in the same shift in philosophy, people are mentioning in the chat, Python has also been moving towards the approach of we shouldn't be throwing exceptions everywhere. +And so in recent versions of Python, they've made optimizations to the internal compiler for C Python runtime to not actually throw the, you know, stop, stop, loop, or stop iteration error, and the other ones that are used for control logic, as in more recent years, they've discovered ways to optimize it. +So they're actually starting to shift away from that. +They can't get rid of that behavior, unfortunately, because of 30 plus years. If that behavior. +So, but that's a that's the worst case scenario. +Fall back for what happens now in Python. + +Dave Abrahams 21:26 +That's good to know. +Umm yeah, I I strongly suspect there are also not separating the bug case from the from the failure case. +Umm. +So they're gonna keep reporting, you know, invalid arguments and and other bugs to you using exceptions. + +Izzy Muerte 21:48 +Umm that has been discouraged for new types that go into the Python's dead Lib. +There's still like some functions in the Python stdlib that's still do that, umm, but you'll see more of like a types integer error like if you pass the wrong number of arguments, obviously or like an assertion error. + +Dave Abrahams 22:03 +It. + +Izzy Muerte 22:05 +Rarely these days you get a value error except for like the built in types because they just had those four for decades at this point. + +Dave Abrahams 22:11 +Yeah. +So, so from the perspective of what I'm saying in this talk type signature error and argument Error, all of those things are equivalent. +They're they're exceptions thrown to indicate, you know, precondition failures, failures of the of the caller to do the right thing. Umm. + +Izzy Muerte 22:33 +Right. +That's that's partially a result of Python's dynamic execution and not not static typing. + +Dave Abrahams 22:39 +Yeah. +Yeah, it's like, you know, you have a an interactive interpreter, right? +And so when you hit a bug, you need to be able to get back to the prompt and they use exceptions. +To do that, you know if it were me, I would prefer that there was some parallel, but but different mechanisms so that I could so that I could keep the handling of those things separate. +But but I understand why they only have one. +OK, so. +Uh, next piece of advice. +Don't use exceptions for local failures right there. +There are optimized for the patterns of Handling. +Uh, problem. +Far from its source, so if you use them for local failures, that means you're gonna write a lot more catch blocks, which increases the complexity of code, right? +It's usually easy to tell what kind of whether a failure is local or not local, but I mean, just think about what the client a typical client is going to have to do. +But if you're writing a function and you really can't guess whether it's failure is going to be handled locally or not, maybe you should consider writing two functions, right? +One that that reports its failure using some other mechanism. +Umm. +And you know, one can call the other, so you don't need to reimplement it. +OK, next consider the performance implications of throwing. +So most languages actually aren't like this, but C implementations are usually biased really heavily towards optimizing the non failure case. +Umm, so that Handling of failure runs one or two orders of magnitude slower than code that's not Handling failure. +So, and that's tends to be a really great trade off because it allows them to skip explicit checks and all the branch prediction failures and other costs associated with checking for the error case on the hot path in the in the code, right. +And so this is what meant by zero cost exception handling. +Umm, if you've heard that term, uh and non local failures are rare in, you know, in terms of like the number of instructions executed and they don't happen repeatedly inside of tight loops, right? +But you know, if you're writing it, that also means if you're writing a tight loop, you know that's really on hot paths. +You don't want to repeatedly throw exceptions in there and and catch them. +If you're writing a real time system, for example, though, you might really want to think twice about using exceptions at all, because there's a. +It might be hard to predict the amount of slowdown that happens in those rare cases where an exception is actually thrown OK. +So I have an example that I think is God is useful so. +So I was one of the founding members of boost and and was involved in the design of the Boost Graph library and but when we were discussing that design, we realized that occasionally a particular use of a graph algorithm might wanna stop early. +Umm. +Now I guess to understand this. +Ohh well, I'll get to that. +OK, so for example Dijkstra's algorithm is. +That's an algorithm that finds all of the paths from A to B in order from shortest to longest, right? +So if you you give it two points in the graph, it'll tell you all of the the different ways to get from one to the other. +But suppose you want to find the 10 shortest paths and then stop. +Well, the way the the algorithms work, you pass them a visitor object that gets notified about results as they are discovered, right? +So they you can think of the algorithm as a loop that calls the visitor every time it finds a new path, for example. +And in fact, there are lots of notification points for various intermediate conditions, not just for finding the complete path. +Umm. +And so if we're going to handle this early stop thing explicitly, we need to generate an explicit test in the algorithm code after each of these points in the algorithms inner loop. +Right. +Uh. +So instead of doing that, which would both make the algorithm harder to read and uh and cost performance for branching, we decided to take advantage of C++'s bias toward optimizing the non failure case. +We set a visitor that wants to stop early. +Can just throw an exception right now. +To be perfectly fair, I don't think we ever benchmarked the effects of this choice, right? +So it might actually have been wrong from an optimization point of view in the end, but it was at least plausibly right, so there's nothing wrong with the with using an exception for that in principle. +If it actually gets you a performance win. +So finally you might also need to consider development culture and the way they the way your team uses their tools. +So some people typically set up their debuggers to stop whenever an exception occurs, and if you're in a team where that's an important practice, you might need to take some extra care not to throw when there's an alternate path to success, right? +Some developers get upset when code stops in a case that will eventually succeed. +OK, so enough about exceptions. +Umm. +So finally we come to the this is the good part. +This was originally gonna be the focus of the entire talk. +OK. +Umm, so I wanna talk about the obligations of the failing function and and of its caller. +So umm question is what do you put in the contract for a function that could fail and what does each side, the caller and the callee? +What do they need to do to ensure correctness? +So OK, the callee. +First of all, there's a documentation obligation you have to document any local failures and what they mean, because you're gonna report them as part of the as part of the return value, right? And. +Nonlocal failures you want to document at their source, right? +But not where they're just propagated from other functions that that they use. +So the problem is if you document them where they're propagated, you have the same problem as if you would included the types of the details of the failure in the type information right, which we talked about last time, creates a lot of churn as as failure reasons that don't really change anything about about the code end up changing. +Uh, as as you've all of your function implementations. +So. +OK so in code. +If you're the callee, if you have any unmanaged resources you've allocated, like you've opened a temporary file, you need to make sure that those things are are released. +Umm, the other example I I had is the the uninitialized memory that you're initializing right? +The the lifetime of those of those objects that you've put into that memory is a resource, and that needs to be. +They need to be, uh, denialist. +OK. +No, there's there's an optional thing that can be really useful if you're a mutating function. +So and and that is to consider saying that your transaction that if and that if there is a failure, the function has no effects. +Umm, that's often called the strong guarantee. +OK, now that can be a really useful guarantee to give when it falls out of the implementation, or at least a an efficient implementation. +Umm, but you don't want to do this if it adds performance cost. +So for example, the simplest way to give us a transactional guarantee on a function that mutates data is to do what I call copy and swap. +So first you make a copy of the data, you mutate the copy and place, and only when that succeeds, then you swap it back into place, right? +Swap it back to the original data and. +Sure, that ends up being transactional, but you pay the cost of making a full copy of the data and you don't wanna. +You don't want to preemptively do that? +Umm, because uh, what happens is often your caller doesn't need that. +That strong guarantee, right? +And what happens if all of the you know components get composed? +So what happens if all components do that copy and swap thing? +Now you have an exponential increase in cost, where at every level you're making copies. +It's sort of the same reason that we don't do object level locking, right. +Uh, you know, for umm concurrency logging, you know thread thread locking. +I saying this right you why we don't have a mutex in every object because. +Clients might need transactionality at a different level, right? +Might need your. +Your component might be a part of a bigger component and then you transactionality on that whole component. +So the locking of your individual component is a waste. +So to get the strong guarantee, sometimes you can do this just by reordering the operations you're performing. +For example, if you do all of the things that can fail before you actually make any mutations that are visible to clients, now you have the strong guarantee. +Umm uh so. +For you know, the simple example is, umm do all your memory allocations up front, then make changes that can't throw and and it's transactional. +So that's a useful thing to dock document when you can get it. +Right. +No, the caller umm. +So the caller's obligation is to discard any partially completed mutations to program state. +So if the caller is just calling a non mutating function right pinned and it throws, they don't have to do anything, they can just allow the the failure to propagate unless they happen to have some recovery strategy. +But I hope I already said this having a recovery strategy is really rare. +That usually means it's a nonlocal Error. +A nonlocal failure. +I mean, sorry, that usually means it's a local failure and the function shouldn't have been throwing in the 1st place so. +Umm. +So if you pass something to the function and and the function is gonna is gonna mutate that thing, you need to make sure that that thing gets gets discarded unless the function has given you this strong transactional guarantee. +Which case, it still has its original meaning, and it's and it's original value, right? +So when I say discard partial mutations to program state, umm, we have to talk about what counts as program state. +So that's data that can have an observable effect on the future behavior of your code. +So for example your if you have a log file that you're just streaming information into, that doesn't count as program state, right? +Because you never read it. +You never change, but the Programs behavior based on what's gone in there. +OK so. +So how do you arrange to discard partially mutated state? +Well, there's really only one strategy that really scales up in practice when mutations can fail. +Well, aside from the strategy of never mutating anything, but arguably that doesn't scale up either. Right? +Because then there's costs of copying. +So if you're, if you're not writing in a functional language, the pure functional language like Haskell, which most of us aren't, you have mutation. +Mutation can fail so. +So how do we manage discarding these partial mutations well? +Normally, the only strategy I've found that scales U is to propagate the responsibility for discarding this partial mutation. +All the way up to the top of the application and So what that means is. +You're at the top level. +Do you have to take this copy and swap strategy right? +You essentially you're gonna mutate a copy of the existing data and only replace the old copy when the mutation succeeds. +So but if you have a large data structure that could be really expensive, right? +To we we can't afford to copy an entire Photoshop document every time we make a change. +Well, actually we can, right? +And why? +Why is that the we do we actually do it? +And it's possible because Photoshop documents are essentially a persistent data structure. +You know, persistent, that's a confusing name because it doesn't have anything to do with persistence in the usual sense. +A persistent data structure is 1 where a partial mutation of a copy ends up sharing a lot of storage with the original. +So we store in Photoshop separate document for each state in the undo history. +But these copies share storage for any parts that weren't mutated between revisions, and this sharing behavior falls out naturally when you compose your data structure from copy on write parts. +So the original copy has basically 0 cost. +It's about, you know, bumping a reference count and then when you start to make changes, something is checking the reference count saying ohh if there's more than one reference. +Now I need to copy that part of the data that's changing. +So everybody follow that one, make sure I. +Yeah. +So there are some hints that's. +Let's hear from those. + +Stephen DiVerdi 40:05 +Hey. +Yeah. +Thanks Dave. +And and question and if this is harping on a previous topic then then let me know when we can just skip it. +But what I'm wondering is, it seems like what you just described about this mechanism for copying, mutating, and then replacing with the ability to handle local failures and robust to local failures also works for being robust to local errors. +And so I don't, I guess I still don't understand why that wouldn't be preferable to handle errors within that same framework of mutating a copy and then replacing it a transactional manner instead of crashing the application. + +Dave Abrahams 40:45 +Well. +This is. +Is a good question. +If you really have, if you really have data isolation and and you know that the only thing being mutated is this is this copy that will be discarded. +I think you might be safe. +To continue. +Uh, Sean? + +Sean Parent 41:26 +Yeah, I agree with that that the question is, do you really have data isolation? +And my answer would be, you know, in in C++, almost certainly not. +Ohm so. + +Dave Abrahams 41:41 +Yeah. + +Sean Parent 41:41 +Yeah. + +Dave Abrahams 41:41 +There, there, there's there's. +There's often, usually there's something that's being mutated that isn't the that isn't just the document state and and that that. +Whose? +Whose mutation isn't going to get undone by by discarding the partially mutated document state. +For example, you might have a queue of background operations, right? +Things get added to that queue. +Right. +And we don't have a we don't have a way to rollback that ad and some mutation fails. + +Stephen DiVerdi 42:30 +OK. Thanks. + +Dave Abrahams 42:38 +I guess I guess another another issue is like so part of the way we get this copy on write behavior with Photoshop is using the VM system with the you know which is. +A bunch of copy on write tiles essentially. +So if a bug were detected in that. +That would that would undermine the guarantees of the that you get from having copy on, right? +Right. +So the the the real problem with with bugs is you can't count on the systems that that uh normally give you this. +This their recovery property. +Uh, David sankel. + +David Sankel 43:29 +Yeah, I was just going to say that, you know, if if a bug is detected. +You all you know is that a bug is. +Detective, you have. +You have no idea of what the nature of the bug is. +I mean it could be corrupted memory, so even if you try to take your you know copy on write Das structure and discard it, it could be the old thing got messed up somehow because of some random thing. +You really, if you're really know nothing about the nature of a bug when it happened. +So the idea of recovering from it is. +Umm, it's not really sound I think. + +Dave Abrahams 44:07 +Yeah. +So I mean I have to like we have to think about, we have to think about the the nature of the environment in which we're running, so. +So I if if I think about you know how this would how this would play out in Hilo or in Rust? +Umm, you know, provided the bug didn't occur in unsafe code which has to be very carefully vetted. +Then you really know what stuff you're you're you're mutating. +When you do a mutation right and so you've really wouldn't know that the original state was was intact. +The problem with C++ is that. +We don't really have those kinds of protections and when there's a bug, it very typically leads to undefined behavior, which very typically could corrupt your old state, right? +It's undefined behavior and other words, it can do anything. +That's one of the you know, if you look at the C standard and and find all of the places where it says the behavior is undefined and you know there are lots of those and a lot of a lot of them apply in many, many places like there are statements like you know if any argument to a standard library function violates a precondition, the behavior is undefined. +Right. +And so that's the problem with with the C++ environment. +So it can really can undermine all of the guarantees that you would be getting from something like like copy on write system. +OK. +Umm. +Moving on. +Uh, I can't see if there are any more hands because I've got a window covering it, but I think there aren't good. +Umm OK so. +Yeah, some, some, some last advice. +Uh that I just added. +Uh. +About what to do when an assertion fires? +Umm. +And this is because especially what not to do because we see this a lot. +So first of all, don't remove the assertion, because the program seems to work when you take it out right the. +That's that's just the case you've tested. +Right. +And the what? +The assertion is saying. +Usually it's usually a precondition check with the assertion is if it's a precondition check, the assertion is saying. +The the owner of the function is saying you did something for which I'm not guaranteeing any particular result. +I don't know what was what result you should expect to get under these conditions. +Right. +So just taking it out doesn't make the program work. +The there are probably some effects that you aren't able to observe that that put the program in a broken state. +Uh. +Another thing not to do is don't go to the owner of the assertion and complain that they're crashing the program. +Remember the an assertion is a controlled shutdown in response to a detected bug, right? +And the first thing you need to do is to understand what kind of check is being performed, right? +So if it's a precondition check in someone else's component, that's probably your bug. +You're probably calling that that component in the wrong way. +Another possibility is that it's a self check, right? +What people often called sanity check, although we we try not to use that term anymore these days. +Umm. +Or it's a post condition check. +Uh, and in those cases, you want to talk to the the owner of the code about why the assumptions might have been violated. +That is, is very possibly a bug in the code that you're using, but this just reminds us why it's important to have different kinds of assertion macros or functions that tell you what their purpose is so that when they fire, people know what to do about them. +Uh, and my last bit of advice is you probably don't wanna use assertions. +Umm, you know the same functions you use for checking preconditions for doing your unit tests. +One reason is people often uh. +So typically when a unit test failure occurs, you don't go ahead using the same data, right? +You typically throw it out and go on to another test, and the other test you know uses fresh data and so that hasn't invalidated the rest of your testing. +People would like to hear about all of the test failures rather than just the just the first one. +So the assertions that exit the program aren't really appropriate there. +You want a different suite for doing those kind of checks, and that's all I've got for you. +Ready to open the floor to questions. + +David Sankel 50:11 +Folks can go ahead and put your hands up if you would like to. +Uh. +Ask a question. +Build a queue. + +Dave Abrahams 50:22 +I have the feeling that I didn't. +I didn't quite adequately deal with everybody's. +I questions that came up during the talk, so I'm happy to revisit those. +Got one hand? + +David Sankel 50:36 +At Philip, go ahead. + +Philip Levy 50:38 +And like to go back to a comment you made about. +The Boost graph library and raising exceptions to terminate that and you were pondering whether that was actually a good thing to have done based on performance, and it was wondering, is the notion that the fact that a visitor could raise an exception affecting performance of the execution you know of of non exceptional cases or just the cost of terminating the the algorithm by just raising that one exception? + +Dave Abrahams 51:21 +OK, I'm I'm going to try to try to answer your question as I understand, but I'm OK. + +Philip Levy 51:28 +Well, let me just clarify a little bit. +My expectations would be that raising an exception to terminate the algorithm wouldn't affect the performance of the execution of the algorithm. +The termination is a one time thing versus you know many thousands of nodes. +You may be looking at and so I was wondering why you were pondering that. + +Dave Abrahams 51:47 +Right. +That's the trade off we thought would love me. +So. +So Philip, yes, that that's the tradeoff that we thought we were making we because because C++ biases in favor of the straight line code, we thought this would be, this would be a good optimization. +My my reason for questioning it is I don't think we ever actually did any measurements. +That's all. + +Philip Levy 52:16 +OK, alright. +So it's it's an unknown, but there's no reason to believe it would be a problem. + +Dave Abrahams 52:22 +Right, that's correct. + +Philip Levy 52:24 +OK. Thank. +Thank you. + +Dave Abrahams 52:27 +I suppose if these graph algorithms were themselves used in tight loops on small problems where where the amount of straight line execution was low and you were throwing exceptions to terminate, that would be that would be bad, right? +If the algorithm was used repeatedly, umm, go ahead. + +Sean Parent 52:48 +So I think the others, I, David, there is there was an assumption that the checks at each node to see if there was a termination requested would be expensive and under modern hardware it probably costs you something, but it's a little hard to say. + +Dave Abrahams 53:13 +Yeah, I mean, you know, it's really hard to say without measuring. + +Sean Parent 53:14 +She tested. + +Dave Abrahams 53:17 +That's pretty much always the case for for performance. +You know, there's a there's a solid argument that, you know, the the functions on visitors are usually inlined. +When all of those intermediate visit points are are, you know are no OPS, the compiler can see it, and then it could skip the checks. +So like you know, the lesson is always measured before you make conclusions about performance. From 661e6cf60b87c8309f26d65e4b58740775a68287 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 30 Sep 2025 14:41:17 -0700 Subject: [PATCH 03/41] WIP --- better-code/src/chapter-3-errors.md | 75 +++++++++++++++++------------ 1 file changed, 45 insertions(+), 30 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index d8a6f3d..93eea59 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -44,21 +44,21 @@ We'll divide errors into three categories: > though its preconditions were satisfied. For example, writing a > file might fail because the filesystem is full. -[^avoidable]: Although “bugs” are inevitable, every *specific* bug is +[^avoidable]: While bugs are inevitable, every *specific* bug is avoidable. ## Error Recovery Let's begin by talking about what it means to “recover from an error.” -Perhaps the [earliest use of the -term](https://dl.acm.org/doi/10.1145/800028.808489) was in the domain -of compilers, where the challenge, after detecting a flaw in the -input, is to continue to process the rest of the input meaningfully. -Consider a simple syntax error: the simplest possiblities are that the -next or previous symbol is extra, missing, or misspelled. Guessing -correctly affects not only the quality of the error message, but also -whether further diagnostics will be useful. For example, in this code, -the `while` keyword is misspelled: +[Perhaps the earliest use +](https://dl.acm.org/doi/10.1145/800028.808489) of the term “error +recovery” was in the domain of compilers, where the challenge, after +detecting a flaw in the input, is to continue to process the rest of +the input meaningfully. Consider a simple syntax error: the simplest +possiblities are that the next or previous symbol is extra, missing, +or misspelled. Guessing correctly affects not only the quality of the +error message, but also whether further diagnostics will be +useful. For example, in this code, the `while` keyword is misspelled: ```swift func f(x: inout Int) { @@ -91,29 +91,44 @@ far. “Making sense” is necessarily a subjective judgement, so examples are called for. - The initial state of a compiler, before it has seen any input, - certainly meets the compiler's invariants. But when an error is encountered, - resuming with that state would ignore the context seen so far that - can help inform further diagnostics. If the following text did not - match what is expected at the beginning of a source file, it would - be flagged as an error. - -- If we have an error while we're applying a blur to some image, it's not enough that the users document is still a well formed file, right? -It also can't have some random or half finished changes that they didn't request. - -So that's that would be that would be very scathed, OK. -OK, so let's talk about recovering from a bug. -So what would that mean? -Well, first it is sumes that you had some way to detect the bug, right? -And not all bugs are detectable, but let's assume that this one is. -So an example of a nondetectable bug is you are trying to sort something, but you're but you're comparison function returns random results. -So, so that doesn't satisfy the requirements for the the sorting function. -It's a precondition that that there's no way to actually check for. -OK. -Uh, so, uh. -So let's assume that we have a detectable bug, and usually that means some somebody's checking a precondition and that precondition check fails. + certainly meets its invariants. But when an error is + encountered, resuming with that state would ignore the context seen + so far that can help inform further diagnostics. If the following + text did not match what is expected at the beginning of a source + file, it would be flagged as an error. We the error might, for + example have been detected in some deeply (correctly) nested + construct. If that state isn't preserved, each closing delimiter of + that construct will be flagged as a new error. + +- In a desktop graphics application, it's not enough that upon error + (say, file creation fails), the user has a well-formed document; an + empty document is not an acceptable result. Leaving them with a + well-formed document that is subtly changed from its state before + the error would be especially bad. + +These examples show that even if invariants are upheld, a program can +be very scathed indeed. + +### What About Recovery From Bugs? + +We've just seen an examples of recovery from an input error and a failure. +What would it mean to recover from a bug? + +First, the bug needs to be detected. As we saw in the previous +chapter, not all bugs are detectable. Also, it's important to admit +that when a runtime bug check fails, we're not detecting the bug +per-se: since bugs are flaws in *code*, finding bugs involves +analyzing the program. We're really detecting a *downstream effect* +that the bug has on *data*, akin to the way physicists conclude from +cosmic microwave background radiation that the universe started with a +big bang. We know something happened, but we don't know exactly where, +how or why. + +Assuming we have a detectable bug, usually that means somebody's checking a precondition and that precondition check fails. And that means there's a bug in the collar that caused them to pass an invalid argument. So when that happens though, you're not really detecting the bug itself. You're detecting one of its symptoms like some kind of a cosmic echo. + The bug itself occurred some indefinite point before that. Right then, there's a series of logical conclusions that the the code may have made about what it had that are incorrect. That led it to produce this input that you you see doesn't satisfy (preconditions,. From 81f3f448e2ba828880534f48f0dccd0395a315d9 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 30 Sep 2025 16:27:09 -0700 Subject: [PATCH 04/41] WIP --- better-code/src/chapter-3-errors.md | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 93eea59..08614bb 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -106,9 +106,6 @@ are called for. well-formed document that is subtly changed from its state before the error would be especially bad. -These examples show that even if invariants are upheld, a program can -be very scathed indeed. - ### What About Recovery From Bugs? We've just seen an examples of recovery from an input error and a failure. @@ -124,17 +121,9 @@ cosmic microwave background radiation that the universe started with a big bang. We know something happened, but we don't know exactly where, how or why. -Assuming we have a detectable bug, usually that means somebody's checking a precondition and that precondition check fails. -And that means there's a bug in the collar that caused them to pass an invalid argument. -So when that happens though, you're not really detecting the bug itself. -You're detecting one of its symptoms like some kind of a cosmic echo. +So can we “sally forth unscathed?” The problem is that you can't +know. The downstream effects of ###### STOPPED HERE ####### -The bug itself occurred some indefinite point before that. -Right then, there's a series of logical conclusions that the the code may have made about what it had that are incorrect. -That led it to produce this input that you you see doesn't satisfy (preconditions,. -OK. -So can you Sally forth on scathed? -Well, the problem is you don't know, right? Because of the bug, your program state could be very, very scathed indeed. Umm. From 919822ed3a15be639ec10a2210a19c09bdc1b0d3 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 5 Nov 2025 14:53:24 -0800 Subject: [PATCH 05/41] Errors WIP --- better-code/src/chapter-3-errors.md | 158 +++++++++++++++++----------- 1 file changed, 97 insertions(+), 61 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 08614bb..d827b08 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -111,57 +111,63 @@ are called for. We've just seen an examples of recovery from an input error and a failure. What would it mean to recover from a bug? -First, the bug needs to be detected. As we saw in the previous -chapter, not all bugs are detectable. Also, it's important to admit -that when a runtime bug check fails, we're not detecting the bug -per-se: since bugs are flaws in *code*, finding bugs involves -analyzing the program. We're really detecting a *downstream effect* -that the bug has on *data*, akin to the way physicists conclude from -cosmic microwave background radiation that the universe started with a -big bang. We know something happened, but we don't know exactly where, -how or why. +First, the bug needs to be detected, and that is not assured. As we +saw in the previous chapter, not all precondition violations are +detectable. Also, it's important to admit that when a runtime bug +check fails, we're not detecting the bug per-se: since bugs are flaws +in *code*, actually detecting bugs involves analyzing the program. +We're really detecting a *downstream effect* that the bug has on +*data*, like some kind of cosmic echo. We know something happened, +but we don't know exactly where, how or why. So can we “sally forth unscathed?” The problem is that you can't -know. The downstream effects of ###### STOPPED HERE ####### +know. The downstream effects of the problem could have affected many +things you didn't test for, and you can't test for everything, or your +code would spend more time on that than on fulfilling its purpose. +Because of the bug, your program state could be very, very scathed +indeed. -Because of the bug, your program state could be very, very scathed indeed. -Umm. +Sallying forth at this point is a terrible idea. +- First, there are effects in the outside world. -Umm OK? -OK, so your program state if it's scathed selling 4th at this point is a terrible idea for lots of reasons. -So there's sort of two categories. -First, there are effects in the outside world. -I don't know. + - so the users data might be corrupted, right? + And they might say that that way and they'll lose the last good state they had. + So that's that's pretty serious. + + - if you've done a security evaluation, the assumptions that + underlie that evaluation might be violated. So by continuing, you + may be opening a security hole + + - You also can't detect whether you've recovered correctly. There's + there's nothing to look at and the penalties that we just talked + about for failure to do it correctly are really, really high. + +- then there's also the impact on the development process. + + - if you Sally forth the bug is gonna be masked and we'll never get + fixed until at some point, you know, somebody will observe the + effects of this + + - It's gonna affect your, your customers and your and if it you know + when it affects the really important customer, your management may + insist that you do something about it. + + - You didn't detect the bug. You don't have a detection of the bug. + You have some very distant echo in the users document that's + corrupted and now now it's a long process to, you know, try to + figure out where that corruption came from. + + - most code is correct, so you're “bug,” recovery code will never + run. it certainly isn't gonna get tested. if it got tested, you're + gonna fix the problem, so now you're shipping a lot of code that's + just protection against future programming mistakes. + + All of this recovery code bloats your program and every single line + is a liability with no offsetting benefits. + +### Actual Bug Recovery -Uh, so the users data might be corrupted, right? -And they might say that that way and they'll lose the last good state they had. -Right. -So that's that's pretty serious. -The other thing is, if you've done in a security evaluation, the assumptions that underlie that evaluation might be violated. -So by continuing, you may be opening a security hole and so it like sort of to sum up, you don't have enough information about the state of your system to do a recovery to to Sally forth reliably. -And you can't. -You also can't detect whether you've. -Recovered correctly, right? -There's there's nothing to look at and the penalties that we just talked about for failure to do it correctly are really, really high. -OK. -So that's one category, but then there's also the impact on the development process. -So if you Sally forth the bug is gonna be masked and we'll never get fixed until at some point, you know, somebody will observe the effects of this. -It's gonna affect your, your customers and your and if it you know when it affects the really important customer, your management may insist that you do something about it. -Right now all you'll have is evidence you don't remember. -You didn't detect the bug. -You don't have a detection of the bug. -You have some very distant echo in the users document that's corrupted and now now it's a long process to, you know, try to figure out where that corruption came from. -Right. -You're you've you've gotten the information very, very late. -Last of all, most code is correct, so you're “bug,” recovery code will never run. -Probably it certainly isn't gonna get tested. -I if it got tested, you're not gonna ship the tested one because you're gonna fix the. -You're gonna fix the problem right? -All of this recovery code bloats your program and every single line is a liability with no offsetting benefits. -So. -Yeah, I think this is. -This is an interesting insight. I mean, they're do exist robust systems, right? So they they can recover from bugs. How do they do that? @@ -170,12 +176,15 @@ It's almost always basically always. It's outside the process, right? Maybe the robustness of the system comes from redundancy. You have you have three different processes and they all vote on the result. -The like this is the kind of thing you might see in like the F22 Joint Strike Fighter, right? +The like this is the kind of thing you might see in like the F22 Joint Strike Fighter. So yeah, there could be a bug. First of all, they you know they check the code a lot more carefully than we do, but but they also put in safeguards in place so that so that if you know you have three systems voting on the result and one disagrees, you can kill that process and start it up again. Umm. So yeah, sometimes it's possible to design a system to recover from books, but don't expect to do it in in your process. To sum up, uh in general you can't recover from bugs and it's a bad idea to try. + +### Correct in-process response to bugs + So what can you do? Well, the way to handle bugs is to stop the program before any more damage is done and generate a crash report for debuggable image that captures as much information as you possibly can about the state of the program. So there's a chance of fixing the bug. @@ -183,6 +192,8 @@ Umm, be there might be some small emergency shutdown procedure. You might need to perform like saving information about the failing command so your application can offer to retry it for you when you restart it. Were you? You know, maybe you can say something to the user about the reason that you're exiting. + + So this is bad, right? This is really bad if if you don't do something, really go out of your way to do something about it, it's gonna be experienced as a crash by the users, but it's the only way to prevent much worse consequences of a botched recovery attempt. Remember the chances of battery are really high because you don't have enough information to do it reliably. @@ -192,6 +203,9 @@ It's not going to slip by those people unnoticed and then hit your customers in So you can though, mitigate this experience of of crashing right? For example, you could say something to the user about the reasons that you're exiting, and you can actually make it sound pretty responsible. So. So this is important. + +### Embracing Early Termination + You know, a lot of people have a hard time accepting the idea of voluntarily crashing or exiting right? Exiting early is really what that should say, but you know we should face it. You're bug detection isn't the only reason that the program might exit early, right? @@ -206,13 +220,15 @@ Umm so. In fact, there are platforms that actually force you to live under constraint of, you know, no early exit, right. So on an iPhone or iPad, for example, to save battery and keep your foreground apps responsive, the OS might kill your process anytime it's in the background. But it's going to make it look to the user like the the app is still running and when the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. -I can tell you that as a user, it's really jarring when you encounter an app that doesn't do that, right? +I can tell you that as a user, it's really jarring when you encounter an app that doesn't do that correctly? So the point is resilience to early termination is something that you can and should design into the system. So Photoshop uses a variety of strategies for this, so we already we always save documents into a new file and then atomically swap that file into place only after the save succeeds. So we never crash, leaving some half written corrupted document on disk, right? We also periodically save backups so you only had Most lose the last few minutes of work, but we could be more ambitious about this, right? We, if we needed to tighten that up, we could maybe save a record of changes since the last fall back backup. -OK. + +### Assertions + Umm, so the usual mechanism that we have for terminating a program when a bug is detected is called an assertion, and traditionally it's spelled, you know something like this and this spelling comes from C and C++. If you're programming in in some other language, you probably have something similar and the the facility from C is pretty straightforward. Either it's disabled, in which case it generates no code at all, even the. @@ -230,6 +246,7 @@ Yeah, I should have. I meant to to make this distinction earlier, right? Exiting because of an assertion is not a crash, right? This is a controlled stop for calculated reasons. + Umm so but the problem with leaving assertions on and release is that some checks are too expensive to ship. And let's be honest, a lot of programmers are gonna go with their gut about what's too expensive instead of measuring. So we really need a second expensive assert, right? @@ -260,6 +277,9 @@ Because there are terminate handlers and those would run. That gives you a chance to do some origin. See shut down measures. So that's another reason to engineer your own assertions, even if you're only engineering one. + +### Fighting For the Right To Die. + OK, so at this point somebody always asks, but you know I I'm not allowed to terminate. My manager says that that we have to keep running no matter what. Right. @@ -272,23 +292,39 @@ And if you lose that fight today, right. You wanna keep fighting, but in the meantime, fail as noisily as possible, preferably by when at least when you're not shipping the code, get it to terminate right and also set yourself up to deal with the day that that you win the fight because at some point the cost of of following this possible this policy are gonna become obvious. And so that means use a suite of assertions that, well, today they don't terminate, but you can change their behavior when you do win the fight, OK. +## Failures + I I don't know if we're going to get to the end because of the scope expansion anyway, so as much as we all love talking about bugs, it's time to leave bugs behind and talk about failures. +> - **Failure**: a function could not fulfill its postconditions even +> though its preconditions were satisfied. For example, writing a +> file might fail because the filesystem is full. + + So let's say you identify a condition where your function is unable to fulfill its primary purpose, so that can occur in one of two ways. -Either something you're function calls has a precondition that you can't be sure you're prepared to satisfy, or something you're function calls. Itself. -Reports the failure to you so usually have two choices at this point. -So one is you can say that your inability to make progress reflects a bug in the caller, right? -You can make not XD be a precondition of your function or you can make X failure right, which means that all of the code in the system is correct. -Umm, that's counterintuitive, but you should actually always prefer to classify that situation as a bug in the caller, as long as it satisfies the criteria for acceptable (preconditions,. -So there there are a few things you need to satisfy, right? -It needs to be possible for the caller to ensure the condition, right? -There's no way for the caller to ensure there's enough disk space to save a file, because other processes can come and use up any space that might have been free before the call. -So you can't make there's enough disk to save a precondition. -The the other way in which something might not be a suitable precondition is if it takes as much work for the caller to ensure it as the work you're gonna do in in performing the operation in the end anyway. -So for example, if if they're deserializing a document, umm and you find that it's corrupted, you can't make it a precondition that the file is well formed, because determining whether it's well formed or not is the same work that as doing the deserialization so. + +Either + +1. something your function calls has a precondition that you can't be sure you're prepared to satisfy, or +2. something your function calls. Itself reports a failure. + + +you so usually have two choices at this point. + +1. Make it a precondition violation: your inability to make progress reflects a bug in the caller +2. Make it a failure, which means that all of the code in the system is correct. + +It's counterintuitive, but you should actually always prefer to classify that situation as a bug in the caller, as long as it satisfies the criteria for acceptable preconditions: + +- it must be possible for the caller to ensure the condition. There's no way for the caller to ensure there's enough disk space to save a file, because other processes can come and use up any space that might have been free before the call. So you can't make there's enough disk to save a precondition. +- The the other way in which something might not be a suitable precondition is if it takes as much work for the caller to ensure it as the work you're gonna do in in performing the operation in the end anyway. So for example, if if they're deserializing a document, umm and you find that it's corrupted, you can't make it a precondition that the file is well formed, because determining whether it's well formed or not is the same work that as doing the deserialization so. + OK, so prefer to make it a precondition, but. -If you can't satisfy a post condition and you're incorrect code, you're in correct code. -That's a failure. + +If you can't satisfy a post condition and you're in correct code, That's a failure. + +> **Definition** + So why am I tying this definition to postconditions, other than to bind our understanding of Error Handling to under to the way we understand correctness? That's a valuable thing, but there's there are more reasons. So first of all, it's simplifies and improves understandability of contracts. From 2ad2aef454d3a13600c9f3232657c65e8d3d92e9 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 12 Nov 2025 09:55:38 -0800 Subject: [PATCH 06/41] Recoverd talk notes --- better-code/src/chapter-3-errors.1.md | 423 ++++++++++++++++++++++++++ 1 file changed, 423 insertions(+) create mode 100644 better-code/src/chapter-3-errors.1.md diff --git a/better-code/src/chapter-3-errors.1.md b/better-code/src/chapter-3-errors.1.md new file mode 100644 index 0000000..23ca2ea --- /dev/null +++ b/better-code/src/chapter-3-errors.1.md @@ -0,0 +1,423 @@ +# Better Code: Errors + +So we're going to talk about errors and handling them. + +So what's an error? + +## Words + +When talking about anything, I like to start out by trying to define it, and reading existing definitions is usually a good way to start. After all, programming is about communication, and if we want to communicate effectively we should use words in the expected ways. + +Normally when I've done a version of this talk it's been a very interactive, in-person experience: I ask the audience for their definitions and we write them all on a board and then dissect them. I don't think that's going to work in this context, so instead I asked the web. + +That exercise was very revealing, and actually changed my mind about the meaning of error and the overall scope of the presentation. So let's review what I found out. These are roughly the top answers Google gave me when I asked it to define “error” and “error handling.” Aside from Wikipedia, I was surprised at some of the hits it chose, but if you don't like them you can take it up with Google. I feel pretty confident that these results reflect the way people talk about errors. + +### Definitions + +Wikipedia: + +An error (from the Latin errāre, meaning 'to wander'[1]) is an inaccurate or incorrect action, thought, or judgement.[1] + + +In statistics, "error" refers to the difference between the value which has been computed and the correct value.[2] An error could result in failure or in a deviation from the intended performance or behavior.[3] + +In human behavior the norms or expectations for behavior or its consequences can be derived from the intention of the actor or from the expectations of other individuals or from a social grouping or from social norms. (See deviance.) Gaffes and faux pas can be labels for certain instances of this kind of error. More serious departures from social norms carry labels such as misbehavior and labels from the legal system, such as misdemeanor and crime. Departures from norms connected to religion can have other labels, such as sin. + +In science and engineering in general, an error is defined as a difference between the desired and actual performance or behavior of a system or object. + +Engineers seek to design devices, machines and systems and in such a way as to mitigate or preferably avoid the effects of error, whether unintentional or not. Such errors in a system can be latent design errors that may go unnoticed for years, until the right set of circumstances arises that cause them to become active. Other errors in engineered systems can arise due to human error, which includes cognitive bias. Human factors engineering is often applied to designs in an attempt to minimize this type of error by making systems more forgiving or error-tolerant. + + + +Error Message: + +An error message is the information displayed when an unforeseen problem occurs, usually on a computer or other device. Modern operating systems with graphical user interfaces, often display error messages using dialog boxes. Error messages are used when user intervention is required, to indicate that a desired operation has failed, or to relay important warnings (such as warning a computer user that they are almost out of hard disk space). + +Lenovo: + +Computer error refers to a mistake or malfunction that occurs within a computer system, leading to unexpected or incorrect behavior. +Computer Hope: + +An error describes any issue that arises unexpectedly that cause a computer to not function properly. + +Vocabulary.com:Definitions of computer error +noun (computer science) the occurrence of an incorrect result produced by a computer +Toppr.com + +An error in computer data is called Bug. + +A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. + +https://textexpander.com/blog/most-common-programming-errors: +The 7 Most Common Types of Errors in Programming and How to Avoid Them +Syntax Errors +Logic Errors +Compilation Errors +Runtime Errors +Arithmetic Errors +Resource Errors +Interface Errors + +Techopedia: + +What Does Error Handling Mean? +Error handling refers to the response and recovery procedures from error conditions present in a software application. In other words, it is the process comprised of anticipation, detection and resolution of application errors, programming errors or communication errors. Error handling helps in maintaining the normal flow of program execution. + +There are four main categories of errors: + +Logical errors +Generated errors +Compile-time errors +Runtime errors + +dremio.com: + +Error Handling refers to the process of detecting, managing, and resolving errors and exceptions that occur during data processing and analytics. It involves implementing mechanisms and strategies to handle unexpected events and ensure data integrity and reliability. + + +OK, so in this text I want to highlight four things: + +First, a lot of it, all this red stuff, is about bugs. If you happened to read the abstract blurb that we used in the talk announcement, you know it said we'll clearly define “error” distinct from “bug,” but these results force me to admit that error usually means bug, and if I want to talk about non-bugs I might need to find a different term. It also convinced me that in a talk about error handling you can't avoid the topic of how to deal with bugs. So we're going to talk about all kinds of errors, both bugs and the other kinds. + +Since I love defining things, I'm going to take this opportunity to define “bug” as an avoidable coding error. + +Statistically, bugs may be inevitable +but +Every individual bug is avoidable. + +Which is a good thing, because you can't really plan for bugs; they could be anywhere. That's why you see the word “unexpected” come up a lot in that red text. + +Second, in a couple of places I colored green, people are talking about things that definitely aren't bugs, like resource allocation failure. If I run out of space on the disk when I'm trying to save a document, that's not a bug. +Maybe it's rare, but you can predict that it will happen sometimes, and you know exactly where in your code it can happen, so you can plan a response for it. These non-bugs are what I used to call “errors” and had intended to be the sole topic of this talk. Let's call them failures. They represent a failure—sometimes temporary—of the code to achieve its primary intent. + +The blue highlight talks about errors due to cognitive bias, a very AI-forward concern. Is that a bug? I'm not sure cognitive bias is avoidable. So I guess I'd go with not-a-bug. However, as far as I know it's not an event; it's a property of the code and/or dataset, so it's really in its own category. + +Finally, these words in yellow talk about recovery, resolution, and maintaining data integrity. How you achieve that is going to be important. + +So there are three important parts to this picture: + +Bugs +Failures (non-bugs, predictable obstacles) +Recovery and Integrity + +## Recovery + +So what do we mean by “recovery?” When I ask the web, most of the hits define error recovery in terms of what a parser does when it hits a syntax error in your code. + +int main() { + int x = 4 + // ^---- error: expected ';' at end of declaration + f(x); + f(x x); + // ^--------- error: expected ')' +} + +Let's say you left out a semicolon. The parser could just stop there and issue one diagnostic about the missing symbol, if that's the only possibility in that syntactic position. But most programming language parsers don't do that (even though I often wish they would). They want to give me all the potentially-useful diagnostics about errors in the rest of my code. If the parser just starts over, discarding its state and pretending the location of the error is the beginning of the file, I'm going to get lots of bogus error messages. That's a pretty poor recovery because although the program continues, it's doing something that almost certainly doesn't make sense. + +x.cpp:1:3: error: unknown type name 'f' + f(x); + ^ +x.cpp:2:5: error: unknown type name 'x' + f(x x); + ^ +x.cpp:2:3: error: a type specifier is required for all declarations + f(x x); + ^ +x.cpp:4:1: error: extraneous closing brace ('}') +} +^ + + +So instead parsers typically try to “recover” by pretending I had written something correct. In this case it injects a phantom semicolon and continues. So as a first cut, let's say recovery is continuing to execute, doing sensible work. But I really like this quote from a stack overflow answer: + +https://stackoverflow.com/a/38387506/125349 + +... i.e.: "to sally forth, entirely unscathed, as though 'such an inconvenient event' never had occurred in the first place." + +By “unscathed” they mean that the program state is intact: not only are the invariants upheld, but the state makes sense given the inputs the program has received. If we have an error while applying a blur, it's not enough that the user's document is a well-formed file; it also can't have some random or half-finished changes they didn't ask for. + +## Recovery from bugs? + +OK, so let's talk about recovering from a bug. What would that mean? +Well, first, it assumes you have some way to detect the bug; not all bugs are detectable, but let's assume this one is. Typically that means some precondition check fails: there's a bug in the caller that caused them to pass an invalid argument. + +When that happens, you're not really detecting the bug, you're detecting one of its symptoms, like a cosmic echo. The bug itself occurred at some indefinite point before that. So can you ”sally forth unscathed?” The problem is, you don't know. Because of the bug, your program state could be very, very scathed indeed. + +Sallying forth at this point is a terrible idea, for so many reasons. First there are effects in the outside world: +- The user's data might be corrupted and they might save it that way, losing the last good state they had. +- The assumptions underlying any security evaluation you did may be violated, so you could be opening a security hole. +- You don't have enough information about the state of your system to do it reliably, you can't detect whether you've done it correctly, and the penalties we just discussed for failure to do it correctly are astronomical. + + +Continuing in the face of a known bug also has a terrible impact on the development process: +- The bug will be masked and will never get fixed… +- …until one day we're about to lose an important customer base because of that corruption. And then you might spend weeks hunting the bug down because the customer sees a much more distant echo of the bug than the earlier echo your code detected. +- Most code is correct, so most of your bug-recovery code will never run. It certainly won't be tested. All this recovery code bloats your program and every line is a liability with no offsetting benefits. + +Some systems can recover from bugs (e.g. redundant ones). Processes can't recover. + +To sum up, in general you can't recover from bugs, and it's a bad idea to try. So what can you do? + +## Handling bugs + +You can stop the program before any more damage is done, and generate a crash report or debuggable image that captures as much information as is available about the state of the program, so there's a chance of fixing the bug. Maybe there's some small emergency shutdown procedure you need to perform, like saving information about the failing command so the application can offer to retry it for you when you restart it. + +Let me be clear: THIS IS BAD. It could be experienced as a crash by users. +But it's the only way to prevent the much worse consequences of a botched recovery attempt. Remember, the chances of botchery are high because you don't have enough information to do it reliably. +Upside: it will also be experienced as a crash by developers, QE teams, and beta testers, giving you a chance to fix the bug. + +*** You can mitigate the experience of crashing *** +*** Don't tell me my assertion is a crash *** +*** An assertion is a controlled shutdown *** + +A lot of people have a hard time accepting the idea of voluntarily terminating, but let's face it: your bug detection isn't the only reason the program might suddenly stop. You can crash from an undetected bug. Or a person can trip over the power cord. You should design your software so that these bad things are not catastrophic. + +*** In fact you could be more ambitious and try to make it really seamless. You have to accept this is part of the UX package to even take this on. *** + +In fact some platforms force you to live under a similar constraint. On an iPhone or iPad, for example, to save battery and keep foreground apps responsive, the OS may kill your process any time it's in the background, but will make it look to the user like it's still running. When the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. I can tell you as a user, it can be really jarring when you encounter an app that doesn't do it right. The point is, resilience to early termination is something you can and should design into the system. + +For example, Photoshop uses a variety of strategies: we always save documents into a new file and atomically swap it into place only after the save succeeds, so we never leave a half-saved document on disk. We also periodically save backups so at most you only lose the last few minutes of work. If we needed to tighten that up we could, by saving a record of changes since the last full backup. + +## Assertions + +The usual mechanism for terminating a program when a bug is detected is called an assertion and traditionally it spelled something like this: + + assert(n >= 0); + +This spelling comes from C and C++. If you're programming in another language, you probably have something similar. + +The C assertion is pretty straightforward: either it's disabled, in which case it generates no code at all—even the check is skipped—or it does the check and exits immediately with a predefined error code if the check fails, usually printing a message containing the text of the failed check and its location in source. + +Debuggers will commonly stop at the assertion rather than exiting, and even if you're not running in the debugger, on major desktop OSes, you'll get a crash report with the entire program state that can be loaded into a debugger. So this is great for catching bugs early, before they get shipped, provided people use it. + +Projects commonly disable assertions in release builds, which has the nice side-effect of making programmers comfortable adding lots of assertions, because they know they won't slow down the release build. And more bugs get caught early. + +But unless you really believe you're shipping bug-free software, you might want to leave most assertions on in release builds. In fact, the security of your software might depend on it. If you're programming in an unsafe language like C++, opportunities to cause undefined behavior are all around you. When you can assert that the conditions for avoiding undefined behavior are met before executing the dangerous operation, the program will come to a controlled stop instead of opening an arbitrarily bad security hole. + +The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. + +There's another problem with having just one assertion: it doesn't express sufficient intent. For example, it might be a precondition check, or the asserting function's author might just be double-checking their own reasoning. When these two assertions fire, the meaning is very different: the first indicates a bug in the caller, the other one is a bug in the callee. So I really want separate precondition and self_check functions. + +If I'm writing in a safe-by-default language like Rust or Swift, the checks that prevent undefined behavior, like array bounds checks, are special: I can afford to turn off all the other checks in shipping code, but these checks are the ones upholding safety properties of my system are compromised. So I want a different assertion for these checks, even if I don't ever anticipate turning off the other ones in a shipped product. These are the ones that we can't delete from the code. I might want to turn the other assertions off locally to measure how much overhead they are incurring. + +I hope you get the idea. I'm not going to prescribe the exact set of assertion facilities your project needs, but a carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. If you haven't got one, go design it. + +One last point about the C++ assert: it's better than nothing, but because it calls abort(), there's no place to put emergency shutdown measures. You can't even display a message to the user, so to the user it will always feel like a hard, unceremonious crash. You probably want failed assertions to call terminate() instead, because it allows terminate handlers can run. So that's another reason to engineer your own assertions, even if you build just one. + +## What if you're not allowed to terminate? + +Fight for the right (to terminate). If the system is critical, advocate creating a recovery system outside the process. +If you lose today +Fail as noisily as possible, preferably by terminating in non-shipping code. +Keep fighting +Be prepared to win someday. That means use a suite of assertions that don't terminate, but whose behavior you can change when you win the fight. + +# Failures + +OK, as much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: + + +Something your function calls has a precondition that you're not sure would be satisfied. +Something your function calls can itself report a failure. + +You usually have two choices at this point: +Make !X a precondition; X reflects a bug in the caller. +Make X a failure; all the code is correct. + +It's counterintuitive, you should always prefer to classify X as a bug, as long as !X satisfies the criteria for preconditions: +It is possible to ensure !X. For example, there's no way for the caller to ensure there's enough disk space to save a file, because other processes can use up any space that might have been free before the call. So you can't make “there's enough disk to save” a precondition. +Ensuring !X is considerably less work than the work done by the callee. For example, if the callee is deserializing a document and finds that it's corrupted, you can't make it a precondition that the file is well-formed, because determining whether it is or not is basically the same work as doing the deserialization. + +## Definition + + Failure: inability to satisfy a postcondition in correct code. + +So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? + +First of all, it simplifies and improves understandability of contracts. This is easiest to see if you have a dedicated language mechanism for error handling: + +** Note: fictional programming language ** + +// Returns `x` sorted in `order`, or throws an exception +// in case order fails. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +// Returns `x` sorted in `order`. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +Even if you feel you need to say something about possible failures, that becomes a secondary note that's not essential to the contract. + +// Returns `x` sorted in `order`. +// +// Propagates any exceptions thrown by `order`. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +A programmer can know everything essential from the summary fragment and the signature. Another way this separation plays nicely with exceptions is that you can say the postcondition of a function describes what you get when it returns, and a throwing function never returns. + +If you don't use exceptions, you still simplified contracts as long as you have dedicated types to represent the possibility of failure. + +// Returns `x` sorted in `order`. +fn sorted(x: [Int], order: Ordering) -> ResultOrFailure<[Int]> + +Separating the function's primary intention from the reasons for failure makes sense, because the reasons for failure matter less. If that's not obvious yet, some justification is coming. + +Another reason to exclude the failure case from the postcondition is that you want postconditions to be solid and fully described, but a mutating operation that fails often leaves behind a state that's very difficult to nail down, and as I said in the contracts talk, that you usually don't want to nail down, because it's detail nobody cares about. But if it's part of the postcondition, you need to say something about it, and that further complicates the contract. + +// Sorts `x` according to `order` or throws an exception +// if `order` fails, leaving `x` modified in unspecified +// ways. +fn sort(mutating x: [Int], order: Ordering) throws + +// Sorts `x` according to `order`. +fn sort(mutating x: [Int], order: Ordering) throws + +## Two kinds of failures + +If you've spent some time writing code that carefully handles failures, especially in a language like C where all the error propagation is explicit, failures start to fall into two main categories: local and non-local, based on where the recovery is likely to happen. + +Local recovery occurs very close to the source of failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. In many cases, the recovery path is performance-critical. + +**Example**: you have an ultrafast memory allocator that draws from a local pool much smaller than your system memory. You build a general-purpose allocator that first tries your fast allocator, and only if that allocation fails, recovers by trying the system allocator. + +**Example**: the lowest level function that tries to send a network packet can fail for a whole slew of reasons (https://www.ibm.com/docs/en/zos/2.3.0?topic=codes-sockets-return-errnos), some of which may indicate a temporary condition like packet collision. 99% of the time, the immediate caller is a higher-level function that checks for these conditions and if found, initiates a retry protocol with exponential backoff, only itself failing after N failed retries. That lowest-level failure is local. The failure after N retries is very likely to be non-local. + +Non-local recovery, which is far more common, occurs far from the source, usually in a way that can be described without reference to the reasons for failure. For example, when you're serializing a complex document, serializing any part means serializing all of its sub-parts, and parts are ultimately nested many layers deep. Because you can run out of space in the serialization medium, every step of the process can fail. If you write out the error propagation explicitly, it usually looks like this: + +// Writes `s` into the archive. +fn serialize_section(s: Section) -> MaybeFailure +{ + var failure: Optional = none; + + failure = serialize_part1(s.part1); + if failure != none { return failure; } + + failure = serialize_part2(s.part2); + if failure != none { return failure; } + + ... + + return serialize_partN(s.partN); +} + +After every operation that can fail, you're adding “and if there was a failure, return it.” + +There are many layers of this propagation. None of it depends on the details of the reasons for failure: whether the disk is full or the OS detects directory corruption, or serialization is going to an in-memory archive and you run out of memory, you're going to do the same thing. Finally, where propagation stops and the failure is handled—let's say this is a desktop app— again, the recovery is usually the same no matter the reasons for the failure: you report the problem to the user and wait for the next command. + +### Interlude: Exceptions? + +Way back in 1996 I embarked on a mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. Yes I'm old. While I've seen some real progress on that over the years, I know some of you out there are still not all that comfortable with the idea. If you'll let me, I think I can help. + +#### Just control flow + +Cases like this are where the motivation for exceptions becomes really obvious. They eliminate the boilerplate and let you see the code's primary intent: + +// Writes `s` into the archive. +fn serialize_section(s: Section) throws { + serialize_part1(s.part1); + serialize_part2(s.part2); + ... + serialize_partN(s.partN); +} + +There's no magic. Exceptions are just control flow. Like a switch statement, they capture a commonly needed pattern control flow pattern and eliminate unneeded syntax. + +To grok the meaning of this code in its full detail, you mentally add “and if there was a failure, return it” everywhere. But if you push failures out of your mind for a moment you can see that how the function fulfills its primary purpose leaps out at you in a way that was obscured by all the failure handling. The effect is even stronger when there's some control flow that isn't related to error handling. + +#### Also, type erasure + +OK, I lied a little when I said exceptions are just control flow. There's one other big difference between the exception version and the explicit version: the exception version erases the types of the failure data, and catch blocks are just big type switches with dynamic downcasts. + +Lots of us are “static typing partisans,” so at first this might sound like a bad thing, but remember, as I said, none of the code propagating this failure (or even recovering from it usually) cares about its details. What do you gain by threading all this failure information through your code? When the reasons for failure change you end up creating a lot of churn in your codebase updating those types. + +In fact, if you look carefully at the explicit signature, you'll see something that typically shows up when failure type information is included: people find a way to bypass that development friction. + +fn serialize_section(s: Section) -> MaybeFailure + +Here an “unknown” case was added that is basically a box for any failure type. This is also a reason that systems with statically checked exception types are a bad idea. Java's “checked exceptions” are a famously failed design because of this dynamic. + +Swift recently added statically-typed error handling in spite of this lesson that should be well-understood to language designers, for reasons I don't understand. There was great fanfare from the community, because, I suppose, everybody thinks they want more static type safety. I'm not optimistic that this time it's going to work out any better. + +The moral of the story: sometimes dynamic polymorphism is the right answer. Non-local error handling is a key example, and the design of most exception systems optimize for that. + +### When (and when not) to use exceptions + +There's a lot of nice sounding advice out there about this that is either meaningless or vague, like “use exceptions for exceptional conditions,” or “don't use exceptions for control flow.” I know that one is really popular around Adobe, but c'mon: if you're using exceptions, you're using them for control flow. I hope to improve on that advice a little bit. + +First of all, you can use exceptions for things that aren't obviously failures, like when the user cancels a command. An exception is appropriate because the control flow pattern is identical to the one where the command runs out of disk space: the condition is propagated up to the top level. In this case recovery is slightly different: there's nothing to report to the user when they cancel, but all the intermediate levels are the same. It would be silly to explicitly propagate cancellation in parallel with the implicit propagation of failures. + +But if you make this choice, I strongly urge you to classify this not-obviously-a-failure thing as a failure! Otherwise you'll undo all the benefits of separating failures from postconditions, and you'll have to include “unless the user cancels, in which case…” in the summary of all your functions. So in the end, my broad advice is, “only use exceptions for failures (but be open minded about what you call a failure).” Actually, even if you're not using exceptions, any condition whose control flow follows the same path as non-local failures should probably be classified as a failure. + +Another prime example is the discovery of a syntax error in some input. In the general case, you are parsing this input out of a file. I/O failures can occur, and will follow the same control flow path. Classifying your syntax error as a failure and using the same reporting mechanism is a win in that case. + +Next, don't use exceptions for bugs. As we've said, when a bug is detected the program cannot proceed reliably, and throwing is likely to destroy valuable debugging information you need to find the bug, leave a corrupt state, open a security hole, and hide the bug from developers. Even though the “default behavior” of exceptions is to stop the program, throwing defers the choice about whether to actually stop to every function above you in the call stack. This is not a service, it's a burden. You've made your function harder to use by giving your clients more decisions to make. Just don't. + +That also means if you use components that misguidedly throw logic_errors, domain_error, invalid_argument, length_error or out_of_range at you, you should almost always stop them and turn them into assertion failures. All that said, there are some systems, like Python, where using exceptions for bugs (to say nothing of exiting loops!) is so deeply ingrained that it's unavoidable. In python you have to ignore this rule. + +Don't use exceptions for local failures. As we've seen, exceptions are optimized for the patterns of non-local failures. Using them for local failures means more catch blocks, which increase code complexity. It's usually easy to tell what kind of failure you've got, but if you're writing a function and you really can't guess whether its failure is going to be handled locally, maybe you should write two functions. + +Next, consider performance implications. Most languages aren't like this, but most C++ implementations are usually biased so heavily toward optimizing the non-failure cases that handling a failure runs one or two orders of magnitude slower. Usually that's a great trade-off because it allows them to skip checking for the error case on the hot path, and non-local failures are rare and don't happen repeatedly inside tight loops. But if you're writing a real-time system for example, you might want to think twice. + +Here's an example that might open your mind a bit: when we were discussing the design of the Boost C++ Graph Library, we realized that occasionally a particular use of a graph algorithm might want to stop early. For example, Dijkstra's algorithm finds all the paths from A to B in order, from shortest to longest. What if you want to find the ten shortest paths and stop? The way this library's algorithms work, you pass them a “visitor” object that gets notified about results as they are discovered. And in fact there are lots of notification points for intermediate conditions, not just “complete path found,” so if we were going to handle this early stop explicitly, we'd generate a test after each one of these points in the algorithm's inner loop. Instead, we decided to take advantage of the C++ bias toward non-failures. We said a visitor that wants to stop early can just throw. Now in fairness, I don't think we ever benchmarked the effects of this choice, so it might have been wrong in the end. But it was at least plausibly right. + +Finally, you might need to consider your team's development culture and use of tooling. If people typically have their debuggers set up to stop when an exception occurs, you might need to take extra care not to throw when there's an alternate path to success. Some developers tend to get upset when code stops in a case that will eventually succeed. + +## How to Handle Failure + +OK, enough about exceptions. Finally we come to the good part! Seriously, this was originally going to be the focus of the entire talk. + +Let's talk about the obligations of a failing function and of its caller. What goes in the contract and what does each side need to do to ensure correctness? + +### Callee + +Documentation: +Document local failures and what they mean. +Document non-local failures at their source, but not where they are simply propagated. That information can be nice to have, but it also complicates contracts and is a burden to propagate and keep up-to-date. + +Code: +Release any unmanaged resources you've allocated (e.g. close temporary file). + +#### Optional + +If mutating, consider giving the strong/transactional guarantee that if there is a failure, the function has no effects. + +Only do this if it has no performance cost. Sometimes it just falls out of the implementation. Sometimes you can get it by reordering the operations. For example, if you do all the things that can fail before you mutate anything visible to clients, you've got it. + +Don't pay a performance penalty to get it because not all clients need it and when composing parts all the needless overheads add up massively. + +### Caller + +- Discard any partially-completed mutations to program state or propagate the error and that responsibility to your caller. This partially mutated state is meaningless. + +What counts as state? Data that can have an observable effect on the future behavior of your code. Your log file doesn't count. + +#### Implications as data structures scale up + +The only strategy that really scales in practice, when mutation can fail, is to propagate responsibility for discarding partial mutations all the way to the top of the application. That in turn implies mutating a copy of existing data and replacing the old copy only when mutation succeeds. Either way, you probably end up with a persistent data structure (which is a confusing name—it has nothing to do with persistence in the usual sense). + +A persistent data structure is one where a partial mutation of a copy shares a lot of storage with the original. For example, in Photoshop, we store a separate document for each state in the undo history, but these copies share storage for any parts that weren't mutated between revisions. This sharing behavior falls out naturally when you compose your data structure from copy-on-write parts. + +### What (not) to do when an assertion fires. + +- Don't remove the assertion because “without that the program works!” +- Don't complain to the owner of the assertion that they are crashing the program. +- Understand what kind of check is being performed + - If it's a precondition check, fix your bug + - If it's a self-check or postcondition check, talk to the code owner about why their assumptions might have been violated + +### Probably different functions for unit testing. + + + + + + + + +Notes: + - read from network, how much was read + - no-error case exists + - podcast + - likely a local handling case. + - don't go to vegas with something you're not prepared to lose. + +Quickdraw GX: 15% performance penalty for making silent null checks. From 0a366a8b29068bea6f64e8d1df25fbd122cb4038 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Fri, 14 Nov 2025 10:58:28 -0800 Subject: [PATCH 07/41] WIP --- better-code/src/chapter-3-errors.md | 84 +++++++++++++++-------------- 1 file changed, 44 insertions(+), 40 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index d827b08..47e4815 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -30,7 +30,7 @@ When we write the word “error” in normal type, we mean the idea above, distinct from the related Swift `Error` protocol, which we'll always spell in code font. -We'll divide errors into three categories: +We'll divide errors into three categories:[^common-definition] > - **Input error**: the program's external inputs are malformed. For > example, a `{` without a matching `}` is discovered in a JSON @@ -47,6 +47,11 @@ We'll divide errors into three categories: [^avoidable]: While bugs are inevitable, every *specific* bug is avoidable. +[^common-definition]: While some folks like to use the word “error” to +refer only to *failures*, as the authors have done in the past, the +use of “error” to encompass all three of these categories appears to +be more widespread. + ## Error Recovery Let's begin by talking about what it means to “recover from an error.” @@ -83,22 +88,26 @@ intact. -- DWA --> More generally, [it has been said](https://stackoverflow.com/a/38387506) that recovering from an -error means that the program can “sally forth entirely unscathed,” -i.e. that the program state is intact—its invariants are upheld. +error allows a program to “to sally forth, entirely unscathed, as +though 'such an inconvenient event' never had occurred in the first +place.” -Also, the state must make sense given the correct inputs received so -far. “Making sense” is necessarily a subjective judgement, so examples -are called for. +Being “unscathed” means two things: first, that the program state is +intact—its invariants are upheld so its code is not relying on any +newly-incorrect assumptions. Second, that the state makes sense +given the correct inputs received so far. “Making sense” is +necessarily a subjective judgement, so examples are called for. - The initial state of a compiler, before it has seen any input, - certainly meets its invariants. But when an error is + certainly meets the compiler's invariants. But when an error is encountered, resuming with that state would ignore the context seen so far that can help inform further diagnostics. If the following text did not match what is expected at the beginning of a source - file, it would be flagged as an error. We the error might, for - example have been detected in some deeply (correctly) nested - construct. If that state isn't preserved, each closing delimiter of - that construct will be flagged as a new error. + file, it would be flagged as an error. The error might, for example + have been detected in some otherwise-correct deeply nested + construct. If the compiler's state is reset to its initial + conditions, each closing delimiter of that construct would be + flagged as a new error. - In a desktop graphics application, it's not enough that upon error (say, file creation fails), the user has a well-formed document; an @@ -116,12 +125,13 @@ saw in the previous chapter, not all precondition violations are detectable. Also, it's important to admit that when a runtime bug check fails, we're not detecting the bug per-se: since bugs are flaws in *code*, actually detecting bugs involves analyzing the program. -We're really detecting a *downstream effect* that the bug has on -*data*, like some kind of cosmic echo. We know something happened, -but we don't know exactly where, how or why. +We're really detecting a *downstream effect* that the bug has had on +*data*. When we observe that a precondition has been violated, we know +something invalid occurred, but we don't necessarily know exactly +where, how, or the full extent of the damaged data. So can we “sally forth unscathed?” The problem is that you can't -know. The downstream effects of the problem could have affected many +know. The downstream effects of the problem could have affected many things you didn't test for, and you can't test for everything, or your code would spend more time on that than on fulfilling its purpose. Because of the bug, your program state could be very, very scathed @@ -131,40 +141,34 @@ Sallying forth at this point is a terrible idea. - First, there are effects in the outside world. - - so the users data might be corrupted, right? - And they might say that that way and they'll lose the last good state they had. - So that's that's pretty serious. + - In an editing application he user's document might be corrupted + and they might save it that way, losing the last good state they + had. - - if you've done a security evaluation, the assumptions that + - If you've done a security evaluation, the assumptions that underlie that evaluation might be violated. So by continuing, you may be opening a security hole - - You also can't detect whether you've recovered correctly. There's - there's nothing to look at and the penalties that we just talked - about for failure to do it correctly are really, really high. - -- then there's also the impact on the development process. + - You don't have enough information about the state of your system + to do it reliably, you can't detect whether you've done it + correctly, and the penalties we just discussed for failure to do + it correctly are astronomical. - - if you Sally forth the bug is gonna be masked and we'll never get - fixed until at some point, you know, somebody will observe the - effects of this +Then, there's also the impact on the development process. - - It's gonna affect your, your customers and your and if it you know - when it affects the really important customer, your management may - insist that you do something about it. + - The bug will be at least partially masked. - - You didn't detect the bug. You don't have a detection of the bug. - You have some very distant echo in the users document that's - corrupted and now now it's a long process to, you know, try to - figure out where that corruption came from. + - If not completely masked, and addressing it will usually be + de-prioritized. - - most code is correct, so you're “bug,” recovery code will never - run. it certainly isn't gonna get tested. if it got tested, you're - gonna fix the problem, so now you're shipping a lot of code that's - just protection against future programming mistakes. + - If the bug ever becomes a priority, it will be harder and more + expensive to fix. - All of this recovery code bloats your program and every single line - is a liability with no offsetting benefits. + - Because most code is correct, so most of your bug-recovery code + will never run or be tested. All this recovery code bloats your + program and every line [is a + liability](https://blog.objectmentor.com/articles/2007/04/16/code-is-a-liability) + with no offsetting benefits. ### Actual Bug Recovery From aed127ccdaecec5a76892bf32ab02358152c6d67 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 1 Dec 2025 13:40:20 -0800 Subject: [PATCH 08/41] More WIP --- better-code/src/chapter-3-errors.md | 1503 +++++---------------------- 1 file changed, 281 insertions(+), 1222 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 47e4815..fa7a73f 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -130,31 +130,36 @@ We're really detecting a *downstream effect* that the bug has had on something invalid occurred, but we don't necessarily know exactly where, how, or the full extent of the damaged data. -So can we “sally forth unscathed?” The problem is that you can't -know. The downstream effects of the problem could have affected many -things you didn't test for, and you can't test for everything, or your -code would spend more time on that than on fulfilling its purpose. +So can we “sally forth unscathed?” The problem is that we can't +know. Since we don't know where the bug is, the downstream effects of +the problem could have affected many things we didn't test for. Because of the bug, your program state could be very, very scathed -indeed. +indeed, violating assumptions made when coding and potentially +compromising security, If user data is quietly corrupted and +subsequently saved, the damage is permanent. -Sallying forth at this point is a terrible idea. +In any case, unless the program has no mutable state and no external +effects, the only principled response to bug detection is to terminate +the process. [^fault-tolerant] -- First, there are effects in the outside world. +[^fault-tolerant]: There do exist systems that recover from bugs in a +principled way, using redundancy: for example, functionality could be +written three different ways by separate teams, and run in separate +processes that “vote” on results. In any case, the loser needs to be +terminated to flush any corrupted program state. - - In an editing application he user's document might be corrupted - and they might save it that way, losing the last good state they - had. +As terrible as that outcome may be, it's better than the alternative. - - If you've done a security evaluation, the assumptions that - underlie that evaluation might be violated. So by continuing, you - may be opening a security hole +Immediate dangers aside, sallying forth in the face of a detected bug +hurts the development process and the health of the codebase. One +could argue that - - You don't have enough information about the state of your system - to do it reliably, you can't detect whether you've done it - correctly, and the penalties we just discussed for failure to do - it correctly are astronomical. +Even if the condition is logged, it is at +least partially masked, and in practice, will usually be +de-prioritized. If it ever becomes a priority, it will be harder and +more expensive to fix than if it had immediately . It's easy to make +bugs drop -Then, there's also the impact on the development process. - The bug will be at least partially masked. @@ -164,1234 +169,288 @@ Then, there's also the impact on the development process. - If the bug ever becomes a priority, it will be harder and more expensive to fix. - - Because most code is correct, so most of your bug-recovery code - will never run or be tested. All this recovery code bloats your + - Because most code is correct, bug-recovery code will never run or be tested. All this recovery code bloats your program and every line [is a liability](https://blog.objectmentor.com/articles/2007/04/16/code-is-a-liability) with no offsetting benefits. +[^needless-checks]: Even if we _could_ test for everything, our code +would spend more time on tests than on fulfilling its purpose. And because most code is correct, the code attempting to recover would be untexted + + +In general, only fault-tolerant systems that can recover from bugs use redundancy + ### Actual Bug Recovery -I mean, they're do exist robust systems, right? -So they they can recover from bugs. -How do they do that? -Well, it's all. -It's almost always basically always. -It's outside the process, right? -Maybe the robustness of the system comes from redundancy. -You have you have three different processes and they all vote on the result. -The like this is the kind of thing you might see in like the F22 Joint Strike Fighter. -So yeah, there could be a bug. -First of all, they you know they check the code a lot more carefully than we do, but but they also put in safeguards in place so that so that if you know you have three systems voting on the result and one disagrees, you can kill that process and start it up again. -Umm. -So yeah, sometimes it's possible to design a system to recover from books, but don't expect to do it in in your process. -To sum up, uh in general you can't recover from bugs and it's a bad idea to try. - -### Correct in-process response to bugs - -So what can you do? -Well, the way to handle bugs is to stop the program before any more damage is done and generate a crash report for debuggable image that captures as much information as you possibly can about the state of the program. -So there's a chance of fixing the bug. -Umm, be there might be some small emergency shutdown procedure. -You might need to perform like saving information about the failing command so your application can offer to retry it for you when you restart it. -Were you? -You know, maybe you can say something to the user about the reason that you're exiting. - - -So this is bad, right? -This is really bad if if you don't do something, really go out of your way to do something about it, it's gonna be experienced as a crash by the users, but it's the only way to prevent much worse consequences of a botched recovery attempt. -Remember the chances of battery are really high because you don't have enough information to do it reliably. -There is an upside, though, right? -It's also gonna be experienced as a crash by developers, QE teams and beta testers, and that gives you a chance to fix the bug, right? -It's not going to slip by those people unnoticed and then hit your customers in a really damaging way. -So you can though, mitigate this experience of of crashing right? -For example, you could say something to the user about the reasons that you're exiting, and you can actually make it sound pretty responsible. So. -So this is important. - -### Embracing Early Termination - -You know, a lot of people have a hard time accepting the idea of voluntarily crashing or exiting right? -Exiting early is really what that should say, but you know we should face it. -You're bug detection isn't the only reason that the program might exit early, right? -You can crash from an undetected bug were a person can trip over the power cord, and really you should design your software so that when these bad things happen, they're not catastrophic. -In fact, you know, if we stop, you know, pushing, pushing bugs away and and early exit away. -As though as though it's an intolerable thing, we could actually embrace it and try to make it really seamless, right? -But you have to to do that. -You have to accept that early exits are sometimes gonna be a part of the whole package of user experience that you're trying to to deliver. -Umm. -Maybe you could arrange for the program to restart itself, for example. -Umm so. -In fact, there are platforms that actually force you to live under constraint of, you know, no early exit, right. -So on an iPhone or iPad, for example, to save battery and keep your foreground apps responsive, the OS might kill your process anytime it's in the background. -But it's going to make it look to the user like the the app is still running and when the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. -I can tell you that as a user, it's really jarring when you encounter an app that doesn't do that correctly? -So the point is resilience to early termination is something that you can and should design into the system. -So Photoshop uses a variety of strategies for this, so we already we always save documents into a new file and then atomically swap that file into place only after the save succeeds. -So we never crash, leaving some half written corrupted document on disk, right? -We also periodically save backups so you only had Most lose the last few minutes of work, but we could be more ambitious about this, right? -We, if we needed to tighten that up, we could maybe save a record of changes since the last fall back backup. - -### Assertions - -Umm, so the usual mechanism that we have for terminating a program when a bug is detected is called an assertion, and traditionally it's spelled, you know something like this and this spelling comes from C and C++. -If you're programming in in some other language, you probably have something similar and the the facility from C is pretty straightforward. -Either it's disabled, in which case it generates no code at all, even the. -Check is skipped. -Umm. -Or it does the check and exits immediately with a predefined error code if the check fails, usually printing a a message containing the text of the failed check and its location in source. -Good debuggers commonly stop at that assertion rather than just exiting, and even if you're not running in the debugger on many on Major OS's you'll get a crash report with the entire program state that could be loaded into a debugger. -So this is great for catching bugs early before they get shipped and and actually diagnosing them provided people use it. -And uh, so another important dynamic is the project's commonly disable assertions in release builds. -So this has the nice side effects of making programmers comfortable adding a lots of assertions because they know they're not gonna slow down the release build, and that means more bugs get caught early. -But unless you really believe you're shipping bug free software, you might wanna leave most assertions on and release builds. -So in in fact the security of your software might depend on it. -So if you're programming in an unsafe language like C, opportunities to cause undefined behavior are all around you, and when you can assert that the conditions for avoiding that you be are met before executing the dangerous operation, the program will come to a controlled stop instead of instead of opening an arbitrarily bad security hole. -Yeah, I should have. -I meant to to make this distinction earlier, right? -Exiting because of an assertion is not a crash, right? -This is a controlled stop for calculated reasons. - -Umm so but the problem with leaving assertions on and release is that some checks are too expensive to ship. -And let's be honest, a lot of programmers are gonna go with their gut about what's too expensive instead of measuring. -So we really need a second expensive assert, right? -That is only on in debug builds, so we can continue to cache those bugs early. -And there's another problem with having just one assertion. -It doesn't Express sufficient intent. -There are lots of different reasons you might wanna be doing this kind of a check, so it might be a precondition check, right? -Or you're asserting functions author might just be double checking their own reasoning, and when these two different assertions fire, the meaning is really different. -The first one indicates a bug in the color and the other one is a bug in the callee, so I really wanna separate precondition and self check functions. -I want both of those. -Now, if I'm writing in a safe by default language like rust or swift, the checks that prevent undefined behavior like array bounds checks or special I can afford to turn off all of the other checks in shipping code. -But these checks are the ones that uphold the safety properties of my system. -Right. -And if I turn those off, the that's compromised. -So I wanna separate assertion for those for those checks that prevent undefined behavior even if I don't ever anticipate turning off the other ones in a shipped product, because these are the ones we can't delete from the code, right? -So you want to make that obvious by their spelling? -And furthermore, I'm I might wanna turn the others off locally so I can measure how much overhead they're incurring. -Alright, so I hope you get the idea. -I'm not trying to prescribe the exact set of assertion facilities your project needs, but at carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. -If you haven't gotten one of these, go design it. -OK. -So one last point about the C++ is Sir. -Umm, it's better than nothing, right? -But because it calls abort, there's no place to put any emergency shutdown measures. -So you can't even display a message to the user, so to the user if you use C's Cert, it's always gonna feel like a hard, unceremonious crash. -You probably won't fail to certains to call terminate instead of abort, right? -Because there are terminate handlers and those would run. -That gives you a chance to do some origin. -See shut down measures. -So that's another reason to engineer your own assertions, even if you're only engineering one. - -### Fighting For the Right To Die. - -OK, so at this point somebody always asks, but you know I I'm not allowed to terminate. -My manager says that that we have to keep running no matter what. -Right. -Umm, So what do you do? -Well, first, you've gotta fight for the right to park. -I need to terminate. -Right. -If you've got a critical system you wanna advocate creating some recovery system that's outside of the process because there is no reliable recovery inside the process. -And if you lose that fight today, right. -You wanna keep fighting, but in the meantime, fail as noisily as possible, preferably by when at least when you're not shipping the code, get it to terminate right and also set yourself up to deal with the day that that you win the fight because at some point the cost of of following this possible this policy are gonna become obvious. -And so that means use a suite of assertions that, well, today they don't terminate, but you can change their behavior when you do win the fight, OK. +Systems that are resilient to bugs do exist, though. They do it by adding -## Failures +Some systems can recover from bugs (e.g. redundant ones). Processes can't recover. -I I don't know if we're going to get to the end because of the scope expansion anyway, so as much as we all love talking about bugs, it's time to leave bugs behind and talk about failures. +To sum up, in general you can't recover from bugs, and it's a bad idea to try. So what can you do? -> - **Failure**: a function could not fulfill its postconditions even -> though its preconditions were satisfied. For example, writing a -> file might fail because the filesystem is full. +## Handling bugs +You can stop the program before any more damage is done, and generate a crash report or debuggable image that captures as much information as is available about the state of the program, so there's a chance of fixing the bug. Maybe there's some small emergency shutdown procedure you need to perform, like saving information about the failing command so the application can offer to retry it for you when you restart it. -So let's say you identify a condition where your function is unable to fulfill its primary purpose, so that can occur in one of two ways. - -Either - -1. something your function calls has a precondition that you can't be sure you're prepared to satisfy, or -2. something your function calls. Itself reports a failure. - - -you so usually have two choices at this point. - -1. Make it a precondition violation: your inability to make progress reflects a bug in the caller -2. Make it a failure, which means that all of the code in the system is correct. - -It's counterintuitive, but you should actually always prefer to classify that situation as a bug in the caller, as long as it satisfies the criteria for acceptable preconditions: - -- it must be possible for the caller to ensure the condition. There's no way for the caller to ensure there's enough disk space to save a file, because other processes can come and use up any space that might have been free before the call. So you can't make there's enough disk to save a precondition. -- The the other way in which something might not be a suitable precondition is if it takes as much work for the caller to ensure it as the work you're gonna do in in performing the operation in the end anyway. So for example, if if they're deserializing a document, umm and you find that it's corrupted, you can't make it a precondition that the file is well formed, because determining whether it's well formed or not is the same work that as doing the deserialization so. - -OK, so prefer to make it a precondition, but. - -If you can't satisfy a post condition and you're in correct code, That's a failure. - -> **Definition** - -So why am I tying this definition to postconditions, other than to bind our understanding of Error Handling to under to the way we understand correctness? -That's a valuable thing, but there's there are more reasons. -So first of all, it's simplifies and improves understandability of contracts. -So this is really easiest to see if you have a dedicated mechanism in the language for Error Handling, so I just. -I'm using fictional programming language here. -There should be easy to understand what's happening though. -Here's here's a couple of examples. -So in the first case we have we have the the error cases treated as though it's part of the post condition, right? -We have to say. -This thing returns X sorted or it throws an exception in case something fails, right? -You're going to end up saying that a lot if it's not part of the post condition. -You can say this now if you you know you know it's throwing an exception, you know that means the operation failed. -There's nothing else you need to say, even if you do feel you need to say something about possible failures, that becomes a secondary note. -That's not essential to the contract, right? -You get something like this? -In in both of these cases, a programmer can know everything essential from that summary fragment at the top and the signature of the function. -So another way this separation plays nicely with exceptions is that you can say that the post condition of a function describes what you get when it returns, and a throwing function never returns. -OK. -But if you don't use exceptions., you still get simplified contracts from this, as long as you have a dedicated type to represent the possibility of failure. -So here's here's an example. -You can say that this returns X in sorted order because you know that result or failure means or. -You know, there's the possibility that the operation failed and I'm reporting that, umm. -Separating this the primary intention from the reasons for failure makes sense because the reasons for failure actually matter less. -And if that's not obvious to you yet, some justification is coming. -So finally another reason to exclude this failure case from the post condition is that you want postconditions, to be solid and fully described. -But when a mutating operation fails, it often leaves behind a state. -That's very nebulous, and as I said in the contracts talk, you usually don't want to describe it because it's detailed that nobody cares about. -But if it's part of the post condition you you end up, you need to say something about it and that further complicates the contracts. -So you end up with something like this. -OK, this sorts of Xia cording to order or throws an exception to Forder fails, leaving X modified in unspecified ways and you end up saying something like that over and over again for mutating operations instead of just being able to say. -Swartz X according to order. -OK. -Now if you spend some time writing code that that handles errors carefully and correctly, especially in a language like C where all of the error propagation is explicit, failures start to sort of sort themselves into two categories. -There's local failures and non local failures based on where the recovery is likely to happen. -Local recovery. -It occurs very close to the source of the failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. -So in many cases, also in, it tends to be more in performance critical code. -So for example, you might have an ultra fast memory allocation memory allocator that draws from a local pool. -That's much smaller than system memory, and on top of it you build a general purpose allocator and first tries your fast allocator, and only if that allocation fails, it recovers by trying the system allocator. -Right, that's very local Handling. -You're gonna try the fast allocator and try your alternative method and the error doesn't propagate any further than that. -Umm, another common example is the lowest level function that tries to send a network packet can fill for a whole slew of reasons, and you can look these up in the in the POSIX documentation. -Some of these indicated temporary condition like packet collision and 99% of the time the immediate caller of this low level function is a higher level function that checks for these conditions and if it finds one of these temporary conditions, it initiates a retry protocol with exponential backoff and only itself fails after about, you know some number of failed retries that lowest level failure is local and the failure after and retries is very likely to be nonlocal, so nonlocal recovery nonlocal recovery is far far more common, umm, and it usually it occurs far from the source usually. -In a way that doesn't depend on the details of the reason for failure. -For example, when you're serializing a complex document, serializing any part means serializing all of that all of the subparts and parts are ultimately nested many layers deep, right? -And because you can run out of space in the serialization medium, every step of the process can fail. -So if you write out the error propagation explicitly, it usually looks something like this. -Right. -You have it error code and then this pattern gets repeated over and over again. -Each part you serialize it and check to see if there was a failure and if there was a failure you you have an early return. Umm. -So after every operation that can fail, you're you're logically adding and if there was a failure, return it OK and so there are many layers of this propagation, and none of it depends on the details of the reasons for failure, whether the disk is full, or the OS detects directory corruption or the serialization is going to an in memory archive and you run out of memory, you're going to do the same thing. -Finally, we're the propagation stops and the failure is ultimately handled. -Like let's say this is a desktop app. -Again, the recovery is usually the same no matter what the reasons are for the failure you report the problem to the user and you wait for the next command. -OK, so let's talk about exceptions for a minute. -Way back in 1996, I sort of developed a personal personal mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. -So yeah, I'm old. -Ohh and while I've seen some real progress on that over the years, I know that some of you out there are still not all that comfortable with the idea of exceptions, and if you'll let me, I think I can help the the first point to know is that exceptions are just control flow and you can see the motivation for for this really easily with cases like this one, because using an exception eliminates the boilerplate and lets you see the codes primary intent right there. -There is no magic here. -Exceptions. -Just like a Switch statement exceptions. -Capture this commonly needed control flow pattern and eliminate unneeded syntax so to to grok the meaning of this code in its full detail, you mentally add and if there was a failure, return it just that same thing that we said we were gonna repeat over and over again in the code with the explicit error handling everywhere. -But if you push failures out of your mind for a moment, you can see how the function also is much more easy to see how it fulfills its primary purpose, right? -That that primary purpose was a lot was obscured by all of the failure handling in the earlier version, and this effect of of clarifying the primary purpose is even stronger when there's some control flow that isn't related to error handling, because the the pattern is less. -You know the pattern of stuff that you can ignore is less obvious, OK? -Umm OK, so I said exceptions. -Are just control flow. -I I lied a little bit. -OK, there's one other big difference between the exception version and the explicit version. -The exception version of erases the types of the failure data and catch blocks are just big type switches with dynamic down casts that recover that information. -So a lot of us are static typing partisans, so at first this, you know, erasing this type information might sound like a bad thing. -But remember, as I said, none of the code propagating this failure or even recovering from it, usually cares about the details of the reason for the failure. -They don't care about the the data in the fail failure report. -What do you gain by threading all that failure information type information through your code when the reasons for failure change, you end up creating lots of churn in your code base. -Updating this types. -In fact, if you look carefully at the explicit signature. -You'll see something that typically shows up in systems where failure type information is included. -People find a way to bypass that the development friction induced by static types right here we have this unknown case, and that's basically a type of raised box for any failure type. -This is also a reason that systems with statically checked exception types are a bad idea, but it doesn't matter whether you're doing exception handling or reporting errors another way. -The same dynamic occurs. -Java has a feature called checked exceptions, which is a famously failed design. -Because of this dynamic people. -Having to bypass it. -Swift recently added statically typed Error Handling. -In spite of this lesson, that should be well understood to the language designers, I I don't understand why there was a lot of fanfare from the Community, because I suppose everybody thinks they want more static type safety. -But I'm not optimistic that this time it's gonna work out any better than I did for Java. -So the moral of the story here is sometimes dynamic polymorphism is the right answer, and nonlocal error handling is a great example of that, and the design of most exception systems optimized for that. -OK, unfortunately we are getting right to the limit on time, so. -Yeah, like there's a we're not gonna get to the end today. -So I think we're gonna. -We're gonna need to have a part too. - -Todd Baumeister 50:51 -Alright, I'll be in the brave idiot who goes first. - -Nick DeMarco 50:55 -Thank you, Todd. - -Todd Baumeister 50:56 -Ah, awesome presentation. -Thank you. -Umm. -As a former C developer, although I don't if I can say former, can you ever forget how to writing a bike? -It was really good and a lot of really good points about Error Handling, but. -I have to ask so the conditions you're looking for are like the worst case. -Like we we can't recover from them. -And you, you mentioned the uh inheritance or the I can't remember the exact words you use, but the idea that you can have a a chain of handlers when an exception comes out and filter through that and then hit an all at the end. -So can I summarize your talk to? - - -Todd Baumeister 51:44 -As a developer, I have expected errors. -Yes, things that I expect to potentially go wrong. -I like for example, my network times out. -I need to handle that, but we always need a catch all at the end for. -Well, no, we should not have a catch all at the end for uh, the unexpected errors is that my main takeaway from here or. - -Dave Abrahams 52:08 -Umm. -If I understand, yeah, if I understand your question right. -Uh, no, I'm not saying that. -Umm, let me ooh. -Let me be let me try to sort some of this out though. -There's a lot of good stuff in your question. -Umm. -So umm, remember from the beginning unexpected errors that you almost always means bugs. -OK. -And and part of my advice. -Well, which we we didn't get to, but. -It's right here. - -Don't use exceptions for bugs when a bug is detected, you should exit the program, not throw. -Don't worry about catching. -Certainly don't use exceptions to exit the program, right? -I know that's the default behavior if you don't catch it, but the problem with that is anybody up the chain from you? -I'm going to just read what I've got here. -The default behavior of exceptions. -Has stopped the program but throwing when you find a bug defers the choice about to whether whether to actually stop to every function above you in the call stack, and that is not the service that is a burden. -Right, giving giving your clients bad choices to make does not help anybody. -So if you made your function harder to use by giving your client just more decisions to make. -If you do that. -OK. -So then so last, OK, what about what about catchall case at the top anyway, right? -Maybe. -Maybe it's not for bugs, maybe it's for some exception type that you don't you you want to wear of at the top level. - -Todd Baumeister 54:09 -Yourerunning.net on the C++ platform and .net through exception and you gotta catch it someplace. - -Dave Abrahams 54:15 -Yeah. - -Todd Baumeister 54:16 -That's where I've experienced this, yeah. - -Dave Abrahams 54:18 -Yeah. -OK. -So I mean, let's let's talk about the the desktop app. -Uh, OK, because that's something I I can I can address easily. -We can look at other examples. -So you got an unidentifiable error, but it it prevented your operation from succeeding. -So what's the problem here? -If you don't know anything about the exception type at all, you can't really give a meaningful error report to the user. -That's the worst. -That's the worst part of it. -That's that's really so you have to say sorry and unknown error occurred. -That's embarrassing, but it's not. -It's not catastrophic, right? -You can from there you can proceed just as though any other thing like ran out of disk space occur. - -Todd Baumeister 55:15 -OK. -Thank you. -Yes, helpful. - -Nick DeMarco 55:20 -Dustin, you wanna go ahead? - -Dustin Passofaro 55:22 -And the last 30 seconds I'm. -I'm sitting at the top of this bell curve of my relationship with exceptions. -You start out with ohh. -Exceptions are cool. -Ohh my gosh, never use exceptions. -They are the bane of my existence. -Why oh why? -And I see you over here and you're starting to push me over the edge to wait a minute. -Maybe there is something and I'm not one over yet, because I still see and please I want to be one over. -Please help me continue to see how this doesn't just lead to the most mind boggling spaghetti. - -Oliver Unter Ecker 55:51 -Can you talk with the office? -I could. - -Dustin Passofaro 55:54 -Or and and. -Sorry, somebody else is talking. - -Dave Abrahams 55:56 -OK. - -Dustin Passofaro 55:57 -So to kind of double down, I still see even in the cases where it's like, oh, this is a known error, you're still deferring and now you have your entire call stack above you. -Oh, is it my responsibility? -How about yours? -How about yours? -That's the first problem I see. - -Dave Abrahams 56:13 -OK. -Well, let me address that to start with, if you if you do Error Handling carefully, no matter what mechanism you use, that pattern comes up. - -Dustin Passofaro 56:26 -Good observation. - -Dave Abrahams 56:27 -OK. - -Dustin Passofaro 56:27 -OK. - -Dave Abrahams 56:27 -So. +Let me be clear: THIS IS BAD. It could be experienced as a crash by users. +But it's the only way to prevent the much worse consequences of a botched recovery attempt. Remember, the chances of botchery are high because you don't have enough information to do it reliably. +Upside: it will also be experienced as a crash by developers, QE teams, and beta testers, giving you a chance to fix the bug. -Dustin Passofaro 56:27 -Yeah, I see that. +*** You can mitigate the experience of crashing *** +*** Don't tell me my assertion is a crash *** +*** An assertion is a controlled shutdown *** -Dave Abrahams 56:28 -So it's just, it's just a mechanism, it doesn't change the nature of failure handling, which is the same no matter what mechanism you use. +A lot of people have a hard time accepting the idea of voluntarily terminating, but let's face it: your bug detection isn't the only reason the program might suddenly stop. You can crash from an undetected bug. Or a person can trip over the power cord. You should design your software so that these bad things are not catastrophic. -Dustin Passofaro 56:38 -That's the light bulb, OK? +*** In fact you could be more ambitious and try to make it really seamless. You have to accept this is part of the UX package to even take this on. *** -Dave Abrahams 56:40 -OK. +In fact some platforms force you to live under a similar constraint. On an iPhone or iPad, for example, to save battery and keep foreground apps responsive, the OS may kill your process any time it's in the background, but will make it look to the user like it's still running. When the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. I can tell you as a user, it can be really jarring when you encounter an app that doesn't do it right. The point is, resilience to early termination is something you can and should design into the system. -Todd Baumeister 56:40 -We can. -I just add the big difference here is known versus unknown, right? -You're talking about known exception handling versus you're talking about unknown, which is a bug, and it's out of no. +For example, Photoshop uses a variety of strategies: we always save documents into a new file and atomically swap it into place only after the save succeeds, so we never leave a half-saved document on disk. We also periodically save backups so at most you only lose the last few minutes of work. If we needed to tighten that up we could, by saving a record of changes since the last full backup. -Dave Abrahams 56:52 -Uh, yeah, yeah, let me, I, I, I wanna be really precise here. -OK so so I prefer not to use the word unknown because a library could throw could give throw you a failure, right? -That doesn't represent a bug, but they didn't tell you about the type. -They didn't tell you they were going to throw that type. -So an unknown unknown exception is not a is not a bug. +## Assertions -Todd Baumeister 57:25 -I a structured exception is that we mean like umm. -An application exception. -Let me put it that way versus like a null pointer someplace, yeah. +The usual mechanism for terminating a program when a bug is detected is called an assertion and traditionally it spelled something like this: -Dave Abrahams 57:37 -OK. -So yeah, there there's this unfortunate. -So. -So yes, what I'm talking about are language feature exceptions. -OK, there are. -Ohh. -Unfortunately, uh. -So lots of systems and and processors call things like divide by zero and exception, but they don't. -They don't act like exceptions. -In languages that propagate up the call stack and and take care of things like destructors. - -Todd Baumeister 58:08 -What? -What's so system interrupts first application errors? - -Dave Abrahams 58:13 -Sorry. What? - -Todd Baumeister 58:13 -Maybe then system enter ups for application errors like divide by zeros to the system interrupt. - -Dave Abrahams 58:20 -OK, the problem again. -Let's let's be precise. -In our language you say application error. -Given what we saw about errors, that could just mean a bug in the logic of the application. -So don't handle that with an exception if the if something you're using throws an exception at you for those cases, which is a common Miss Design, C has it even in places there, right? -Stop it. -Turn it into an assertion failure. -You don't want your your unrecoverable code paths mixed with your recoverable code paths, right? - -Todd Baumeister 59:06 -Thank you. + assert(n >= 0); -Nick DeMarco 59:09 -I have a question from Kevin Hopps as I sense a cadence here. - -Dave Abrahams 59:12 -OK. - -Nick DeMarco 59:14 -Incidentally, as I've seen a clapping emoji, if you have to drop off, feel free to. -But we like to go with the Q&A section for as long as folks are interested or until someone gets exhausted. -So we're gonna go for a little bit longer. -Also, Please note that I've just dropped a survey link in the chat, so if you do drop off, please take a like the five minutes that it takes to answer the questions we recently made all of the questions not required, so if you just wanna tell us one thing about the talk, you can do that now. -So don't feel obligated to fill out every single question, but Kevin asks how should I write my utility function? -Take open file for example. -It might be fatal for one caller and error for another caller. -We're not even an error for yet another caller. - -Dave Abrahams 1:00:02 -OK. -So does it? -That's a great question. -Umm, so if it's not necessarily fatal, right? -If if you want to make this thing useful to everybody, obviously you can't make the decision to make it fatal, right? -So so. -You have to. -You have to report the condition to to your caller. -Now remember what I said about about contracts and. -Contracts and and whether you decide something as an error or not. -Well, it's actually. -All right, listen, let's let's back up first and and classify this. -This is this is a not a bug error right? -No, you you can't. -Nobody's in a position to control whether open a file is gonna succeed, right? -So so clearly it's not a bug, it's a failure of some kind, right? - -Kevin Hopps 1:01:13 -Might be a bug if the call if the call to open the file is from a piece of code that knows that file should be there, then it's a bug. - -Dave Abrahams 1:01:25 -What do you mean should be there? - -Kevin Hopps 1:01:28 -If you say always create a log file when you start the program and later you try to open that log file and it's not there, that could be a bug. - -Dave Abrahams 1:01:40 -Well, not a bugging your code. - -Kevin Hopps 1:01:46 -Well, maybe I've come up with a poor example. -The point is, only the caller knows whether the result is a bug or not. -I mean it, it might be a bug in some context and not a bug in another context. - -Dave Abrahams 1:01:59 -Fine. -Fine. -If if to here in general, if only the caller knows you can't make a, you can't make a catastrophic decision, right? -It's the same as not knowing whether it's fatal, right? - -Kevin Hopps 1:02:14 -Right. +This spelling comes from C and C++. If you're programming in another language, you probably have something similar. -Dave Abrahams 1:02:15 -Actually, bugs are fatal. -We've just, we've said that right, bugs are gonna be fatal. -So so if you don't know whether it's going to be fatal for the client, you can't make the decision that is fatal. -So you can treat it as a post condition failure, right? -So what's the? -Was the other possibility. -It's not an error for another caller. -I'm not sure what that means, like whether it's an error in your, you're the open file function, whether it's an error in the open file function is not up to the caller, right? -The the caller may decide. -Ohh, I'm gonna deal with the failure to open the file some other way, but it's still a failure of the open file function, right? -Maybe it's maybe the caller is not gonna propagate that failure to its caller. -Maybe it's got some alternative way to achieve success, but it would still an error in the scope of the open file function. -I hope that answers the question, Kevin. - -Kevin Hopps 1:03:24 -Yes, thank you. - -Dave Abrahams 1:03:26 -Sure. - -Nick DeMarco 1:03:34 -Not seeing other questions in the chat, although I do wanna call out that Florin shared something very interesting about Erlang that I invite folks to read if they're curious. -Describes it as a nice complement to the concepts that Dave presented today, and I'm inclined to agree. -Having skimmed it twice now, but we are now a little bit over time. -I saw someone just come on camera Learn. -Maybe you wanna comment a little bit about Erlang and what you shared. - -Florin Trofin 1:04:03 -Umm. -Yes. -Umm, I highly recommend for engineers to read that paper because I think I I mean you can skip over the airline itself. -The language is not my favorite language, but the runtime system. -It's remarkable and some of the properties that it has, for example, you know it, it had like a an astounding, I don't remember like 7 nines of like it's been running for years and decades without stopping it and being able to patch the system at runtime without shutting it down. -I mean only those two properties. -If I tell you like that, that should raise an eyebrow and say, OK, well, that's something. -And it was done in the 70s, right? -So it's like ohh way back, but the principles I think they're very sound and the principles have been validated by, you know, these remarkable traits that these systems have these switches, these telephone switches that they've been running for decades and the supervision concept that it's introduced an airline, it's very powerful. -So the idea is that basically when you want to do something that's not trivial. -So if it's any complexity, then you delegate to a child subprocess, and in Neverland the processes are not like the OS processes. -There's something very much cheaper than that. -So you can spawn millions of processes and a nice property of the system. -Is that also establishes a bidirectional link between the parent and the the children that you're spawning? -So if a child so you, you you you need to do something. -And so you you delegate to a child or multiple children for example, you can spawn a bunch of children. -And let's say one of them fails. -Then you immediately get notified that that particular thing fails, and you as a parent can have different policies. -For example, you can say I wanna respond that node, you know a retry it or I can respond all the nodes that were even though it was only one child. -You know, because they kind of like altogether and it doesn't make sense to respond just that node and need to retry the whole thing. -And then if I fail doing that, then I report back to my parents. -So then simpler and simpler things get done. -You know, until like the whole system restarts, right? -It like restarts automatically, so that's a very interesting thought. -And you know, I've. -I've been thinking about that for a long time and I think his specially for distributed systems and services that makes has a lot of appeal, but I don't think it should be limited just to distributed systems. -I think it's also powerful when you think about normal software like desktop software. - -Nick DeMarco 1:06:37 -Hmm. -Dave, your thoughts? - -Dave Abrahams 1:06:42 -Umm yeah, I guess about that. -I I was wondering Florian, when you described those failures. -Umm, are those are those indeed failures the way the way I've described them in this presentation i.e. -Not bugs or or. -If they're or are these sometimes bugs? - -Florin Trofin 1:07:05 -So the plan, the paper actually does make a distinction, and it talks about like, what's the difference between an exception error or a failure and a bug. -And it's, I think a lot of your talk it uh, it overlaps with some of the concepts that are in there. -So that's why I said it's kind of like a nice complement to what you already discussed there. -I would be curious to hear your thoughts. -Maybe next time, you know, after you read the paper, you know there. - -Dave Abrahams 1:07:31 -Yeah, I'll read it before Part 2. - -Florin Trofin 1:07:32 -And because I also find it that it's it's it's it's really well written and well organized. - -Dave Abrahams 1:07:34 -That's a great idea. - -Florin Trofin 1:07:39 -As you know, the author organized their thought, his thoughts in the in the very well manner, so. - -Dave Abrahams 1:07:44 -So. -So it's better than the Erlang movies. -Have you seen their long movies? - -Florin Trofin 1:07:50 -No. - -Dave Abrahams 1:07:51 -Ohh there you can look them up on YouTube. -They're they're kind of hilarious. - -Florin Trofin 1:07:56 -OK. - -Dave Abrahams 1:07:58 -Yeah. -What else? - -Nick DeMarco 1:08:09 -All right. - -Dave Abrahams 1:08:10 -Anything else? - -Nick DeMarco 1:08:11 -What else? -I was just giving folks a moment. -But I think I think we might be reaching a cadence here. -So thank you to everyone that came. -Thank you to the 28 of you that that are sticking around for the discussion. -That's always fun, and I'm going to share a link in the chat one more time just to please take a survey. -If you've got a couple of minutes, let us know what you think. -Let us know in particular what you thought of the reading text on a screen narrative presentation style. -That's a first for us and I'm curious to see how it was in terms of communicating ideas and and understandability and things like that. -So please share your thoughts, but for now I guess let's wrap this up. -I'll see you next month for type design with Sean Parent. -That should be fun, but for now, enjoy your long weekend and we'll see you the next one. -Thanks everyone I. - -Florin Trofin 1:08:58 -OK. -Thank you for organizing these and thank you, Dave, for for your presentation. - -Dave Abrahams 1:08:59 -Thank you everybody. - -Nick DeMarco 1:09:03 -I agree. - -Speaker 1 1:09:03 -Thank you. +The C assertion is pretty straightforward: either it's disabled, in which case it generates no code at all—even the check is skipped—or it does the check and exits immediately with a predefined error code if the check fails, usually printing a message containing the text of the failed check and its location in source. -Nick DeMarco 1:09:03 -Big kudos today if this was a lot of fun. - -Dave Abrahams 1:09:05 -Thanks, bye. - -Nick DeMarco 1:09:06 -File. - -Nick DeMarco stopped transcription - -## PART 2 ## - -Alright, welcome back everybody. -Umm. -So just to refresh where we where well, first of all, for those of you who who don't remember part one or weren't here, umm this is a very slick presentation where I show you no slides and just this document that I wrote up with my notes is is in the background cuz it contains some examples which I'll no want you to look at. -Umm, so where we were at, we were talking about exceptions and I just wanna review a few things just for background. -You know, I tried to. -I tried to demystify exceptions. -A little bit there. -They're just a control flow mechanism. -I and and they don't introduce any new. -Problems to error handling, but if you're Handling errors right, you have basically all the same issues to to think about whether you're using return types or not, but they they do optimize for. -For things a little differently, they optimize for nonlocal error handling right where, where it's very likely that your immediate caller doesn't have anything to do with the error, and they're just gonna need to propagate it up. -And they also optimized for the they tend to to erase the types of error information in, which tends to prevent, uh, code churn has has different kinds of errors end up propagated through the code and just turns out to be a good thing for for most code, which mostly doesn't care about, uh, about what types you've actually got in the. -In the air, this cause most of the code is just propagating OK. -So and we were about to talk about Wendy's exceptions and when not to. -OK. -And I wanna start by by piercing some of the the aphorisms you may have heard about this because there's a lot of really nice sounding advice about when to use exceptions. -That's either meaningless or really vague, so like, use exceptions for exceptional conditions, right? -Well, how do I measure what's an exceptional condition? -I don't know. -Don't use exceptions for control flow. -That one specifically. -I know that's really popular around Adobe and even appears in our one of our coding guidelines documents, but come on, if you're using exceptions, you're using them for control flow because that's what it is, right? -Umm exceptions. -Change which code executes next. -So I hope I can improve on that advice a little bit. -So umm, first of all, you can use exceptions for things that aren't obviously failures. -Umm so for example, when the user cancels a command, an exception is appropriate here because the control flow pattern is identical to the one where the command runs out of disk space. -For example, the condition ends up propagated to the top level. -OK. -Umm, uh. -And in this case, the recovery is just very slightly different, right? -There's nothing to report to the user when they cancel, but all the intermediate levels between the point where the failure is initiated and the point at the top of your event loop are the same. -So it would be silly to explicitly propagate cancellation using some other mechanism in parallel with the implicit propagation of failures that you get from exceptions. -So uh, but if you make the choice to to use exceptions to deal with user cancellation, I would strongly urge you to in your in your thinking and in your terminology classify this case as a failure, right? -I said it's OK to use exceptions for things that aren't obviously failures, but you can call this a failure. -Otherwise, if you don't do that, you're gonna undo all of the benefits you've got by separating failures from postconditions,. -Right. -And you'll have to include unless the user cancels, in which case you know an exception is thrown in the description of all of the functions that it could be cancelled, right? -So in the end, my broad advice is only use exceptions for failures, but be open minded about what you call a failure. -Actually, even if you're not using exceptions, any condition whose control flow follows the same path as nonlocal failures should probably be classified as a failure. -OK. -Umm, another prime example of a non obvious place to use exceptions is the discovery of a syntax error and some input right in the general case you're parsing this input out of a file and IO failures can occur, and Wilf and what's gonna happen, right? -If you have some nested call stack where you're, you've got a recursive descent for service. -Say umm uh. -When you hit this IO error, the control flow is going to be the same as the control flow. -When you hit a syntax error. -So if you call the syntax error, the failure of the parsing routine and use the same error reporting mechanism you you have a win for your code. -OK. -So those are some places where you can use exceptions next when not to use them right? -Don't use exceptions for bugs when a bug is detected, the program can't proceed reliably, right? -And what happens when you throw? -Well, there is a whole set of unwinding actions that happen. -It destroys things on the stack. -It changes where the stack pointer is and all of that happens before your debugger were your crash report. -If if you even get one occurs, right? -So you're destroying valuable information that you might need to find a bug. -Furthermore, anything that you do that's extra once a budget is reported is that much more likely to cause a problem. -Maybe maybe corrupt your document. -Open security hold and finally it can hide the bug from developers because after all, when you throw an exception that delegates responsibility for how to deal with it to your callers and your callers, maybe you know don't feel like stopping the application, right? -Maybe they wanna swallow the exception and continue that. -As I said before, it is not a service to delegate that choice to your to your callers. -It's a burden, right? -Don't don't give your your clients extra decisions to make the specially not don't open the door to bad decisions like continuing after a bug is detected, so you've just make your function that much harder to use. -OK. -So another thing that, yeah. - -David Sankel 9:20 -It looks like there's a a question in the chat Dinesh, you wanna go ahead? - -Dinesh Agarwal 9:24 -Yeah. -OK. -Thank you. -So I just had a quick question. -So Dave, you mentioned that if it detect a bug, please don't pass it as exception. -But while the code is in production, ideally the bug would translate as an exception. -I really don't understand that. -What exactly it means to like detect a bug if a developer is detecting a bug, they will try to fix it, right? - -Dave Abrahams 9:53 -Uh, OK, so you said a few different things that I guess need to be responded to if the developer has. - -Dinesh Agarwal 10:00 -Yeah. OK. - -Dave Abrahams 10:01 -So let's let me respond to the last thing first. -Cause that one easy if the developer detects a bug, ideally they would try to fix it. -Yes, I agree. -OK. -So then about production, so we were you were you present for part one of the talk? - -Dinesh Agarwal 10:19 -I joined 5 minutes late. - -Dave Abrahams 10:21 -OK, so so I pretty sure that we covered this in part one. -So once a bug is detected, if you continue to run, you increase the chance that you silently corrupt the users data in an unrecoverable way, right? -So for example, let's you can take Photoshop, which periodically saves document to recovery files you have. -You know, even there are more sophisticated systems that will also dribble out the commands that have executed successfully so far, so that you so that the application can replay them regardless. -You know that that's a solid state before the before the bug is detected, or at least there's a very high likelihood once the bug is detected. -Now you're proceeding based on incorrect assumptions and what can very easily happen is that those incorrect assumptions lead to corruption and the document that the user doesn't see, so they they proceed and then they save their document and it's all over, right? -You can never get that. - -Dinesh Agarwal 11:36 -Got it. -I see. -I see. -Cool. -Thank you. -Got it. - -Dave Abrahams 11:43 -OK. - -Dustin Passofaro 11:44 -I'm can I can I step in there too? -Because I think I'm also not understanding I I I shared his question actually and I was here for part one, so maybe I missed, maybe I wasn't understanding something there. - -Dave Abrahams 11:47 -Sure. -Yeah, you were. -I remember you. - -Dustin Passofaro 11:56 -Ah, good. -That's good. - -Dave Abrahams 11:58 -Yeah, you had good questions, so. - -Dustin Passofaro 11:59 -Umm well or yeah I I hope. -I hope this question is is also memorable, but we'll see. -I was also thinking sometimes a bug will come up and we'll we'll present itself as an exception. -Umm and awesome that I'll let you keep going. - -Dave Abrahams 12:14 -That's my next point. -Yes. -So as it says Ohh my my finger doesn't quite reach it, I can scroll it down right right there. -If you use components that misguidedly throw things like logic errors or domain errors or invalid arguments at you, those things all represent bugs. -Don't let those exceptions propagate. -Catch them and terminate the application. -Otherwise, you're just doing. -You're just essentially indirectly doing what we've just said is a bad idea. -OK umm now. -Uh, there are some systems like Python. - -Dinesh Agarwal 13:05 -It's so you didn't run, but it's a very interesting statement. -Like we understand that there is some misguided code in the uh code base. -Is there any guidance or is there any guidelines how how do we decide it's misguided a function or maybe code piece of code? - -Dave Abrahams 13:27 -OK, well this this is a very simple criterion if the code. -Response to bugs, in other words, misuse of the code right precondition violations by throwing exception. -That's misguided. - -Dinesh Agarwal 13:49 -But that we would know once we basically run it multiple times and basically let's say if there is a library that is getting loaded, we have not run it multiple times. -It's dynamic library. -In that case, is there any guidance would you like to share some? -Maybe sanity. -Shall we do some sanity for that code before relying on that? Ohh. - -Dave Abrahams 14:13 -Known OK so. -So I you know, I probably shouldn't assume this but but my basic my basic assumption is that the components that you use have documented APIs APIs. -OK, so that means they, you know, they tell you what they're going to do. Umm. -Although you know when there's a preconditioned failure, there are obligations to do things. -There are no obligations to do anything, so I guess discovering what you know, discovering these misguided uh things. -Uh probably ends up having to be a product of auditing or or, you know, when you observe these, these kinds of misguided exceptions during runtime. - -Dinesh Agarwal 15:09 -Got it. - -Dave Abrahams 15:09 -Umm so yeah. - -Sean Parent 15:14 -Or comment that your list there is is pretty good and you know a lot of people will just inherit from student logic error. -Uh. -For for anything that's a bug, and so. -So that's a good case if you're. -If you catch a Sood logic error. -You might want to treat it you know, as as fatal all the places in Photoshop where it puts up a dialog box that says a program error has occurred. -Umm, it's fine to tell the user that, but the next thing should be save out or recovery document and exit so. - -Dave Abrahams 15:50 -Umm yeah. -That said, I mean you shouldn't. -You shouldn't make the assumption that every component you're you use is going to be misguided, right? -Then you'll then you'll have try catch blocks all over the place and basically try catch blocks if you. -If you write your code right, should be extremely rare, mostly only at the top level. -Some you know the exception is. -Sometimes you have an otherwise an unmanaged resource and you need to clean it up. -So usually you can deal with that by managing it in a destructor, but something like you're initializing on initialized memory and you know an exception is thrown while you're doing that. -You might need a cache block to go and denialist all of the elements you've initialized, right? -So very rare. -OK. -Umm. -And all that said, I want to also acknowledge that there are some. -There are some systems like Python where using exceptions to report bugs is just part of the fabric of the system, right? -In fact, in Python they use exceptions to exit loops, which is. -As a little while arming to some people, but in Python you just you can't use this rule that that if you see one of these things you you stop the program. -You have to let it propagate. -OK. -Umm. -And there's a hand. - -Josep Valls 17:33 -Hi, yes, I think I was for so I mostly programming Python so that that's me but even in Java when network is in place there are lots of things that are very commonly reported exception but by means train libraries, anything from time out to to access like temporary resource access. - -Dave Abrahams 17:34 -Shouldn't. -Are those bugs? - -Josep Valls 17:58 -So, but those are the bugs. -But these these exceptions seems to be handled in a trackage, so we have lots. - -Dave Abrahams 18:05 -No. -Well, yeah, not not immediately, right. -So generally the pattern is there's some place, so remember, exceptions are for nonlocal error handling, right? -That they are generally for things that can't be responded to by the immediate caller. -So generally in the general case, the pattern is there's one try catch block, sort of at the top level of the application that that catches all of the things that propagate out of operations that fit. - -Josep Valls 18:51 -But simple things like a retry. -Will there be a good excuse for an exception to where the library throws a? - -Dave Abrahams 18:57 -Yeah. -Right. -Yeah. -So so if you have to retry a network operation now, right. -So that's what I'm what I've been calling a local failure and you might have a component that misguidedly uses exceptions to report local failures, in which case, yes, you do need a local Tri catch. -But as I also said in the previous section, local failures are are far and away much more rare than nonlocal failures. -There's there's just a few low level functions that that need to report local failures, so if you have, if you get a a component that reports a local failure with an exception, what you can do is put a little wrapper around it and use that wrapper everywhere. -Make that wrapper report the error differently. -So which is going to be my next piece of advice is don't use exceptions for local failures. -They're not optimized for that. - -Josep Valls 20:13 -Yeah. - -Dave Abrahams 20:13 -Does that help? - -Josep Valls 20:15 -Yes, I guess I'll get more context. -When after thought process things. - -Dave Abrahams 20:22 -OK, we can come back to that. -There will be time for questions at the end. -Are there other hands that we should deal with? - -David Sankel 20:30 -We've got one more hand in the queue, is he? - -Izzy Muerte 20:32 -Yeah, so this isn't actually a question, just a small note that in the same shift in philosophy, people are mentioning in the chat, Python has also been moving towards the approach of we shouldn't be throwing exceptions everywhere. -And so in recent versions of Python, they've made optimizations to the internal compiler for C Python runtime to not actually throw the, you know, stop, stop, loop, or stop iteration error, and the other ones that are used for control logic, as in more recent years, they've discovered ways to optimize it. -So they're actually starting to shift away from that. -They can't get rid of that behavior, unfortunately, because of 30 plus years. If that behavior. -So, but that's a that's the worst case scenario. -Fall back for what happens now in Python. - -Dave Abrahams 21:26 -That's good to know. -Umm yeah, I I strongly suspect there are also not separating the bug case from the from the failure case. -Umm. -So they're gonna keep reporting, you know, invalid arguments and and other bugs to you using exceptions. - -Izzy Muerte 21:48 -Umm that has been discouraged for new types that go into the Python's dead Lib. -There's still like some functions in the Python stdlib that's still do that, umm, but you'll see more of like a types integer error like if you pass the wrong number of arguments, obviously or like an assertion error. - -Dave Abrahams 22:03 -It. - -Izzy Muerte 22:05 -Rarely these days you get a value error except for like the built in types because they just had those four for decades at this point. - -Dave Abrahams 22:11 -Yeah. -So, so from the perspective of what I'm saying in this talk type signature error and argument Error, all of those things are equivalent. -They're they're exceptions thrown to indicate, you know, precondition failures, failures of the of the caller to do the right thing. Umm. - -Izzy Muerte 22:33 -Right. -That's that's partially a result of Python's dynamic execution and not not static typing. - -Dave Abrahams 22:39 -Yeah. -Yeah, it's like, you know, you have a an interactive interpreter, right? -And so when you hit a bug, you need to be able to get back to the prompt and they use exceptions. -To do that, you know if it were me, I would prefer that there was some parallel, but but different mechanisms so that I could so that I could keep the handling of those things separate. -But but I understand why they only have one. -OK, so. -Uh, next piece of advice. -Don't use exceptions for local failures right there. -There are optimized for the patterns of Handling. -Uh, problem. -Far from its source, so if you use them for local failures, that means you're gonna write a lot more catch blocks, which increases the complexity of code, right? -It's usually easy to tell what kind of whether a failure is local or not local, but I mean, just think about what the client a typical client is going to have to do. -But if you're writing a function and you really can't guess whether it's failure is going to be handled locally or not, maybe you should consider writing two functions, right? -One that that reports its failure using some other mechanism. -Umm. -And you know, one can call the other, so you don't need to reimplement it. -OK, next consider the performance implications of throwing. -So most languages actually aren't like this, but C implementations are usually biased really heavily towards optimizing the non failure case. -Umm, so that Handling of failure runs one or two orders of magnitude slower than code that's not Handling failure. -So, and that's tends to be a really great trade off because it allows them to skip explicit checks and all the branch prediction failures and other costs associated with checking for the error case on the hot path in the in the code, right. -And so this is what meant by zero cost exception handling. -Umm, if you've heard that term, uh and non local failures are rare in, you know, in terms of like the number of instructions executed and they don't happen repeatedly inside of tight loops, right? -But you know, if you're writing it, that also means if you're writing a tight loop, you know that's really on hot paths. -You don't want to repeatedly throw exceptions in there and and catch them. -If you're writing a real time system, for example, though, you might really want to think twice about using exceptions at all, because there's a. -It might be hard to predict the amount of slowdown that happens in those rare cases where an exception is actually thrown OK. -So I have an example that I think is God is useful so. -So I was one of the founding members of boost and and was involved in the design of the Boost Graph library and but when we were discussing that design, we realized that occasionally a particular use of a graph algorithm might wanna stop early. -Umm. -Now I guess to understand this. -Ohh well, I'll get to that. -OK, so for example Dijkstra's algorithm is. -That's an algorithm that finds all of the paths from A to B in order from shortest to longest, right? -So if you you give it two points in the graph, it'll tell you all of the the different ways to get from one to the other. -But suppose you want to find the 10 shortest paths and then stop. -Well, the way the the algorithms work, you pass them a visitor object that gets notified about results as they are discovered, right? -So they you can think of the algorithm as a loop that calls the visitor every time it finds a new path, for example. -And in fact, there are lots of notification points for various intermediate conditions, not just for finding the complete path. -Umm. -And so if we're going to handle this early stop thing explicitly, we need to generate an explicit test in the algorithm code after each of these points in the algorithms inner loop. -Right. -Uh. -So instead of doing that, which would both make the algorithm harder to read and uh and cost performance for branching, we decided to take advantage of C++'s bias toward optimizing the non failure case. -We set a visitor that wants to stop early. -Can just throw an exception right now. -To be perfectly fair, I don't think we ever benchmarked the effects of this choice, right? -So it might actually have been wrong from an optimization point of view in the end, but it was at least plausibly right, so there's nothing wrong with the with using an exception for that in principle. -If it actually gets you a performance win. -So finally you might also need to consider development culture and the way they the way your team uses their tools. -So some people typically set up their debuggers to stop whenever an exception occurs, and if you're in a team where that's an important practice, you might need to take some extra care not to throw when there's an alternate path to success, right? -Some developers get upset when code stops in a case that will eventually succeed. -OK, so enough about exceptions. -Umm. -So finally we come to the this is the good part. -This was originally gonna be the focus of the entire talk. -OK. -Umm, so I wanna talk about the obligations of the failing function and and of its caller. -So umm question is what do you put in the contract for a function that could fail and what does each side, the caller and the callee? -What do they need to do to ensure correctness? -So OK, the callee. -First of all, there's a documentation obligation you have to document any local failures and what they mean, because you're gonna report them as part of the as part of the return value, right? And. -Nonlocal failures you want to document at their source, right? -But not where they're just propagated from other functions that that they use. -So the problem is if you document them where they're propagated, you have the same problem as if you would included the types of the details of the failure in the type information right, which we talked about last time, creates a lot of churn as as failure reasons that don't really change anything about about the code end up changing. -Uh, as as you've all of your function implementations. -So. -OK so in code. -If you're the callee, if you have any unmanaged resources you've allocated, like you've opened a temporary file, you need to make sure that those things are are released. -Umm, the other example I I had is the the uninitialized memory that you're initializing right? -The the lifetime of those of those objects that you've put into that memory is a resource, and that needs to be. -They need to be, uh, denialist. -OK. -No, there's there's an optional thing that can be really useful if you're a mutating function. -So and and that is to consider saying that your transaction that if and that if there is a failure, the function has no effects. -Umm, that's often called the strong guarantee. -OK, now that can be a really useful guarantee to give when it falls out of the implementation, or at least a an efficient implementation. -Umm, but you don't want to do this if it adds performance cost. -So for example, the simplest way to give us a transactional guarantee on a function that mutates data is to do what I call copy and swap. -So first you make a copy of the data, you mutate the copy and place, and only when that succeeds, then you swap it back into place, right? -Swap it back to the original data and. -Sure, that ends up being transactional, but you pay the cost of making a full copy of the data and you don't wanna. -You don't want to preemptively do that? -Umm, because uh, what happens is often your caller doesn't need that. -That strong guarantee, right? -And what happens if all of the you know components get composed? -So what happens if all components do that copy and swap thing? -Now you have an exponential increase in cost, where at every level you're making copies. -It's sort of the same reason that we don't do object level locking, right. -Uh, you know, for umm concurrency logging, you know thread thread locking. -I saying this right you why we don't have a mutex in every object because. -Clients might need transactionality at a different level, right? -Might need your. -Your component might be a part of a bigger component and then you transactionality on that whole component. -So the locking of your individual component is a waste. -So to get the strong guarantee, sometimes you can do this just by reordering the operations you're performing. -For example, if you do all of the things that can fail before you actually make any mutations that are visible to clients, now you have the strong guarantee. -Umm uh so. -For you know, the simple example is, umm do all your memory allocations up front, then make changes that can't throw and and it's transactional. -So that's a useful thing to dock document when you can get it. -Right. -No, the caller umm. -So the caller's obligation is to discard any partially completed mutations to program state. -So if the caller is just calling a non mutating function right pinned and it throws, they don't have to do anything, they can just allow the the failure to propagate unless they happen to have some recovery strategy. -But I hope I already said this having a recovery strategy is really rare. -That usually means it's a nonlocal Error. -A nonlocal failure. -I mean, sorry, that usually means it's a local failure and the function shouldn't have been throwing in the 1st place so. -Umm. -So if you pass something to the function and and the function is gonna is gonna mutate that thing, you need to make sure that that thing gets gets discarded unless the function has given you this strong transactional guarantee. -Which case, it still has its original meaning, and it's and it's original value, right? -So when I say discard partial mutations to program state, umm, we have to talk about what counts as program state. -So that's data that can have an observable effect on the future behavior of your code. -So for example your if you have a log file that you're just streaming information into, that doesn't count as program state, right? -Because you never read it. -You never change, but the Programs behavior based on what's gone in there. -OK so. -So how do you arrange to discard partially mutated state? -Well, there's really only one strategy that really scales up in practice when mutations can fail. -Well, aside from the strategy of never mutating anything, but arguably that doesn't scale up either. Right? -Because then there's costs of copying. -So if you're, if you're not writing in a functional language, the pure functional language like Haskell, which most of us aren't, you have mutation. -Mutation can fail so. -So how do we manage discarding these partial mutations well? -Normally, the only strategy I've found that scales U is to propagate the responsibility for discarding this partial mutation. -All the way up to the top of the application and So what that means is. -You're at the top level. -Do you have to take this copy and swap strategy right? -You essentially you're gonna mutate a copy of the existing data and only replace the old copy when the mutation succeeds. -So but if you have a large data structure that could be really expensive, right? -To we we can't afford to copy an entire Photoshop document every time we make a change. -Well, actually we can, right? -And why? -Why is that the we do we actually do it? -And it's possible because Photoshop documents are essentially a persistent data structure. -You know, persistent, that's a confusing name because it doesn't have anything to do with persistence in the usual sense. -A persistent data structure is 1 where a partial mutation of a copy ends up sharing a lot of storage with the original. -So we store in Photoshop separate document for each state in the undo history. -But these copies share storage for any parts that weren't mutated between revisions, and this sharing behavior falls out naturally when you compose your data structure from copy on write parts. -So the original copy has basically 0 cost. -It's about, you know, bumping a reference count and then when you start to make changes, something is checking the reference count saying ohh if there's more than one reference. -Now I need to copy that part of the data that's changing. -So everybody follow that one, make sure I. -Yeah. -So there are some hints that's. -Let's hear from those. - -Stephen DiVerdi 40:05 -Hey. -Yeah. -Thanks Dave. -And and question and if this is harping on a previous topic then then let me know when we can just skip it. -But what I'm wondering is, it seems like what you just described about this mechanism for copying, mutating, and then replacing with the ability to handle local failures and robust to local failures also works for being robust to local errors. -And so I don't, I guess I still don't understand why that wouldn't be preferable to handle errors within that same framework of mutating a copy and then replacing it a transactional manner instead of crashing the application. - -Dave Abrahams 40:45 -Well. -This is. -Is a good question. -If you really have, if you really have data isolation and and you know that the only thing being mutated is this is this copy that will be discarded. -I think you might be safe. -To continue. -Uh, Sean? - -Sean Parent 41:26 -Yeah, I agree with that that the question is, do you really have data isolation? -And my answer would be, you know, in in C++, almost certainly not. -Ohm so. - -Dave Abrahams 41:41 -Yeah. - -Sean Parent 41:41 -Yeah. - -Dave Abrahams 41:41 -There, there, there's there's. -There's often, usually there's something that's being mutated that isn't the that isn't just the document state and and that that. -Whose? -Whose mutation isn't going to get undone by by discarding the partially mutated document state. -For example, you might have a queue of background operations, right? -Things get added to that queue. -Right. -And we don't have a we don't have a way to rollback that ad and some mutation fails. +Debuggers will commonly stop at the assertion rather than exiting, and even if you're not running in the debugger, on major desktop OSes, you'll get a crash report with the entire program state that can be loaded into a debugger. So this is great for catching bugs early, before they get shipped, provided people use it. -Stephen DiVerdi 42:30 -OK. Thanks. +Projects commonly disable assertions in release builds, which has the nice side-effect of making programmers comfortable adding lots of assertions, because they know they won't slow down the release build. And more bugs get caught early. -Dave Abrahams 42:38 -I guess I guess another another issue is like so part of the way we get this copy on write behavior with Photoshop is using the VM system with the you know which is. -A bunch of copy on write tiles essentially. -So if a bug were detected in that. -That would that would undermine the guarantees of the that you get from having copy on, right? -Right. -So the the the real problem with with bugs is you can't count on the systems that that uh normally give you this. -This their recovery property. -Uh, David sankel. - -David Sankel 43:29 -Yeah, I was just going to say that, you know, if if a bug is detected. -You all you know is that a bug is. -Detective, you have. -You have no idea of what the nature of the bug is. -I mean it could be corrupted memory, so even if you try to take your you know copy on write Das structure and discard it, it could be the old thing got messed up somehow because of some random thing. -You really, if you're really know nothing about the nature of a bug when it happened. -So the idea of recovering from it is. -Umm, it's not really sound I think. - -Dave Abrahams 44:07 -Yeah. -So I mean I have to like we have to think about, we have to think about the the nature of the environment in which we're running, so. -So I if if I think about you know how this would how this would play out in Hilo or in Rust? -Umm, you know, provided the bug didn't occur in unsafe code which has to be very carefully vetted. -Then you really know what stuff you're you're you're mutating. -When you do a mutation right and so you've really wouldn't know that the original state was was intact. -The problem with C++ is that. -We don't really have those kinds of protections and when there's a bug, it very typically leads to undefined behavior, which very typically could corrupt your old state, right? -It's undefined behavior and other words, it can do anything. -That's one of the you know, if you look at the C standard and and find all of the places where it says the behavior is undefined and you know there are lots of those and a lot of a lot of them apply in many, many places like there are statements like you know if any argument to a standard library function violates a precondition, the behavior is undefined. -Right. -And so that's the problem with with the C++ environment. -So it can really can undermine all of the guarantees that you would be getting from something like like copy on write system. -OK. -Umm. -Moving on. -Uh, I can't see if there are any more hands because I've got a window covering it, but I think there aren't good. -Umm OK so. -Yeah, some, some, some last advice. -Uh that I just added. -Uh. -About what to do when an assertion fires? -Umm. -And this is because especially what not to do because we see this a lot. -So first of all, don't remove the assertion, because the program seems to work when you take it out right the. -That's that's just the case you've tested. -Right. -And the what? -The assertion is saying. -Usually it's usually a precondition check with the assertion is if it's a precondition check, the assertion is saying. -The the owner of the function is saying you did something for which I'm not guaranteeing any particular result. -I don't know what was what result you should expect to get under these conditions. -Right. -So just taking it out doesn't make the program work. -The there are probably some effects that you aren't able to observe that that put the program in a broken state. -Uh. -Another thing not to do is don't go to the owner of the assertion and complain that they're crashing the program. -Remember the an assertion is a controlled shutdown in response to a detected bug, right? -And the first thing you need to do is to understand what kind of check is being performed, right? -So if it's a precondition check in someone else's component, that's probably your bug. -You're probably calling that that component in the wrong way. -Another possibility is that it's a self check, right? -What people often called sanity check, although we we try not to use that term anymore these days. -Umm. -Or it's a post condition check. -Uh, and in those cases, you want to talk to the the owner of the code about why the assumptions might have been violated. -That is, is very possibly a bug in the code that you're using, but this just reminds us why it's important to have different kinds of assertion macros or functions that tell you what their purpose is so that when they fire, people know what to do about them. -Uh, and my last bit of advice is you probably don't wanna use assertions. -Umm, you know the same functions you use for checking preconditions for doing your unit tests. -One reason is people often uh. -So typically when a unit test failure occurs, you don't go ahead using the same data, right? -You typically throw it out and go on to another test, and the other test you know uses fresh data and so that hasn't invalidated the rest of your testing. -People would like to hear about all of the test failures rather than just the just the first one. -So the assertions that exit the program aren't really appropriate there. -You want a different suite for doing those kind of checks, and that's all I've got for you. -Ready to open the floor to questions. +But unless you really believe you're shipping bug-free software, you might want to leave most assertions on in release builds. In fact, the security of your software might depend on it. If you're programming in an unsafe language like C++, opportunities to cause undefined behavior are all around you. When you can assert that the conditions for avoiding undefined behavior are met before executing the dangerous operation, the program will come to a controlled stop instead of opening an arbitrarily bad security hole. + +The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. + +There's another problem with having just one assertion: it doesn't express sufficient intent. For example, it might be a precondition check, or the asserting function's author might just be double-checking their own reasoning. When these two assertions fire, the meaning is very different: the first indicates a bug in the caller, the other one is a bug in the callee. So I really want separate precondition and self_check functions. + +If I'm writing in a safe-by-default language like Rust or Swift, the checks that prevent undefined behavior, like array bounds checks, are special: I can afford to turn off all the other checks in shipping code, but these checks are the ones upholding safety properties of my system are compromised. So I want a different assertion for these checks, even if I don't ever anticipate turning off the other ones in a shipped product. These are the ones that we can't delete from the code. I might want to turn the other assertions off locally to measure how much overhead they are incurring. + +I hope you get the idea. I'm not going to prescribe the exact set of assertion facilities your project needs, but a carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. If you haven't got one, go design it. + +One last point about the C++ assert: it's better than nothing, but because it calls abort(), there's no place to put emergency shutdown measures. You can't even display a message to the user, so to the user it will always feel like a hard, unceremonious crash. You probably want failed assertions to call terminate() instead, because it allows terminate handlers can run. So that's another reason to engineer your own assertions, even if you build just one. + +## What if you're not allowed to terminate? + +Fight for the right (to terminate). If the system is critical, advocate creating a recovery system outside the process. +If you lose today +Fail as noisily as possible, preferably by terminating in non-shipping code. +Keep fighting +Be prepared to win someday. That means use a suite of assertions that don't terminate, but whose behavior you can change when you win the fight. + +# Failures + +OK, as much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: + + +Something your function calls has a precondition that you're not sure would be satisfied. +Something your function calls can itself report a failure. + +You usually have two choices at this point: +Make !X a precondition; X reflects a bug in the caller. +Make X a failure; all the code is correct. + +It's counterintuitive, you should always prefer to classify X as a bug, as long as !X satisfies the criteria for preconditions: +It is possible to ensure !X. For example, there's no way for the caller to ensure there's enough disk space to save a file, because other processes can use up any space that might have been free before the call. So you can't make “there's enough disk to save” a precondition. +Ensuring !X is considerably less work than the work done by the callee. For example, if the callee is deserializing a document and finds that it's corrupted, you can't make it a precondition that the file is well-formed, because determining whether it is or not is basically the same work as doing the deserialization. + +## Definition + + Failure: inability to satisfy a postcondition in correct code. + +So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? + +First of all, it simplifies and improves understandability of contracts. This is easiest to see if you have a dedicated language mechanism for error handling: + +** Note: fictional programming language ** + +// Returns `x` sorted in `order`, or throws an exception +// in case order fails. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +// Returns `x` sorted in `order`. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +Even if you feel you need to say something about possible failures, that becomes a secondary note that's not essential to the contract. + +// Returns `x` sorted in `order`. +// +// Propagates any exceptions thrown by `order`. +fn sorted(x: [Int], order: Ordering) throws -> [Int] + +A programmer can know everything essential from the summary fragment and the signature. Another way this separation plays nicely with exceptions is that you can say the postcondition of a function describes what you get when it returns, and a throwing function never returns. + +If you don't use exceptions, you still simplified contracts as long as you have dedicated types to represent the possibility of failure. + +// Returns `x` sorted in `order`. +fn sorted(x: [Int], order: Ordering) -> ResultOrFailure<[Int]> + +Separating the function's primary intention from the reasons for failure makes sense, because the reasons for failure matter less. If that's not obvious yet, some justification is coming. + +Another reason to exclude the failure case from the postcondition is that you want postconditions to be solid and fully described, but a mutating operation that fails often leaves behind a state that's very difficult to nail down, and as I said in the contracts talk, that you usually don't want to nail down, because it's detail nobody cares about. But if it's part of the postcondition, you need to say something about it, and that further complicates the contract. + +// Sorts `x` according to `order` or throws an exception +// if `order` fails, leaving `x` modified in unspecified +// ways. +fn sort(mutating x: [Int], order: Ordering) throws + +// Sorts `x` according to `order`. +fn sort(mutating x: [Int], order: Ordering) throws + +## Two kinds of failures + +If you've spent some time writing code that carefully handles failures, especially in a language like C where all the error propagation is explicit, failures start to fall into two main categories: local and non-local, based on where the recovery is likely to happen. + +Local recovery occurs very close to the source of failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. In many cases, the recovery path is performance-critical. + +**Example**: you have an ultrafast memory allocator that draws from a local pool much smaller than your system memory. You build a general-purpose allocator that first tries your fast allocator, and only if that allocation fails, recovers by trying the system allocator. + +**Example**: the lowest level function that tries to send a network packet can fail for a whole slew of reasons (https://www.ibm.com/docs/en/zos/2.3.0?topic=codes-sockets-return-errnos), some of which may indicate a temporary condition like packet collision. 99% of the time, the immediate caller is a higher-level function that checks for these conditions and if found, initiates a retry protocol with exponential backoff, only itself failing after N failed retries. That lowest-level failure is local. The failure after N retries is very likely to be non-local. + +Non-local recovery, which is far more common, occurs far from the source, usually in a way that can be described without reference to the reasons for failure. For example, when you're serializing a complex document, serializing any part means serializing all of its sub-parts, and parts are ultimately nested many layers deep. Because you can run out of space in the serialization medium, every step of the process can fail. If you write out the error propagation explicitly, it usually looks like this: + +// Writes `s` into the archive. +fn serialize_section(s: Section) -> MaybeFailure +{ + var failure: Optional = none; + + failure = serialize_part1(s.part1); + if failure != none { return failure; } + + failure = serialize_part2(s.part2); + if failure != none { return failure; } + + ... + + return serialize_partN(s.partN); +} + +After every operation that can fail, you're adding “and if there was a failure, return it.” + +There are many layers of this propagation. None of it depends on the details of the reasons for failure: whether the disk is full or the OS detects directory corruption, or serialization is going to an in-memory archive and you run out of memory, you're going to do the same thing. Finally, where propagation stops and the failure is handled—let's say this is a desktop app— again, the recovery is usually the same no matter the reasons for the failure: you report the problem to the user and wait for the next command. + +### Interlude: Exceptions? + +Way back in 1996 I embarked on a mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. Yes I'm old. While I've seen some real progress on that over the years, I know some of you out there are still not all that comfortable with the idea. If you'll let me, I think I can help. + +#### Just control flow + +Cases like this are where the motivation for exceptions becomes really obvious. They eliminate the boilerplate and let you see the code's primary intent: + +// Writes `s` into the archive. +fn serialize_section(s: Section) throws { + serialize_part1(s.part1); + serialize_part2(s.part2); + ... + serialize_partN(s.partN); +} + +There's no magic. Exceptions are just control flow. Like a switch statement, they capture a commonly needed pattern control flow pattern and eliminate unneeded syntax. + +To grok the meaning of this code in its full detail, you mentally add “and if there was a failure, return it” everywhere. But if you push failures out of your mind for a moment you can see that how the function fulfills its primary purpose leaps out at you in a way that was obscured by all the failure handling. The effect is even stronger when there's some control flow that isn't related to error handling. + +#### Also, type erasure + +OK, I lied a little when I said exceptions are just control flow. There's one other big difference between the exception version and the explicit version: the exception version erases the types of the failure data, and catch blocks are just big type switches with dynamic downcasts. + +Lots of us are “static typing partisans,” so at first this might sound like a bad thing, but remember, as I said, none of the code propagating this failure (or even recovering from it usually) cares about its details. What do you gain by threading all this failure information through your code? When the reasons for failure change you end up creating a lot of churn in your codebase updating those types. + +In fact, if you look carefully at the explicit signature, you'll see something that typically shows up when failure type information is included: people find a way to bypass that development friction. + +fn serialize_section(s: Section) -> MaybeFailure + +Here an “unknown” case was added that is basically a box for any failure type. This is also a reason that systems with statically checked exception types are a bad idea. Java's “checked exceptions” are a famously failed design because of this dynamic. + +Swift recently added statically-typed error handling in spite of this lesson that should be well-understood to language designers, for reasons I don't understand. There was great fanfare from the community, because, I suppose, everybody thinks they want more static type safety. I'm not optimistic that this time it's going to work out any better. + +The moral of the story: sometimes dynamic polymorphism is the right answer. Non-local error handling is a key example, and the design of most exception systems optimize for that. + +### When (and when not) to use exceptions + +There's a lot of nice sounding advice out there about this that is either meaningless or vague, like “use exceptions for exceptional conditions,” or “don't use exceptions for control flow.” I know that one is really popular around Adobe, but c'mon: if you're using exceptions, you're using them for control flow. I hope to improve on that advice a little bit. + +First of all, you can use exceptions for things that aren't obviously failures, like when the user cancels a command. An exception is appropriate because the control flow pattern is identical to the one where the command runs out of disk space: the condition is propagated up to the top level. In this case recovery is slightly different: there's nothing to report to the user when they cancel, but all the intermediate levels are the same. It would be silly to explicitly propagate cancellation in parallel with the implicit propagation of failures. + +But if you make this choice, I strongly urge you to classify this not-obviously-a-failure thing as a failure! Otherwise you'll undo all the benefits of separating failures from postconditions, and you'll have to include “unless the user cancels, in which case…” in the summary of all your functions. So in the end, my broad advice is, “only use exceptions for failures (but be open minded about what you call a failure).” Actually, even if you're not using exceptions, any condition whose control flow follows the same path as non-local failures should probably be classified as a failure. + +Another prime example is the discovery of a syntax error in some input. In the general case, you are parsing this input out of a file. I/O failures can occur, and will follow the same control flow path. Classifying your syntax error as a failure and using the same reporting mechanism is a win in that case. + +Next, don't use exceptions for bugs. As we've said, when a bug is detected the program cannot proceed reliably, and throwing is likely to destroy valuable debugging information you need to find the bug, leave a corrupt state, open a security hole, and hide the bug from developers. Even though the “default behavior” of exceptions is to stop the program, throwing defers the choice about whether to actually stop to every function above you in the call stack. This is not a service, it's a burden. You've made your function harder to use by giving your clients more decisions to make. Just don't. + +That also means if you use components that misguidedly throw logic_errors, domain_error, invalid_argument, length_error or out_of_range at you, you should almost always stop them and turn them into assertion failures. All that said, there are some systems, like Python, where using exceptions for bugs (to say nothing of exiting loops!) is so deeply ingrained that it's unavoidable. In python you have to ignore this rule. + +Don't use exceptions for local failures. As we've seen, exceptions are optimized for the patterns of non-local failures. Using them for local failures means more catch blocks, which increase code complexity. It's usually easy to tell what kind of failure you've got, but if you're writing a function and you really can't guess whether its failure is going to be handled locally, maybe you should write two functions. + +Next, consider performance implications. Most languages aren't like this, but most C++ implementations are usually biased so heavily toward optimizing the non-failure cases that handling a failure runs one or two orders of magnitude slower. Usually that's a great trade-off because it allows them to skip checking for the error case on the hot path, and non-local failures are rare and don't happen repeatedly inside tight loops. But if you're writing a real-time system for example, you might want to think twice. + +Here's an example that might open your mind a bit: when we were discussing the design of the Boost C++ Graph Library, we realized that occasionally a particular use of a graph algorithm might want to stop early. For example, Dijkstra's algorithm finds all the paths from A to B in order, from shortest to longest. What if you want to find the ten shortest paths and stop? The way this library's algorithms work, you pass them a “visitor” object that gets notified about results as they are discovered. And in fact there are lots of notification points for intermediate conditions, not just “complete path found,” so if we were going to handle this early stop explicitly, we'd generate a test after each one of these points in the algorithm's inner loop. Instead, we decided to take advantage of the C++ bias toward non-failures. We said a visitor that wants to stop early can just throw. Now in fairness, I don't think we ever benchmarked the effects of this choice, so it might have been wrong in the end. But it was at least plausibly right. + +Finally, you might need to consider your team's development culture and use of tooling. If people typically have their debuggers set up to stop when an exception occurs, you might need to take extra care not to throw when there's an alternate path to success. Some developers tend to get upset when code stops in a case that will eventually succeed. + +## How to Handle Failure + +OK, enough about exceptions. Finally we come to the good part! Seriously, this was originally going to be the focus of the entire talk. + +Let's talk about the obligations of a failing function and of its caller. What goes in the contract and what does each side need to do to ensure correctness? + +### Callee + +Documentation: +Document local failures and what they mean. +Document non-local failures at their source, but not where they are simply propagated. That information can be nice to have, but it also complicates contracts and is a burden to propagate and keep up-to-date. + +Code: +Release any unmanaged resources you've allocated (e.g. close temporary file). + +#### Optional + +If mutating, consider giving the strong/transactional guarantee that if there is a failure, the function has no effects. + +Only do this if it has no performance cost. Sometimes it just falls out of the implementation. Sometimes you can get it by reordering the operations. For example, if you do all the things that can fail before you mutate anything visible to clients, you've got it. + +Don't pay a performance penalty to get it because not all clients need it and when composing parts all the needless overheads add up massively. + +### Caller + +- Discard any partially-completed mutations to program state or propagate the error and that responsibility to your caller. This partially mutated state is meaningless. + +What counts as state? Data that can have an observable effect on the future behavior of your code. Your log file doesn't count. + +#### Implications as data structures scale up + +The only strategy that really scales in practice, when mutation can fail, is to propagate responsibility for discarding partial mutations all the way to the top of the application. That in turn implies mutating a copy of existing data and replacing the old copy only when mutation succeeds. Either way, you probably end up with a persistent data structure (which is a confusing name—it has nothing to do with persistence in the usual sense). + +A persistent data structure is one where a partial mutation of a copy shares a lot of storage with the original. For example, in Photoshop, we store a separate document for each state in the undo history, but these copies share storage for any parts that weren't mutated between revisions. This sharing behavior falls out naturally when you compose your data structure from copy-on-write parts. + +### What (not) to do when an assertion fires. + +- Don't remove the assertion because “without that the program works!” +- Don't complain to the owner of the assertion that they are crashing the program. +- Understand what kind of check is being performed + - If it's a precondition check, fix your bug + - If it's a self-check or postcondition check, talk to the code owner about why their assumptions might have been violated + +### Probably different functions for unit testing. + + + + + + + + +Notes: + - read from network, how much was read + - no-error case exists + - podcast + - likely a local handling case. + - don't go to vegas with something you're not prepared to lose. + +Quickdraw GX: 15% performance penalty for making silent null checks. David Sankel 50:11 Folks can go ahead and put your hands up if you would like to. From ab2f6b567060ade519d899f7753e0a95c74a1967 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 2 Dec 2025 14:43:53 -0800 Subject: [PATCH 09/41] WIPPITY WIP WIP WOW --- better-code/src/chapter-3-errors.md | 60 +++++++++-------------------- 1 file changed, 19 insertions(+), 41 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index fa7a73f..fe045a0 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -136,11 +136,12 @@ the problem could have affected many things we didn't test for. Because of the bug, your program state could be very, very scathed indeed, violating assumptions made when coding and potentially compromising security, If user data is quietly corrupted and -subsequently saved, the damage is permanent. +subsequently saved, the damage becomes permanent. In any case, unless the program has no mutable state and no external effects, the only principled response to bug detection is to terminate -the process. [^fault-tolerant] +the process, possibly after taking some emergency shutdown +measures such as saving diagnostic information. [^fault-tolerant] [^fault-tolerant]: There do exist systems that recover from bugs in a principled way, using redundancy: for example, functionality could be @@ -148,45 +149,22 @@ written three different ways by separate teams, and run in separate processes that “vote” on results. In any case, the loser needs to be terminated to flush any corrupted program state. -As terrible as that outcome may be, it's better than the alternative. - -Immediate dangers aside, sallying forth in the face of a detected bug -hurts the development process and the health of the codebase. One -could argue that - -Even if the condition is logged, it is at -least partially masked, and in practice, will usually be -de-prioritized. If it ever becomes a priority, it will be harder and -more expensive to fix than if it had immediately . It's easy to make -bugs drop - - - - The bug will be at least partially masked. - - - If not completely masked, and addressing it will usually be - de-prioritized. - - - If the bug ever becomes a priority, it will be harder and more - expensive to fix. - - - Because most code is correct, bug-recovery code will never run or be tested. All this recovery code bloats your - program and every line [is a - liability](https://blog.objectmentor.com/articles/2007/04/16/code-is-a-liability) - with no offsetting benefits. - -[^needless-checks]: Even if we _could_ test for everything, our code -would spend more time on tests than on fulfilling its purpose. And because most code is correct, the code attempting to recover would be untexted - - -In general, only fault-tolerant systems that can recover from bugs use redundancy - -### Actual Bug Recovery - -Systems that are resilient to bugs do exist, though. They do it by adding - -Some systems can recover from bugs (e.g. redundant ones). Processes can't recover. - -To sum up, in general you can't recover from bugs, and it's a bad idea to try. So what can you do? +As terrible as that outcome may be, it's better than the +alternative. Recovery code is almost never exercised or tested and +thus is likely wrong, and the consequences of a botched recovery +attempt can be worse than termination. To no advantage, most recovery +code obscures the rest of the code and adds bloat, which hurts +performance. Continuing to run after a bug is detected also hurts our +ability to fix the bug. When a bug is detected, before any further +state changes, you want to immediately capture as much information as +possible that could assist in diagnosis. In development that +typically means dropping into a debugger, and in deployed code that +might mean producing a crash log or core dump. If deployed code +continues to run, the bug is obscured and—even if automatically +reported—will likely be de-prioritized for fixing until it is less +fresh and thus harder to address. Worse, it can result in *multiple* +symptoms that will be reported as separate higher-priority bugs whose +root cause could have been addressed once. ## Handling bugs From 333b81e78f080a6e14d2f7bbcbe1d849a5b658fb Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 11 Dec 2025 15:10:39 -0800 Subject: [PATCH 10/41] X --- better-code/src/chapter-3-errors.md | 53 +++++++++++++++++------------ 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index fe045a0..e44ff24 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -161,38 +161,49 @@ possible that could assist in diagnosis. In development that typically means dropping into a debugger, and in deployed code that might mean producing a crash log or core dump. If deployed code continues to run, the bug is obscured and—even if automatically -reported—will likely be de-prioritized for fixing until it is less +reported—will likely be de-prioritized until it is less fresh and thus harder to address. Worse, it can result in *multiple* symptoms that will be reported as separate higher-priority bugs whose root cause could have been addressed once. ## Handling bugs -You can stop the program before any more damage is done, and generate a crash report or debuggable image that captures as much information as is available about the state of the program, so there's a chance of fixing the bug. Maybe there's some small emergency shutdown procedure you need to perform, like saving information about the failing command so the application can offer to retry it for you when you restart it. - -Let me be clear: THIS IS BAD. It could be experienced as a crash by users. -But it's the only way to prevent the much worse consequences of a botched recovery attempt. Remember, the chances of botchery are high because you don't have enough information to do it reliably. -Upside: it will also be experienced as a crash by developers, QE teams, and beta testers, giving you a chance to fix the bug. - -*** You can mitigate the experience of crashing *** -*** Don't tell me my assertion is a crash *** -*** An assertion is a controlled shutdown *** - -A lot of people have a hard time accepting the idea of voluntarily terminating, but let's face it: your bug detection isn't the only reason the program might suddenly stop. You can crash from an undetected bug. Or a person can trip over the power cord. You should design your software so that these bad things are not catastrophic. - -*** In fact you could be more ambitious and try to make it really seamless. You have to accept this is part of the UX package to even take this on. *** - -In fact some platforms force you to live under a similar constraint. On an iPhone or iPad, for example, to save battery and keep foreground apps responsive, the OS may kill your process any time it's in the background, but will make it look to the user like it's still running. When the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. I can tell you as a user, it can be really jarring when you encounter an app that doesn't do it right. The point is, resilience to early termination is something you can and should design into the system. - -For example, Photoshop uses a variety of strategies: we always save documents into a new file and atomically swap it into place only after the save succeeds, so we never leave a half-saved document on disk. We also periodically save backups so at most you only lose the last few minutes of work. If we needed to tighten that up we could, by saving a record of changes since the last full backup. +The best strategy is to stop the program before any more damage is +done and generate a crash report or debuggable image that captures as +much information as is available about the state of the program, so +there's a chance of fixing the bug. Maybe there's some small +emergency shutdown procedure you need to perform, like saving +information about the failing command so the application can offer to +retry it for you when you restart it. + +Many people have a hard time accepting the idea of voluntarily +terminating, but let's face it: your bug detection isn't the only +reason the program might suddenly stop. The program can crash from an +undetected bug… or a person can trip over the power cord. Where it +matters, software should be designed so that sudden termination is not +catastrophic. Techniques for doing that, such as saving backup files, +are well-known, but outside the scope of this book. + +In fact, it's often possible to make restarting the app a completely +seamless experience. On an iPhone or iPad, for example, to save +battery and keep foreground apps responsive, the OS may kill your +process any time it's in the background, but will make it look to the +user like it's still running. When the user switches back, every app +is supposed to complete the illusion by coming back up in the same +state it was killed in. Resilience to early termination is something +you can and should design into your system. ## Assertions -The usual mechanism for terminating a program when a bug is detected is called an assertion and traditionally it spelled something like this: +The classic mechanism for terminating a program when a bug is detected +is called an assertion and traditionally it spelled something like +this: - assert(n >= 0); +```swift +assert(n >= 0); +``` -This spelling comes from C and C++. If you're programming in another language, you probably have something similar. +This spelling comes from the C programming language. The C assertion is pretty straightforward: either it's disabled, in which case it generates no code at all—even the check is skipped—or it does the check and exits immediately with a predefined error code if the check fails, usually printing a message containing the text of the failed check and its location in source. From 32a543b8c406a76a10ef656796135cf5c68a4059 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 15 Dec 2025 14:25:28 -0800 Subject: [PATCH 11/41] Progress --- better-code/src/chapter-3-errors.md | 118 +++++++++++++++++++++++++--- 1 file changed, 108 insertions(+), 10 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e44ff24..46df4df 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -193,25 +193,123 @@ is supposed to complete the illusion by coming back up in the same state it was killed in. Resilience to early termination is something you can and should design into your system. -## Assertions +## Checking For Bugs -The classic mechanism for terminating a program when a bug is detected -is called an assertion and traditionally it spelled something like -this: +While, as we've seen, not all bugs are detectable, checking for the +others at runtime is an extremely valuable technique for creating +robust software. + +### Precondition Checks + +Swift supplies a function for checking that a precondition is upheld, +which can be used as follows: + +```swift +precondition(n >= 0) +``` + +*or* + +```swift +precondition(n >= 0, "n == \(n); it must be non-negative.") +``` + +In either case, if the condition is false, the program will be +terminated (or stop if run in a debugger). [^Onone] In debug builds, +the file and line of the call will be written to the standard error +stream, along with any message supplied. In release builds, to save +on program size, nothing is printed and any expression passed as a +second argument is never evaluated. + +[^Onone]: Actually, if you build your program with `-Onone`, both + forms have no effect; the conditional expression will never even + be evaluated. However, `-Onone` makes Swift an unsafe language: + any failure to satisfy preconditions can cause *arbitrary + behavior*. The results can be so serious that we strongly advise + against using `-Onone`, except as an experiment to satisfy + yourself that Swift's built-in checks do not have unacceptable + cost. The rest of this book is therefore written as though + `-Onone` does not exist. + +### Assertions + +Swift supplies a similar function called `assert`, modeled on the one +from the C programming language. Its intended use is as a “soundness +check,” to validate your own assumptions rather than to make checks at +function boundaries. For example, in the binary search algorithm +mentioned in the previous chapter, ```swift -assert(n >= 0); + // precondition: l <= h + let m = (h - l) / 2 + h = l + m + // postcondition: l <= h ``` -This spelling comes from the C programming language. +There is no contract supplying the Hoare-style precondition and +postcondition you see there; they are internal to a single function. +If violated, they indicate we've failed to understand the code we've +written: the informal proof we used to evaluate the function's +correctness was flawed. Replacing those comments with assertions can +help us uncover those failures during testing of debug builds without +impacting performance of release builds: + +```swift + assert(l <= h) + let m = (h - l) / 2 + h = l + m + assert(l <= h, "unexpected h value \(h)") +``` -The C assertion is pretty straightforward: either it's disabled, in which case it generates no code at all—even the check is skipped—or it does the check and exits immediately with a predefined error code if the check fails, usually printing a message containing the text of the failed check and its location in source. +Similarly, `assert` can be useful for ensuring loop invariants are +correct (see the algorithms chapter). When trying to track down a +mysterious bug, temporarily adding as many assertions as possible in +the problem area can be a useful technique for narrowing the scope of +code you have to review. + +Assertions are checked only in debug builds, compiling to nothing in +release builds. This has the useful effect of allowing programmers to +use `assert`s liberally without concern for slowing down release +builds. + +> **Note:** when unsafe components are used to build safe ones, any +> checks that prevent misuse of unsafe functionality must of course be +> unconditional unless you can prove that the code's logic implies +> those checks will always pass. + +### Postcondition and Expensive Precondition Checks + +Checking postconditions is the role of unit tests, so in most cases we +recommend leaving postcondition checks out of function bodies. +However, if you can't be confident that unit tests cover enough cases, +since postconditions are often expensive to check, it might make sense +to use assertions to check them as a confidence-building +measure. Similarly, a precondition that can only be checked with a +significant cost to preformance could be checked with +`assert`. However, in both cases we suggest using a forwarding +function whose name describes its meaning, so that `assert` is used +exclusively for internal soundness checks: -Debuggers will commonly stop at the assertion rather than exiting, and even if you're not running in the debugger, on major desktop OSes, you'll get a crash report with the entire program state that can be loaded into a debugger. So this is great for catching bugs early, before they get shipped, provided people use it. +```swift +public func preconditionUncheckedInRelease( + _ condition: @autoclosure () -> Bool, + _ message: @autoclosure () -> String = String(), + file: StaticString = #file, line: UInt = #line +) { + assert(condition, message, file: file, line: line) +} +``` -Projects commonly disable assertions in release builds, which has the nice side-effect of making programmers comfortable adding lots of assertions, because they know they won't slow down the release build. And more bugs get caught early. +### -But unless you really believe you're shipping bug-free software, you might want to leave most assertions on in release builds. In fact, the security of your software might depend on it. If you're programming in an unsafe language like C++, opportunities to cause undefined behavior are all around you. When you can assert that the conditions for avoiding undefined behavior are met before executing the dangerous operation, the program will come to a controlled stop instead of opening an arbitrarily bad security hole. +Unless you really believe you're shipping bug-free software, you +might want to leave most assertions on in release builds. In fact, the +security of your software might depend on it. If you're programming +in an unsafe language like C++, opportunities to cause undefined +behavior are all around you. When you can assert that the conditions +for avoiding undefined behavior are met before executing the dangerous +operation, the program will come to a controlled stop instead of +opening an arbitrarily bad security hole. The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. From 0dbc6b3b8ddc048bd284cbddfae22fe85771cf08 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 16 Dec 2025 16:21:27 -0800 Subject: [PATCH 12/41] End section on bugs. --- better-code/src/chapter-3-errors.md | 80 +++++++++++++++++++---------- 1 file changed, 54 insertions(+), 26 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 46df4df..efc8a19 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -293,41 +293,69 @@ exclusively for internal soundness checks: ```swift public func preconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, - _ message: @autoclosure () -> String = String(), + _ message: @autoclosure () -> String = "Precondition violated", file: StaticString = #file, line: UInt = #line ) { assert(condition, message, file: file, line: line) } ``` -### +The distinction between this check and a use of `assert` is important: +when it fails, this one indicates a bug in the caller, while a failed +`assert` normally indicates a bug in the callee. -Unless you really believe you're shipping bug-free software, you -might want to leave most assertions on in release builds. In fact, the -security of your software might depend on it. If you're programming -in an unsafe language like C++, opportunities to cause undefined -behavior are all around you. When you can assert that the conditions -for avoiding undefined behavior are met before executing the dangerous -operation, the program will come to a controlled stop instead of -opening an arbitrarily bad security hole. - -The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. - -There's another problem with having just one assertion: it doesn't express sufficient intent. For example, it might be a precondition check, or the asserting function's author might just be double-checking their own reasoning. When these two assertions fire, the meaning is very different: the first indicates a bug in the caller, the other one is a bug in the callee. So I really want separate precondition and self_check functions. - -If I'm writing in a safe-by-default language like Rust or Swift, the checks that prevent undefined behavior, like array bounds checks, are special: I can afford to turn off all the other checks in shipping code, but these checks are the ones upholding safety properties of my system are compromised. So I want a different assertion for these checks, even if I don't ever anticipate turning off the other ones in a shipped product. These are the ones that we can't delete from the code. I might want to turn the other assertions off locally to measure how much overhead they are incurring. - -I hope you get the idea. I'm not going to prescribe the exact set of assertion facilities your project needs, but a carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. If you haven't got one, go design it. - -One last point about the C++ assert: it's better than nothing, but because it calls abort(), there's no place to put emergency shutdown measures. You can't even display a message to the user, so to the user it will always feel like a hard, unceremonious crash. You probably want failed assertions to call terminate() instead, because it allows terminate handlers can run. So that's another reason to engineer your own assertions, even if you build just one. +All that said, beware the temptation to turn off a precondition check +in release builds before measuring its effect on performance. The +value of stopping the program before things go too far wrong is often +higher than the cost of any particular check. Certainly, any +precondition check in a safe function that ultimately prevents an +unsafe component from being misused can never be turned off in release +builds. -## What if you're not allowed to terminate? +```swift +/// Exchanges the first and last elements of `x`. +func swapFirstAndLast(_ x: inout Array) { + precondition(!x.isEmpty) + if x.count == 1 { return } + x.withUnsafeBufferPointer { + f = x.baseAddress + l = f + x.count - 1 + swap(&f[0], &l[0]) + } +} +``` -Fight for the right (to terminate). If the system is critical, advocate creating a recovery system outside the process. -If you lose today -Fail as noisily as possible, preferably by terminating in non-shipping code. -Keep fighting -Be prepared to win someday. That means use a suite of assertions that don't terminate, but whose behavior you can change when you win the fight. + + +In this example, the precondition prevents an out-of-bounds access to +a non-existent first element. + +### Emergency Shutdown and Seamless Restarts + +When a bug is detected, it can be useful to take emergency measures +before shutdown, e.g.: + +- release system resources that aren't automatically reclaimed upon + process termination. +- log user actions to aid in reproducing the violation or in + recovering work that would otherwise be lost. + +Unfortunately, as of this writing, Swift does not provide a facility +for taking emergency shutdown measures. You cannot release resources +when a bug is detected, and the only way to generate logs is to do it +pre-emptively and unconditionally, which is probably a more principled +approach anyway. + +Regardless, it is useful to think about how to create an experience of +resilience for users. Bug detection is hardly the only reason your +process might suddenly terminate. Someone could trip over the power +cord, or the operating system itself could detect an internal bug, +causing a “kernel panic” that restarts the hardware. Some +environments, such as iOS, may kill any process to better manage +system resources, with the guideline that programs should come up in +the same state in which they were killed. When you accept that sudden +termination is part of *every* program's reality, it is easier to +accept it as a response to bug detection, and to mitigate the effects. # Failures From 1904312fc93eefee4ed6b32eca79615305eba545 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 16 Dec 2025 16:26:52 -0800 Subject: [PATCH 13/41] Fix levels --- better-code/src/chapter-3-errors.md | 40 ++++++++++++++--------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index efc8a19..0bf61fa 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -166,7 +166,7 @@ fresh and thus harder to address. Worse, it can result in *multiple* symptoms that will be reported as separate higher-priority bugs whose root cause could have been addressed once. -## Handling bugs +## How to Handle Bugs The best strategy is to stop the program before any more damage is done and generate a crash report or debuggable image that captures as @@ -193,13 +193,13 @@ is supposed to complete the illusion by coming back up in the same state it was killed in. Resilience to early termination is something you can and should design into your system. -## Checking For Bugs +### Checking For Bugs While, as we've seen, not all bugs are detectable, checking for the others at runtime is an extremely valuable technique for creating robust software. -### Precondition Checks +#### Precondition Checks Swift supplies a function for checking that a precondition is upheld, which can be used as follows: @@ -231,7 +231,7 @@ second argument is never evaluated. cost. The rest of this book is therefore written as though `-Onone` does not exist. -### Assertions +#### Assertions Swift supplies a similar function called `assert`, modeled on the one from the C programming language. Its intended use is as a “soundness @@ -277,7 +277,7 @@ builds. > unconditional unless you can prove that the code's logic implies > those checks will always pass. -### Postcondition and Expensive Precondition Checks +#### Postcondition and Expensive Precondition Checks Checking postconditions is the role of unit tests, so in most cases we recommend leaving postcondition checks out of function bodies. @@ -357,9 +357,9 @@ the same state in which they were killed. When you accept that sudden termination is part of *every* program's reality, it is easier to accept it as a response to bug detection, and to mitigate the effects. -# Failures +## Failures -OK, as much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: +As much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: Something your function calls has a precondition that you're not sure would be satisfied. @@ -373,7 +373,7 @@ It's counterintuitive, you should always prefer to classify X as a bug, as long It is possible to ensure !X. For example, there's no way for the caller to ensure there's enough disk space to save a file, because other processes can use up any space that might have been free before the call. So you can't make “there's enough disk to save” a precondition. Ensuring !X is considerably less work than the work done by the callee. For example, if the callee is deserializing a document and finds that it's corrupted, you can't make it a precondition that the file is well-formed, because determining whether it is or not is basically the same work as doing the deserialization. -## Definition +### Definition Failure: inability to satisfy a postcondition in correct code. @@ -416,7 +416,7 @@ fn sort(mutating x: [Int], order: Ordering) throws // Sorts `x` according to `order`. fn sort(mutating x: [Int], order: Ordering) throws -## Two kinds of failures +### Two kinds of failures If you've spent some time writing code that carefully handles failures, especially in a language like C where all the error propagation is explicit, failures start to fall into two main categories: local and non-local, based on where the recovery is likely to happen. @@ -448,11 +448,11 @@ After every operation that can fail, you're adding “and if there was a failure There are many layers of this propagation. None of it depends on the details of the reasons for failure: whether the disk is full or the OS detects directory corruption, or serialization is going to an in-memory archive and you run out of memory, you're going to do the same thing. Finally, where propagation stops and the failure is handled—let's say this is a desktop app— again, the recovery is usually the same no matter the reasons for the failure: you report the problem to the user and wait for the next command. -### Interlude: Exceptions? +#### Interlude: Exceptions? Way back in 1996 I embarked on a mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. Yes I'm old. While I've seen some real progress on that over the years, I know some of you out there are still not all that comfortable with the idea. If you'll let me, I think I can help. -#### Just control flow +##### Just control flow Cases like this are where the motivation for exceptions becomes really obvious. They eliminate the boilerplate and let you see the code's primary intent: @@ -468,7 +468,7 @@ There's no magic. Exceptions are just control flow. Like a switch statement, t To grok the meaning of this code in its full detail, you mentally add “and if there was a failure, return it” everywhere. But if you push failures out of your mind for a moment you can see that how the function fulfills its primary purpose leaps out at you in a way that was obscured by all the failure handling. The effect is even stronger when there's some control flow that isn't related to error handling. -#### Also, type erasure +##### Also, type erasure OK, I lied a little when I said exceptions are just control flow. There's one other big difference between the exception version and the explicit version: the exception version erases the types of the failure data, and catch blocks are just big type switches with dynamic downcasts. @@ -484,7 +484,7 @@ Swift recently added statically-typed error handling in spite of this lesson tha The moral of the story: sometimes dynamic polymorphism is the right answer. Non-local error handling is a key example, and the design of most exception systems optimize for that. -### When (and when not) to use exceptions +#### When (and when not) to use exceptions There's a lot of nice sounding advice out there about this that is either meaningless or vague, like “use exceptions for exceptional conditions,” or “don't use exceptions for control flow.” I know that one is really popular around Adobe, but c'mon: if you're using exceptions, you're using them for control flow. I hope to improve on that advice a little bit. @@ -506,13 +506,13 @@ Here's an example that might open your mind a bit: when we were discussing the d Finally, you might need to consider your team's development culture and use of tooling. If people typically have their debuggers set up to stop when an exception occurs, you might need to take extra care not to throw when there's an alternate path to success. Some developers tend to get upset when code stops in a case that will eventually succeed. -## How to Handle Failure +### How to Handle Failure OK, enough about exceptions. Finally we come to the good part! Seriously, this was originally going to be the focus of the entire talk. Let's talk about the obligations of a failing function and of its caller. What goes in the contract and what does each side need to do to ensure correctness? -### Callee +#### Callee Documentation: Document local failures and what they mean. @@ -521,7 +521,7 @@ Document non-local failures at their source, but not where they are simply propa Code: Release any unmanaged resources you've allocated (e.g. close temporary file). -#### Optional +##### Optional If mutating, consider giving the strong/transactional guarantee that if there is a failure, the function has no effects. @@ -529,19 +529,19 @@ Only do this if it has no performance cost. Sometimes it just falls out of the i Don't pay a performance penalty to get it because not all clients need it and when composing parts all the needless overheads add up massively. -### Caller +#### Caller - Discard any partially-completed mutations to program state or propagate the error and that responsibility to your caller. This partially mutated state is meaningless. What counts as state? Data that can have an observable effect on the future behavior of your code. Your log file doesn't count. -#### Implications as data structures scale up +##### Implications as data structures scale up The only strategy that really scales in practice, when mutation can fail, is to propagate responsibility for discarding partial mutations all the way to the top of the application. That in turn implies mutating a copy of existing data and replacing the old copy only when mutation succeeds. Either way, you probably end up with a persistent data structure (which is a confusing name—it has nothing to do with persistence in the usual sense). A persistent data structure is one where a partial mutation of a copy shares a lot of storage with the original. For example, in Photoshop, we store a separate document for each state in the undo history, but these copies share storage for any parts that weren't mutated between revisions. This sharing behavior falls out naturally when you compose your data structure from copy-on-write parts. -### What (not) to do when an assertion fires. +#### What (not) to do when an assertion fires. - Don't remove the assertion because “without that the program works!” - Don't complain to the owner of the assertion that they are crashing the program. @@ -549,7 +549,7 @@ A persistent data structure is one where a partial mutation of a copy shares a l - If it's a precondition check, fix your bug - If it's a self-check or postcondition check, talk to the code owner about why their assumptions might have been violated -### Probably different functions for unit testing. +#### Probably different functions for unit testing. From 1c53ffaab291af64b39ca38720a84dd8f3232eb4 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 16 Dec 2025 16:37:56 -0800 Subject: [PATCH 14/41] Whitespace --- better-code/src/chapter-2-contracts.md | 1 + 1 file changed, 1 insertion(+) diff --git a/better-code/src/chapter-2-contracts.md b/better-code/src/chapter-2-contracts.md index aa4a6b7..86c42aa 100644 --- a/better-code/src/chapter-2-contracts.md +++ b/better-code/src/chapter-2-contracts.md @@ -639,6 +639,7 @@ It's an invariant of your program that a manager ID can't just be random; it has to identify an employee that's in the database—that's part of what it means for the program to be in a good state, and all through the program you have code to ensure that invariant is upheld. + #### Encapsulating invariants It would be a good idea to identify and document that whole-program From 797f1281b22d71ac23ff1c9f20458eadfb52e284 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 16 Dec 2025 16:41:13 -0800 Subject: [PATCH 15/41] Bugfix --- better-code/src/chapter-3-errors.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 0bf61fa..c893c43 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -317,10 +317,10 @@ builds. func swapFirstAndLast(_ x: inout Array) { precondition(!x.isEmpty) if x.count == 1 { return } - x.withUnsafeBufferPointer { - f = x.baseAddress - l = f + x.count - 1 - swap(&f[0], &l[0]) + x.withUnsafeBufferPointer { b in + f = b.baseAddress + l = f + b.count - 1 + swap(&f.pointee, &l.pointee) } } ``` From 85c96869b99ad705d16f68cfeedfe5718e35be9e Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 17 Dec 2025 16:21:21 -0800 Subject: [PATCH 16/41] X --- better-code/src/chapter-3-errors.md | 59 +++++++++++------------------ 1 file changed, 23 insertions(+), 36 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 46df4df..42e7c77 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -272,11 +272,6 @@ release builds. This has the useful effect of allowing programmers to use `assert`s liberally without concern for slowing down release builds. -> **Note:** when unsafe components are used to build safe ones, any -> checks that prevent misuse of unsafe functionality must of course be -> unconditional unless you can prove that the code's logic implies -> those checks will always pass. - ### Postcondition and Expensive Precondition Checks Checking postconditions is the role of unit tests, so in most cases we @@ -300,41 +295,33 @@ public func preconditionUncheckedInRelease( } ``` -### - -Unless you really believe you're shipping bug-free software, you -might want to leave most assertions on in release builds. In fact, the -security of your software might depend on it. If you're programming -in an unsafe language like C++, opportunities to cause undefined -behavior are all around you. When you can assert that the conditions -for avoiding undefined behavior are met before executing the dangerous -operation, the program will come to a controlled stop instead of -opening an arbitrarily bad security hole. - -The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. - -There's another problem with having just one assertion: it doesn't express sufficient intent. For example, it might be a precondition check, or the asserting function's author might just be double-checking their own reasoning. When these two assertions fire, the meaning is very different: the first indicates a bug in the caller, the other one is a bug in the callee. So I really want separate precondition and self_check functions. - -If I'm writing in a safe-by-default language like Rust or Swift, the checks that prevent undefined behavior, like array bounds checks, are special: I can afford to turn off all the other checks in shipping code, but these checks are the ones upholding safety properties of my system are compromised. So I want a different assertion for these checks, even if I don't ever anticipate turning off the other ones in a shipped product. These are the ones that we can't delete from the code. I might want to turn the other assertions off locally to measure how much overhead they are incurring. - -I hope you get the idea. I'm not going to prescribe the exact set of assertion facilities your project needs, but a carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. If you haven't got one, go design it. - -One last point about the C++ assert: it's better than nothing, but because it calls abort(), there's no place to put emergency shutdown measures. You can't even display a message to the user, so to the user it will always feel like a hard, unceremonious crash. You probably want failed assertions to call terminate() instead, because it allows terminate handlers can run. So that's another reason to engineer your own assertions, even if you build just one. - -## What if you're not allowed to terminate? - -Fight for the right (to terminate). If the system is critical, advocate creating a recovery system outside the process. -If you lose today -Fail as noisily as possible, preferably by terminating in non-shipping code. -Keep fighting -Be prepared to win someday. That means use a suite of assertions that don't terminate, but whose behavior you can change when you win the fight. +> **Note:** when unsafe components are used to build safe ones, any +> checks that prevent misuse of unsafe functionality must of course be +> unconditional unless you can prove that the code's logic implies +> those checks will always pass. # Failures -OK, as much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: - +OK, as much as we all love bugs, it's time to leave them behind and +talk about failures. Let's say you identify a condition X where your +function is unable to fulfill its primary purpose. That can occur one +of two ways: + +1. Something your function uses has a precondition that you can't + be sure would be satisfied. For example, + + ```swift + extension Array { + /// Returns the number of unused elements when a maximal + /// number of `n`-element chunks are stored in `self`. + func excessWhenFilled(withChunksOfSize n: Int) { + size / + } + } + ``` +### -Something your function calls has a precondition that you're not sure would be satisfied. +your function might take an integer parameter that is used Something your function calls can itself report a failure. You usually have two choices at this point: From cc1b758fb13a383784503fb78d3bda8bd29de564 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 18 Dec 2025 13:06:24 -0800 Subject: [PATCH 17/41] Checkpoint --- better-code/src/chapter-3-errors.md | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 6544b7c..cb146d5 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -360,7 +360,7 @@ accept it as a response to bug detection, and to mitigate the effects. ## Failures As much as we all love bugs, it's time to leave them behind and talk -about failures. Let's say you identify a condition X where your +about failures. Let's say you identify a condition `X` where your function is unable to fulfill its primary purpose. That can occur one of two ways: @@ -372,16 +372,31 @@ of two ways: /// Returns the number of unused elements when a maximal /// number of `n`-element chunks are stored in `self`. func excessWhenFilled(withChunksOfSize n: Int) { - size / + count() % n // n == 0 would violate the precondition of % + } + } + ``` + +2. Something your function uses can itself report a failure: + + ```swift + extension Array { + /// Writes a textual representation of `self` to a temporary file, + /// which is returned. + func writeToTempFile(withChunksOfSize n: Int) -> URL { + let r = FileManager.defaultTemporaryDirectory + .appendingPathComponent(UUID().uuidString) + "\(self)".write( // compile error: call can throw; error not handled + to: r, atomically: false, encoding: .utf8) + return r } } ``` -### -your function might take an integer parameter that is used -Something your function calls can itself report a failure. +In general, you have two choices: you can make `!X` a precondition of your function, or you can have your function + +### -You usually have two choices at this point: Make !X a precondition; X reflects a bug in the caller. Make X a failure; all the code is correct. From 4f5d481a1eca563050eb361860b551a9dac05bb9 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 18 Dec 2025 13:36:03 -0800 Subject: [PATCH 18/41] Remove treatment of emergency shutdown measures. --- better-code/src/chapter-3-errors.md | 69 ++++++++++------------------- 1 file changed, 23 insertions(+), 46 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index cb146d5..19c1d63 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -140,13 +140,12 @@ subsequently saved, the damage becomes permanent. In any case, unless the program has no mutable state and no external effects, the only principled response to bug detection is to terminate -the process, possibly after taking some emergency shutdown -measures such as saving diagnostic information. [^fault-tolerant] +the process. [^fault-tolerant] [^fault-tolerant]: There do exist systems that recover from bugs in a -principled way, using redundancy: for example, functionality could be -written three different ways by separate teams, and run in separate -processes that “vote” on results. In any case, the loser needs to be +principled way by using redundancy: for example, functionality could +be written three different ways by separate teams, and run in separate +processes that “vote” on results. In any case, the loser needs to be terminated to flush any corrupted program state. As terrible as that outcome may be, it's better than the @@ -171,18 +170,15 @@ root cause could have been addressed once. The best strategy is to stop the program before any more damage is done and generate a crash report or debuggable image that captures as much information as is available about the state of the program, so -there's a chance of fixing the bug. Maybe there's some small -emergency shutdown procedure you need to perform, like saving -information about the failing command so the application can offer to -retry it for you when you restart it. +there's a chance of fixing the bug. Many people have a hard time accepting the idea of voluntarily -terminating, but let's face it: your bug detection isn't the only -reason the program might suddenly stop. The program can crash from an -undetected bug… or a person can trip over the power cord. Where it -matters, software should be designed so that sudden termination is not -catastrophic. Techniques for doing that, such as saving backup files, -are well-known, but outside the scope of this book. +terminating, but let's face it: bug detection isn't the only reason +the program might suddenly stop. The program can crash from an +*un*detected bug in unsafe code… or a person can trip over the power +cord, or the operating system itself could detect an internal bug, +causing a “kernel panic” that restarts the hardware. Software should +be designed so that sudden termination is not catastrophic. In fact, it's often possible to make restarting the app a completely seamless experience. On an iPhone or iPad, for example, to save @@ -190,8 +186,16 @@ battery and keep foreground apps responsive, the OS may kill your process any time it's in the background, but will make it look to the user like it's still running. When the user switches back, every app is supposed to complete the illusion by coming back up in the same -state it was killed in. Resilience to early termination is something -you can and should design into your system. +state it was killed in. Non-catastrophic early termination is +something you can and should design into your system. [^techniques] +When you accept that sudden termination is part of *every* program's +reality, it is easier to accept it as a response to bug detection, and +to mitigate the effects. + +[^techniques]: Techniques for ensuring that restarting is seamless, +such as saving incremental backup files, are well-known, but outside +the scope of this book. + ### Checking For Bugs @@ -330,42 +334,15 @@ func swapFirstAndLast(_ x: inout Array) { In this example, the precondition prevents an out-of-bounds access to a non-existent first element. -### Emergency Shutdown and Seamless Restarts - -When a bug is detected, it can be useful to take emergency measures -before shutdown, e.g.: - -- release system resources that aren't automatically reclaimed upon - process termination. -- log user actions to aid in reproducing the violation or in - recovering work that would otherwise be lost. - -Unfortunately, as of this writing, Swift does not provide a facility -for taking emergency shutdown measures. You cannot release resources -when a bug is detected, and the only way to generate logs is to do it -pre-emptively and unconditionally, which is probably a more principled -approach anyway. - -Regardless, it is useful to think about how to create an experience of -resilience for users. Bug detection is hardly the only reason your -process might suddenly terminate. Someone could trip over the power -cord, or the operating system itself could detect an internal bug, -causing a “kernel panic” that restarts the hardware. Some -environments, such as iOS, may kill any process to better manage -system resources, with the guideline that programs should come up in -the same state in which they were killed. When you accept that sudden -termination is part of *every* program's reality, it is easier to -accept it as a response to bug detection, and to mitigate the effects. - ## Failures As much as we all love bugs, it's time to leave them behind and talk -about failures. Let's say you identify a condition `X` where your +about failures. Let's say you identify a condition where your function is unable to fulfill its primary purpose. That can occur one of two ways: 1. Something your function uses has a precondition that you can't - be sure would be satisfied. For example, + be sure would be satisfied: ```swift extension Array { From 642a4e53068bb2903470609b8e5760cd5546125f Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 18 Dec 2025 14:48:16 -0800 Subject: [PATCH 19/41] Preface caveat --- better-code/src/chapter-3-errors.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 19c1d63..2f7a786 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -11,6 +11,14 @@ In the interest of progressive disclosure, we didn't look closely at the idea, because behind that simple word lies a chapter's worth of discussion. Welcome to the *Errors* chapter! +Before we get into it, we want you to know that what we present here +is not the only logically consistent approach to errors, and our +approach may clash with your instincts. In the space of approaches, +ours is the result of optimizing for local reasoning and scalable +software development, and the justifications for our choices are +interdependent. We hope you'll bear with us as we tie them all +together. + ## Definitions To understand any topic, it's important to define it crisply, and From bb19270c15adb867e229f486643b2f3055f46ad0 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 18 Dec 2025 14:49:19 -0800 Subject: [PATCH 20/41] Checkpoint --- better-code/src/chapter-3-errors.md | 113 +++++++++++++++++++--------- 1 file changed, 76 insertions(+), 37 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 2f7a786..424536a 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -208,8 +208,10 @@ the scope of this book. ### Checking For Bugs While, as we've seen, not all bugs are detectable, checking for the -others at runtime is an extremely valuable technique for creating -robust software. +others at runtime is a powerful way to make code better, by detecting +coding errors close to their source and creating an incentive to +prioritize fixing them. + #### Precondition Checks @@ -294,53 +296,60 @@ to use assertions to check them as a confidence-building measure. Similarly, a precondition that can only be checked with a significant cost to preformance could be checked with `assert`. However, in both cases we suggest using a forwarding -function whose name describes its meaning, so that `assert` is used -exclusively for internal soundness checks: +function whose name describes its meaning, so that `assert` is +directly used only for internal soundness checks: ```swift public func preconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, - _ message: @autoclosure () -> String = "Precondition violated", + _ message: @autoclosure () -> String = "", + file: StaticString = #file, line: UInt = #line +) { + assert( + condition, "Precondition violated:" + message, + file: file, line: line) +} + +public func postconditionUncheckedInRelease( + _ condition: @autoclosure () -> Bool, + _ message: @autoclosure () -> String = "", file: StaticString = #file, line: UInt = #line ) { - assert(condition, message, file: file, line: line) + assert( + condition, "Postcondition violated:" + message, + file: file, line: line) } ``` -The distinction between this check and a use of `assert` is important: -when it fails, this one indicates a bug in the caller, while a failed +The distinction between these checks and a use of `assert` is important: +on failure, these indicate a bug in the caller, while a failed `assert` normally indicates a bug in the callee. -> **Note:** when unsafe components are used to build safe ones, any -> checks that prevent misuse of unsafe functionality must of course be -> unconditional unless you can prove that the code's logic implies -> those checks will always pass. - -All that said, beware the temptation to turn off a precondition check +All that said, resist the temptation to turn off a precondition check in release builds before measuring its effect on performance. The -value of stopping the program before things go too far wrong is often -higher than the cost of any particular check. Certainly, any +value of stopping the program before things go too far wrong is +usually higher than the cost of any particular check. Certainly, any precondition check in a safe function that ultimately prevents an unsafe component from being misused can never be turned off in release builds. ```swift -/// Exchanges the first and last elements of `x`. -func swapFirstAndLast(_ x: inout Array) { - precondition(!x.isEmpty) - if x.count == 1 { return } - x.withUnsafeBufferPointer { b in - f = b.baseAddress - l = f + b.count - 1 - swap(&f.pointee, &l.pointee) +extension Array { + /// Exchanges the first and last elements. + mutating func swapFirstAndLast() { + precondition(!self.isEmpty) + if count() == 1 { return } // swapping would be a no-op. + withUnsafeBufferPointer { b in + f = b.baseAddress + l = f + b.count - 1 + swap(&f.pointee, &l.pointee) + } } } ``` - - -In this example, the precondition prevents an out-of-bounds access to -a non-existent first element. +In this example, the precondition check prevents an out-of-bounds +access to a non-existent first element. ## Failures @@ -366,8 +375,8 @@ of two ways: ```swift extension Array { - /// Writes a textual representation of `self` to a temporary file, - /// which is returned. + /// Writes a textual representation of `self` to a temporary file + /// whose location is returned. func writeToTempFile(withChunksOfSize n: Int) -> URL { let r = FileManager.defaultTemporaryDirectory .appendingPathComponent(UUID().uuidString) @@ -378,16 +387,46 @@ of two ways: } ``` -In general, you have two choices: you can make `!X` a precondition of your function, or you can have your function +> Note: both of the examples above are incomplete. + +In general, when condition *c* interferes with fulfilling your +postcondition, you have two choices: you can make *¬c* a precondition +of your function, or you can have your function report the error to its +caller. + +Making *¬c* a precondition is appropriate when: +- It is **possible for the caller to ensure** it is fulfilled. In the + second example above, the call to `write` can fail because the + storage is full. Even if the caller were to measure free space + before the call and find it sufficient, other processes could fill + that space before the call to `write`. We must report a failure in + this case: + + ```swift + extension Array { + /// Writes a textual representation of `self` to a temporary file + /// whose location is returned. + func writeToTempFile(withChunksOfSize n: Int) throws -> URL { + let r = FileManager.defaultTemporaryDirectory + .appendingPathComponent(UUID().uuidString) + try "\(self)".write(to: r, atomically: false, encoding: .utf8) + return r + } + } + ``` -### +- The work required for the caller to ensure the precondition is much + cheaper than the call it is making. For example, when deserializing + a document you might discover that the input is corrupted. The work + required by a caller to check for corruption before the call is + nearly as high as the cost of deserialization, so well-formedness + would be an inappropriate precondition for deserialization. -Make !X a precondition; X reflects a bug in the caller. -Make X a failure; all the code is correct. +BY CONSTRUCTION +## -It's counterintuitive, you should always prefer to classify X as a bug, as long as !X satisfies the criteria for preconditions: -It is possible to ensure !X. For example, there's no way for the caller to ensure there's enough disk space to save a file, because other processes can use up any space that might have been free before the call. So you can't make “there's enough disk to save” a precondition. -Ensuring !X is considerably less work than the work done by the callee. For example, if the callee is deserializing a document and finds that it's corrupted, you can't make it a precondition that the file is well-formed, because determining whether it is or not is basically the same work as doing the deserialization. +When both of these conditions are satisfied, you should prefer to make +*¬c* a precondition. ### Definition From ce7e1e64bb891d28ac5d2ae6b3f5bffd58a7723e Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 18 Dec 2025 15:45:18 -0800 Subject: [PATCH 21/41] Checkpoint --- better-code/src/chapter-3-errors.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 424536a..8a5ad6e 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -419,15 +419,25 @@ Making *¬c* a precondition is appropriate when: cheaper than the call it is making. For example, when deserializing a document you might discover that the input is corrupted. The work required by a caller to check for corruption before the call is - nearly as high as the cost of deserialization, so well-formedness - would be an inappropriate precondition for deserialization. + usually nearly as high as the cost of deserialization, so + well-formedness would be an inappropriate precondition for + deserialization. That said, remember that ensuring a precondition + can often be done *by construction*, which makes it free. If this + input is always known to be machine-generated by the same program + that parses it, a precondition is an appropriate choice. -BY CONSTRUCTION -## When both of these conditions are satisfied, you should prefer to make *¬c* a precondition. +- Reasoning: + + - Expense of conditional branches up the call chain + - Cascade of throws and try or complex Result types. + - Helps reasoning about bugs when they occur by classifying more + things as bugs. If every use of a function is legal it becomes + hard to point a finger at anything. + ### Definition Failure: inability to satisfy a postcondition in correct code. From cf5a22363b17bd60935b84913c253b79915d961a Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 23 Dec 2025 15:58:01 -0800 Subject: [PATCH 22/41] checkpt --- better-code/src/chapter-3-errors.md | 128 ++++++++++++++++++++++------ 1 file changed, 101 insertions(+), 27 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 8a5ad6e..f1379d2 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -13,11 +13,10 @@ discussion. Welcome to the *Errors* chapter! Before we get into it, we want you to know that what we present here is not the only logically consistent approach to errors, and our -approach may clash with your instincts. In the space of approaches, -ours is the result of optimizing for local reasoning and scalable -software development, and the justifications for our choices are -interdependent. We hope you'll bear with us as we tie them all -together. +approach may clash with your instincts. It is the result of +optimizing for local reasoning and scalable software development, and +the justifications for our choices are interdependent. We hope you'll +bear with us as we tie them all together. ## Definitions @@ -52,13 +51,14 @@ We'll divide errors into three categories:[^common-definition] > though its preconditions were satisfied. For example, writing a > file might fail because the filesystem is full. -[^avoidable]: While bugs are inevitable, every *specific* bug is - avoidable. +[^avoidable]: While bugs in general are inevitable, every *specific* + bug is avoidable. [^common-definition]: While some folks like to use the word “error” to -refer only to *failures*, as the authors have done in the past, the -use of “error” to encompass all three of these categories appears to -be more widespread. +refer only to what we call *failures*—as the authors have done in the +past—the use of “error” to encompass all three of these categories +seems to be the most widespread practice. We've adopted it to avoid +clashing with common understanding. ## Error Recovery @@ -389,13 +389,27 @@ of two ways: > Note: both of the examples above are incomplete. -In general, when condition *c* interferes with fulfilling your -postcondition, you have two choices: you can make *¬c* a precondition -of your function, or you can have your function report the error to its -caller. - -Making *¬c* a precondition is appropriate when: -- It is **possible for the caller to ensure** it is fulfilled. In the +In general, when a condition *C* is necessary for fulfilling your +postcondition, there are three possible choices: you can make *C* a +precondition of your function, you can have your function throw an +`Error`, or you can weaken the postcondition, usually by making the +function return an `Result` instead of a +`T`.[^failable-initializer] + +[^failable-initializer]: Most functions that return `Optional`, and + what Swift calls a “failable initializer” (declared as `init?(…)`) + can be thought of as taking a “weakened postcondition” approach. + Despite the name “failable initializer,” by our definition an + optional result represents not a failure, but a successful + fulfillment of the weak postcondition. Producing an `Optional` + rather than a `Result` is appropriate when there is no + useful distinction among the reasons that the function can't + produce a `T` (which includes the case that there is only one + possible reason). + +A precondition is appropriate when: + +- It is **possible for the caller to ensure** *C* is fulfilled. In the second example above, the call to `write` can fail because the storage is full. Even if the caller were to measure free space before the call and find it sufficient, other processes could fill @@ -426,21 +440,81 @@ Making *¬c* a precondition is appropriate when: input is always known to be machine-generated by the same program that parses it, a precondition is an appropriate choice. +When both of these conditions are satisfied, you should prefer the +precondition, because, in general: + +- Making *C* a precondition classifies ¬*C* as a bug in the caller, + which aids reasoning about the source of misbehaviors. When all + inputs are allowed, an opportunity to easily identify the incorrect + code is lost. +- Even if you had chosen one of the other options, most clients will + have satisfied *C* by construction at the point of the call. +- Making a client deal with the possibility of a reported error or + return values that will never occur forces them to think about the + case and write code to deal with it. +- Adding error reporting or expanded return values to a function + inevitably generates code and some performance. Most often these + results can't be handled in the immediate caller, so are propagated + upwards, meaning these costs tend to spread to callers, and their + callers, and so forth. This applies even in Swift where the control + flow implied by `try` is implicit. +- The viral nature adds complexity to function signatures, either + by `throws` annotations or by more complex types such as `Result`. + +### The Non-Precondition Approaches + +Throwing is a syntactic optimization for the case where the immediate +caller will propagate the error to *its* caller, which can be done +with a simple `try` label on the expression containing the call. +Doing anything else with the error in the caller requires a much +heavier `do { ... } catch ... { ... }` construct. Because errors are +propagated much more often than they are handled Swift has a +first-class language feature—`throw`—to express that pattern. + + +### +- therefore Dynamic type + +Whether +to `throw` or weaken the postcondition is a judgement call + + +When a precondition is not viable, the choice whether to weaken the +postcondition or throw an `Error` is a judgement call. + + +### Failures Are Not Postconditions + +The fact that failures report an inability to satisfy postconditions +means that their details—and the possibilty that they occur—means that +unlike return values, they **are not documented in the description of +the postcondition**. If you find this difference counterintuitive, +consider our rationale + +The fact that the vast majority of errors are not handled in the +immediate caller, but instead propagated up the call chain, is +consequential. + + + +and in many cases +may not be described at all except at a module level, e.g. + +> Any `ThisModule` function that `throws` may report a +> `ThisModule.Error`. -When both of these conditions are satisfied, you should prefer to make -*¬c* a precondition. +Since a type satisfying a protocol with functions marked `throws` may +throw arbitrary errors, a module with generic components would often +have to add -- Reasoning: +> or any errors reported by types satisfying protocol requirements of +> the function. - - Expense of conditional branches up the call chain - - Cascade of throws and try or complex Result types. - - Helps reasoning about bugs when they occur by classifying more - things as bugs. If every use of a function is legal it becomes - hard to point a finger at anything. +This wrinkle means there is not much value to being more precise than -### Definition +> Any `ThisModule` function that `throws` may report arbitrary errors, +> including `ThisModule.Error`. - Failure: inability to satisfy a postcondition in correct code. So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? From 91c80db4b203a068402bf876eba538c83ecf908a Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Sat, 3 Jan 2026 22:01:06 -0800 Subject: [PATCH 23/41] Tweekz --- better-code/src/chapter-3-errors.md | 33 +++++++++++++---------------- 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index f1379d2..1e8b67e 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -101,27 +101,27 @@ though 'such an inconvenient event' never had occurred in the first place.” Being “unscathed” means two things: first, that the program state is -intact—its invariants are upheld so its code is not relying on any +intact—its invariants are upheld so code is not relying on any newly-incorrect assumptions. Second, that the state makes sense given the correct inputs received so far. “Making sense” is -necessarily a subjective judgement, so examples are called for. - -- The initial state of a compiler, before it has seen any input, - certainly meets the compiler's invariants. But when an error is - encountered, resuming with that state would ignore the context seen - so far that can help inform further diagnostics. If the following - text did not match what is expected at the beginning of a source - file, it would be flagged as an error. The error might, for example - have been detected in some otherwise-correct deeply nested - construct. If the compiler's state is reset to its initial - conditions, each closing delimiter of that construct would be - flagged as a new error. +a subjective judgement. For example: + +- The initial state of a compiler, before it has seen any input, meets + the compiler's invariants. But when an error is encountered, + resuming with that state would discard the context seen so + far. Unless the code following the error would have been legal at + the beginning a source file, the compiler will issue many unhelpful + diagnostics for that following code. Recovery means accounting + somehow for the non-erroneous code seen so far and re-synchronizing + the compiler with what follows. - In a desktop graphics application, it's not enough that upon error (say, file creation fails), the user has a well-formed document; an empty document is not an acceptable result. Leaving them with a well-formed document that is subtly changed from its state before - the error would be especially bad. + the error would be especially bad. Recovery means to preserving the + effects of actions issued before the last one, so the document + appears unchanged. ### What About Recovery From Bugs? @@ -482,8 +482,7 @@ to `throw` or weaken the postcondition is a judgement call When a precondition is not viable, the choice whether to weaken the postcondition or throw an `Error` is a judgement call. - -### Failures Are Not Postconditions +### Failures Are Not A Part of Postconditions The fact that failures report an inability to satisfy postconditions means that their details—and the possibilty that they occur—means that @@ -495,8 +494,6 @@ The fact that the vast majority of errors are not handled in the immediate caller, but instead propagated up the call chain, is consequential. - - and in many cases may not be described at all except at a module level, e.g. From cca0e8a1dba1e0924b8aad1e43643fe418f3eaaf Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 5 Jan 2026 19:45:29 -0800 Subject: [PATCH 24/41] Edits --- better-code/src/chapter-3-errors.md | 54 ++++++++++++++--------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 1e8b67e..10d269c 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -125,18 +125,19 @@ a subjective judgement. For example: ### What About Recovery From Bugs? -We've just seen an examples of recovery from an input error and a failure. -What would it mean to recover from a bug? +We've just seen examples of recovery from an input error and of a +failure. What would it mean to recover from a bug? It's not entirely +clear. First, the bug needs to be detected, and that is not assured. As we saw in the previous chapter, not all precondition violations are detectable. Also, it's important to admit that when a runtime bug check fails, we're not detecting the bug per-se: since bugs are flaws -in *code*, actually detecting bugs involves analyzing the program. -We're really detecting a *downstream effect* that the bug has had on -*data*. When we observe that a precondition has been violated, we know -something invalid occurred, but we don't necessarily know exactly -where, how, or the full extent of the damaged data. +in *code*, truly detecting bugs involves analyzing the program. +Instead, a runtime check detects a *downstream effect* that the bug +has had on *data*. When we observe that a precondition has been +violated, we know something invalid occurred, but we don't necessarily +know exactly where, how, or the full extent of the damaged data. So can we “sally forth unscathed?” The problem is that we can't know. Since we don't know where the bug is, the downstream effects of @@ -175,10 +176,10 @@ root cause could have been addressed once. ## How to Handle Bugs -The best strategy is to stop the program before any more damage is -done and generate a crash report or debuggable image that captures as -much information as is available about the state of the program, so -there's a chance of fixing the bug. +When a bug is detected, the best strategy is to stop the program +before more damage is done to data and generate a crash report or +debuggable image that captures as much information as is available +about the state of the program so there's a chance of fixing it. Many people have a hard time accepting the idea of voluntarily terminating, but let's face it: bug detection isn't the only reason @@ -186,16 +187,17 @@ the program might suddenly stop. The program can crash from an *un*detected bug in unsafe code… or a person can trip over the power cord, or the operating system itself could detect an internal bug, causing a “kernel panic” that restarts the hardware. Software should -be designed so that sudden termination is not catastrophic. +be designed so that sudden termination is not catastrophic for its +users. In fact, it's often possible to make restarting the app a completely seamless experience. On an iPhone or iPad, for example, to save -battery and keep foreground apps responsive, the OS may kill your -process any time it's in the background, but will make it look to the -user like it's still running. When the user switches back, every app -is supposed to complete the illusion by coming back up in the same -state it was killed in. Non-catastrophic early termination is -something you can and should design into your system. [^techniques] +battery and keep foreground apps responsive, the operating system may +kill your process any time it's in the background, but the user can +still “switch back” to the app. When the user switches back, every +app is supposed to complete the illusion by coming back up in the same +state it was killed in. So non-catastrophic early termination is +something you *can and should* design into your system. [^techniques] When you accept that sudden termination is part of *every* program's reality, it is easier to accept it as a response to bug detection, and to mitigate the effects. @@ -204,14 +206,12 @@ to mitigate the effects. such as saving incremental backup files, are well-known, but outside the scope of this book. - ### Checking For Bugs -While, as we've seen, not all bugs are detectable, checking for the -others at runtime is a powerful way to make code better, by detecting -coding errors close to their source and creating an incentive to -prioritize fixing them. - +While, as we've seen, not all bugs are detectable, detecting as many +as possible at runtime is still a powerful way to improve code, by +finding detecting the presence of coding errors close to their source +and creating an incentive to prioritize fixing them. #### Precondition Checks @@ -297,7 +297,7 @@ measure. Similarly, a precondition that can only be checked with a significant cost to preformance could be checked with `assert`. However, in both cases we suggest using a forwarding function whose name describes its meaning, so that `assert` is -directly used only for internal soundness checks: +used directly only for internal soundness checks: ```swift public func preconditionUncheckedInRelease( @@ -306,7 +306,7 @@ public func preconditionUncheckedInRelease( file: StaticString = #file, line: UInt = #line ) { assert( - condition, "Precondition violated:" + message, + condition, "Precondition violated: \(message())", file: file, line: line) } @@ -316,7 +316,7 @@ public func postconditionUncheckedInRelease( file: StaticString = #file, line: UInt = #line ) { assert( - condition, "Postcondition violated:" + message, + condition, "Postcondition violated: \(message())", file: file, line: line) } ``` From 66303f0a6df86d39343cdaef1a22a4897206581f Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 6 Jan 2026 16:26:52 -0800 Subject: [PATCH 25/41] Checkpoindexter --- better-code/src/chapter-3-errors.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 10d269c..a8fa0da 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -299,31 +299,34 @@ significant cost to preformance could be checked with function whose name describes its meaning, so that `assert` is used directly only for internal soundness checks: -```swift public func preconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, _ message: @autoclosure () -> String = "", file: StaticString = #file, line: UInt = #line ) { assert( - condition, "Precondition violated: \(message())", - file: file, line: line) + condition() || ( + false, fatalError("Precondition violated: \(message())", + file: file, line: line)).0) } -public func postconditionUncheckedInRelease( +public func preconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, _ message: @autoclosure () -> String = "", file: StaticString = #file, line: UInt = #line ) { assert( - condition, "Postcondition violated: \(message())", - file: file, line: line) + condition() || ( + false, fatalError("Postcondition violated: \(message())", + file: file, line: line)).0) } ``` The distinction between these checks and a use of `assert` is important: on failure, these indicate a bug in the caller, while a failed -`assert` normally indicates a bug in the callee. +`assert` normally indicates a bug in the callee. [^tricky] + +[^tricky]: All that said, resist the temptation to turn off a precondition check in release builds before measuring its effect on performance. The @@ -500,6 +503,9 @@ may not be described at all except at a module level, e.g. > Any `ThisModule` function that `throws` may report a > `ThisModule.Error`. + +## It's just API design to tell people about the errors you think they can handle. + Since a type satisfying a protocol with functions marked `throws` may throw arbitrary errors, a module with generic components would often have to add From a3abe801b9af1aeefe472dfce08632fbe802cf2c Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Tue, 6 Jan 2026 18:46:42 -0800 Subject: [PATCH 26/41] Copy-pasta --- better-code/src/chapter-3-errors.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index a8fa0da..ce81b22 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -310,7 +310,7 @@ public func preconditionUncheckedInRelease( file: file, line: line)).0) } -public func preconditionUncheckedInRelease( +public func postconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, _ message: @autoclosure () -> String = "", file: StaticString = #file, line: UInt = #line From b7b3d42e4bf2774f78181d82434dead93726a0ec Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 17 Dec 2025 16:21:21 -0800 Subject: [PATCH 27/41] X --- better-code/src/chapter-3-errors.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index ce81b22..2d74c3f 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -286,7 +286,7 @@ release builds. This has the useful effect of allowing programmers to use `assert`s liberally without concern for slowing down release builds. -#### Postcondition and Expensive Precondition Checks +### Postcondition and Expensive Precondition Checks Checking postconditions is the role of unit tests, so in most cases we recommend leaving postcondition checks out of function bodies. @@ -299,6 +299,7 @@ significant cost to preformance could be checked with function whose name describes its meaning, so that `assert` is used directly only for internal soundness checks: +``` public func preconditionUncheckedInRelease( _ condition: @autoclosure () -> Bool, _ message: @autoclosure () -> String = "", From 1ab7851b96ad6bfa61cede7e942e323129f0db53 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 7 Jan 2026 14:18:16 -0800 Subject: [PATCH 28/41] Simplicity! --- better-code/src/chapter-3-errors.md | 58 +++++++++-------------------- 1 file changed, 17 insertions(+), 41 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 2d74c3f..e08d2ae 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -293,49 +293,23 @@ recommend leaving postcondition checks out of function bodies. However, if you can't be confident that unit tests cover enough cases, since postconditions are often expensive to check, it might make sense to use assertions to check them as a confidence-building -measure. Similarly, a precondition that can only be checked with a -significant cost to preformance could be checked with -`assert`. However, in both cases we suggest using a forwarding -function whose name describes its meaning, so that `assert` is -used directly only for internal soundness checks: +measure. -``` -public func preconditionUncheckedInRelease( - _ condition: @autoclosure () -> Bool, - _ message: @autoclosure () -> String = "", - file: StaticString = #file, line: UInt = #line -) { - assert( - condition() || ( - false, fatalError("Precondition violated: \(message())", - file: file, line: line)).0) -} +Similarly, a precondition that can only be checked with a significant +cost to preformance could be checked with `assert`, but because—unlike +most uses of `assert`—a failure indicates a bug in the caller, it's +important to distinguish these uses in the assertion message: -public func postconditionUncheckedInRelease( - _ condition: @autoclosure () -> Bool, - _ message: @autoclosure () -> String = "", - file: StaticString = #file, line: UInt = #line -) { - assert( - condition() || ( - false, fatalError("Postcondition violated: \(message())", - file: file, line: line)).0) -} +``` +assert(x.isSorted(), "Precondition failed: x is not sorted.") ``` -The distinction between these checks and a use of `assert` is important: -on failure, these indicate a bug in the caller, while a failed -`assert` normally indicates a bug in the callee. [^tricky] - -[^tricky]: - -All that said, resist the temptation to turn off a precondition check -in release builds before measuring its effect on performance. The -value of stopping the program before things go too far wrong is -usually higher than the cost of any particular check. Certainly, any -precondition check in a safe function that ultimately prevents an -unsafe component from being misused can never be turned off in release -builds. +All that said, resist the temptation to skip a precondition check in +release builds before measuring its effect on performance. The value +of stopping the program before things go too far wrong is usually +higher than the cost of any particular check. Certainly, any +precondition check that prevents a safe function from misusing unsafe +operations must never be turned off in release builds. ```swift extension Array { @@ -352,8 +326,10 @@ extension Array { } ``` -In this example, the precondition check prevents an out-of-bounds -access to a non-existent first element. +The precondition check above prevents an out-of-bounds access to a +non-existent first element, and cannot be skipped without also making +the function unsafe (in which case “unsafe” should appear in the +function name). ## Failures From 79c426c7223b1886f9b2c32f8d5c9b18f0711784 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 7 Jan 2026 14:48:23 -0800 Subject: [PATCH 29/41] Examples and language cleanup --- better-code/src/chapter-3-errors.md | 40 +++++++++++++++++------------ 1 file changed, 23 insertions(+), 17 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e08d2ae..e6991c0 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -379,13 +379,14 @@ function return an `Result` instead of a [^failable-initializer]: Most functions that return `Optional`, and what Swift calls a “failable initializer” (declared as `init?(…)`) can be thought of as taking a “weakened postcondition” approach. - Despite the name “failable initializer,” by our definition an - optional result represents not a failure, but a successful - fulfillment of the weak postcondition. Producing an `Optional` - rather than a `Result` is appropriate when there is no - useful distinction among the reasons that the function can't - produce a `T` (which includes the case that there is only one - possible reason). + Despite the name “failable initializer,” by our definition a `nil` + result represents not a failure, but a successful fulfillment of + the weak postcondition. Producing an `Optional` rather than a + `Result` is appropriate when there will never be a + distinction, useful to the client, among reasons that the function + can't produce a `T`. Subscripting a `Dictionary` with its key type + is a good example. The only reason it would not produce a value + is if the key were not present. A precondition is appropriate when: @@ -409,16 +410,16 @@ A precondition is appropriate when: } ``` -- The work required for the caller to ensure the precondition is much - cheaper than the call it is making. For example, when deserializing - a document you might discover that the input is corrupted. The work - required by a caller to check for corruption before the call is - usually nearly as high as the cost of deserialization, so - well-formedness would be an inappropriate precondition for - deserialization. That said, remember that ensuring a precondition - can often be done *by construction*, which makes it free. If this - input is always known to be machine-generated by the same program - that parses it, a precondition is an appropriate choice. +- The work required for the caller to ensure the precondition is + insignificant. For example, when deserializing a document you might + discover that the input is corrupted. The work required by a caller + to check for corruption before the call is usually nearly as high as + the cost of deserialization, so well-formedness would be an + inappropriate precondition for deserialization. That said, remember + that ensuring a precondition can often be done *by construction*, + which makes it free. If this input is always known to be + machine-generated by the same program that parses it, a precondition + is an appropriate choice. When both of these conditions are satisfied, you should prefer the precondition, because, in general: @@ -441,6 +442,11 @@ precondition, because, in general: - The viral nature adds complexity to function signatures, either by `throws` annotations or by more complex types such as `Result`. +Array indexing is a perfect example where a precondition is better +than a failure: a client can very cheaply ensure that the index is in +range, and in most cases the client's other logic means that no +separate check is needed. + ### The Non-Precondition Approaches Throwing is a syntactic optimization for the case where the immediate From f52541e2a7beed829524afef030689dadafaa5e9 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 7 Jan 2026 15:46:35 -0800 Subject: [PATCH 30/41] Progress on throw vs. Result --- better-code/src/chapter-3-errors.md | 76 +++++++++++++++++++---------- 1 file changed, 51 insertions(+), 25 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e6991c0..e4e592f 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -372,16 +372,16 @@ of two ways: In general, when a condition *C* is necessary for fulfilling your postcondition, there are three possible choices: you can make *C* a precondition of your function, you can have your function throw an -`Error`, or you can weaken the postcondition, usually by making the -function return an `Result` instead of a -`T`.[^failable-initializer] +`Error`, or you can weaken the postcondition, usually by returning a +broader range of values such as those of `Result` instead of +`T`. [^failable-initializer] [^failable-initializer]: Most functions that return `Optional`, and what Swift calls a “failable initializer” (declared as `init?(…)`) can be thought of as taking a “weakened postcondition” approach. Despite the name “failable initializer,” by our definition a `nil` result represents not a failure, but a successful fulfillment of - the weak postcondition. Producing an `Optional` rather than a + the weakened postcondition. Producing an `Optional` rather than a `Result` is appropriate when there will never be a distinction, useful to the client, among reasons that the function can't produce a `T`. Subscripting a `Dictionary` with its key type @@ -443,30 +443,56 @@ precondition, because, in general: by `throws` annotations or by more complex types such as `Result`. Array indexing is a perfect example where a precondition is better -than a failure: a client can very cheaply ensure that the index is in -range, and in most cases the client's other logic means that no -separate check is needed. +than a failure or weakened postcondition: a client can very cheaply +ensure that the index is in range, and in most cases the client's +other logic means that no separate check is needed, and the simple +return type means there's no added cost (e.g. `!` or `try!`) imposed +on client code. ### The Non-Precondition Approaches -Throwing is a syntactic optimization for the case where the immediate -caller will propagate the error to *its* caller, which can be done -with a simple `try` label on the expression containing the call. -Doing anything else with the error in the caller requires a much -heavier `do { ... } catch ... { ... }` construct. Because errors are -propagated much more often than they are handled Swift has a -first-class language feature—`throw`—to express that pattern. - - -### -- therefore Dynamic type - -Whether -to `throw` or weaken the postcondition is a judgement call - - -When a precondition is not viable, the choice whether to weaken the -postcondition or throw an `Error` is a judgement call. +The decision about whether to `throw` or weaken the postcondition is +an API design judgement call, but it is dominated by one consequential +fact: + +> *In most cases*, when a callee can't fulfill its primary purpose, +> neither can the caller—that inability instead propagates up the call +> chain to some general handler that usually reports the condition +> somehow and, if continuing is possible, restores the program to its +> state before the failing operation. + +Because this pattern is so common, most languages provide first-class +features to accomodate it. Swift's error handling fills that role, +propagating errors upward with a simple `try` label on an expression +containing the call. Doing anything else with the error in the caller +requires a much heavier `do { ... } catch ... { ... }` construct. + +The commonality of propagation also explains why most throwing +functions specify in their signatures *that*, but not *what*, they can +throw: the correctness of most of the call chain above the doesn't +depend on that information, and encoding it in the type system would +needlessly limit the evolution of function implementations or cause +meaningless churn in function signatures as implementations change, +essentially exposing what should be implementation +details. [^typed-throws] In fact, since reporting the error is +typically the only useful response, handling the error doesn't depend +on any specifics of its type: `any Error` already provides +[`localizedDescription`](https://developer.apple.com/documentation/swift/error/localizeddescription). + +[^typed-throws]: Swift does have a [“typed throws” + feature](https://docs.swift.org/swift-book/documentation/the-swift-programming-language/errorhandling#Specifying-the-Error-Type) + that lets you specify possible error types, but we suggest you + avoid it for the same reasons outlined above. + +So in the vast majority of cases, you will want to make the condition +a failure rather than weakening the postcondition. The most obvious +exceptions are those cases where it's very likely that an immediate +caller will be able to take some action that allows it to succeed. +For example, a low-level function that makes a single attempt to send +a network packet is very likely to be called by a higher-level +function that retries several times before failing. The low-level +function might return a `Result`, while the higher-level function can +throw. ### Failures Are Not A Part of Postconditions From 15c810da39304cbde324885a9164f8faccbc53bd Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Sat, 10 Jan 2026 14:38:40 -0800 Subject: [PATCH 31/41] Progress. --- better-code/src/chapter-3-errors.md | 182 ++++++++++++++++++---------- 1 file changed, 118 insertions(+), 64 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e4e592f..f8fdcc4 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -14,9 +14,10 @@ discussion. Welcome to the *Errors* chapter! Before we get into it, we want you to know that what we present here is not the only logically consistent approach to errors, and our approach may clash with your instincts. It is the result of -optimizing for local reasoning and scalable software development, and -the justifications for our choices are interdependent. We hope you'll -bear with us as we tie them all together. +optimizing for local reasoning and the ergonomics of scalable software +development, and the justifications for our choices are +interdependent. We hope you'll bear with us as we tie them all +together. ## Definitions @@ -54,6 +55,10 @@ We'll divide errors into three categories:[^common-definition] [^avoidable]: While bugs in general are inevitable, every *specific* bug is avoidable. + + [^common-definition]: While some folks like to use the word “error” to refer only to what we call *failures*—as the authors have done in the past—the use of “error” to encompass all three of these categories @@ -109,10 +114,10 @@ a subjective judgement. For example: - The initial state of a compiler, before it has seen any input, meets the compiler's invariants. But when an error is encountered, resuming with that state would discard the context seen so - far. Unless the code following the error would have been legal at + far. Unless the input following the error would have been legal at the beginning a source file, the compiler will issue many unhelpful - diagnostics for that following code. Recovery means accounting - somehow for the non-erroneous code seen so far and re-synchronizing + diagnostics for that following input. Recovery means accounting + somehow for the non-erroneous input seen so far and re-synchronizing the compiler with what follows. - In a desktop graphics application, it's not enough that upon error @@ -161,15 +166,15 @@ As terrible as that outcome may be, it's better than the alternative. Recovery code is almost never exercised or tested and thus is likely wrong, and the consequences of a botched recovery attempt can be worse than termination. To no advantage, most recovery -code obscures the rest of the code and adds bloat, which hurts -performance. Continuing to run after a bug is detected also hurts our -ability to fix the bug. When a bug is detected, before any further -state changes, you want to immediately capture as much information as -possible that could assist in diagnosis. In development that -typically means dropping into a debugger, and in deployed code that -might mean producing a crash log or core dump. If deployed code -continues to run, the bug is obscured and—even if automatically -reported—will likely be de-prioritized until it is less +code obscures the rest of the code and adds needless tests, which +hurts performance. Continuing to run after a bug is detected also +hurts our ability to fix the bug. When a bug is detected, before any +further state changes, we want to immediately capture as much +information as possible that could assist in diagnosis. In +development that typically means dropping into a debugger, and in +deployed code that might mean producing a crash log or core dump. If +deployed code continues to run, the bug is obscured and—even if +automatically reported—will likely be de-prioritized until it is less fresh and thus harder to address. Worse, it can result in *multiple* symptoms that will be reported as separate higher-priority bugs whose root cause could have been addressed once. @@ -334,7 +339,7 @@ function name). ## Failures As much as we all love bugs, it's time to leave them behind and talk -about failures. Let's say you identify a condition where your +about failures. Suppose you identify a condition where your function is unable to fulfill its primary purpose. That can occur one of two ways: @@ -376,6 +381,14 @@ precondition of your function, you can have your function throw an broader range of values such as those of `Result` instead of `T`. [^failable-initializer] + + + [^failable-initializer]: Most functions that return `Optional`, and what Swift calls a “failable initializer” (declared as `init?(…)`) can be thought of as taking a “weakened postcondition” approach. @@ -390,12 +403,12 @@ broader range of values such as those of `Result` instead of A precondition is appropriate when: -- It is **possible for the caller to ensure** *C* is fulfilled. In the - second example above, the call to `write` can fail because the +- It is **possible for the caller to ensure** *C* is fulfilled. In + the second example above, the call to `write` can fail because the storage is full. Even if the caller were to measure free space before the call and find it sufficient, other processes could fill - that space before the call to `write`. We must report a failure in - this case: + that space before the call to `write`. We *cannot* make sufficient + disk space a precondition in this case: ```swift extension Array { @@ -410,16 +423,16 @@ A precondition is appropriate when: } ``` -- The work required for the caller to ensure the precondition is - insignificant. For example, when deserializing a document you might - discover that the input is corrupted. The work required by a caller - to check for corruption before the call is usually nearly as high as - the cost of deserialization, so well-formedness would be an - inappropriate precondition for deserialization. That said, remember - that ensuring a precondition can often be done *by construction*, - which makes it free. If this input is always known to be - machine-generated by the same program that parses it, a precondition - is an appropriate choice. +- It is **affordable for the caller to ensure** the precondition. For + example, when deserializing a document you might discover that the + input is corrupted. The work required by a caller to check for + corruption before the call is usually nearly as high as the cost of + deserialization, so well-formedness would be an inappropriate + precondition for deserialization. That said, remember that ensuring + a precondition can often be done *by construction*, which makes it + free. If this input is always known to be machine-generated by the + same program that parses it, a precondition is an appropriate + choice. When both of these conditions are satisfied, you should prefer the precondition, because, in general: @@ -431,14 +444,15 @@ precondition, because, in general: - Even if you had chosen one of the other options, most clients will have satisfied *C* by construction at the point of the call. - Making a client deal with the possibility of a reported error or - return values that will never occur forces them to think about the - case and write code to deal with it. + with return values that will never occur (because success can be + ensured by construction) forces them to think about the case and + write code to deal with it. - Adding error reporting or expanded return values to a function - inevitably generates code and some performance. Most often these - results can't be handled in the immediate caller, so are propagated - upwards, meaning these costs tend to spread to callers, and their - callers, and so forth. This applies even in Swift where the control - flow implied by `try` is implicit. + inevitably generates code and costs some performance. Most often + these results can't be handled in the immediate caller, so are + propagated upwards, meaning these costs tend to spread to callers, + and their callers, and so forth. This applies even in Swift where + the control flow implied by `try` is implicit. - The viral nature adds complexity to function signatures, either by `throws` annotations or by more complex types such as `Result`. @@ -457,50 +471,90 @@ fact: > *In most cases*, when a callee can't fulfill its primary purpose, > neither can the caller—that inability instead propagates up the call -> chain to some general handler that usually reports the condition -> somehow and, if continuing is possible, restores the program to its -> state before the failing operation. +> chain to some general handler that reports the condition somehow +> and restores the program to a state appropriate for continuing. Because this pattern is so common, most languages provide first-class features to accomodate it. Swift's error handling fills that role, propagating errors upward with a simple `try` label on an expression containing the call. Doing anything else with the error in the caller requires a much heavier `do { ... } catch ... { ... }` construct. +Handling a thrown error is slighly heavier-weight than the `if let` or +`if case let` required to decode an expanded return value, so +ergonomically speaking, throwing makes sense unless it's very +likely that most immediate callers will ultimately fulfill their primary +purpose even when the callee has failed to. -The commonality of propagation also explains why most throwing -functions specify in their signatures *that*, but not *what*, they can -throw: the correctness of most of the call chain above the doesn't -depend on that information, and encoding it in the type system would -needlessly limit the evolution of function implementations or cause -meaningless churn in function signatures as implementations change, -essentially exposing what should be implementation -details. [^typed-throws] In fact, since reporting the error is -typically the only useful response, handling the error doesn't depend -on any specifics of its type: `any Error` already provides -[`localizedDescription`](https://developer.apple.com/documentation/swift/error/localizeddescription). - -[^typed-throws]: Swift does have a [“typed throws” - feature](https://docs.swift.org/swift-book/documentation/the-swift-programming-language/errorhandling#Specifying-the-Error-Type) - that lets you specify possible error types, but we suggest you - avoid it for the same reasons outlined above. - -So in the vast majority of cases, you will want to make the condition -a failure rather than weakening the postcondition. The most obvious -exceptions are those cases where it's very likely that an immediate -caller will be able to take some action that allows it to succeed. For example, a low-level function that makes a single attempt to send a network packet is very likely to be called by a higher-level function that retries several times before failing. The low-level function might return a `Result`, while the higher-level function can -throw. +throw. These cases, however, are *extremely* rare, and if you have no +special insight into this aspect of callers, choosing to throw is a +pretty good bet. The use cases for `Result` are rare enough, in fact, +that it *can* be a reasonable choice to simplify your coding standard +and always throw when a function's primary purpose can't be fulfilled. + +The commonality of propagation means that the correctness of functions +in the call chain leading to the callee seldom depends on detailed +information about thrown errors. The usual untyped `throws` +specification in a function signature tells most callers everything +they need to use the function correctly. In fact, since reporting the +error is typically the only useful response when error propagation +stops, handling an error usually doesn't depend on any specifics of +its type: `any Error` provides +[`localizedDescription`](https://developer.apple.com/documentation/swift/error/localizeddescription) +for that purpose. + +#### Dynamic Typing of Errors + +Swift does have a [“typed throws” +feature](https://docs.swift.org/swift-book/documentation/the-swift-programming-language/errorhandling#Specifying-the-Error-Type) +that lets you encode possible error types in the types of functions, +but we suggest you avoid it, because it doesn't scale well and tends +to “leak” what should be an implementation detail into a function's +interface. Because failing in a new way can be a breaking change for +clients that use the same feature, it adds development friction and—if +the friction is overcome—causes ripples of change throughout a +codebase. In practice, programmers routinely circumvent similar +features in other languages because they are a poor match for common +usage and have too high a cost to the development process. + +You can think of a thrown error the same way you'd think of a returned +`any P` (where `P` is a protocol—`Error` in this case): we normally +don't feel obliged to specify all the possible concrete types that can +inhabit a given protocol instance, because the protocol itself +provides the interface clients are expected to use. Just as an `is` +test or `as?` cast is *able* to interrogate the concrete type of a +protocol instance, so can a `catch` clause, but that ability does not +oblige a function to expose the details of those types. Of course, an +alternative to the “open” polymorphism of `any P` is the “closed” +polymorphism of an `enum`. Each has its place, but for all the +reasons outlined above, open polymorphism is generally a better fit +for the use case of error reporting. + +### How to Document Thrown Errors + +Because throwing indicates a failure to fulfill postconditions, +details about errors thrown does not belong in a function's summary +sentence fragment. In fact, for reasons just detailed, it's very +common that nothing needs to be documented at all: `throws` in the +function signature indicates that arbitrary errors can be thrown and +nothing more is needed to use the function correctly, so its contract +is complete with no further documentation. + +This does not mean that possible error types and conditions never need +to be documented. If you can anticipate that clients of a function +can use the details of some failure programmatically, it may make +sense to put details in the function's documentation, especially when +some reasons. ### Failures Are Not A Part of Postconditions The fact that failures report an inability to satisfy postconditions means that their details—and the possibilty that they occur—means that unlike return values, they **are not documented in the description of -the postcondition**. If you find this difference counterintuitive, -consider our rationale +the postcondition**. and may not be explicitly documented at all. The fact that the vast majority of errors are not handled in the immediate caller, but instead propagated up the call chain, is From 896dc1dc734217c0ec9ffa930cde1677e9a4f300 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Wed, 21 Jan 2026 16:31:21 -0800 Subject: [PATCH 32/41] Terminology + massaging --- better-code/src/chapter-3-errors.md | 484 +++++++++++++++------------- 1 file changed, 255 insertions(+), 229 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index f8fdcc4..cf05067 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -11,12 +11,11 @@ In the interest of progressive disclosure, we didn't look closely at the idea, because behind that simple word lies a chapter's worth of discussion. Welcome to the *Errors* chapter! -Before we get into it, we want you to know that what we present here -is not the only logically consistent approach to errors, and our -approach may clash with your instincts. It is the result of -optimizing for local reasoning and the ergonomics of scalable software -development, and the justifications for our choices are -interdependent. We hope you'll bear with us as we tie them all +What we present here is not the only logically consistent approach to +errors, and our approach may clash with your instincts. It is the +result of optimizing for local reasoning and the ergonomics of +scalable software development, and the justifications for our choices +are interdependent. We hope you'll bear with us as we tie them all together. ## Definitions @@ -31,38 +30,30 @@ Unless we want to invent new terms, we will have to impose a little of our own structure on the usual terminology. We hope these definitions are at least consistent with your understanding: -> **Error**: a condition in conflict with the primary intention of the -> code. +> **Error**: anything that prevents a function from fulfilling its +> postcondition. When we write the word “error” in normal type, we mean the idea above, distinct from the related Swift `Error` protocol, which we'll always spell in code font. -We'll divide errors into three categories:[^common-definition] +Errors come in two flavors:[^common-definition] -> - **Input error**: the program's external inputs are malformed. For -> example, a `{` without a matching `}` is discovered in a JSON -> file. +> - **Programming Error**, or **bug**: code contains an +> avoidable[^avoidable] mistake. For example, an `if` statement +> tests the logical inverse of the correct condition. > -> - **Bug**: code contains an avoidable[^avoidable] mistake. For -> example, an `if` statement might test the logical inverse of the -> correct condition. -> -> - **Failure**: a function could not fulfill its postconditions even -> though its preconditions were satisfied. For example, writing a -> file might fail because the filesystem is full. +> - **Runtime error**: a function could not fulfill its postconditions +> even though its preconditions were satisfied. For example, +> writing a file might fail because the filesystem is full. [^avoidable]: While bugs in general are inevitable, every *specific* bug is avoidable. - - [^common-definition]: While some folks like to use the word “error” to -refer only to what we call *failures*—as the authors have done in the -past—the use of “error” to encompass all three of these categories -seems to be the most widespread practice. We've adopted it to avoid +refer only to what we call *runtime errors*—as the authors have done +in the past—the use of “error” to encompass both categories seems to +be the most widespread practice. We've adopted that usage to avoid clashing with common understanding. ## Error Recovery @@ -87,9 +78,10 @@ func f(x: inout Int) { ``` As of this writing, the Swift compiler treats `whilee` as an -identifier and issues five unhelpful errors, four of which point to -the remaining otherwise-valid code. That's not an indictment of -Swift; doing this job correctly is nontrivial. +identifier rather than a misspelled keyword, and issues five unhelpful +errors, four of which point to the remaining otherwise-valid code. +That's not an indictment of Swift; doing this job correctly is +nontrivial. +1. You can make *C* a precondition of your function +2. You can make the function report a runtime error to its caller +3. You can weaken the postcondition (e.g. by returning + `Optional` instead of `T`). [^failable-initializer] [^failable-initializer]: Most functions that return `Optional`, and what Swift calls a “failable initializer” (declared as `init?(…)`) can be thought of as taking a “weakened postcondition” approach. Despite the name “failable initializer,” by our definition a `nil` - result represents not a failure, but a successful fulfillment of + result represents not a runtime error, but a successful fulfillment of the weakened postcondition. Producing an `Optional` rather than a `Result` is appropriate when there will never be a distinction, useful to the client, among reasons that the function @@ -401,14 +384,17 @@ between `() -> Result` and `() throws -> T`. is a good example. The only reason it would not produce a value is if the key were not present. -A precondition is appropriate when: +### Adding a Precondition + +It's appropriate to add a precondition when: - It is **possible for the caller to ensure** *C* is fulfilled. In the second example above, the call to `write` can fail because the - storage is full. Even if the caller were to measure free space - before the call and find it sufficient, other processes could fill - that space before the call to `write`. We *cannot* make sufficient - disk space a precondition in this case: + storage is full (among other reasons). Even if the caller were to + measure free space before the call and find it sufficient, other + processes could fill that space before the call to `write`. We + *cannot* make sufficient disk space a precondition in this case, so + we should instead propagate the error: ```swift extension Array { @@ -424,101 +410,101 @@ A precondition is appropriate when: ``` - It is **affordable for the caller to ensure** the precondition. For - example, when deserializing a document you might discover that the - input is corrupted. The work required by a caller to check for + example, when deserializing a data structure you might discover that + the input is corrupted. The work required by a caller to check for corruption before the call is usually nearly as high as the cost of - deserialization, so well-formedness would be an inappropriate - precondition for deserialization. That said, remember that ensuring - a precondition can often be done *by construction*, which makes it - free. If this input is always known to be machine-generated by the - same program that parses it, a precondition is an appropriate - choice. + deserialization, so validity is an inappropriate precondition for + deserialization. That said, remember that ensuring a precondition + can often be done *by construction*, which makes it free. If the + input is always known to be machine-generated by the same OS process + that parses it, a precondition is an appropriate choice. -When both of these conditions are satisfied, you should prefer the -precondition, because, in general: +### Reporting a Runtime Error -- Making *C* a precondition classifies ¬*C* as a bug in the caller, - which aids reasoning about the source of misbehaviors. When all - inputs are allowed, an opportunity to easily identify the incorrect - code is lost. -- Even if you had chosen one of the other options, most clients will - have satisfied *C* by construction at the point of the call. -- Making a client deal with the possibility of a reported error or - with return values that will never occur (because success can be - ensured by construction) forces them to think about the case and - write code to deal with it. -- Adding error reporting or expanded return values to a function - inevitably generates code and costs some performance. Most often - these results can't be handled in the immediate caller, so are - propagated upwards, meaning these costs tend to spread to callers, - and their callers, and so forth. This applies even in Swift where - the control flow implied by `try` is implicit. -- The viral nature adds complexity to function signatures, either - by `throws` annotations or by more complex types such as `Result`. +Swift provides two ways to report runtime errors: `throw`ing an +`Error` and returning a `Result`. The choice of which to +use is an API design judgement call, but it is dominated by one +consequential fact: -Array indexing is a perfect example where a precondition is better -than a failure or weakened postcondition: a client can very cheaply -ensure that the index is in range, and in most cases the client's -other logic means that no separate check is needed, and the simple -return type means there's no added cost (e.g. `!` or `try!`) imposed -on client code. +> *In most cases*, when a callee can't fulfill its postconditions, +> neither can the caller—that inability instead propagates up the call +> chain to some general handler that restores the program to a state +> appropriate for continuing, usually after some form of error +> reporting. -### The Non-Precondition Approaches +Because this pattern is so common, most languages provide first-class +features to accomodate it without causing this kind of repeated +boilerplate: -The decision about whether to `throw` or weaken the postcondition is -an API design judgement call, but it is dominated by one consequential -fact: + ```swift + let someValueOrError = thing1ThatCanFail() + guard case .success(let someValue) = someValueOrError else { + return someValueOrError + } -> *In most cases*, when a callee can't fulfill its primary purpose, -> neither can the caller—that inability instead propagates up the call -> chain to some general handler that reports the condition somehow -> and restores the program to a state appropriate for continuing. + let otherValueOrError = thing2ThatCanFail() + guard case .success(let otherValue) = otherValueOrError else { + return otherValueOrError + } + ``` + + +Swift's thrown errors fill that role by propagating errors upward with +a simple `try` label on an expression containing the call. + + ```swift + let someValue = try thing1ThatCanFail() + let otherValue = try thing2ThatCanFail() + ``` + +Doing anything with the error *other* than propagating it requires a +much heavier `do { ... } catch ... { ... }` construct, which is +slighly heavier-weight than the boilerplate pattern, making throwing a +worse choice when clients do not directly propagate errors. + +The great ergonomic advantage of throwing in the common case means +that returning a `Result` only makes sense when it's very likely that +your callers will be able to satisfy their postconditions, *even when +faced with your runtime error*. For example, a low-level +function that makes a single attempt to send a network packet is very +likely to be called by a higher-level function that retries several +times with an exponentially-increasing delay before failing. The +low-level function might return a `Result`, while the higher-level +function would throw. These cases, however, are *extremely* rare, and +if you have no special insight into your function's callers, choosing +to `throw` is a pretty good bet.[^uniform-choice] + +[^uniform-choice]: Returning a `Result` could also make sense when + most callers are going to transform the error somehow before + propagating it, but code that propagates transformed errors is + also very rare. The use cases for `Result` are rare enough, in + fact, that it's a reasonable choice to always `throw` for runtime + error reporting. -Because this pattern is so common, most languages provide first-class -features to accomodate it. Swift's error handling fills that role, -propagating errors upward with a simple `try` label on an expression -containing the call. Doing anything else with the error in the caller -requires a much heavier `do { ... } catch ... { ... }` construct. -Handling a thrown error is slighly heavier-weight than the `if let` or -`if case let` required to decode an expanded return value, so -ergonomically speaking, throwing makes sense unless it's very -likely that most immediate callers will ultimately fulfill their primary -purpose even when the callee has failed to. - -For example, a low-level function that makes a single attempt to send -a network packet is very likely to be called by a higher-level -function that retries several times before failing. The low-level -function might return a `Result`, while the higher-level function can -throw. These cases, however, are *extremely* rare, and if you have no -special insight into this aspect of callers, choosing to throw is a -pretty good bet. The use cases for `Result` are rare enough, in fact, -that it *can* be a reasonable choice to simplify your coding standard -and always throw when a function's primary purpose can't be fulfilled. - -The commonality of propagation means that the correctness of functions -in the call chain leading to the callee seldom depends on detailed -information about thrown errors. The usual untyped `throws` -specification in a function signature tells most callers everything -they need to use the function correctly. In fact, since reporting the -error is typically the only useful response when error propagation -stops, handling an error usually doesn't depend on any specifics of -its type: `any Error` provides +#### Dynamic Typing of Errors + +The overwhelming commonality of propagation means that functions in +the call chain above the one initiating the error report seldom +depends on detailed information about thrown errors. The usual +untyped `throws` specification in a function signature tells most +callers everything they need to use the function correctly. In fact, +since reporting the error to a human is typically the only useful +response when propagation stops, the same often applies to the +function that ultimately catches the error: `any Error` provides [`localizedDescription`](https://developer.apple.com/documentation/swift/error/localizeddescription) for that purpose. -#### Dynamic Typing of Errors - Swift does have a [“typed throws” feature](https://docs.swift.org/swift-book/documentation/the-swift-programming-language/errorhandling#Specifying-the-Error-Type) that lets you encode possible error types in the types of functions, but we suggest you avoid it, because it doesn't scale well and tends to “leak” what should be an implementation detail into a function's interface. Because failing in a new way can be a breaking change for -clients that use the same feature, it adds development friction and—if -the friction is overcome—causes ripples of change throughout a -codebase. In practice, programmers routinely circumvent similar -features in other languages because they are a poor match for common -usage and have too high a cost to the development process. +clients that use the same feature, it adds development friction +which—if overcome—causes ripples of change throughout a codebase. In +languages with statically constrained error reporting, programmers +routinely circumvent the mechanism because it is a poor match for +common usage and has too high a cost to the development process. You can think of a thrown error the same way you'd think of a returned `any P` (where `P` is a protocol—`Error` in this case): we normally @@ -527,60 +513,100 @@ inhabit a given protocol instance, because the protocol itself provides the interface clients are expected to use. Just as an `is` test or `as?` cast is *able* to interrogate the concrete type of a protocol instance, so can a `catch` clause, but that ability does not -oblige a function to expose the details of those types. Of course, an -alternative to the “open” polymorphism of `any P` is the “closed” -polymorphism of an `enum`. Each has its place, but for all the -reasons outlined above, open polymorphism is generally a better fit -for the use case of error reporting. - -### How to Document Thrown Errors - -Because throwing indicates a failure to fulfill postconditions, -details about errors thrown does not belong in a function's summary -sentence fragment. In fact, for reasons just detailed, it's very -common that nothing needs to be documented at all: `throws` in the -function signature indicates that arbitrary errors can be thrown and -nothing more is needed to use the function correctly, so its contract -is complete with no further documentation. - -This does not mean that possible error types and conditions never need -to be documented. If you can anticipate that clients of a function -can use the details of some failure programmatically, it may make -sense to put details in the function's documentation, especially when -some reasons. - -### Failures Are Not A Part of Postconditions - -The fact that failures report an inability to satisfy postconditions -means that their details—and the possibilty that they occur—means that -unlike return values, they **are not documented in the description of -the postcondition**. and may not be explicitly documented at all. - -The fact that the vast majority of errors are not handled in the -immediate caller, but instead propagated up the call chain, is -consequential. - -and in many cases -may not be described at all except at a module level, e.g. +oblige a function to expose the details of those types. + +Of course, an alternative to the “open” polymorphism of `any P` is the +“closed” polymorphism of an `enum`. Each has its place, but for all +the reasons outlined above, open polymorphism is generally a better +fit for the use case of error reporting. + +The exception to this reasoning is once again the case where clients +are very unlikely to directly propagate the error, in which case you +are likely to use `Result` rather than throwing, and using a +more specific error type than `any Error` might make sense. + +#### How to Document Thrown Errors + +Because a runtime error report indicates a failure to fulfill +postconditions, information about errors—including that they are +possible—does not belong in a function's postcondition documentation, +whose primary home is the summary sentence fragment.[^result-doc] + +[^result-doc]: This creates a slightly awkward special case for + functions that return a `Result`, which should be documented + as though they just return a `T`: + + swift``` + extension Array { + /// Writes a textual representation of `self` to a temporary file + /// whose location is returned. + func writeToTempFile(withChunksOfSize n: Int) + -> Result + { ... } + } + ``` + +In fact, for reasons just detailed, it's very common that nothing +about errors needs to be documented at all: `throws` (or `Result`) in +the function signature indicates that arbitrary errors can be thrown +and no further information about errors is required to use the +function correctly. + +That does not mean that possible error types and conditions should +*never* be documented. If you anticipate that clients of a function +will use the details of some runtime error programmatically, it may +make sense to put details in the function's documentation, but resist +the urge to document these details just because they “might be +needed.” As with any other detail of an API, documenting errors that +(almost) noone cares creates a usability tax that is paid by everyone. + +A useful middle ground is to describe reported errors at the module +level, e.g. > Any `ThisModule` function that `throws` may report a > `ThisModule.Error`. +A description like the one above does not preclude reporting other +errors, such as those thrown by a dependency like `Foundation`, but +calls attention to the error type introduced by `ThisModule`. -## It's just API design to tell people about the errors you think they can handle. +### Weakening The Postcondition -Since a type satisfying a protocol with functions marked `throws` may -throw arbitrary errors, a module with generic components would often -have to add +- sort example +- array indexing example (see below) +- sum of unsigned numbers returns -1 example +- dictionary indexing -> optional -> or any errors reported by types satisfying protocol requirements of -> the function. +When both of these conditions are satisfied, you should prefer the +precondition, because, in general: -This wrinkle means there is not much value to being more precise than +- Making *C* a precondition classifies ¬*C* as a bug in the caller, + which aids reasoning about the source of misbehaviors. When all + inputs are allowed, an opportunity to easily identify the incorrect + code is lost. +- Even if you had chosen one of the other options, most clients will + have satisfied *C* by construction at the point of the call. +- Making a client deal with the possibility of a reported error or + with return values that will never occur (because success can be + ensured by construction) forces them to think about the case and + write code to deal with it. +- Adding error reporting or expanded return values to a function + inevitably generates code and costs some performance. Most often + these results can't be handled in the immediate caller, so are + propagated upwards, meaning these costs tend to spread to callers, + and their callers, and so forth. This applies even in Swift where + the control flow implied by `try` is implicit. +- The viral nature adds complexity to function signatures, either + by `throws` annotations or by more complex types such as `Result`. -> Any `ThisModule` function that `throws` may report arbitrary errors, -> including `ThisModule.Error`. +Array indexing is a perfect example where a precondition is better +than a runtime error or weakened postcondition: a client can very cheaply +ensure that the index is in range, and in most cases the client's +other logic means that no separate check is needed, and the simple +return type means there's no added cost (e.g. `!` or `try!`) imposed +on client code. +### Onward So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? From 842b853aae3dfaa999084a7fb4ec589bc057a71b Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 22 Jan 2026 15:23:05 -0800 Subject: [PATCH 33/41] Onward --- better-code/src/chapter-3-errors.md | 131 ++++++++++++++++++++-------- 1 file changed, 97 insertions(+), 34 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index cf05067..0ce1e81 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -377,12 +377,7 @@ postcondition, there are three possible choices: can be thought of as taking a “weakened postcondition” approach. Despite the name “failable initializer,” by our definition a `nil` result represents not a runtime error, but a successful fulfillment of - the weakened postcondition. Producing an `Optional` rather than a - `Result` is appropriate when there will never be a - distinction, useful to the client, among reasons that the function - can't produce a `T`. Subscripting a `Dictionary` with its key type - is a good example. The only reason it would not produce a value - is if the key were not present. + the weakened postcondition. ### Adding a Precondition @@ -481,6 +476,7 @@ to `throw` is a pretty good bet.[^uniform-choice] fact, that it's a reasonable choice to always `throw` for runtime error reporting. + #### Dynamic Typing of Errors The overwhelming commonality of propagation means that functions in @@ -536,7 +532,7 @@ whose primary home is the summary sentence fragment.[^result-doc] functions that return a `Result`, which should be documented as though they just return a `T`: - swift``` + ```swift extension Array { /// Writes a textual representation of `self` to a temporary file /// whose location is returned. @@ -547,10 +543,9 @@ whose primary home is the summary sentence fragment.[^result-doc] ``` In fact, for reasons just detailed, it's very common that nothing -about errors needs to be documented at all: `throws` (or `Result`) in -the function signature indicates that arbitrary errors can be thrown -and no further information about errors is required to use the -function correctly. +about errors needs to be documented at all: `throws` in the function +signature indicates that arbitrary errors can be thrown and no further +information about errors is required to use the function correctly. That does not mean that possible error types and conditions should *never* be documented. If you anticipate that clients of a function @@ -572,39 +567,107 @@ calls attention to the error type introduced by `ThisModule`. ### Weakening The Postcondition -- sort example -- array indexing example (see below) -- sum of unsigned numbers returns -1 example -- dictionary indexing -> optional +There are several ways to weaken a postcondition. The first is to make +it conditional on some property of the function's inputs. For +example, take the `sort` method from the previous chapter. Instead of +making it a precondition that the comparison is a total preorder, we +could weaken the postcondition as follows: + +```swift +/// Sorts the elements so that all adjacent pairs satisfy +/// `areInOrder`, or permutes the elements in an unspecified way if +/// `areInOrder` is not a [total +/// preorder](https://en.wikipedia.org/wiki/Weak_ordering#Total_preorders) +/// `areInOrder`. +/// +/// - Complexity: at most N log N comparisons, where N is the number +/// of elements. +mutating func sort(areInOrder: (Element, Element)->Bool) { ... } +``` + +As you can see, this change makes the API more complicated to no +advantage: an unspecified permutation is not a result any client wants +from `sort`.[^random-sort] + +[^random-sort]: We've seen attempts to randomly shuffle elements using + `x.sort { Bool.random() }`, but that has worse performance than a + proper `x.randomShuffle()` would, and is not guaranteed to + preserve the same randomness properties. Perhaps more + importantly, the code lies by claiming to sort when it in fact + does not. + +Another approach is to intentionally expand the range of values +returned. For example, `Array`'s existing `subscript` could be +declared as: + +``` +/// The `i`th element. +subscript(i: Int) -> Element +``` + +but could have instead been designed this way: + +``` +/// The `i`th element, or `nil` if there is no such element. +subscript(i: Int) -> Element? +``` + +The change adds only a small amount of complexity to the contract, but +consider the impact on callers: every existing use of array indexing +now needs to be force-unwrapped. Aside from the runtime cost of all +those tests and branches, seeing `!` in the code adds cognitive +overhead for human readers. In the vast majority of callers, the +precondition of the original API is established by construction with +no special checks, but should a client need to check that an index is +in bounds, doing so is extremely cheap. + +Occasionally, though, a weakened postcondition is appropriate. +Dictionary's `subscript` taking a key is one example: + +``` +/// The value associated with `k`, or `nil` if there is no such value. +subscript(k: Key) -> Value? +``` + +In this case, it's common that callers have not somehow ensured the +dictionary has a key `k`, and checking for the presence of the key in +the caller would have a substantial cost similar to that of the +subscript itself, so it's much more efficient to pay that cost once in +the `subscript` implementation. + +### How to Choose? -When both of these conditions are satisfied, you should prefer the -precondition, because, in general: +Clearly weakening a postcondition seldom pays off and should be used +rarely. Whenever it is appropriate, you should prefer to add a +precondition, because: + +- A failure to satisfy the condition becomes a bug in the caller, + which aids in reasoning about the source of misbehaviors. When all + inputs are allowed, an opportunity to easily identify incorrect code + is lost. Furthermore, if the precondition is checkable at runtime, + you can catch misuse in testing, *before* it becomes misbehavior. -- Making *C* a precondition classifies ¬*C* as a bug in the caller, - which aids reasoning about the source of misbehaviors. When all - inputs are allowed, an opportunity to easily identify the incorrect - code is lost. -- Even if you had chosen one of the other options, most clients will - have satisfied *C* by construction at the point of the call. - Making a client deal with the possibility of a reported error or with return values that will never occur (because success can be ensured by construction) forces them to think about the case and write code to deal with it. + - Adding error reporting or expanded return values to a function inevitably generates code and costs some performance. Most often these results can't be handled in the immediate caller, so are propagated upwards, meaning these costs tend to spread to callers, - and their callers, and so forth. This applies even in Swift where - the control flow implied by `try` is implicit. -- The viral nature adds complexity to function signatures, either - by `throws` annotations or by more complex types such as `Result`. - -Array indexing is a perfect example where a precondition is better -than a runtime error or weakened postcondition: a client can very cheaply -ensure that the index is in range, and in most cases the client's -other logic means that no separate check is needed, and the simple -return type means there's no added cost (e.g. `!` or `try!`) imposed -on client code. + and their callers, and so forth. (The control flow implied by `try` + has a cost similar to the cost for checking and propagating a + returned `Result`). + +- The alternatives complicate APIs. + +Producing an `Optional` rather than a + `Result` is appropriate when there will never be a + distinction, useful to the client, among reasons that the function + can't produce a `T`. Subscripting a `Dictionary` with its key type + is a good example. The only reason it would not produce a value + is if the key were not present. ### Onward From 9fb694fb28f8ef89c0dd448c4097c1fcefdccf57 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Fri, 23 Jan 2026 17:25:50 -0800 Subject: [PATCH 34/41] Finish 1st draft --- better-code/src/chapter-3-errors.md | 471 +++++++++++----------------- 1 file changed, 183 insertions(+), 288 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 0ce1e81..0d60278 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -521,39 +521,43 @@ are very unlikely to directly propagate the error, in which case you are likely to use `Result` rather than throwing, and using a more specific error type than `any Error` might make sense. -#### How to Document Thrown Errors +#### How to Document Runtime Errors Because a runtime error report indicates a failure to fulfill postconditions, information about errors—including that they are possible—does not belong in a function's postcondition documentation, whose primary home is the summary sentence fragment.[^result-doc] -[^result-doc]: This creates a slightly awkward special case for +[^result-doc]: This rule creates a slightly awkward special case for functions that return a `Result`, which should be documented as though they just return a `T`: ```swift extension Array { - /// Writes a textual representation of `self` to a temporary file - /// whose location is returned. + /// Writes a textual representation of `self` to a temporary file, + /// returning its location. func writeToTempFile(withChunksOfSize n: Int) -> Result { ... } } ``` -In fact, for reasons just detailed, it's very common that nothing -about errors needs to be documented at all: `throws` in the function -signature indicates that arbitrary errors can be thrown and no further -information about errors is required to use the function correctly. +In fact, because most callers propagate errors, it's very common that +nothing about errors needs to be documented at all: `throws` in the +function signature indicates that arbitrary errors can be thrown and +no further information about errors is required to use the function +correctly. That does not mean that possible error types and conditions should *never* be documented. If you anticipate that clients of a function will use the details of some runtime error programmatically, it may -make sense to put details in the function's documentation, but resist -the urge to document these details just because they “might be +make sense to put details in the function's documentation. That said, +resist the urge to document these details just because they “might be needed.” As with any other detail of an API, documenting errors that -(almost) noone cares creates a usability tax that is paid by everyone. +are irrelevant to most code creates a usability tax that is paid by +everyone. In any case, keeping runtime error information out of +postconditions (and thus summary documentation) works to simplify +contracts and make functions easier to use. A useful middle ground is to describe reported errors at the module level, e.g. @@ -565,6 +569,81 @@ A description like the one above does not preclude reporting other errors, such as those thrown by a dependency like `Foundation`, but calls attention to the error type introduced by `ThisModule`. +##### Documenting Mutating Functions + +When a runtime error occurs partway through a mutating operation, a a +partially mutated state may be left behind. Trying to describe these +states in detail is usually a bad idea. Apart from the fact that +such descriptions can be unmanageably complex—try to document the +state of an array from partway through an aborted sorting operation—it +is normally information no client can use. + +Partially documenting these states *can* be useful, however. For +example, [Swift's +`sort(by:)`](https://developer.apple.com/documentation/swift/array/sort(by:)) +method guarantees that no elements are lost if an error occurs, which +can be useful in code that manages allocated resources, or that +depends for its safety on invariants being upheld (usually the +implementations of safe types with unsafe implementation details). +The following code uses that guarantee to ensure that all the +allocated buffers are eventually freed. + +```swift +/// Processes each element of `xs` in an order determined by the +/// [total +/// preorder](https://en.wikipedia.org/wiki/Weak_ordering#Total_preorders) +/// `areInOrder` using a distinct 1Kb buffer for each one. +func f(_ xs: [X], orderedBy areInOrder: (X, X) throws -> Bool) rethrows +{ + var buffers = xs.map { x in + (p, UnsafeMutablePointer.allocate(capacity: 1024)) } + defer { for _, b in buffers { b.deallocate() } } + + buffers.sort { !areInOrder($1.0, $0.0) } + ... +} +``` + +The **strong guarantee** that *no mutation occurs at all* in case +of an error is the easiest to document and most useful special case: + +```swift +/// If `shouldSwap(x, y)`, swaps `x` and `y`. +/// +/// If an error is thrown there are no effects. +func swap( + _ x: inout T, _ y: inout T, if shouldSwap: (T, T) throws->Bool +) rethrows { + if try shouldSwap(x, y) { + swap(&x, &y) + } +} +``` + +A few caveats about mutation guarantees when errors occur: + +1. Known use cases are few and rare: most allacated resources are + ultimately managed by the `deinit` of some class, and uses of + unsafe operations are usually encapsulated. Weigh the marginal + utility of making guarantees against the complexity it adds to + documentation. +2. Like any guarantee, they can limit your ability to change a + function's implementation without breaking clients. +3. Avoid making guarantees if it has a performance cost. For example, + one way to get the strong guarantee is to order operations so the + first mutation occurs only after all throwing operations are + complete. Some mutating operations can be arranged that way at + little or no cost, but you can do it to *any* operation by copying + the data, mutating the copy (which might fail), and finally + replacing the data with the mutated copy. The problem is that the + copy can be expensive and you can't be sure all clients need it. + Even when a client needs to give the same guarantee itself, your + work may be wasted: when operations A and B give the strong + guarantee, the operation C composed of A and then B does not (if B + fails, the modifications of A remain). If you need a strong + guarantee for C, another copy is required and the lower-level + copies haven't helped at all. + ### Weakening The Postcondition There are several ways to weaken a postcondition. The first is to make @@ -641,288 +720,104 @@ Clearly weakening a postcondition seldom pays off and should be used rarely. Whenever it is appropriate, you should prefer to add a precondition, because: -- A failure to satisfy the condition becomes a bug in the caller, - which aids in reasoning about the source of misbehaviors. When all - inputs are allowed, an opportunity to easily identify incorrect code - is lost. Furthermore, if the precondition is checkable at runtime, - you can catch misuse in testing, *before* it becomes misbehavior. +- It makes it easy to identify incorrect code. A failure to satisfy + the condition becomes a bug in the caller, which aids in reasoning + about the source of misbehaviors. If the precondition is checkable + at runtime, you can even catch misuse in testing, *before* it + becomes misbehavior. -- Making a client deal with the possibility of a reported error or - with return values that will never occur (because success can be - ensured by construction) forces them to think about the case and - write code to deal with it. +- Making a client deal with the possibility of return values or + runtime errors that will never occur in practice forces authors and + readers of client code to think about the case and the code to + handle it (or about why that code isn't needed). - Adding error reporting or expanded return values to a function inevitably generates code and costs some performance. Most often these results can't be handled in the immediate caller, so are - propagated upwards, meaning these costs tend to spread to callers, - and their callers, and so forth. (The control flow implied by `try` - has a cost similar to the cost for checking and propagating a - returned `Result`). + propagated upwards, spreading the cost to callers, their callers, + and so forth. (The control flow implied by `try` has a cost similar + to the cost for checking and propagating a returned `Result`). - The alternatives complicate APIs. -Producing an `Optional` rather than a - `Result` is appropriate when there will never be a - distinction, useful to the client, among reasons that the function - can't produce a `T`. Subscripting a `Dictionary` with its key type - is a good example. The only reason it would not produce a value - is if the key were not present. - -### Onward - -So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? - -First of all, it simplifies and improves understandability of contracts. This is easiest to see if you have a dedicated language mechanism for error handling: - -** Note: fictional programming language ** - -// Returns `x` sorted in `order`, or throws an exception -// in case order fails. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -// Returns `x` sorted in `order`. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -Even if you feel you need to say something about possible failures, that becomes a secondary note that's not essential to the contract. - -// Returns `x` sorted in `order`. -// -// Propagates any exceptions thrown by `order`. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -A programmer can know everything essential from the summary fragment and the signature. Another way this separation plays nicely with exceptions is that you can say the postcondition of a function describes what you get when it returns, and a throwing function never returns. - -If you don't use exceptions, you still simplified contracts as long as you have dedicated types to represent the possibility of failure. - -// Returns `x` sorted in `order`. -fn sorted(x: [Int], order: Ordering) -> ResultOrFailure<[Int]> - -Separating the function's primary intention from the reasons for failure makes sense, because the reasons for failure matter less. If that's not obvious yet, some justification is coming. - -Another reason to exclude the failure case from the postcondition is that you want postconditions to be solid and fully described, but a mutating operation that fails often leaves behind a state that's very difficult to nail down, and as I said in the contracts talk, that you usually don't want to nail down, because it's detail nobody cares about. But if it's part of the postcondition, you need to say something about it, and that further complicates the contract. - -// Sorts `x` according to `order` or throws an exception -// if `order` fails, leaving `x` modified in unspecified -// ways. -fn sort(mutating x: [Int], order: Ordering) throws - -// Sorts `x` according to `order`. -fn sort(mutating x: [Int], order: Ordering) throws - -### Two kinds of failures - -If you've spent some time writing code that carefully handles failures, especially in a language like C where all the error propagation is explicit, failures start to fall into two main categories: local and non-local, based on where the recovery is likely to happen. - -Local recovery occurs very close to the source of failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. In many cases, the recovery path is performance-critical. - -**Example**: you have an ultrafast memory allocator that draws from a local pool much smaller than your system memory. You build a general-purpose allocator that first tries your fast allocator, and only if that allocation fails, recovers by trying the system allocator. - -**Example**: the lowest level function that tries to send a network packet can fail for a whole slew of reasons (https://www.ibm.com/docs/en/zos/2.3.0?topic=codes-sockets-return-errnos), some of which may indicate a temporary condition like packet collision. 99% of the time, the immediate caller is a higher-level function that checks for these conditions and if found, initiates a retry protocol with exponential backoff, only itself failing after N failed retries. That lowest-level failure is local. The failure after N retries is very likely to be non-local. - -Non-local recovery, which is far more common, occurs far from the source, usually in a way that can be described without reference to the reasons for failure. For example, when you're serializing a complex document, serializing any part means serializing all of its sub-parts, and parts are ultimately nested many layers deep. Because you can run out of space in the serialization medium, every step of the process can fail. If you write out the error propagation explicitly, it usually looks like this: - -// Writes `s` into the archive. -fn serialize_section(s: Section) -> MaybeFailure -{ - var failure: Optional = none; - - failure = serialize_part1(s.part1); - if failure != none { return failure; } - - failure = serialize_part2(s.part2); - if failure != none { return failure; } - - ... - - return serialize_partN(s.partN); -} - -After every operation that can fail, you're adding “and if there was a failure, return it.” - -There are many layers of this propagation. None of it depends on the details of the reasons for failure: whether the disk is full or the OS detects directory corruption, or serialization is going to an in-memory archive and you run out of memory, you're going to do the same thing. Finally, where propagation stops and the failure is handled—let's say this is a desktop app— again, the recovery is usually the same no matter the reasons for the failure: you report the problem to the user and wait for the next command. - -#### Interlude: Exceptions? - -Way back in 1996 I embarked on a mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. Yes I'm old. While I've seen some real progress on that over the years, I know some of you out there are still not all that comfortable with the idea. If you'll let me, I think I can help. - -##### Just control flow - -Cases like this are where the motivation for exceptions becomes really obvious. They eliminate the boilerplate and let you see the code's primary intent: - -// Writes `s` into the archive. -fn serialize_section(s: Section) throws { - serialize_part1(s.part1); - serialize_part2(s.part2); - ... - serialize_partN(s.partN); -} - -There's no magic. Exceptions are just control flow. Like a switch statement, they capture a commonly needed pattern control flow pattern and eliminate unneeded syntax. - -To grok the meaning of this code in its full detail, you mentally add “and if there was a failure, return it” everywhere. But if you push failures out of your mind for a moment you can see that how the function fulfills its primary purpose leaps out at you in a way that was obscured by all the failure handling. The effect is even stronger when there's some control flow that isn't related to error handling. - -##### Also, type erasure - -OK, I lied a little when I said exceptions are just control flow. There's one other big difference between the exception version and the explicit version: the exception version erases the types of the failure data, and catch blocks are just big type switches with dynamic downcasts. - -Lots of us are “static typing partisans,” so at first this might sound like a bad thing, but remember, as I said, none of the code propagating this failure (or even recovering from it usually) cares about its details. What do you gain by threading all this failure information through your code? When the reasons for failure change you end up creating a lot of churn in your codebase updating those types. - -In fact, if you look carefully at the explicit signature, you'll see something that typically shows up when failure type information is included: people find a way to bypass that development friction. - -fn serialize_section(s: Section) -> MaybeFailure - -Here an “unknown” case was added that is basically a box for any failure type. This is also a reason that systems with statically checked exception types are a bad idea. Java's “checked exceptions” are a famously failed design because of this dynamic. - -Swift recently added statically-typed error handling in spite of this lesson that should be well-understood to language designers, for reasons I don't understand. There was great fanfare from the community, because, I suppose, everybody thinks they want more static type safety. I'm not optimistic that this time it's going to work out any better. - -The moral of the story: sometimes dynamic polymorphism is the right answer. Non-local error handling is a key example, and the design of most exception systems optimize for that. - -#### When (and when not) to use exceptions - -There's a lot of nice sounding advice out there about this that is either meaningless or vague, like “use exceptions for exceptional conditions,” or “don't use exceptions for control flow.” I know that one is really popular around Adobe, but c'mon: if you're using exceptions, you're using them for control flow. I hope to improve on that advice a little bit. - -First of all, you can use exceptions for things that aren't obviously failures, like when the user cancels a command. An exception is appropriate because the control flow pattern is identical to the one where the command runs out of disk space: the condition is propagated up to the top level. In this case recovery is slightly different: there's nothing to report to the user when they cancel, but all the intermediate levels are the same. It would be silly to explicitly propagate cancellation in parallel with the implicit propagation of failures. - -But if you make this choice, I strongly urge you to classify this not-obviously-a-failure thing as a failure! Otherwise you'll undo all the benefits of separating failures from postconditions, and you'll have to include “unless the user cancels, in which case…” in the summary of all your functions. So in the end, my broad advice is, “only use exceptions for failures (but be open minded about what you call a failure).” Actually, even if you're not using exceptions, any condition whose control flow follows the same path as non-local failures should probably be classified as a failure. - -Another prime example is the discovery of a syntax error in some input. In the general case, you are parsing this input out of a file. I/O failures can occur, and will follow the same control flow path. Classifying your syntax error as a failure and using the same reporting mechanism is a win in that case. - -Next, don't use exceptions for bugs. As we've said, when a bug is detected the program cannot proceed reliably, and throwing is likely to destroy valuable debugging information you need to find the bug, leave a corrupt state, open a security hole, and hide the bug from developers. Even though the “default behavior” of exceptions is to stop the program, throwing defers the choice about whether to actually stop to every function above you in the call stack. This is not a service, it's a burden. You've made your function harder to use by giving your clients more decisions to make. Just don't. - -That also means if you use components that misguidedly throw logic_errors, domain_error, invalid_argument, length_error or out_of_range at you, you should almost always stop them and turn them into assertion failures. All that said, there are some systems, like Python, where using exceptions for bugs (to say nothing of exiting loops!) is so deeply ingrained that it's unavoidable. In python you have to ignore this rule. - -Don't use exceptions for local failures. As we've seen, exceptions are optimized for the patterns of non-local failures. Using them for local failures means more catch blocks, which increase code complexity. It's usually easy to tell what kind of failure you've got, but if you're writing a function and you really can't guess whether its failure is going to be handled locally, maybe you should write two functions. - -Next, consider performance implications. Most languages aren't like this, but most C++ implementations are usually biased so heavily toward optimizing the non-failure cases that handling a failure runs one or two orders of magnitude slower. Usually that's a great trade-off because it allows them to skip checking for the error case on the hot path, and non-local failures are rare and don't happen repeatedly inside tight loops. But if you're writing a real-time system for example, you might want to think twice. - -Here's an example that might open your mind a bit: when we were discussing the design of the Boost C++ Graph Library, we realized that occasionally a particular use of a graph algorithm might want to stop early. For example, Dijkstra's algorithm finds all the paths from A to B in order, from shortest to longest. What if you want to find the ten shortest paths and stop? The way this library's algorithms work, you pass them a “visitor” object that gets notified about results as they are discovered. And in fact there are lots of notification points for intermediate conditions, not just “complete path found,” so if we were going to handle this early stop explicitly, we'd generate a test after each one of these points in the algorithm's inner loop. Instead, we decided to take advantage of the C++ bias toward non-failures. We said a visitor that wants to stop early can just throw. Now in fairness, I don't think we ever benchmarked the effects of this choice, so it might have been wrong in the end. But it was at least plausibly right. - -Finally, you might need to consider your team's development culture and use of tooling. If people typically have their debuggers set up to stop when an exception occurs, you might need to take extra care not to throw when there's an alternate path to success. Some developers tend to get upset when code stops in a case that will eventually succeed. - -### How to Handle Failure - -OK, enough about exceptions. Finally we come to the good part! Seriously, this was originally going to be the focus of the entire talk. - -Let's talk about the obligations of a failing function and of its caller. What goes in the contract and what does each side need to do to ensure correctness? - -#### Callee - -Documentation: -Document local failures and what they mean. -Document non-local failures at their source, but not where they are simply propagated. That information can be nice to have, but it also complicates contracts and is a burden to propagate and keep up-to-date. - -Code: -Release any unmanaged resources you've allocated (e.g. close temporary file). - -##### Optional - -If mutating, consider giving the strong/transactional guarantee that if there is a failure, the function has no effects. - -Only do this if it has no performance cost. Sometimes it just falls out of the implementation. Sometimes you can get it by reordering the operations. For example, if you do all the things that can fail before you mutate anything visible to clients, you've got it. - -Don't pay a performance penalty to get it because not all clients need it and when composing parts all the needless overheads add up massively. - -#### Caller - -- Discard any partially-completed mutations to program state or propagate the error and that responsibility to your caller. This partially mutated state is meaningless. - -What counts as state? Data that can have an observable effect on the future behavior of your code. Your log file doesn't count. - -##### Implications as data structures scale up - -The only strategy that really scales in practice, when mutation can fail, is to propagate responsibility for discarding partial mutations all the way to the top of the application. That in turn implies mutating a copy of existing data and replacing the old copy only when mutation succeeds. Either way, you probably end up with a persistent data structure (which is a confusing name—it has nothing to do with persistence in the usual sense). - -A persistent data structure is one where a partial mutation of a copy shares a lot of storage with the original. For example, in Photoshop, we store a separate document for each state in the undo history, but these copies share storage for any parts that weren't mutated between revisions. This sharing behavior falls out naturally when you compose your data structure from copy-on-write parts. - -#### What (not) to do when an assertion fires. - -- Don't remove the assertion because “without that the program works!” -- Don't complain to the owner of the assertion that they are crashing the program. -- Understand what kind of check is being performed - - If it's a precondition check, fix your bug - - If it's a self-check or postcondition check, talk to the code owner about why their assumptions might have been violated - -#### Probably different functions for unit testing. - - - - - - - - -Notes: - - read from network, how much was read - - no-error case exists - - podcast - - likely a local handling case. - - don't go to vegas with something you're not prepared to lose. - -Quickdraw GX: 15% performance penalty for making silent null checks. - -David Sankel 50:11 -Folks can go ahead and put your hands up if you would like to. -Uh. -Ask a question. -Build a queue. - -Dave Abrahams 50:22 -I have the feeling that I didn't. -I didn't quite adequately deal with everybody's. -I questions that came up during the talk, so I'm happy to revisit those. -Got one hand? - -David Sankel 50:36 -At Philip, go ahead. - -Philip Levy 50:38 -And like to go back to a comment you made about. -The Boost graph library and raising exceptions to terminate that and you were pondering whether that was actually a good thing to have done based on performance, and it was wondering, is the notion that the fact that a visitor could raise an exception affecting performance of the execution you know of of non exceptional cases or just the cost of terminating the the algorithm by just raising that one exception? - -Dave Abrahams 51:21 -OK, I'm I'm going to try to try to answer your question as I understand, but I'm OK. - -Philip Levy 51:28 -Well, let me just clarify a little bit. -My expectations would be that raising an exception to terminate the algorithm wouldn't affect the performance of the execution of the algorithm. -The termination is a one time thing versus you know many thousands of nodes. -You may be looking at and so I was wondering why you were pondering that. - -Dave Abrahams 51:47 -Right. -That's the trade off we thought would love me. -So. -So Philip, yes, that that's the tradeoff that we thought we were making we because because C++ biases in favor of the straight line code, we thought this would be, this would be a good optimization. -My my reason for questioning it is I don't think we ever actually did any measurements. -That's all. - -Philip Levy 52:16 -OK, alright. -So it's it's an unknown, but there's no reason to believe it would be a problem. - -Dave Abrahams 52:22 -Right, that's correct. - -Philip Levy 52:24 -OK. Thank. -Thank you. - -Dave Abrahams 52:27 -I suppose if these graph algorithms were themselves used in tight loops on small problems where where the amount of straight line execution was low and you were throwing exceptions to terminate, that would be that would be bad, right? -If the algorithm was used repeatedly, umm, go ahead. - -Sean Parent 52:48 -So I think the others, I, David, there is there was an assumption that the checks at each node to see if there was a termination requested would be expensive and under modern hardware it probably costs you something, but it's a little hard to say. - -Dave Abrahams 53:13 -Yeah, I mean, you know, it's really hard to say without measuring. - -Sean Parent 53:14 -She tested. - -Dave Abrahams 53:17 -That's pretty much always the case for for performance. -You know, there's a there's a solid argument that, you know, the the functions on visitors are usually inlined. -When all of those intermediate visit points are are, you know are no OPS, the compiler can see it, and then it could skip the checks. -So like you know, the lesson is always measured before you make conclusions about performance. +Most of the time, when a precondition isn't added, it makes sense to +report a runtime error, because it preserves the idea of the +function's simple primary purpose, implying that all the other cases +are some kind of failure to achieve that purpose. Weakening the +postcondition means considering more cases successful, which makes a +function into a multipurpose tool, which is usually harder to +document, use, and understand. + +If you must weaken the postcondition, returning an `Optional` +instead of a `T` adds the least possible amount of information to the +success case, and thus does the least harm to API simplicity. It can +be appropriate when there will never be a useful distinction among +reasons that the function can't produce a `T`. Subscripting a +`Dictionary` with its key type is a good example. The only reason it +would not produce a value is if the key were not present. + +Lastly, remember that the choice is in your hands, and what you choose +has a profound effect on clients of your code. There is no criterion +that tells us a condition must or must not be a runtime error other +than the effect it has on client code. + +## Handling Runtime Errors Correctly + +The previous section was about how to design APIs; this one covers how +to account for errors in function bodies. + +### When Propagation Stops + +Code that stops upward propagation of an error and continues to run +has one fundamental obligation: to discard any partially-mutated state +that can affect on the future behavior of your code (that excludes log +files, for example). In general, this state is completely unspecified +and there's no other valid thing you can do with it. + +For the same reasons that the strong guarantee does not compose, +neither does the discarding of partial mutations: if the second of two +composed operations fails, modifications made by the first +remain. So ultimately, that means responsibility for discarding partial +mutations tends to propagate all the way to the top of an application. + +In most cases, the only acceptable behavior at that point is to +present an error report to the user and leave their data unchanged, +i.e. the program must provide the strong guarantee. That in turn +means—unless the data is all in a transactional database—a program +must usually follow the formula already given for the strong +guarantee: mutate a copy of the user's data and replace the data only +when mutation succeeds.[^persistent] + +[^persistent]: This pattern is only reasonably efficient when the data + is small or in a [persistent data + structure](https://en.wikipedia.org/wiki/Persistent_data_structure). + Because of Swift's use of + [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) for + variable-sized data, any data structure built out of standard + collections can be viewed as persistent provided none are allowed to + grow too large, but easier and more rigorous implementations of + persistence can be found in + [swift-collections](https://github.com/apple/swift-collections), + e.g. [`TreeSet` and + `TreeDictionary`](https://swiftpackageindex.com/apple/swift-collections/1.3.0/documentation/hashtreecollections) + +### Mutating Functions + +The fact that all partially-mutated state must be discarded has one +profound implication for invariants: in general, when an error occurs, +a mutating method need not restore any invariants it has broken. It +can instead propagate the error to its caller with no further +ceremony! + +The exception is the invariants of types whose safe operations are +implemented in terms of unsafe ones. Any invariants depended on to +satisfy preconditions of those unsafe operations must of course be +upheld to preserve the safety guarantees. + +### Functions that Allocate Unmanaged Resources + +Any resources such as open files or raw memory allocations that are +not otherwise managed must be released. It's best to limit that +concern by making resource release the responsibility of some class' +`deinit` method. From 8cbdf6a9a0b95e4e7e3d0a28e5e2c026e606cf24 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 29 Jan 2026 11:22:44 -0800 Subject: [PATCH 35/41] typo --- better-code/src/chapter-3-errors.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 0d60278..e49af28 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -140,7 +140,7 @@ know. Since we don't know where the bug is, the downstream effects of the problem could have affected many things we didn't test for. Because of the bug, your program state could be very, very scathed indeed, violating assumptions made when coding and potentially -compromising security, If user data is quietly corrupted and +compromising security. If user data is quietly corrupted and subsequently saved, the damage becomes permanent. In any case, unless the program has no mutable state and no external From 8147eb3dd8e804b54f788ab6576145a45b045a73 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 29 Jan 2026 13:26:40 -0800 Subject: [PATCH 36/41] Consistent terminology. --- better-code/src/chapter-2-contracts.md | 6 +++--- better-code/src/chapter-3-errors.md | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/better-code/src/chapter-2-contracts.md b/better-code/src/chapter-2-contracts.md index 86c42aa..ca41606 100644 --- a/better-code/src/chapter-2-contracts.md +++ b/better-code/src/chapter-2-contracts.md @@ -815,9 +815,9 @@ the array has an element. OK, so what about postconditions? The postconditions are the effects of the method plus any returned result. If the preconditions are met, but the postconditions are not, -and the function does not report an error, we'd say the method has a -bug. The bug could be in the documentation of course, *which is a -part of the method*. +and the function does not report a runtime error, we'd say the method +has a bug. The bug could be in the documentation of course, *which is +a part of the method*. ```swift /// Removes and returns the last element. diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e49af28..1e28bc3 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -4,8 +4,8 @@ In the *Contracts* chapter you may have noticed we made this reference to the concept of *errors*: > If the preconditions are met, but the postconditions are not, and -> the function does not report an error, we'd say the method has a -> bug. +> the function does not report a runtime error, we'd say the method +> has a bug. In the interest of progressive disclosure, we didn't look closely at the idea, because behind that simple word lies a chapter's worth of From 25249dabd2417a8dc9fa1406494243570b99a3b7 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 26 Jan 2026 10:53:33 -0800 Subject: [PATCH 37/41] Intro tweaks --- better-code/src/chapter-3-errors.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 1e28bc3..439ca96 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -20,11 +20,13 @@ together. ## Definitions -To understand any topic, it's important to define it crisply, and -unfortunately “error” and associated words have been used rather -loosely, and previous attempts to define these words have relied on -other words, like “expected,” which themselves lack clear definitions, -at least when it comes to programming. +To understand any topic, it's important to have crisp definitions of +the terms you're using, and ideally, to take those definitions from +the most common existing practice. Unfortunately “error” and +associated words have been used rather loosely, and previous attempts +to define these words have relied on other words, like “expected,” +which themselves lack clear definitions, at least when it comes to +programming. Unless we want to invent new terms, we will have to impose a little of our own structure on the usual terminology. We hope these definitions From 9ab66651995412085fdb13eb23c497b947a1a9b8 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 26 Jan 2026 17:27:10 -0800 Subject: [PATCH 38/41] Conclusion + fleshing out. --- better-code/src/chapter-3-errors.md | 94 ++++++++++++++++++++++++----- 1 file changed, 80 insertions(+), 14 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 439ca96..e21ff3c 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -769,13 +769,47 @@ than the effect it has on client code. The previous section was about how to design APIs; this one covers how to account for errors in function bodies. +### Reporting or propagating an Error From a Function + +When a function exits with an error, either locally initiated or +propagated, any resources such as open files or raw memory allocations +that are not otherwise managed must be released. The best way to +manage that is with a `defer` block releasing the resources +immediately after they are allocated: + +```swift +let f = try FileHandle(forReadingFrom: p) +defer { f.close() } +// use f +``` + +If the resources must be released somewhere other than the end of the +scope where they were allocated, you can tie them to the `deinit` of +some type: + +```swift +struct OpenFileHandle: ~Copyable { + /// The underlying type with unmanaged close functionality + private let raw: FileHandle + + /// An instance for reading from p. + init(forReadingFrom p: URL) { raw = .init(forReadingFrom: p) } + + deinit { + raw.close() + } +} +``` + ### When Propagation Stops Code that stops upward propagation of an error and continues to run has one fundamental obligation: to discard any partially-mutated state that can affect on the future behavior of your code (that excludes log files, for example). In general, this state is completely unspecified -and there's no other valid thing you can do with it. +and there's no other valid thing you can do with it. Any use of +a partially mutated instance, other than to deinitialize it, is +erroneous. For the same reasons that the strong guarantee does not compose, neither does the discarding of partial mutations: if the second of two @@ -807,19 +841,51 @@ when mutation succeeds.[^persistent] ### Mutating Functions The fact that all partially-mutated state must be discarded has one -profound implication for invariants: in general, when an error occurs, -a mutating method need not restore any invariants it has broken. It -can instead propagate the error to its caller with no further -ceremony! +profound implication for invariants: when an error occurs, with two +rare exceptions, a mutating method need not restore invariants it has +broken, and can simply propagate the error to its caller. + +The first exception is for invariants depended on by a `deinit` +method. However, `deinit` methods are rare, and `deinit` methods with +dependencies on invariants that might be left broken in case of an +error are rarer still. You _might_ encounter one in a `ManagedBuffer` +subclass—see the Data Structures chapter for more details. -The exception is the invariants of types whose safe operations are +The second exception for invariants of types whose safe operations are implemented in terms of unsafe ones. Any invariants depended on to satisfy preconditions of those unsafe operations must of course be -upheld to preserve the safety guarantees. - -### Functions that Allocate Unmanaged Resources - -Any resources such as open files or raw memory allocations that are -not otherwise managed must be released. It's best to limit that -concern by making resource release the responsibility of some class' -`deinit` method. +upheld to maintain the safety guarantees. So, for example, if a +supposedly-safe operation deallocates an `UnsafePointer`, it depends +on the precondition that the pointer was returned by an earlier +allocation and hasn't been deallocated. Any invariant that ensures the +precondition would be satisfied (e.g. “`p: UnsafePointer?` is +either `nil` or valid for deallocation”) must be upheld by all +mutating methods. + +The key to controlling any invariant is to factor the properties +involved into a `struct` whose only job is to manage the values of +those properties, and keep write access to those properties `private`. +Establish the invariant in this struct's `init` methods, and—for these +exceptional cases—take care that it is restored before propagating any +errors from its `mutating` methods. + +## Conclusion + +This chapter completes the Better Code picture of how to program by +contract. As mentioned in the introduction, it's not the only +possible approach to errors. One could, for example, view error +information as part of a function's postconditions, but that +complicates contracts, obscures a function's primary purpose, and +elevates information that most clients don't care about to the same +level as the postcondition, which they do care about. One could take +the position that all invariants must be upheld even in the case of +errors during mutation, but that adds an unnecessary burden for +programmers, and in some cases, forces type authors to weaken +invariants to account for states that can only be reached when an +error occurs, when operations that could observe the broken invariant +can only arise through a failure discard the partially mutated +instance. One could try to statically constrain the types of all +errors, but that makes designs hard to evolve and elevates +implementation details to the API level. Our approach minimizes +complexity and provides the tools to reason about code without overly +constraining development. From d8c1a436f105a16732809b9f0775749b2bcb1b14 Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Mon, 26 Jan 2026 17:30:18 -0800 Subject: [PATCH 39/41] Remove flotsam --- better-code/src/chapter-3-errors.1.md | 423 -------------------------- 1 file changed, 423 deletions(-) delete mode 100644 better-code/src/chapter-3-errors.1.md diff --git a/better-code/src/chapter-3-errors.1.md b/better-code/src/chapter-3-errors.1.md deleted file mode 100644 index 23ca2ea..0000000 --- a/better-code/src/chapter-3-errors.1.md +++ /dev/null @@ -1,423 +0,0 @@ -# Better Code: Errors - -So we're going to talk about errors and handling them. - -So what's an error? - -## Words - -When talking about anything, I like to start out by trying to define it, and reading existing definitions is usually a good way to start. After all, programming is about communication, and if we want to communicate effectively we should use words in the expected ways. - -Normally when I've done a version of this talk it's been a very interactive, in-person experience: I ask the audience for their definitions and we write them all on a board and then dissect them. I don't think that's going to work in this context, so instead I asked the web. - -That exercise was very revealing, and actually changed my mind about the meaning of error and the overall scope of the presentation. So let's review what I found out. These are roughly the top answers Google gave me when I asked it to define “error” and “error handling.” Aside from Wikipedia, I was surprised at some of the hits it chose, but if you don't like them you can take it up with Google. I feel pretty confident that these results reflect the way people talk about errors. - -### Definitions - -Wikipedia: - -An error (from the Latin errāre, meaning 'to wander'[1]) is an inaccurate or incorrect action, thought, or judgement.[1] - - -In statistics, "error" refers to the difference between the value which has been computed and the correct value.[2] An error could result in failure or in a deviation from the intended performance or behavior.[3] - -In human behavior the norms or expectations for behavior or its consequences can be derived from the intention of the actor or from the expectations of other individuals or from a social grouping or from social norms. (See deviance.) Gaffes and faux pas can be labels for certain instances of this kind of error. More serious departures from social norms carry labels such as misbehavior and labels from the legal system, such as misdemeanor and crime. Departures from norms connected to religion can have other labels, such as sin. - -In science and engineering in general, an error is defined as a difference between the desired and actual performance or behavior of a system or object. - -Engineers seek to design devices, machines and systems and in such a way as to mitigate or preferably avoid the effects of error, whether unintentional or not. Such errors in a system can be latent design errors that may go unnoticed for years, until the right set of circumstances arises that cause them to become active. Other errors in engineered systems can arise due to human error, which includes cognitive bias. Human factors engineering is often applied to designs in an attempt to minimize this type of error by making systems more forgiving or error-tolerant. - - - -Error Message: - -An error message is the information displayed when an unforeseen problem occurs, usually on a computer or other device. Modern operating systems with graphical user interfaces, often display error messages using dialog boxes. Error messages are used when user intervention is required, to indicate that a desired operation has failed, or to relay important warnings (such as warning a computer user that they are almost out of hard disk space). - -Lenovo: - -Computer error refers to a mistake or malfunction that occurs within a computer system, leading to unexpected or incorrect behavior. -Computer Hope: - -An error describes any issue that arises unexpectedly that cause a computer to not function properly. - -Vocabulary.com:Definitions of computer error -noun (computer science) the occurrence of an incorrect result produced by a computer -Toppr.com - -An error in computer data is called Bug. - -A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. - -https://textexpander.com/blog/most-common-programming-errors: -The 7 Most Common Types of Errors in Programming and How to Avoid Them -Syntax Errors -Logic Errors -Compilation Errors -Runtime Errors -Arithmetic Errors -Resource Errors -Interface Errors - -Techopedia: - -What Does Error Handling Mean? -Error handling refers to the response and recovery procedures from error conditions present in a software application. In other words, it is the process comprised of anticipation, detection and resolution of application errors, programming errors or communication errors. Error handling helps in maintaining the normal flow of program execution. - -There are four main categories of errors: - -Logical errors -Generated errors -Compile-time errors -Runtime errors - -dremio.com: - -Error Handling refers to the process of detecting, managing, and resolving errors and exceptions that occur during data processing and analytics. It involves implementing mechanisms and strategies to handle unexpected events and ensure data integrity and reliability. - - -OK, so in this text I want to highlight four things: - -First, a lot of it, all this red stuff, is about bugs. If you happened to read the abstract blurb that we used in the talk announcement, you know it said we'll clearly define “error” distinct from “bug,” but these results force me to admit that error usually means bug, and if I want to talk about non-bugs I might need to find a different term. It also convinced me that in a talk about error handling you can't avoid the topic of how to deal with bugs. So we're going to talk about all kinds of errors, both bugs and the other kinds. - -Since I love defining things, I'm going to take this opportunity to define “bug” as an avoidable coding error. - -Statistically, bugs may be inevitable -but -Every individual bug is avoidable. - -Which is a good thing, because you can't really plan for bugs; they could be anywhere. That's why you see the word “unexpected” come up a lot in that red text. - -Second, in a couple of places I colored green, people are talking about things that definitely aren't bugs, like resource allocation failure. If I run out of space on the disk when I'm trying to save a document, that's not a bug. -Maybe it's rare, but you can predict that it will happen sometimes, and you know exactly where in your code it can happen, so you can plan a response for it. These non-bugs are what I used to call “errors” and had intended to be the sole topic of this talk. Let's call them failures. They represent a failure—sometimes temporary—of the code to achieve its primary intent. - -The blue highlight talks about errors due to cognitive bias, a very AI-forward concern. Is that a bug? I'm not sure cognitive bias is avoidable. So I guess I'd go with not-a-bug. However, as far as I know it's not an event; it's a property of the code and/or dataset, so it's really in its own category. - -Finally, these words in yellow talk about recovery, resolution, and maintaining data integrity. How you achieve that is going to be important. - -So there are three important parts to this picture: - -Bugs -Failures (non-bugs, predictable obstacles) -Recovery and Integrity - -## Recovery - -So what do we mean by “recovery?” When I ask the web, most of the hits define error recovery in terms of what a parser does when it hits a syntax error in your code. - -int main() { - int x = 4 - // ^---- error: expected ';' at end of declaration - f(x); - f(x x); - // ^--------- error: expected ')' -} - -Let's say you left out a semicolon. The parser could just stop there and issue one diagnostic about the missing symbol, if that's the only possibility in that syntactic position. But most programming language parsers don't do that (even though I often wish they would). They want to give me all the potentially-useful diagnostics about errors in the rest of my code. If the parser just starts over, discarding its state and pretending the location of the error is the beginning of the file, I'm going to get lots of bogus error messages. That's a pretty poor recovery because although the program continues, it's doing something that almost certainly doesn't make sense. - -x.cpp:1:3: error: unknown type name 'f' - f(x); - ^ -x.cpp:2:5: error: unknown type name 'x' - f(x x); - ^ -x.cpp:2:3: error: a type specifier is required for all declarations - f(x x); - ^ -x.cpp:4:1: error: extraneous closing brace ('}') -} -^ - - -So instead parsers typically try to “recover” by pretending I had written something correct. In this case it injects a phantom semicolon and continues. So as a first cut, let's say recovery is continuing to execute, doing sensible work. But I really like this quote from a stack overflow answer: - -https://stackoverflow.com/a/38387506/125349 - -... i.e.: "to sally forth, entirely unscathed, as though 'such an inconvenient event' never had occurred in the first place." - -By “unscathed” they mean that the program state is intact: not only are the invariants upheld, but the state makes sense given the inputs the program has received. If we have an error while applying a blur, it's not enough that the user's document is a well-formed file; it also can't have some random or half-finished changes they didn't ask for. - -## Recovery from bugs? - -OK, so let's talk about recovering from a bug. What would that mean? -Well, first, it assumes you have some way to detect the bug; not all bugs are detectable, but let's assume this one is. Typically that means some precondition check fails: there's a bug in the caller that caused them to pass an invalid argument. - -When that happens, you're not really detecting the bug, you're detecting one of its symptoms, like a cosmic echo. The bug itself occurred at some indefinite point before that. So can you ”sally forth unscathed?” The problem is, you don't know. Because of the bug, your program state could be very, very scathed indeed. - -Sallying forth at this point is a terrible idea, for so many reasons. First there are effects in the outside world: -- The user's data might be corrupted and they might save it that way, losing the last good state they had. -- The assumptions underlying any security evaluation you did may be violated, so you could be opening a security hole. -- You don't have enough information about the state of your system to do it reliably, you can't detect whether you've done it correctly, and the penalties we just discussed for failure to do it correctly are astronomical. - - -Continuing in the face of a known bug also has a terrible impact on the development process: -- The bug will be masked and will never get fixed… -- …until one day we're about to lose an important customer base because of that corruption. And then you might spend weeks hunting the bug down because the customer sees a much more distant echo of the bug than the earlier echo your code detected. -- Most code is correct, so most of your bug-recovery code will never run. It certainly won't be tested. All this recovery code bloats your program and every line is a liability with no offsetting benefits. - -Some systems can recover from bugs (e.g. redundant ones). Processes can't recover. - -To sum up, in general you can't recover from bugs, and it's a bad idea to try. So what can you do? - -## Handling bugs - -You can stop the program before any more damage is done, and generate a crash report or debuggable image that captures as much information as is available about the state of the program, so there's a chance of fixing the bug. Maybe there's some small emergency shutdown procedure you need to perform, like saving information about the failing command so the application can offer to retry it for you when you restart it. - -Let me be clear: THIS IS BAD. It could be experienced as a crash by users. -But it's the only way to prevent the much worse consequences of a botched recovery attempt. Remember, the chances of botchery are high because you don't have enough information to do it reliably. -Upside: it will also be experienced as a crash by developers, QE teams, and beta testers, giving you a chance to fix the bug. - -*** You can mitigate the experience of crashing *** -*** Don't tell me my assertion is a crash *** -*** An assertion is a controlled shutdown *** - -A lot of people have a hard time accepting the idea of voluntarily terminating, but let's face it: your bug detection isn't the only reason the program might suddenly stop. You can crash from an undetected bug. Or a person can trip over the power cord. You should design your software so that these bad things are not catastrophic. - -*** In fact you could be more ambitious and try to make it really seamless. You have to accept this is part of the UX package to even take this on. *** - -In fact some platforms force you to live under a similar constraint. On an iPhone or iPad, for example, to save battery and keep foreground apps responsive, the OS may kill your process any time it's in the background, but will make it look to the user like it's still running. When the user switches back, every app is supposed to complete the illusion by coming back up in the same state it was killed in. I can tell you as a user, it can be really jarring when you encounter an app that doesn't do it right. The point is, resilience to early termination is something you can and should design into the system. - -For example, Photoshop uses a variety of strategies: we always save documents into a new file and atomically swap it into place only after the save succeeds, so we never leave a half-saved document on disk. We also periodically save backups so at most you only lose the last few minutes of work. If we needed to tighten that up we could, by saving a record of changes since the last full backup. - -## Assertions - -The usual mechanism for terminating a program when a bug is detected is called an assertion and traditionally it spelled something like this: - - assert(n >= 0); - -This spelling comes from C and C++. If you're programming in another language, you probably have something similar. - -The C assertion is pretty straightforward: either it's disabled, in which case it generates no code at all—even the check is skipped—or it does the check and exits immediately with a predefined error code if the check fails, usually printing a message containing the text of the failed check and its location in source. - -Debuggers will commonly stop at the assertion rather than exiting, and even if you're not running in the debugger, on major desktop OSes, you'll get a crash report with the entire program state that can be loaded into a debugger. So this is great for catching bugs early, before they get shipped, provided people use it. - -Projects commonly disable assertions in release builds, which has the nice side-effect of making programmers comfortable adding lots of assertions, because they know they won't slow down the release build. And more bugs get caught early. - -But unless you really believe you're shipping bug-free software, you might want to leave most assertions on in release builds. In fact, the security of your software might depend on it. If you're programming in an unsafe language like C++, opportunities to cause undefined behavior are all around you. When you can assert that the conditions for avoiding undefined behavior are met before executing the dangerous operation, the program will come to a controlled stop instead of opening an arbitrarily bad security hole. - -The problem with leaving assertions on in release is that some checks are too expensive to ship. And let's be honest; many programmers will go with their gut, instead of measuring, when making that determination. We really need a second, expensive_assert(), that's only on in debug builds, so we continue to catch those bugs early. - -There's another problem with having just one assertion: it doesn't express sufficient intent. For example, it might be a precondition check, or the asserting function's author might just be double-checking their own reasoning. When these two assertions fire, the meaning is very different: the first indicates a bug in the caller, the other one is a bug in the callee. So I really want separate precondition and self_check functions. - -If I'm writing in a safe-by-default language like Rust or Swift, the checks that prevent undefined behavior, like array bounds checks, are special: I can afford to turn off all the other checks in shipping code, but these checks are the ones upholding safety properties of my system are compromised. So I want a different assertion for these checks, even if I don't ever anticipate turning off the other ones in a shipped product. These are the ones that we can't delete from the code. I might want to turn the other assertions off locally to measure how much overhead they are incurring. - -I hope you get the idea. I'm not going to prescribe the exact set of assertion facilities your project needs, but a carefully engineered suite of these functions with properties appropriate to your project is part of a comprehensive strategy for dealing with bugs. If you haven't got one, go design it. - -One last point about the C++ assert: it's better than nothing, but because it calls abort(), there's no place to put emergency shutdown measures. You can't even display a message to the user, so to the user it will always feel like a hard, unceremonious crash. You probably want failed assertions to call terminate() instead, because it allows terminate handlers can run. So that's another reason to engineer your own assertions, even if you build just one. - -## What if you're not allowed to terminate? - -Fight for the right (to terminate). If the system is critical, advocate creating a recovery system outside the process. -If you lose today -Fail as noisily as possible, preferably by terminating in non-shipping code. -Keep fighting -Be prepared to win someday. That means use a suite of assertions that don't terminate, but whose behavior you can change when you win the fight. - -# Failures - -OK, as much as we all love bugs, it's time to leave them behind and talk about failures. Let's say you identify a condition X where your function is unable to fulfill its primary purpose. That can occur one of two ways: - - -Something your function calls has a precondition that you're not sure would be satisfied. -Something your function calls can itself report a failure. - -You usually have two choices at this point: -Make !X a precondition; X reflects a bug in the caller. -Make X a failure; all the code is correct. - -It's counterintuitive, you should always prefer to classify X as a bug, as long as !X satisfies the criteria for preconditions: -It is possible to ensure !X. For example, there's no way for the caller to ensure there's enough disk space to save a file, because other processes can use up any space that might have been free before the call. So you can't make “there's enough disk to save” a precondition. -Ensuring !X is considerably less work than the work done by the callee. For example, if the callee is deserializing a document and finds that it's corrupted, you can't make it a precondition that the file is well-formed, because determining whether it is or not is basically the same work as doing the deserialization. - -## Definition - - Failure: inability to satisfy a postcondition in correct code. - -So why am I tying this definition to postconditions other than to bind our understanding of error handling to our understanding of correctness? - -First of all, it simplifies and improves understandability of contracts. This is easiest to see if you have a dedicated language mechanism for error handling: - -** Note: fictional programming language ** - -// Returns `x` sorted in `order`, or throws an exception -// in case order fails. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -// Returns `x` sorted in `order`. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -Even if you feel you need to say something about possible failures, that becomes a secondary note that's not essential to the contract. - -// Returns `x` sorted in `order`. -// -// Propagates any exceptions thrown by `order`. -fn sorted(x: [Int], order: Ordering) throws -> [Int] - -A programmer can know everything essential from the summary fragment and the signature. Another way this separation plays nicely with exceptions is that you can say the postcondition of a function describes what you get when it returns, and a throwing function never returns. - -If you don't use exceptions, you still simplified contracts as long as you have dedicated types to represent the possibility of failure. - -// Returns `x` sorted in `order`. -fn sorted(x: [Int], order: Ordering) -> ResultOrFailure<[Int]> - -Separating the function's primary intention from the reasons for failure makes sense, because the reasons for failure matter less. If that's not obvious yet, some justification is coming. - -Another reason to exclude the failure case from the postcondition is that you want postconditions to be solid and fully described, but a mutating operation that fails often leaves behind a state that's very difficult to nail down, and as I said in the contracts talk, that you usually don't want to nail down, because it's detail nobody cares about. But if it's part of the postcondition, you need to say something about it, and that further complicates the contract. - -// Sorts `x` according to `order` or throws an exception -// if `order` fails, leaving `x` modified in unspecified -// ways. -fn sort(mutating x: [Int], order: Ordering) throws - -// Sorts `x` according to `order`. -fn sort(mutating x: [Int], order: Ordering) throws - -## Two kinds of failures - -If you've spent some time writing code that carefully handles failures, especially in a language like C where all the error propagation is explicit, failures start to fall into two main categories: local and non-local, based on where the recovery is likely to happen. - -Local recovery occurs very close to the source of failure, usually in the immediate caller, in a way that often depends heavily on the reasons for the failure. In many cases, the recovery path is performance-critical. - -**Example**: you have an ultrafast memory allocator that draws from a local pool much smaller than your system memory. You build a general-purpose allocator that first tries your fast allocator, and only if that allocation fails, recovers by trying the system allocator. - -**Example**: the lowest level function that tries to send a network packet can fail for a whole slew of reasons (https://www.ibm.com/docs/en/zos/2.3.0?topic=codes-sockets-return-errnos), some of which may indicate a temporary condition like packet collision. 99% of the time, the immediate caller is a higher-level function that checks for these conditions and if found, initiates a retry protocol with exponential backoff, only itself failing after N failed retries. That lowest-level failure is local. The failure after N retries is very likely to be non-local. - -Non-local recovery, which is far more common, occurs far from the source, usually in a way that can be described without reference to the reasons for failure. For example, when you're serializing a complex document, serializing any part means serializing all of its sub-parts, and parts are ultimately nested many layers deep. Because you can run out of space in the serialization medium, every step of the process can fail. If you write out the error propagation explicitly, it usually looks like this: - -// Writes `s` into the archive. -fn serialize_section(s: Section) -> MaybeFailure -{ - var failure: Optional = none; - - failure = serialize_part1(s.part1); - if failure != none { return failure; } - - failure = serialize_part2(s.part2); - if failure != none { return failure; } - - ... - - return serialize_partN(s.partN); -} - -After every operation that can fail, you're adding “and if there was a failure, return it.” - -There are many layers of this propagation. None of it depends on the details of the reasons for failure: whether the disk is full or the OS detects directory corruption, or serialization is going to an in-memory archive and you run out of memory, you're going to do the same thing. Finally, where propagation stops and the failure is handled—let's say this is a desktop app— again, the recovery is usually the same no matter the reasons for the failure: you report the problem to the user and wait for the next command. - -### Interlude: Exceptions? - -Way back in 1996 I embarked on a mission to dispel the widespread fear, loathing, and misunderstanding around exceptions. Yes I'm old. While I've seen some real progress on that over the years, I know some of you out there are still not all that comfortable with the idea. If you'll let me, I think I can help. - -#### Just control flow - -Cases like this are where the motivation for exceptions becomes really obvious. They eliminate the boilerplate and let you see the code's primary intent: - -// Writes `s` into the archive. -fn serialize_section(s: Section) throws { - serialize_part1(s.part1); - serialize_part2(s.part2); - ... - serialize_partN(s.partN); -} - -There's no magic. Exceptions are just control flow. Like a switch statement, they capture a commonly needed pattern control flow pattern and eliminate unneeded syntax. - -To grok the meaning of this code in its full detail, you mentally add “and if there was a failure, return it” everywhere. But if you push failures out of your mind for a moment you can see that how the function fulfills its primary purpose leaps out at you in a way that was obscured by all the failure handling. The effect is even stronger when there's some control flow that isn't related to error handling. - -#### Also, type erasure - -OK, I lied a little when I said exceptions are just control flow. There's one other big difference between the exception version and the explicit version: the exception version erases the types of the failure data, and catch blocks are just big type switches with dynamic downcasts. - -Lots of us are “static typing partisans,” so at first this might sound like a bad thing, but remember, as I said, none of the code propagating this failure (or even recovering from it usually) cares about its details. What do you gain by threading all this failure information through your code? When the reasons for failure change you end up creating a lot of churn in your codebase updating those types. - -In fact, if you look carefully at the explicit signature, you'll see something that typically shows up when failure type information is included: people find a way to bypass that development friction. - -fn serialize_section(s: Section) -> MaybeFailure - -Here an “unknown” case was added that is basically a box for any failure type. This is also a reason that systems with statically checked exception types are a bad idea. Java's “checked exceptions” are a famously failed design because of this dynamic. - -Swift recently added statically-typed error handling in spite of this lesson that should be well-understood to language designers, for reasons I don't understand. There was great fanfare from the community, because, I suppose, everybody thinks they want more static type safety. I'm not optimistic that this time it's going to work out any better. - -The moral of the story: sometimes dynamic polymorphism is the right answer. Non-local error handling is a key example, and the design of most exception systems optimize for that. - -### When (and when not) to use exceptions - -There's a lot of nice sounding advice out there about this that is either meaningless or vague, like “use exceptions for exceptional conditions,” or “don't use exceptions for control flow.” I know that one is really popular around Adobe, but c'mon: if you're using exceptions, you're using them for control flow. I hope to improve on that advice a little bit. - -First of all, you can use exceptions for things that aren't obviously failures, like when the user cancels a command. An exception is appropriate because the control flow pattern is identical to the one where the command runs out of disk space: the condition is propagated up to the top level. In this case recovery is slightly different: there's nothing to report to the user when they cancel, but all the intermediate levels are the same. It would be silly to explicitly propagate cancellation in parallel with the implicit propagation of failures. - -But if you make this choice, I strongly urge you to classify this not-obviously-a-failure thing as a failure! Otherwise you'll undo all the benefits of separating failures from postconditions, and you'll have to include “unless the user cancels, in which case…” in the summary of all your functions. So in the end, my broad advice is, “only use exceptions for failures (but be open minded about what you call a failure).” Actually, even if you're not using exceptions, any condition whose control flow follows the same path as non-local failures should probably be classified as a failure. - -Another prime example is the discovery of a syntax error in some input. In the general case, you are parsing this input out of a file. I/O failures can occur, and will follow the same control flow path. Classifying your syntax error as a failure and using the same reporting mechanism is a win in that case. - -Next, don't use exceptions for bugs. As we've said, when a bug is detected the program cannot proceed reliably, and throwing is likely to destroy valuable debugging information you need to find the bug, leave a corrupt state, open a security hole, and hide the bug from developers. Even though the “default behavior” of exceptions is to stop the program, throwing defers the choice about whether to actually stop to every function above you in the call stack. This is not a service, it's a burden. You've made your function harder to use by giving your clients more decisions to make. Just don't. - -That also means if you use components that misguidedly throw logic_errors, domain_error, invalid_argument, length_error or out_of_range at you, you should almost always stop them and turn them into assertion failures. All that said, there are some systems, like Python, where using exceptions for bugs (to say nothing of exiting loops!) is so deeply ingrained that it's unavoidable. In python you have to ignore this rule. - -Don't use exceptions for local failures. As we've seen, exceptions are optimized for the patterns of non-local failures. Using them for local failures means more catch blocks, which increase code complexity. It's usually easy to tell what kind of failure you've got, but if you're writing a function and you really can't guess whether its failure is going to be handled locally, maybe you should write two functions. - -Next, consider performance implications. Most languages aren't like this, but most C++ implementations are usually biased so heavily toward optimizing the non-failure cases that handling a failure runs one or two orders of magnitude slower. Usually that's a great trade-off because it allows them to skip checking for the error case on the hot path, and non-local failures are rare and don't happen repeatedly inside tight loops. But if you're writing a real-time system for example, you might want to think twice. - -Here's an example that might open your mind a bit: when we were discussing the design of the Boost C++ Graph Library, we realized that occasionally a particular use of a graph algorithm might want to stop early. For example, Dijkstra's algorithm finds all the paths from A to B in order, from shortest to longest. What if you want to find the ten shortest paths and stop? The way this library's algorithms work, you pass them a “visitor” object that gets notified about results as they are discovered. And in fact there are lots of notification points for intermediate conditions, not just “complete path found,” so if we were going to handle this early stop explicitly, we'd generate a test after each one of these points in the algorithm's inner loop. Instead, we decided to take advantage of the C++ bias toward non-failures. We said a visitor that wants to stop early can just throw. Now in fairness, I don't think we ever benchmarked the effects of this choice, so it might have been wrong in the end. But it was at least plausibly right. - -Finally, you might need to consider your team's development culture and use of tooling. If people typically have their debuggers set up to stop when an exception occurs, you might need to take extra care not to throw when there's an alternate path to success. Some developers tend to get upset when code stops in a case that will eventually succeed. - -## How to Handle Failure - -OK, enough about exceptions. Finally we come to the good part! Seriously, this was originally going to be the focus of the entire talk. - -Let's talk about the obligations of a failing function and of its caller. What goes in the contract and what does each side need to do to ensure correctness? - -### Callee - -Documentation: -Document local failures and what they mean. -Document non-local failures at their source, but not where they are simply propagated. That information can be nice to have, but it also complicates contracts and is a burden to propagate and keep up-to-date. - -Code: -Release any unmanaged resources you've allocated (e.g. close temporary file). - -#### Optional - -If mutating, consider giving the strong/transactional guarantee that if there is a failure, the function has no effects. - -Only do this if it has no performance cost. Sometimes it just falls out of the implementation. Sometimes you can get it by reordering the operations. For example, if you do all the things that can fail before you mutate anything visible to clients, you've got it. - -Don't pay a performance penalty to get it because not all clients need it and when composing parts all the needless overheads add up massively. - -### Caller - -- Discard any partially-completed mutations to program state or propagate the error and that responsibility to your caller. This partially mutated state is meaningless. - -What counts as state? Data that can have an observable effect on the future behavior of your code. Your log file doesn't count. - -#### Implications as data structures scale up - -The only strategy that really scales in practice, when mutation can fail, is to propagate responsibility for discarding partial mutations all the way to the top of the application. That in turn implies mutating a copy of existing data and replacing the old copy only when mutation succeeds. Either way, you probably end up with a persistent data structure (which is a confusing name—it has nothing to do with persistence in the usual sense). - -A persistent data structure is one where a partial mutation of a copy shares a lot of storage with the original. For example, in Photoshop, we store a separate document for each state in the undo history, but these copies share storage for any parts that weren't mutated between revisions. This sharing behavior falls out naturally when you compose your data structure from copy-on-write parts. - -### What (not) to do when an assertion fires. - -- Don't remove the assertion because “without that the program works!” -- Don't complain to the owner of the assertion that they are crashing the program. -- Understand what kind of check is being performed - - If it's a precondition check, fix your bug - - If it's a self-check or postcondition check, talk to the code owner about why their assumptions might have been violated - -### Probably different functions for unit testing. - - - - - - - - -Notes: - - read from network, how much was read - - no-error case exists - - podcast - - likely a local handling case. - - don't go to vegas with something you're not prepared to lose. - -Quickdraw GX: 15% performance penalty for making silent null checks. From 08fa18e50082aee263cf55e5fcb076f42e63e92b Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 29 Jan 2026 15:46:27 -0800 Subject: [PATCH 40/41] Edits based on David Sankel's feedback. --- better-code/src/chapter-3-errors.md | 52 +++++++++++++---------------- 1 file changed, 23 insertions(+), 29 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index e21ff3c..623f4c5 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -11,13 +11,6 @@ In the interest of progressive disclosure, we didn't look closely at the idea, because behind that simple word lies a chapter's worth of discussion. Welcome to the *Errors* chapter! -What we present here is not the only logically consistent approach to -errors, and our approach may clash with your instincts. It is the -result of optimizing for local reasoning and the ergonomics of -scalable software development, and the justifications for our choices -are interdependent. We hope you'll bear with us as we tie them all -together. - ## Definitions To understand any topic, it's important to have crisp definitions of @@ -41,16 +34,14 @@ spell in code font. Errors come in two flavors:[^common-definition] -> - **Programming Error**, or **bug**: code contains an -> avoidable[^avoidable] mistake. For example, an `if` statement -> tests the logical inverse of the correct condition. +> - **Programming Error**, or **bug**: code contains a mistake. For +> example, an `if` statement tests the logical inverse of the +> correct condition. > > - **Runtime error**: a function could not fulfill its postconditions > even though its preconditions were satisfied. For example, > writing a file might fail because the filesystem is full. -[^avoidable]: While bugs in general are inevitable, every *specific* - bug is avoidable. [^common-definition]: While some folks like to use the word “error” to refer only to what we call *runtime errors*—as the authors have done @@ -872,20 +863,23 @@ errors from its `mutating` methods. ## Conclusion This chapter completes the Better Code picture of how to program by -contract. As mentioned in the introduction, it's not the only -possible approach to errors. One could, for example, view error -information as part of a function's postconditions, but that -complicates contracts, obscures a function's primary purpose, and -elevates information that most clients don't care about to the same -level as the postcondition, which they do care about. One could take -the position that all invariants must be upheld even in the case of -errors during mutation, but that adds an unnecessary burden for -programmers, and in some cases, forces type authors to weaken -invariants to account for states that can only be reached when an -error occurs, when operations that could observe the broken invariant -can only arise through a failure discard the partially mutated -instance. One could try to statically constrain the types of all -errors, but that makes designs hard to evolve and elevates -implementation details to the API level. Our approach minimizes -complexity and provides the tools to reason about code without overly -constraining development. +contract. Your key takeaways: + +- Programming errors (bugs) are mistakes in the program code. The + most effective response to bug detection is to terminate the + program. +- Runtime errors signal dynamic conditions that prevent fulfilling + postconditions, even when all code is correct. +- Most runtime errors are propagated to callers. +- To keep contracts simple and a function's primary purpose clear, and + to emphasize the information most clients need, keep documentation + about errors out of summaries and postconditions. Consider omitting + detailed error information altogether, or documenting it only at the + module level. +- To keep invariants strong and simple and to reduce the mental tax of + handling errors that propagate, do not try to maintain invariants + (except those depended on for `deinit` methods or safety) when + mutating operations fail. +- To make designs easy to evolve with low friction, resist the + temptation to represent the static types of errors in function + signatures. From e7e7879ea42db21e97f9645cfea3434f8febb34a Mon Sep 17 00:00:00 2001 From: Dave Abrahams Date: Thu, 29 Jan 2026 15:46:52 -0800 Subject: [PATCH 41/41] Let it flow. --- better-code/src/chapter-3-errors.md | 78 +++++++++++++++++++++++------ 1 file changed, 62 insertions(+), 16 deletions(-) diff --git a/better-code/src/chapter-3-errors.md b/better-code/src/chapter-3-errors.md index 623f4c5..dba8892 100644 --- a/better-code/src/chapter-3-errors.md +++ b/better-code/src/chapter-3-errors.md @@ -616,7 +616,7 @@ func swap( A few caveats about mutation guarantees when errors occur: 1. Known use cases are few and rare: most allacated resources are - ultimately managed by the `deinit` of some class, and uses of + ultimately managed by a `deinit` method, and uses of unsafe operations are usually encapsulated. Weigh the marginal utility of making guarantees against the complexity it adds to documentation. @@ -798,15 +798,14 @@ Code that stops upward propagation of an error and continues to run has one fundamental obligation: to discard any partially-mutated state that can affect on the future behavior of your code (that excludes log files, for example). In general, this state is completely unspecified -and there's no other valid thing you can do with it. Any use of -a partially mutated instance, other than to deinitialize it, is -erroneous. +and there's no other valid thing you can do with it. Use of a +partially mutated instance other than for deinitialization is a bug. For the same reasons that the strong guarantee does not compose, neither does the discarding of partial mutations: if the second of two -composed operations fails, modifications made by the first -remain. So ultimately, that means responsibility for discarding partial -mutations tends to propagate all the way to the top of an application. +composed operations fails, modifications made by the first remain. So +ultimately, that means responsibility for discarding partial mutations +tends to propagate all the way to the top of an application. In most cases, the only acceptable behavior at that point is to present an error report to the user and leave their data unchanged, @@ -829,18 +828,65 @@ when mutation succeeds.[^persistent] e.g. [`TreeSet` and `TreeDictionary`](https://swiftpackageindex.com/apple/swift-collections/1.3.0/documentation/hashtreecollections) -### Mutating Functions +### Let It Flow The fact that all partially-mutated state must be discarded has one profound implication for invariants: when an error occurs, with two -rare exceptions, a mutating method need not restore invariants it has -broken, and can simply propagate the error to its caller. - -The first exception is for invariants depended on by a `deinit` -method. However, `deinit` methods are rare, and `deinit` methods with -dependencies on invariants that might be left broken in case of an -error are rarer still. You _might_ encounter one in a `ManagedBuffer` -subclass—see the Data Structures chapter for more details. +rare exceptions detailed below, a mutating method need not restore +invariants it has broken, and can simply propagate the error to its +caller. Allowing type invariants to remain broken when a runtime +error occurs may seem to conflict with the very idea of an invariant, +but remember, the obligation to discard partially mutated state +implies that only incorrect code can ever observe this broken state. + +#### Why Not Maintain Invariants Always? + +The most obvious advantage of the “let it flow” approach over trying +to keep invariants intact is that it simplifies writing and reasoning +about error handling. For most types, discardability is trivial to +maintain, but invariants often have more complex relationships. A +less obvious advantage is that in some cases, it allows stronger +invariants. For example, imagine a disk-backed version of `PairArray` +from the last chapter, where I/O operations can throw: + +```swift +/// A disk-backed series of `(X, Y)` pairs, where the `X`s and `Y`s +/// are stored in separate files. +struct DiskBackedPairArray { + // Invariant: `xs.count == ys.count` + + /// The first part of each element. + private var xs = DiskBackedArray() + + /// The second part of each element. + private var ys = DiskBackedArray() + + // ... + + /// Adds `e` to the end. + public mutating func append(_ e: (X, Y)) throws { + try xs.append(e.0) // breaks invariant + try ys.append(e.1) // restores invariant + } +} +``` + +All mutations of a `DiskBackedArray` perform file I/O and thus can +throw. In the the `append` method, if if `ys.append(e.1)` throws, +there may be no way to restore the invariant that `xs` and `ys` have +the same length. If the rule were that invariants must be maintained +even in the face of errors, it would force us to weaken the invariant +of `DiskBackedPairArray`. + +#### The Exceptions: Invariants That Must Be Maintained + +The first exception to the “let it flow” rule is for invariants +depended on by a `deinit` method—the ones that maintain +discardability. However, `deinit` methods are rare, and `deinit` +methods with dependencies on invariants that might be left broken in +case of an error are rarer still. You _might_ encounter one in a +`ManagedBuffer` subclass—see the Data Structures chapter for more +details. The second exception for invariants of types whose safe operations are implemented in terms of unsafe ones. Any invariants depended on to