The Chamber 🏰 of Tech Secrets is open. The summer is winding down (for kids, anyway) and the temperatures are heating up. I’ll look to escape some of the Georgia heat next week while traveling to San Francisco to speak about Edge Observability at the DataDog Dash conference. I hope to see some of you there! Thanks for reading and the continued slew of support and kind words. 🙏
Towards Better and Safer Crashes
Last week, I was driving to work for an in-person meeting. As I entered an intersection under a green light, a truck started making a slow left turn right in front of me. Unfortunately, there was not much time to respond. 💥 Thankfully nobody was significantly injured, though my Lexus GX460 is likely to be totaled. 😢 I was just about to make it more off-road friendly with a lift, new wheels and tires, and a new front bumper (a significant project itself), so thankfully I hadn’t purchased any of those items yet!
Modern vehicles are designed, engineered, and tested for crashes and are outfit with numerous safety systems. To name a few…
Seat Belts: My seat belt kept me in place and prevented me from being tossed elsewhere in the vehicle. 🙏
Airbags: My driver side curtain airbags deployed and punched me in the ear.
Crumple Zones: Vehicles are designed with “crumple zones” which are areas of the vehicle that are designed to absorb the energy of a crash to keep the passenger safe. I am not sure if my vehicle has them.
Automatic Emergency Call Systems: My vehicle detected the crash and automatically connected me to a call center that offered to call the proper emergency services.
Police, Fire, EMS, and Tow Services: Aside from the vehicles, our society has invested in first responders to support traffic incidents and to attend to the people, vehicles, and collateral damage involved. In my case, the local police 👮🏻♂️ were on the scene within ~20 seconds to help everyone involved and assist with other traffic navigating the area.
Resiliency in Software
This got me thinking about the resiliency and reliability of software systems. As we all know very well, some sort of “crash event”is not a question of if… but when. In these cases, how much has the “crash event” been intentionally designed for? As Kelsey reminds us, reliability is often the number one feature.
A few reflections…
Resiliency: Resiliency is the ability for a system to [gracefully] handle and recover from errors. Attempts at resiliency have the potential to bring added complexity that can actually drive reliability down when not executed well. How resilient should a system be? I believe every solution justifies different levels of investment, but generally speaking, a system needs to be resilient such that it nets a reliability that leaves its users satisfied. For example, in the case of my car, it was resilient to the crash and employed numerous safety systems such as my seat belt and airbags to ensure its critical asset (me) was able to endure. It is important to note that resiliency mattered for the thing that mattered (me), not the vessel I was traveling in. We use similar architecture patterns like “cattle not pets” and expect component values but design a system to tolerate them. Crumple zones often mean a vehicle will be hard (or impossible) to repair, but they assist the passenger’s safety. Crumple architecture components may be valuable to preserve the larger system.
Observability: My car was outfitted with sensors and notification systems. It knew precisely where the crash impact took place and deployed only the appropriate systems (side curtain air bags, but no others) to respond to that scenario. It also shut the engine off the ensure there were no cascading effects (a circuit breaker pattern, if you will). It also had a communication system to notify those who could help (EMS) that something went wrong. A good software observability system will do similar by pinpointing issues through smart instrumentation (first-class metrics, perhaps?) and precise automation. When humans need to be involved, they are provided with the information they need to act rapidly (more signal, less noise).
User Experience: Sometimes bad user experience is good user experience, such as when the side curtain airbag punched me in the head (not a great experience) but prevented potential traumatic head injury (a nice feature). Users don’t enjoy degraded states in applications, but they are better than a complete outage. Latency is friction, but may be justified when it ensures no critical user data is lost.
Support: The local police were quick to respond, navigate the situation with care and kindness, and return me to my home. They also got the intersection looking like nothing had happened within 20 minutes. This process was by design. The police were trained and followed protocols. They called one of two available towing companies that were selected for all of their incidents, and they were on site within 7-8 minutes. The road system is itself a complex system, including hundreds of thousands of cars, roads, intersections, traffic lights, police precincts, communication systems, and emergency response protocols. Support processes deserve design, too.
I recognize there is nothing particularly novel about this post, but hopefully it creates a good and tangible metaphor that is useful when thinking about system resiliency.
First of all, I am glad that you are okay.
There are so many good analogies and stories that can be told in the auto industry.
Why Do Cars Have Brakes?
Cars have breaks so they can go faster.
Companies have risk management programs so they can take risks.
There are four primary ways to handle risk in the professional world, no matter the industry, which include:
Avoid risk
Reduce or mitigate risk
Transfer risk
Accept risk
Brian, I have a topic that confuses some of us that perhaps you can shed some light on in the future. No rush, or just ignore the idea. No pressure. When to use or not use- VM, Containers, Cloudflare Workers, or just a mini computer (NUC) at the edge or near edge.
I got the first level understanding, and maybe the 2nd level. But, they are not all interchangeable. At some point they have a limit (like connectivity to the internet) or a storage limit, or a complexity limit, scale limit, cost...
It would be interesting to hear you wrap it all up in a bow in like 3-4 paragraphs. 😂
Maybe that is easier said than done, but it is a complement that I thought of you!