Disaster Lessons by Anoop Dixith

Software Engineering Lessons from RCAs of world's greatest disasters

Disasters happen! Post Mortems are done. It would be stupid not to learn from them or apply them to other industries.

Compiled by Anoop Dixith

Back to Anoop's Homepage

Disaster	RCAs (Taken from official investigations)	Software Engineering Lessons
Sinking of the RMS Titanic	1. A lack of emergency preparations had left Titanic's passengers and crew in "a state of absolute unpreparedness" 2. Safety compromising ambition. In a quest to complete the journey sooner, ship traveled at high speeds in dangerous areas 3. Titanic's Captain Edward Smith had shown an "indifference to danger [that] was one of the direct and contributing causes of this unnecessary tragedy.	It's not a matter of if, it's a matter of when. Systems fail. Premortem all known failure scenarios. Be a pessimist while planning, pragmatist while implementation, and an optimist while handover. Maintain structured, comprehensive runbooks for all known failure scenarios. Simulate and test failure scenarios using chaos engineering and synthetic monitoring techniques. Leading during the time of a software crisis (think production database dropped, security vulnerability found, system-wide failures etc.) requires a leader who can stay calm and composed, yet think quickly and ACT. The leadership of Gene Kranz during Apollo 13 is just one of the examples of such an action.
Challenger Disaster	1. An accident rooted in history - failure of both NASA and its contractor, Morton Thiokol, to respond adequately to the design flaw. 2. Flawed launch decision - management overriding engineering recommendation that was based on the data. 3. Drastically off understanding of and estimates on safety factors between engineering and management	Software Rot, lack of Software Archaeology, Technical Debt, and unattended Software Entropy WILL sink you, sooner or later. Enable engineers to make "data-backed" engineering recommendations and decision making, by creating a psychological safety net.
Deepwater Horizon Oil Spill	1. Ignoring "near misses" with cement job 2. Focus on speed over safety, given that the well was behind schedule costing BP $1.5 million a day-helped lead to the accident. 3. Most decisions made were in favor of approaches which were shorter in time and lower in cost without due consideration of the tradeoffs between cost, schedule, risk, and safety	Let's please not have this argument again, we all know the answer to the question - 'Is high quality software worth the cost?' I personally believe in: "If you're building a road, build it like a runway. But if you have to do it in an hour with two people, start with something your customer can just travel on, but reliably” (Quality is paramount, but if time and resources are at play, ensure a reliable, quality barebone MVP first)". You can probably win a hackathon by going at tachyon speed, compromizing quality and structure. But good luck making a customer happy with that! Check the evolution of nps if you like!
Chernobyl Disaster	1. At the time of the accident the reactor was being operated with many key safety systems turned off, most notably the emergency core cooling system 2. Fact that the reactor was brought to a state not specified by procedures. It was held that the designers of the reactor considered this combination of events to be impossible 3. The poor quality of operating procedures and instructions, along with personnel having had an insufficient understanding of technical procedures involved with the nuclear reactor	Never skip / disable any test whose associated code is active! Murphy's law states 'Whatever can go wrong, will.' This is applicable to code/software as well. ALWAYS plan for software failures. Anticipate failures with a pre-mortem, lead during an outage, and learn from the failures
Air France Flight 447 Crash	1. Lack of training on manual handling of the aircraft owing to over reliance on auto-pilot 2. The cockpit's lack of a clear display of the inconsistencies in airspeed readings identified by the flight computers.	Not to the extent of Literate Programming, but method level documentation, code level documentation, API documentation, class level documentation, design docs, specs, user guides, and other forms of documentation (READMEs, UMLs, licensing info etc.) are all as central to a software development process. Build UI that shows the right set of information in the right place at the right time
September 11 Attacks	1. Lack of communication and information handling between the CIA and the FBI (amongst various other political factors). The position of Director of National Intelligence was then created	Put an end to information hoarding within orgs/teams and set the stage for open communication across teams/orgs. Maintain a central place for companywide knowledgebase without giving up to the epistomological entropy that can naturally creep in. Also, be human and just ask!
2008 Financial Crisis	1. Widespread failures in financial regulation and supervision proved devastating 2. Dramatic failures of corporate governance and risk management at many systemically important financial institutions 3. A combination of excessive borrowing, risky investments, and lack of transparency put the financial system on a collision course with crisis 4. We conclude the failures of credit rating agencies were essential cogs in the wheel of financial destruction	There should be no "single person responsible" for an engineering failure! Processes must be put in place that ensures any failure as a collective responsibility. A bug causing a catastrophic failure in production is not (just) a developer's fault! Why didn't the reviewers catch it? Why did the QA miss it? Why did the manager let it deploy without verifying automated tests were in place? Why wasn't the canary deployment employed? Was dogfooding / fishfooding of the feature/changes considered?
Bhopal Gas Tragedy	(Going with the official version) 1. The disaster was caused by a potent combination of under-maintained and decaying facilities, a weak attitude towards safety, and an under-trained workforce, culminating in worker actions that inadvertently enabled water to penetrate the MIC tanks in the absence of properly working safeguards	Tech debt, together with prioritizing speed over quality of new features with insufficient requirements, along with insufficient testing can lead to a "software tragedy"!
Chicago Tylenol Murders	(This is "response" to a disaster rather than an RCA) 1. Johnson & Johnson (maker of Tylenol) fully cooperated with RCA/investigation, openly. 2. Johnson & Johnson did a full recall of the product, cooperated with the authorities, announced prize money for any hint leading to the culprit(s) 3. Johnson & Johnson promptly started innovating a non-compromisable triple sealed capsule, which was impossible to unseal without destroying entirely. This move brought the market share for Tylenol back to its original glory.	Crisis Communication Strategy and Handling is a well researched topic, and applies to the software industry equally well. The following analyses by Department of Defense on strategies used by Johnson & Johnson, Denny's, Jack in the Box, and Union Carbide highlight different such strategies. Within the software industry, public facing status portals have become a norm as a form quick crisis communication during outages and incident management. As always, crisis is considered a launchpad for innovation! Don't just fix problems, but innovate solutions.
Santiago de Compostela derailment	1. Overspeeding. 2. Lack of real time monitoring. Outdated monitoring devices. 3. Inaction. The rail driver had made his overspeeding habit public well before the accident.	Real time software and system monitoring and Observability are not only the best ways to avoid and take precautions for production failures, but the total Software Visibility" using real time monitoring is also the key to innovation. Inaction/ignorance in a software development process should be avoided. That "backlog of bugs" need to be fixed, that "tech debt" need to be resolved, that "TODO" needs to be done, that "toggle" needs to be removed, that document needs to be written.
MH370 findings or the lack thereof	1. Lack of sufficient real time monitoring for the aircraft, sufficient back-up systems, sufficient longetivity considerations for the black box and other communication systems 2. Lack of openness in sharing radar data mostly owing to the security of each nations 3. Not enough resources on underwater and remote-water search owing to rough seas, remote locations, and the high cost of undertakings.	Software Longetivity and thus Software Monitoring Longetivity are "must consider" factors during software development. There is nothing called underinvesting in different aspects of application monitoring, including on a. uptime monitoring, b. realtime anomaly detection, c. realtime performance monitoring, and d. realtime cost burn estimation, with realtime, well thoughtout alerts on the same.
Boeing 737 MCAS Accidents	1. Concealing information about changes to MCAS, a critical system withing Boeing 737 Max, to avoid pilot training which is a major cost source for airlines. 2. Insufficient oversight by the FAA despite high stakes 3. No attention/consideration given to raised red flags on MCAS much before the crashes	Two of the three pillars of empiricism aspect of agile project management are "inspection" and "transparency". Any and all yellow and red flags need to considered, transparency should be uphelp, and clear commuincation needs to be in place.
The successful failure of Apollo 13	1. A faulty Teflon coil resulted in a series of events that led to the explosion, which led to the capsule returning to earth without landing on the moon. Tefon coil was one of the 2.5 million parts that make up an everage spaceship.	The impact of a software bug can't be assessed by its complexity! No bug too small! Check out this list of costliest software bugs! One of them is literally just a unit mismatch. The economic cost of software bugs is estimated to be $60 billion annually in 2002. These epic software failures also mandate the need for adequate software testing.
The Great Depression	1. A change in monetary policy from "Price Stability" to "Real Bills Doctrine", where all currency or securities have material goods backing them. 2. Fundamental understanding of the business cycle was wrong/not data-backed. The idea that 'if consumption fell due to savings, the savings would cause the rate of interest to fall; lower interest rates would lead to increased investment spending and demand would remain constant' was not correct. 3. Over-production and under-consumption	The importance of dogfooding, fishfooding, a design partnership, A/B testing, simulation, and canary deployment cannot be understated when it comes to reducing risks from changes of any sort. Assumption Management in software development is an important entity. Assumptions form the fundamental understanding of a software, and thus they need to be well thought through, properly communicated, clearly documented, and periodically evaluated.
Johnstown Floods	1. Ignoring security for luxury. Dam was lowered to enable the construction of a luxury fishing club. 2. Blatant ignorance in communicating the warnings: Even though there was time, no warnings were properly dispatched	There is no workaround for software security, or an excuse for the lack thereof! At the code level, the paradigms of Secure Coding and Defensive Programmingneed to be embedded by default while coding. At the system level, the design and architectural patterns of "Secure by default" and Secure by Design need to be prioritized. Raise the red flags. Always, and as soon as possible. It's one of the essential qualities of an engineering or a project manager to be able to assess and balance efficiently between "raising red flags" and "avoiding false alarms and red herring"

Disaster

RCAs (Taken from official investigations)

Software Engineering Lessons

Sinking of the RMS Titanic

1. A lack of emergency preparations had left Titanic's passengers and crew in "a state of absolute unpreparedness"

2. Safety compromising ambition. In a quest to complete the journey sooner, ship traveled at high speeds in dangerous areas

3. Titanic's Captain Edward Smith had shown an "indifference to danger [that] was one of the direct and contributing causes of this unnecessary tragedy.

It's not a matter of if, it's a matter of when. Systems fail. Premortem all known failure scenarios. Be a pessimist while planning, pragmatist while implementation, and an optimist while handover. Maintain structured, comprehensive runbooks for all known failure scenarios. Simulate and test failure scenarios using chaos engineering and synthetic monitoring techniques.

Leading during the time of a software crisis (think production database dropped, security vulnerability found, system-wide failures etc.) requires a leader who can stay calm and composed, yet think quickly and ACT. The leadership of Gene Kranz during Apollo 13 is just one of the examples of such an action.

Challenger Disaster

1. An accident rooted in history - failure of both NASA and its contractor, Morton Thiokol, to respond adequately to the design flaw.

2. Flawed launch decision - management overriding engineering recommendation that was based on the data.

3. Drastically off understanding of and estimates on safety factors between engineering and management

Software Rot, lack of Software Archaeology, Technical Debt, and unattended Software Entropy WILL sink you, sooner or later.

Enable engineers to make "data-backed" engineering recommendations and decision making, by creating a psychological safety net.

Deepwater Horizon Oil Spill

1. Ignoring "near misses" with cement job

2. Focus on speed over safety, given that the well was behind schedule costing BP $1.5 million a day-helped lead to the accident.

3. Most decisions made were in favor of approaches which were shorter in time and lower in cost without due consideration of the tradeoffs between cost, schedule, risk, and safety

Let's please not have this argument again, we all know the answer to the question - 'Is high quality software worth the cost?'

I personally believe in: "If you're building a road, build it like a runway. But if you have to do it in an hour with two people, start with something your customer can just travel on, but reliably” (Quality is paramount, but if time and resources are at play, ensure a reliable, quality barebone MVP first)".

You can probably win a hackathon by going at tachyon speed, compromizing quality and structure. But good luck making a customer happy with that! Check the evolution of nps if you like!

Chernobyl Disaster

1. At the time of the accident the reactor was being operated with many key safety systems turned off, most notably the emergency core cooling system

2. Fact that the reactor was brought to a state not specified by procedures. It was held that the designers of the reactor considered this combination of events to be impossible

3. The poor quality of operating procedures and instructions, along with personnel having had an insufficient understanding of technical procedures involved with the nuclear reactor

Never skip / disable any test whose associated code is active!

Murphy's law states 'Whatever can go wrong, will.' This is applicable to code/software as well. ALWAYS plan for software failures. Anticipate failures with a pre-mortem, lead during an outage, and learn from the failures

Air France Flight 447 Crash

1. Lack of training on manual handling of the aircraft owing to over reliance on auto-pilot

2. The cockpit's lack of a clear display of the inconsistencies in airspeed readings identified by the flight computers.

Not to the extent of Literate Programming, but method level documentation, code level documentation, API documentation, class level documentation, design docs, specs, user guides, and other forms of documentation (READMEs, UMLs, licensing info etc.) are all as central to a software development process.

Build UI that shows the right set of information in the right place at the right time

September 11 Attacks

1. Lack of communication and information handling between the CIA and the FBI (amongst various other political factors). The position of Director of National Intelligence was then created

Put an end to information hoarding within orgs/teams and set the stage for open communication across teams/orgs. Maintain a central place for companywide knowledgebase without giving up to the epistomological entropy that can naturally creep in. Also, be human and just ask!

2008 Financial Crisis

1. Widespread failures in financial regulation and supervision proved devastating

2. Dramatic failures of corporate governance and risk management at many systemically important financial institutions

3. A combination of excessive borrowing, risky investments, and lack of transparency put the financial system on a collision course with crisis

4. We conclude the failures of credit rating agencies were essential cogs in the wheel of financial destruction

There should be no "single person responsible" for an engineering failure! Processes must be put in place that ensures any failure as a collective responsibility.

A bug causing a catastrophic failure in production is not (just) a developer's fault!

Why didn't the reviewers catch it?

Why did the QA miss it? Why did the manager let it deploy without verifying automated tests were in place?

Why wasn't the canary deployment employed?

Was dogfooding / fishfooding of the feature/changes considered?

Bhopal Gas Tragedy

(Going with the official version)

1. The disaster was caused by a potent combination of under-maintained and decaying facilities, a weak attitude towards safety, and an under-trained workforce, culminating in worker actions that inadvertently enabled water to penetrate the MIC tanks in the absence of properly working safeguards

Tech debt, together with prioritizing speed over quality of new features with insufficient requirements, along with insufficient testing can lead to a "software tragedy"!

Chicago Tylenol Murders

(This is "response" to a disaster rather than an RCA)

1. Johnson & Johnson (maker of Tylenol) fully cooperated with RCA/investigation, openly.

2. Johnson & Johnson did a full recall of the product, cooperated with the authorities, announced prize money for any hint leading to the culprit(s)

3. Johnson & Johnson promptly started innovating a non-compromisable triple sealed capsule, which was impossible to unseal without destroying entirely. This move brought the market share for Tylenol back to its original glory.

Crisis Communication Strategy and Handling is a well researched topic, and applies to the software industry equally well. The following analyses by Department of Defense on strategies used by Johnson & Johnson, Denny's, Jack in the Box, and Union Carbide highlight different such strategies.
Within the software industry, public facing status portals have become a norm as a form quick crisis communication during outages and incident management.

As always, crisis is considered a launchpad for innovation! Don't just fix problems, but innovate solutions.

Santiago de Compostela derailment

1. Overspeeding.

2. Lack of real time monitoring. Outdated monitoring devices.

3. Inaction. The rail driver had made his overspeeding habit public well before the accident.

Real time software and system monitoring and Observability are not only the best ways to avoid and take precautions for production failures, but the total Software Visibility" using real time monitoring is also the key to innovation.

Inaction/ignorance in a software development process should be avoided. That "backlog of bugs" need to be fixed, that "tech debt" need to be resolved, that "TODO" needs to be done, that "toggle" needs to be removed, that document needs to be written.

MH370 findings or the lack thereof

1. Lack of sufficient real time monitoring for the aircraft, sufficient back-up systems, sufficient longetivity considerations for the black box and other communication systems

2. Lack of openness in sharing radar data mostly owing to the security of each nations

3. Not enough resources on underwater and remote-water search owing to rough seas, remote locations, and the high cost of undertakings.

Software Longetivity and thus Software Monitoring Longetivity are "must consider" factors during software development.

There is nothing called underinvesting in different aspects of application monitoring, including on
a. uptime monitoring,
b. realtime anomaly detection,
c. realtime performance monitoring, and
d. realtime cost burn estimation,
with realtime, well thoughtout alerts on the same.

Boeing 737 MCAS Accidents

1. Concealing information about changes to MCAS, a critical system withing Boeing 737 Max, to avoid pilot training which is a major cost source for airlines.

2. Insufficient oversight by the FAA despite high stakes

3. No attention/consideration given to raised red flags on MCAS much before the crashes

Two of the three pillars of empiricism aspect of agile project management are "inspection" and "transparency". Any and all yellow and red flags need to considered, transparency should be uphelp, and clear commuincation needs to be in place.

The successful failure of Apollo 13

1. A faulty Teflon coil resulted in a series of events that led to the explosion, which led to the capsule returning to earth without landing on the moon. Tefon coil was one of the 2.5 million parts that make up an everage spaceship.

The impact of a software bug can't be assessed by its complexity! No bug too small! Check out this list of costliest software bugs! One of them is literally just a unit mismatch. The economic cost of software bugs is estimated to be $60 billion annually in 2002. These epic software failures also mandate the need for adequate software testing.

The Great Depression

1. A change in monetary policy from "Price Stability" to "Real Bills Doctrine", where all currency or securities have material goods backing them.

2. Fundamental understanding of the business cycle was wrong/not data-backed. The idea that 'if consumption fell due to savings, the savings would cause the rate of interest to fall; lower interest rates would lead to increased investment spending and demand would remain constant' was not correct.

3. Over-production and under-consumption

The importance of dogfooding, fishfooding, a design partnership, A/B testing, simulation, and canary deployment cannot be understated when it comes to reducing risks from changes of any sort.

Assumption Management in software development is an important entity. Assumptions form the fundamental understanding of a software, and thus they need to be well thought through, properly communicated, clearly documented, and periodically evaluated.

Johnstown Floods

1. Ignoring security for luxury. Dam was lowered to enable the construction of a luxury fishing club.

2. Blatant ignorance in communicating the warnings: Even though there was time, no warnings were properly dispatched

There is no workaround for software security, or an excuse for the lack thereof! At the code level, the paradigms of Secure Coding and Defensive Programmingneed to be embedded by default while coding.
At the system level, the design and architectural patterns of "Secure by default" and Secure by Design need to be prioritized.

Raise the red flags. Always, and as soon as possible. It's one of the essential qualities of an engineering or a project manager to be able to assess and balance efficiently between "raising red flags" and "avoiding false alarms and red herring"