The words “chaos” and “engineering” aren’t usually found together. After all, good engineers keep chaos at bay. Yet lately software developers are deploying what they loosely call “chaos” in careful amounts to strengthen their computer systems by revealing hidden flaws. The results aren’t perfect – anything chaotic can’t offer guarantees– but the techniques are often surprisingly effective, at least some of the time, and that makes them worthwhile.

The process can be especially useful for security analysts because their job is to find the undocumented and unanticipated backdoors. The chaotic testing can’t identify all security failures, but it can reveal dangerous, unpatched vulnerabilities that were not imagined by the developers. Good chaos engineering can help both the DevSecOps and DevOps teams because sometimes problems of reliability or resilience can also be security weaknesses, too. The same coding mistakes can often either crash the system or lead to intrusions.

What is chaos engineering?

The term is a neologism meant to unify different techniques that have found some success. Some use words like “fuzzing” or “glitching” to describe how they tweak a computer system and, perhaps, send it off balance and maybe crash it. They inject random behavior that can stress the software while watching carefully for malfunctions or bugs to appear. These are often failure modes that might take years to reveal themselves in regular usage.

John Gilmore, one of the founders of the Electronic Freedom Foundation (EFF) and the member of the development team behind several key open-source projects, says that coding is a process of continual refinement and chaos engineering is one way to speed up the search for all possible execution paths. “The real value of long-running code is that most of the bugs have been shaken out of it by the first 10 million people to run it, the first 20 compilers that have compiled it, the first five operating systems it runs on. Ones that have then been tested by fuzzing and penetration tests (e.g., Google Project Zero) have many fewer unexplored code paths than any new piece of code.” he explains.

Gilmore likes to tell a story from the 1970s of when he worked for Data General, an early minicomputer manufacturer. He found that flipping a power switch at random times would leave the operating system state in disarray. “Rather than fixing the problem, the operating system engineers claimed that flipping the breaker wasn’t a valid test.” Gilmore says, before adding,   “As a result, Data General is dead now.”