The 7 worst automation failures

Source – csoonline.com

There are IT jobs that you just know are built for failure. They are so big and cumbersome and in some cases are plowing through new ground that unforeseen outcomes are likely. Then there are other situations where an IT pro might just say “whoops” when that unforeseen result should have been, well, foreseen.

UpGuard has pulled together a group of the biggest instances in the past few years in which the well-intentioned automation of a company’s IT systems facilitated a major breach instead.

Healthcare.gov: How an oversight broke the U.S. government’s healthcare website

When the U.S. government rolled out the Affordable Care Act’s web enrollment tool, Healthcare.gov, in October 2013, it was expected to be a monumental undertaking; and with the delivery of millions of citizens’ health insurance on the line, the stakes were high. So, when a major software failure crashed the website a mere two hours following its launch, the White House administration suffered a sizeable backlash. Due to a lack of integration, visibility, and testing, the project had significant problems from the start – beginning with over 100 defects with Healthcare.gov’s account creation feature, dubbed “Account Lite.”

Given its function, Account Lite was a crucial piece of the Healthcare.gov site, serving as the mechanism by which people would create their accounts and gain access to their healthcare options. This particular module had so many problems that it was assuredly a disaster waiting to happen. Nevertheless, contractors moved forward with it as it stood.

The software release failed, preventing millions from securing healthcare coverage. What’s more, the outage had political ramifications as critics of the Affordable Care Act began citing the outage as evidence of the administration’s inability to develop a successful healthcare program. The site was eventually stabilized, but the work that should have been integrated before the release was completed only after the crash occurred.

Dropbox: The buggy outage that dropped Dropbox from the web

No IT team enjoys the experience of an outage, especially when it kicks off a race for your team to implement its emergency procedures. In January 2014, Dropbox found themselves scrambling in this very scenario, when a planned product upgrade took down the sites for three hours.

When a “subtle bug” in the Dropbox script automatically applied its updates to a small number of active machines, it affected Dropbox’s thousands of production servers and caused the company’s live services to fail. Fortunately for Dropbox, its emergency procedures were well designed and largely effective. With its backup and recovery strategy, the IT team was able to restore most of their services within three hours. For some of the larger databases, however, recovery was slower – taking the company several days for all of its core services to fully return.

Amazon/DynamoDB: When the DynamoDB database disrupted all of Amazon’s infrastructure

Just as physical services like freight haulage require physical infrastructures like roads and highways, companies’ digital services depend on underlying digital infrastructures. When some of Amazon’s automated infrastructure processes timed out in September 2015, their Amazon Web Services cloud platform suffered an outage. Cascading from a simple network disruption into broad service failure, Amazon experienced a network outage like those traditional on-premise data centers experience, despite its very advanced and integrated cloud platform.

Amazon had a network disruption that impacted a portion of its DynamoDB cloud database’s storage servers. When this happened, a number of storage servers simultaneously requested their membership data, exceeded their allowed retrieval and transmission time. As a result, the servers were unable to obtain their membership data, and subsequently removed themselves from taking requests.

When the servers that became unavailable for requests began retrying the requests, the DynamoDB timeout issue manifested itself in a broader network outage. Just like that, a network disruption started a vicious cycle and affecting Amazon’s customers as it took down AWS for 5 hours.

Opsmatic: recipe for disaster

When managed under traditional server administration, automation often faces the same set of age old IT problems. One of those classic, faulty assumptions is “if it ain’t broke, don’t fix it” – assuming that all systems are operating the way they should be. When Opsmatic’s routine server maintenance shut down its whole operation, it was because things weren’t exactly as they had thought.

In Opsmatic’s case, a Chef recipe called “remove_default_users” had been created during the early stages of the company’s Amazon Web Services experimentation. Now, long after the test, that recipe was somehow still running against the production servers, unbeknownst to the staff maintaining them.

Like many major outages, this incident was the result of a long, causal sequence of mistakes, none of which were caught until they added up to a giant problem.

Knight Capital: How one tiny mistype cost Knight Capital $1 billion

Knight Capital automated not only its administrative IT processes, but also its algorithmic trading. Unfortunately, this meant that changes and unplanned errors – in handling real money – could happen very quickly. This is the story of how a single error caused Knight Capital to lose $172,222 per second for 45 minutes straight in 2012.

When operating a data center at scale, clusters of servers often run a single function. This distributes the load across more computing resources and provides better performance for high traffic applications. This model requires all the servers in a cluster to use the same configurations, no matter which particular server in the cluster they are using, so that all the applications will behave the same way. However, configurations – even if identical at provisioning – always drift apart.

Despite all of its automation, Knight Capital was still manually deploying code across server banks, and an inevitable human error caused one of its eight servers to have a different configuration from all the others. When one of Knight’s technicians made this mistake during the deployment of the new server code, no one knew. Thus, from that point forward, the IT staff were operating under the misconception that these servers were identical.

At the same time, a decommissioned code remained available on the misconfigured server. “As a result, this server began sending orders to certain trading centers for execution,” and the error triggered a domino effect around algorithmic stock trading – costing Knight Capital $465 million in trading loss.

Delta Airlines: automated fleet of flightless birds

Large logistics operations rely on automated systems to achieve the necessary speed to perform at scale. Some airlines struggle to keep those systems functional. Just like traditional, manual methods of systems administration, automated systems suffer from misconfigurations. In the worst-case scenarios from recent years, failure of these systems has cost airlines hundreds of millions of dollars and more in their customers’ goodwill.

When misconfigurations occur, they are pushed out quickly through automated mechanisms and can bring entire systems down. For airlines, this means flight operations are interrupted, planes are delayed, and money is siphoned out of the business. In one such case in January 2017, Delta told investors that one glitch in their automated system caused an expansive outage, costing the airline more than $150 million.

Google Gmail: You’ve got mail?: Gmail’s 2014 bug-induced failure

When technology giants experience the occasional automation-related outage, an hour of downtime can mean a lot more. For these huge organizations to make any sort of change, they have to do so across thousands of servers. Having always been on the bleeding edge of technology, it’s no surprise that Google has automated its configuration management. Although employed to make operations easier, when the wrong change is executed in an automated system that means it can propagate far and wide within a matter of seconds.

In 2014, a bug in Google’s internal automated configuration system caused Gmail to crash for around half an hour. The incorrect configuration was sent to live services, causing users’ requests for their data to be ignored, for those services, in turn, to generate errors.

The lesson is that configuration automation is not the same as configuration management. Automation ensure that changes get pushed out across all systems.