What we learned from Google Cloud’s June outage
In early June 2019, Google Cloud suffered a cascading set of faults that rendered multiple service regions unavailable for several hours.
This by itself isn’t totally unprecedented; what made it significant was the way it propagated through the very software that was designed to contain it. Moreover, engineers’ initial attempts to correct the issue were thwarted by the failure of that same software architecture. It was the interconnectedness and interdependencies of Google’s management components that contributed to the outage.
Google Cloud outage halts YouTube, Gmail and more
Google Cloud outage
To understand this situation more fully, the following is a short summary of what happened. A maintenance “event” that normally wouldn’t be such a big deal triggered a reaction in Google Cloud Platform’s (GCP) network control plane, further exacerbated by a fault in that code enabling it to stop other jobs elsewhere in Google’s infrastructure.
The distributed nature of a cloud platform means that although clusters in one area are designed to be maintained independently of clusters in another if those management processes leak across regions, the failure spreads like a virus. And because these controllers are responsible for routing traffic throughout their cloud, as more of them turned off, network traffic just became that much more constrained, leading to even more failures. In technical terms:
- Network control-plane jobs were stopped on multiple clusters, across multiple regions at the same time.
- Simultaneous packet loss and error rates increased across multiple regions.
- As a result, key network routes became unavailable, increasing network latency for certain services.
- Tooling to troubleshoot the issue became unusable because tooling traffic competed with service traffic.
Once the root cause was identified, remediating the failures required unraveling the management paths to take the right processes offline in the right order and apply the necessary fixes:
- Google engineers first disabled the automation tool responsible for performing maintenance jobs.
- Server cluster management software was updated to no longer accept risky requests which could affect other clusters.
- Updates were made to store configurations locally, reducing recovery times by avoiding the latency caused by rebuilding systems automatically.
- Network fail-static conditions were extended to allow engineers more time to mitigate errors.
- Tooling was improved to communicate status to impacted customers even through network congestion.
Ultimately, the fault did not lie in Google’s willingness or ability to address issues, but rather in a systemic problem with how the platform reacted to unforeseen events. In a real-time computing environment, there is no margin during which management systems can be offline to fix a different problem located in the other system used to apply fixes to the first one.
URGING fans to plunge into a virtual high-res surround sound universe of extraordinary games, Google hopes its cloud-based Stadia platform will take the world by storm on its November launch.
The US digital behemoth unveiled details of its nascent streaming video platform at last week’s Gamescom trade fair in Cologne in the hope it can gain massive traction among hardcore gamers to zap past other providers of existing gaming fare.
Gamescom, styling itself the biggest event in the European gaming industry, is a sizeable window on the state of play in a mushrooming market worth an estimated US$135 billion globally last year, according to analysts – with mobile platforms accounting for about half.
Stadia, details on which first publicly emerged in June at E3, the world’s premier event for computer and video games, offers as its USP the chance for users to play their favourite game on a range of platforms in high resolution quality on different media from smart TV to console or smart phone.SEE ALSO: Google says YouTube campaign targeted Hong Kong protests
That presages something of a gaming revolution. “People have been talking about cloud gaming for 10 years – we are on the third generation of actors. The signals have not yet turned green but Google has got solid enough guts to try it. We’ve never been so close,” says Laurent Michaud, director of studies at French digital market consultancy Idate.
Gamescom represents a chance for some hands-on experience and the brand’s huge logo and its battalion of hostesses on its stand are helping to pull in the curious, as they compare relative attractions with rivals led by Sony’s Playstation and Microsoft’s Xbox.
Google chief executive Sundar Pichai explained at E3 in Los Angeles that the idea is “to build a game platform for everyone” following an initial rollout in 14 countries using a subscription model after an initial bundled hardware purchase.
Some games will be free and others will require payment.
Even so, the Gamescom evidence after last Monday’s opening suggested interest had yet to hit the heights of neighbouring stands Nintendo or Konami – the latter being the developer of Pro Evolution Soccer’s latest gambit PES 2020.
“I find their concept interesting but I have doubts as to their capacity to guarantee good connectivity,” commented stand visitor Rishil Kuta, 22. A keen console user he said he would nevertheless be “ready to pay” a premium for a “stable” product.
Not sharing that opinion was Steven Mertes, 28, who said he did not see himself as ready to log off from his PC or close his console “which propose games of much better quality”.
“I have always been used to playing on a computer – it’s much more comfortable.”
Whichever way the cloud gaming cards fall, the race is on to hook players, especially the hardcore ones, for next-generation gameplay.
“The most difficult gamers to convince will be the ‘hardcore gamers’. They may not be as numerous as occasional players but they are the ones who count. If they don’t go to a platform, things could be difficult,” predicts Mr Michaud.
The hardcore brigade tend to be willing to pay out for the rig and content they want – but are often very attached to their favoured support environment, be it console or PC-based.
Beyond the task of converting gamers to Stadia, Google must address various technical obstacles that go with the territory of developing cloud gaming.
Although Stadia is promising 4K high resolution at 60 frames per second for minimal time lag, it remains to be seen how the platform can persuade players who may not have suitably adapted screens along with fibre optic broadband or 4G connections to subscribe.
“We have a small doubt on the development of cloud gaming,” says Wandrille Pruvot, CEO of Xtra Life, a cloud-based apps manager for Apple. “The challenge will notably be technical as the better the resolution the greater the need for a quality internet network. The games we are working on are simpler, more based on gameplay quality and that requires less bandwidth for the graphics,” says Mr Pruvot.
The bet for cloud gaming is thus to push independent, if not always very visible titles – a means for Google and rival producers to position themselves as a ‘Netflix for gaming’ by providing original content.
Overall, though, just as consoles did not kill off PC gaming, cloud gaming could essentially offer an extra strand of choice for fans of video games. AFP