How Monzo Isolated Their Microservices Using Kubernetes Network Policies

Source:-infoq.com

Monzo’s security team shared their story about implementing Kubernetes network policies using Calico APIs to provide isolation among 1500 microservices.

Monzo is a mobile-only digital bank that runs its core infrastructure on AWS. Using Kubernetes for hosting its microservices, Monzo uses Apache Cassandra as its primary database, Apache Kafka for messaging and Go for most of its application code. The security team at Monzo engineering adopted zero trust networking as one of their goals. A zero trust platform works on the principle that no entities – within or outside the network – are trusted to access private information unless they are verified. Each service in Monzo’s backend would be allowed to talk to only a pre-approved list of services. Monzo has around 1500 services, with over 9300 inter-service interactions, making this a difficult task. The team used Calico specific network policies on Kubernetes to provide this isolation, after building a custom toolset that derives the policies by performing code analysis.

The team isolated one service to test out their preliminary approach. They wrote a custom tool called rpcmap which discovers inter-service dependencies from static code analysis. According to Jack Kleeman, Backend Engineer at Monzo, they chose static analysis over observation during integration testing or during runtime because:

Monzo has a lot of code paths – there isn’t an integration test for everything. And for runtime, just because something is rarely called doesn’t mean it’s never called; a bank can have yearly processes.

The rules had to be stored in a manageable and readable way without disrupting how the existing services work. The team used Kubernetes’ NetworkPolicy to enforce the discovered rules. Monzo uses the Calico networking plugin to implement Kubernetes network policies. This initial approach was brittle in terms of testability, and put the onus of maintaining the rule list on the team managing the invoked service. Another drawback cited was that the dev teams would have to edit Kubernetes config files by hand.

To solve these, Kleeman says that “we spoke to the Calico community about testing network policies, and found that we could use some features of Calico that aren’t accessible through Kubernetes to make our policies testable.” One of these features allowed traffic that would normally be disallowed by network policies, followed by logging such cases. Kubernetes network policies work on selectors and labels, and in the absence of any policies Kubernetes allows all communication between pods. Monzo configured their policy to run at the end, and monitored their network traffic to figure out which services would have dropped packets.

After this, the team switched the list of allowed traffic for a service to be a property of the invoking service, instead of the destination service. Services used labels to declare which services they needed access to in the policy’s ingress spec. To handle services that call a large number of other ones, e.g. monitoring, services were grouped by “service-type” and the same principle applied without having to list all the individual services. rpcmap was configured to run on every commit, and the deployment pipeline would convert the rule files into service labels. The team plans to implement this using a service mesh instead of in the CNI (Container Network Interface) layer in the future, having already moved to a custom mesh using Envoy.