Engineering

Securing your network with WeaveNet

Andrew Neudegg

29/05/2020

At FINBOURNE we run highly available Kubernetes clusters in multiple availability zones, with about twenty-five M5.2xLarge nodes per cluster. These nodes live within an Amazon Web Services Virtual Private Cloud (VPC) and although they are not physically linked, they can communicate. We use WeaveNet as our overlay network to secure our cluster traffic with encryption within a VPC. WeaveNet (Weave or commonly known as weave) gives us confidence that pod traffic is not being intercepted or manipulated between nodes; deploying WeaveNet as a Daemonset means that we can manage configuration in a single place. This configuration is deployed through our CICD systems from a central repository. We maintain the capability to update low-level components without triggering an outage by implementing an immutable infrastructure pattern.

This post is about Weave, our customisations, and a Weave related production issue that FINBOURNE encountered and resolved in the last couple of months.

Overlay networks

To use WeaveNet in Kubernetes effectively, a baseline knowledge of pod orchestration is required. Kubernetes is a collection of applications that together manipulate a container runtime, in our case Docker, to run a containerised application.

Kubernetes creates deployed applications as pods, which are in turn composed of one or more docker containers. They are placed on the first, most suitable host. The Kubelet is the Kubernetes component that is responsible for managing containers on any given host.

Pods have their own IP addresses that they can use to communicate with one another. Docker does not expose running containers’ networking to the wider internet by default, meaning that without the help of a networking solution pods are unable to route traffic to applications that are not on the same node.

To solve this problem we require an overlay network. The Container Network Interface (CNI) specification is a Cloud Native Computing Foundation (CNCF) project that aims to create a single standard that overlay networks should adhere to. As a consequence we are free to plug-and-play whichever overlay network we choose. There are several popular solutions; Flannel, Calico and WeaveNet. At FINBOURNE we use Weave because of its superior fault tolerance and encrypted networking support. This allows us to secure the traffic between nodes, while ensuring that traffic is routed as directly as possible between pods.

Deploying and Managing Weave

To deploy Weave we follow a few steps that are laid out in the WeaveNet documentation. Note that while we manage our own Kubernetes clusters, Weave is available and useful for managed Kubernetes environments such as EKS or GKE. The installation process is quite simple, with the only caveat being that the WeaveNet daemonset must be present when the cluster first boots. This is because Weave deploys vital CNI config to disk as it starts up. This CNI config is required for the Kubelet to operate normally. Failures that result in weave being unable to reach the Kubernetes API server, for instance kube-proxy not writing proxy rules correctly, will result in CNI errors being logged from the kubelet.

When everything is installed you’ll see a number of Weave pods in your chosen deployment namespace:


$ kubectl get pods -n weave-namespace
NAME READY STATUS RESTARTS AGE
weave-net-297ss 2/2 Running 0 8d
weave-net-4km5h 2/2 Running 0 3d
weave-net-x2r8k 2/2 Running 0 9d
weave-net-x4mjw 2/2 Running 0 3d
...
weave-net-zfwnr 2/2 Running 0 6d

These pods have two containers, weave and weave-npc. weave-npc is the weave network policy controller, and is responsible for enforcing Kubernetes Network Policies.

The pods run quite light on resource with the real usage being:


$ kubectl top -n weave-namespace pod weave-net-2qjxq
NAME CPU(cores) MEMORY(bytes)
weave-net-2qjxq 11m 87Mi

From a requests of:


resources:
  requests:
    cpu: 10m

This indicates that this Weave cluster (one of our test clusters) with 25 AWS M5.2xLarge’s needed a little more requests headroom to operate in a healthy manner when the cluster is under heavy load.

Customisations

There are a few very easy changes that will improve your weave game. Weave supports encrypted traffic by default, but you do have to pass an environment variable WEAVE_PASSWORD to ensure the traffic gets encrypted!

Our clusters breach the default CPU requests and we want to ensure weave always has the resources it needs to operate, so we intentionally do not set limits:


resources:
  requests:
    cpu: 100m

Updating weave can be difficult because restarting a pod breaks a node’s cluster connectivity for a very small window of time. To mitigate this we have set the deployment to update OnDelete. Weave is compatible with the version before and after, allowing us to update Weave by draining and cordoning nodes as we go, then restarting the WeaveNet pod meaning that traffic between nodes is never unsecured and our service does not incur any interruption.


updateStrategy:
  rollingUpdate:
    maxUnavailable: 1
    type: OnDelete

Finally, an environment variable that populates the CIDR pod range at deployment time


spec:
  containers:
    env:
      - name: IPALLOC_RANGE
        value: ${pod_cidr}
      # Becomes
      - name: IPALLOC_RANGE
        value: 10.2.0.0/16

Each node gets a proportion of IP address to allocate to pods, this variable restricts the addresses that pods can own.

Learning from an anomaly

A few months ago, some of our supporting services were experiencing a little disruption. The non-production account was sporadically losing connections between nodes. First we noticed that our three node ElasticSearch cluster was destabilising, the cluster had splitbrain’d and shards were missing.

One of the systems FINBOURNE uses to permission pod access to AWS resources is KIAM. We noticed that pods that were depending on various AWS permissions were failing. Our Thanos deployments, applications for long term telemetry storage, were in crash loop backoff because they couldn’t access AWS S3.

Our CICD system was returning 5XX errors when running tests against this environment. The external HTTP interfaces were slow to respond, sometimes failing to respond at all — mimicking the behaviour our CI system was experiencing. The demonstration was in a few hours. We took stock to understand what we were seeing.

Some applications were misbehaving but others from the same deployment were not.
Some inter-cluster services were failing (for instance the ingress traffic), despite the target pods being fully operational.
Traffic between members of the same deployment, i.e. Elasticsearch, was not being routed properly.

So what did we do? We started gracefully restarting pods that could be at fault:

CoreDNS
KubeProxy
Weave

No change.

We started a root cause analysis across the errors that were being identified by our telemetry and logging infrastructure and found that the Weave container was complaining about IP addresses being owned by another peer.

Understanding the anomaly
Weave forms quorum (consensus) amongst its members. It does this so that it may route traffic efficiently between members. This gossip protocol allows a Weave network to continue to function even if some peers cannot be directly reached. This gives us greater resiliency to failures, but does introduce a quirk of behaviour that a node can become completely isolated from all other nodes and not immediately break.

This quirk can result in traffic becoming un-routable to and from specific members of the cluster. In our case, approximately half of the quorum believed that an IP address range belonged to one node, while the other half believed that it belonged to a different node. This resulted in each half of the cluster dropping the traffic from one of the nodes on either side of this split.

From the Weave documentation:

Weave is a networking plugin for Kubernetes that relies
on consensus to allocate IP addresses to nodes.
If a node suddenly leaves the cluster, the weave consensus
pool may not become aware that it can reclaim these
IP addresses, effectively rendering them blocked.

To see the IP Address explanations exec into a running Weave pod and run:


$ kubectl exec -n weave-namespace weave-net-mq2n7 -c weave -- sh -c './weave --local status ipam'
AA:AA:AA:AA:AA:AA(ip-00-000-00-00.eu-west-2.compute.internal) 5120 IPs (07.8% of total) (9 active)
AA:AA:AA:AA:AA:AA(ip-000-000-000-000.eu-west-2.compute.internal) 2816 IPs (04.3% of total) - unreachable!
...
AA:AA:AA:AA:AA:AA(ip-000-000-000-000.eu-west-2.compute.internal) 3567 IPs (05.4% of total) - unreachable!
AA:AA:AA:AA:AA:AA(ip-000-000-000-000.eu-west-2.compute.internal) 3584 IPs (05.5% of total)
AA:AA:AA:AA:AA:AA(ip-000-000-000-000.eu-west-2.compute.internal) 1024 IPs (01.6% of total)

What we saw was that when we exec’d into each half of the split in the Weave cluster, different nodes were reported as unreachable as they conflicted over which node actually owned an IP range.

The root cause of this event was our auto scaling mechanism – when nodes leave the cluster their IP addresses are not always reclaimed. This has been resolved in the 2.6.X branch of weave, but during this incident we were on 2.5.X.

With just over an hour left until the demonstration, we upgraded Weave and were still encountering issues. The IP Allocations persisted even after removing the dissenting nodes from the cluster.

So we took another look at the pod specification looking for anywhere that the pod could be storing state between pod restarts:


volumes:
  - hostPath:
      path: /var/lib/weave
      type: ""
      name: weavedb
  - hostPath:
      path: /opt
      type: ""
      name: cni-bi
  - hostPath:
      path: /home
      type: ""
      name: cni-bin2
  - hostPath:
      path: /etc/kubernetes
      type: ""
      name: cni-conf
  - hostPath:
      path: /var/lib/dbus
      type: ""
      name: dbus
  - hostPath:
      path: /lib/modules
      type: ""
      name: lib-modules
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
      name: xtables-lock

Weavedb is a BoltDB that stores the shared quorum state so was where we turned our attention for an issue seemingly related to state.

The remediation we chose was to:

update Weave to 2.6.X
create a daemonset that mounted the Weavedb directory and deleted the database;
remove the nodes that believed they owned the same IP address.

Then we manually restarted all the WeaveNet pods. These actions restored full functionality to the cluster. Removing the weave state allowed the whole weave cluster to establish quorum and thus function properly again. Removing the nodes that owned the same IP address ranges prevented the split from reoccurring.

If the quorum had been in agreement we would have been able to remove the offending peer and redistribute the held IP addresses.

The demonstration was a success and we learnt a bit more about a key piece of our infrastructure.

Follow up actions

After verifying the integrity of the environment by running our product tests, we implemented two key changes.

We needed an early warning of this sort of failure, and we needed to ensure when this early warning system was tripped, we would be able to resolve the problem without causing an outage.

weave_ipam_unreachable_count is one of the exported Prometheus metrics (OpenMetrics), available on :6784/metrics by default. For us, this figure should always be zero. If your use case needs to take advantage of Weave’s ability to route traffic around obstacles then you may wish to increase this number. We also wanted to know when connections were being repeatedly terminated rate(weave_connection_terminations_total[5m]) > 0.1 with some allowance for connection churn.

We then added information to our runbooks, ensuring that whoever received the call for this alert being tripped would know what they had to do to resolve it. Now, we have entries for two scenarios;

A node as a consequence of loss or scale down leaving the cluster, but its IP addresses not being removed from the weave state.
weave_ipam_unreachable_count>0 but no nodes have left the cluster.

We have not seen a reoccurrence of this issue because the latest version of Weave resolves quorum instability arising from nodes exiting the cluster. Early identification and intervention maintains sanity in a busy Kubernetes cluster.