How we made deployments safer at SEKOIA.IO

SEKOIA.IO process almost a billion client events per day. That’s tens of thousands of log entries per second. Every single event has to be analyzed quickly and reliably by our detection pipeline, to detect cyber threats and react as soon as possible. You can imagine that breaking this pipeline, even for a few seconds, is out of the question.

Safely deploying changes to this high-throughput, low-latency workflow is a major challenge that we continuously try to solve.

Of course, we embrace the microservice architecture: production is handled by a large set of loosely coupled components. This approach allows us to easily scale and update parts of our infrastructure independently, but it also means that we have to handle a large number of deployments, frequently.

All our microservices run on Kubernetes, which is great for auto-healing and progressive rollouts. Kubernetes is smart, but Kubernetes is not that smart. Even combined with all the CI/CD and staging in the world, we can still deploy broken services. This has multiple implications:

  • Deployments need to be triggered and overlooked by a real person (a Site Reliability Engineering team in our case), to make sure that everything is going smoothly.

Both issues are harmful to productivity and velocity and can induce stress on the ones responsible for deployments.

Even with this measures, we sometimes faced issues when deploying updates: services that worked in the test environment broke in production when they were put under pressure, or simply didn’t work because of an asymmetric configuration.

So how to avoid that? We needed a way to safely roll out updates to microservices, without having to fear that it could trigger an incident.

Fortunately, automation exists and can be applied to just about anything.

Canary deployments 🐦

Canary birds were used in coal mines to alert workers of carbon monoxide leaks, the “silent killer”. As canaries died of carbon monoxide way before mine workers did, they were, at the time, the only way of detecting a dangerous gas leak.

Nowadays, the term is used to describe any test subject, especially an inadvertent or unwilling one.

In the Ops field, canary deployments are exactly this: by transparently deploying the latest version of a service to a small percentage of users, you are able to see how the new version performs without breaking everything, in the event of a bugged release. However, if that version performs well, you can gradually keep rolling it out to more users. If it keeps on performing well, you get a pretty strong indication that it is suitable to replace the old one.

As Kubernetes provides an API to interact with deployments, it is well suited for the automation of this kind of operation.

Existing solutions

As one can expect, quite a lot of different solutions trying to solve this problem already exist. However, we had some hard requirements before adopting any solution:

  • We didn’t want to have to rewrite any of our Kubernetes stacks, because we have around 80 microservices.

With these requirements in mind, we started hunting for a solution, without much success. Solutions like Gloo or Flagger are very much HTTP-based, and Argo is a whole ecosystem, that requires changes to the deployment stacks.

We could not identify a simple solution for Kafka-based workers. After a few days of reading documentation and trying out stuff on our test environment, it became clear that none of the existing solutions would fit our needs.

Aviary 🦜

As any frustrated engineer would do, we decided to create our own solution, coined Aviary (because it handles canaries all day long!).

What we were trying to achieve could simply be implemented this way :

  • On init, Aviary duplicates the service’s deployment by adding the suffix -primary to its name, and scales the original deployment to 0.

Two possibles outcomes are to happen:

  • If we reach a break-point where enough -canary instances were successful over a period of time, the -canary deployment is promoted to -primary, and the original deployment stays untouched.

This logic enables us to have progressive canary rollouts and automatic rollback in production while making absolutely no change to our existing codebase. Our CI/CD pipeline also stays untouched, as it is interacting with the same object as before, the original deployment. Aviary handles the rest.

The configuration stays minimal, with the ability to define, per service:

  • Break-point percentage (after which percentage of successful canary instances do we fully deploy the new version).

It is also highly interactive and allows an operator to cancel an in-progress deployment, as well as bypassing itself for the next deployment of a service (for pushing hotfixes).

Conclusion

Our solution has been deployed in production for a few months now, and it has already prevented numerous bad rollouts and regressions. The lack of codebase changes and its very little footprint in the Kubernetes cluster makes it a very satisfying solution.

It doesn’t even change anything to normal SRE operations and automation, as it transparently handles scaling up and down deployments, as well as services restarts.

We are very happy to have made this step forward, as developers are one step closer to being able to deploy their changes themselves, without having to worry about undetected issues that would arise in production.

Of course, Aviary was made in the most business-agnostic way and is freely available on our GitHub repository.

Chat with our team!

Would you like to know more about our solutions? Do you want to discover our XDR and CTI products? Do you have a cyber security project in your organization? Make an appointment and meet us!

Échangez avec l’équipe

Vous souhaitez en savoir plus sur nos solutions de protection ? Vous voulez découvrir nos produits de XDR et de CTI ? Vous avez un projet de cybersécurité dans votre organisation ? Prenez rendez-vous et rencontrons-nous !