04. Februar 2021
For our customers, otto.de is a large webshop with a huge range of products. Behind the seamless façade, numerous systems are involved. This highly complex landscape is now managed by 30 teams; each team is responsible for a manageable number of microservices, and each microservice provides exactly one functionality for otto.de. Teams can deploy and redeploy their microservices on their own initiative and very largely independently, and also provide these with updated software (read more).
To ensure that deployments run smoothly, each microservice has its own build pipeline that automatically tests and rolls out the software. This process is called CD/CI and is an integral part of otto.de. Every microservice therefore always includes a system that tests and updates it.
In the first few years after the rollout of Lhotse, the successful OTTO in-house development (read more), Jenkins was used by all teams. That changed over the following years and the build-systems landscape grew more diverse with GitLab CI, GoCD, and LambdaCD.
My team is responsible for static resources on otto.de and uses node.js for back-end systems. Jenkins has long been the tool of choice for us too.
When OTTO opened up more towards the cloud in 2018 and GitLab was replaced by GitHub, our team decided to switch from GitLab CI to CircleCI. In retrospect the migration was not too difficult, but it took a long time. Nevertheless, we were very happy to use a build system ‘as a service’. The combination of GitHub and CircleCI appeared very suitable to us. Many teams followed our lead and soon CircleCI became almost overcrowded. Because our resources there were contractually limited, we sometimes had to put up with longer queues. Bottlenecks like these are critical in key live deployments.
In April 2020, GitHub announced its own CI/CD platform, and thanks to the enterprise contract an alternative to CircleCI became available to us pretty much overnight. Since we already use the Github Package Registry for Docker images and NPM packages, we definitely wanted to try out the new actions. Our first impression, however, was rather sobering – the actions seemed unfinished and alien. In addition, a 1:1 migration was impossible. The concept of manual gates that we use for deployments at CircleCI does not exist there. So far, we have found it reassuring that pipelines do not update software on live systems, even following successful checks, without our intervention – some changes need to be very closely monitored. A manual gate only allows us to deploy code if we have the capacity for monitoring and a rollback.
Since we’ve put a lot of work into our CircleCI workflows over the last two years, there was no interest in having to restart from scratch.
After an initial proof of concept, our opinion changed for the better and we decided to dare to make the move.
We have migrated a total of 64 CircleCI pipelines to GitHub over the last few weeks and can now proudly give GitHub Actions a warm welcome!
Before we launched the move we identified the similarities between our pipelines.
All our pipelines:
Besides this we also have three types of pipelines:
Pipelines with an impact on the customer also report their deployment to a central monitoring function.
Besides this, we also have the following rules:
We were able to create appropriate GitHub workflows for all requirements. Only the gates could not be depicted. Instead, we use the ‘Released’ trigger for a deployment, which can be activated via the GitHub GUI or the GitHub API. We use prereleases for deployments that are only to be rolled out up to the develop system.
name: Deploy Terraform 'on': release: types: - prereleased - released
Pipelines, especially if they are very similar, are unfortunately vulnerable to code duplications. For CircleCI we use Orbs and YAML anchors to make sure we stay DRY – ‘Don’t Repeat Yourself’.
For example, a deployment to our develop system is different from a live deployment in one place precisely – the target environment (live instead of develop). All other steps are completely identical. In the YAML files for CircleCI we were able to create a single source of truth via Anchors and Aliases, so we only have to make changes in one place.
Because we have pipelines that deploy within up to 6 different AWS accounts, we were able to cut out much of our identical code and keep a clear overview of the workflow. Orbs helped us reuse code across pipelines.
Regrettably there are no anchors on GitHub Actions and this feature is sorely missed by the community. GitHub is aware of this pain point; so far, however, there has only been one announcement that they’re working on it.
But since we didn't want to start duplicating code at all, we wrote an application called Gitty that updates workflows in all repositories via the GitHub API.
In each repo we have a config that Gitty reads via GitHub API. YAML workflows are generated from the config and checked back in by commit. Our initial intensive effort quickly paid off.
When it was recently announced that set-env would soon be deactivated, we needed to adjust code in just one place to update all repositories (read more).
Pipelines outside AWS need credentials to create resources within AWS. At CircleCI we had already distributed the AWS Credentials via Lambda and API and rotated them regularly. This is also possible with GitHub. But beware! With the GitHub API, the abuse detection mechanism strikes quickly. An unlimited Promise.all() against all repos triggers a very snappy 403.
Unfortunately, a CI system can always crash. This has rarely been the case with CircleCI in recent years, but when it has actually occurred it has always been at the wrong time. We have therefore designed all pipelines so that they can also be executed locally.
That's why we use an AWS role in the pipeline that can also be ‘assumed’ by our local AWS users. For instance, a deployment no longer fails because of a missing IAM policy. Fortunately, there are just a few core requirements for a development computer: Node.js, terraform, git, GITHUB_TOKEN, and AWS Credentials are enough to build, test and deploy all services locally. We definitely want to maintain this independence in all situations. While CI systems handle a lot of the hard work, they should never be the only systems available to roll out code in an emergency.
We are now very happy with GitHub Actions and have hardly had to make any further adjustments over the last few weeks. Jobs are running fast and stable. There are no waiting times for us at the moment, although things may soon get tight here too.
Changing the CI system is always a good opportunity for a good spring clean, and we have taken full advantage of this. The close dovetailing with GitHub has proven to be a great advantage. We are really happy to have dared to change and do not want to do without this solution in future!