1
 
 
Account
In your account you can view the status of your application, save incomplete applications and view current news and events
July 10, 2023

.Enterprise-scaled Self-Healing StackSets

What is the article about?

At OTTO, we faced several challenges to operate AWS CloudFormation StackSets at Scale. We must govern several hundred AWS accounts for our product teams, all while balancing the need for agility and control.

At this scale, operations can take a lot of time, because there are multiple operational tasks that we need to do when AWS accounts are leaving the AWS Organization or Teams are nuking the AWS account, StackSets Instances get drifted, because not all required resources for compliance can be secured (SCP Limitations), existing AWS accounts are joining the AWS Organization and all mandatory StackSets needs to be deployed, and manual steps should be reduced to a minimum. Furthermore, there is no feature from the Service itself to gain an overview of the status of drifted Instances and the general health of your StackSet health and compliance.

The cloud competence center at OTTO IT, also known as the Governance at Scale (GAS) team, developed a solution for self-healing on StackSets, that is integrated into the OTTO tooling ecosystem with Confluence and Microsoft Teams.

OTTO worked with globaldatanet to set up its Landing Zone. globaldatanet is an award-winning AWS Advanced Consulting Partner and longtime Cloud Solution Provider for OTTO, supporting the team in cloud security and GAS. Their focus on building cloud-native solutions using Serverless supported over 100 companies within 5 years to develop and innovate products and services in the cloud.

In this post, we’ll demonstrate how to implement fully automated enterprise-scaled self-healing on StackSets using AWS StepFunctions and create a Dashboard to get an overview of your StackSet health and compliance and reduce operational time.

The solution workflow includes the following steps:

  1. The tagging concept for StackSets
  2. Automatically create StackSets configuration in SSM Parameter Store
  3. Implementing StepFunction for StackSet Self-Healing

Let’s see how this works.

Prerequisites

The following prerequisites are necessary for following along with the contents of this post:

Solution overview

The following architecture shows the whole solution of the Self Healing StackSets:

Figure 1: Architecture of fully-automated Self Healing Solution with integration to Confluence
Figure 1: Architecture of fully-automated Self Healing Solution with integration to Confluence

Figure 1: Architecture of fully-automated Self Healing Solution with integration to Confluence

Tagging concept for StackSets

The solution requires a JSON file in the AWS parameter store, the easiest way is to create it automatically based on the StackSet configurations and the tags assigned there. We'll go into more detail about this in the next section of the Automatically create StackSets configuration Parameter Store article. In the following, we describe which tags we introduced to our StackSet and what we need these tags for.

⚠️ AWS tags do not allow commas in value, so ":" as divider for arrays
KeyValueResultExample
antidependsonStackSet Nameantidependson marks stacksets which collide with each other.MYSTACKSET
dependson[List of StackSet Names]List of Stacksets that need to be rolled out before deploying this stackset (e.g. Enable Config before Activate Config Rules). NOTE : Please reduce to only one dependson-stackset for now. Form "chains" for multi-dependencies.MY-STACKSET1:MYSTACKSET2
mandatorytrue or falseThe stackset instances must be present on all AWS accountstrue
selfhealingtrue or falseStackSet can be healed via Delete & Redeploy (exception e.g. IDP roles) - Parameter Overwrites will be cached.true
region[Regions]List of Regions in which the stackset instances are to be deployedeu-west-1:eu-central-1:us-east-1

Automatically create StackSets configuration Parameter Store

The automated generation of the Stackset-configuration via JSON inside the ParameterStore is a multi-purpose-utility:

  • Removing the chore to configure manually a JSON-document
  • Ensure the Account vending-machines knows what to deploy in which order
  • Supporting the self-healing StepFunction about the expected setup of the member-accounts

The lambda responsible for the task is called via an event rule:
Every time a Stackset operation is completed with status "succeeded".
This is due to the fact that the tags of a Stackset are part of the Stackset, not additional items describing a Stackset, so a change to the tags will always result in a Stackset-Update-Operation.

In terms of computer science the Lambda is quite interesting, as the primary problem was to build a nonweighted tree based on the "dependson" and "antidependson" tags and then compile an ordered one-dimensional list, like in the good old "travelling salesmen"-problem.

Implementing StepFunction for StackSet Self-Healing

AWS Step Functions is a cloud service that enables you to coordinate the components of distributed applications and microservices using visual workflows. It allows you to build and automate the execution of complex processes and tasks across multiple AWS services, using a visual interface to define and execute your workflows. Since the Self Healing Solutions needs a complex workflow we decided to use Step Functions for this Usecase. Following we will explain you the workflow of the Self Healing.

Figure 2:StepFunction Workflow
Figure 2:StepFunction Workflow

Figure 2: StepFunction Workflow

Functionality

ƛ Serverless Functions

  • StackSetInitCleanupLambda: Performs a search to identify StackSet instances of AWS Accounts that are either not present within the AWS Organization or deployed to AWS accounts that are suspended. Once identified, proceed with the deletion of these instances from all associated StackSets.
  • MandatoryStackSetDeploymentLambda: Search missing StackSets Instances (which are tagged with mandatory = true) and deploy those Instances
  • StackSetDriftDetectionLambda: Trigger Drift Detection on all StackSets
  • TriggerDriftStatusLambda: Check if Drift Detection is completed on all StackSets
  • SearchStackSetInstanceToHealLambda: Searches for drifted StackSet Instances from StackSets which are tagged with Selfhealing = true
  • StackSetCleanupLambda: Removes unhealthy StackSet Instances and redeploys them. Parameter Overrides will be cached so the new deployed instance will have the same setting as before.
  • StatusPrepareHTMLLambda: Prepare the HTML output Dashboard for Confluence and Json log file of the current StackSet Healthiness State
  • TeamsNotificationLambda: Send Teams Notification which summary to notify the GAS Team after each execution

?!Decisions

  • InitCleanup Complete: Check whether all unnecessary instances have been removed. If not, StepFunction is triggering the StackSetInitCleanupLambda function again.
  • MandatoryStackSetDeployment Complete: Check whether all mandatory instances have been deployed. If not, StepFunction is triggering the MandatoryStackSetDeploymentLambda function again.
  • StackSetDriftDetection Complete: Wait until StackSet Drift Detection has been finished on all StackSets
  • Healing Complete: Check if all unhealthy Instances are healed otherwise invoke StackSetCleanupLambda again


Limitations

While developing the solution we faced several limitations. Here are our findings and solutions for that.

🚨 StackSets instance operations: Maximum number of stack instances, across all stack sets, that you can run operations on in each region at the same time, per administrator account is limited to 10.000 operations.

✅ We implemented a counter to count the current StackSets operations which are in progress, in addition we also caught the Exception from CloudFormation and waited few seconds to try the operation again.

🚨 Parameter Overwrites Caching: Whenever removing a drifted StackSet Instance which has Parameter Overwrite you will lose the individual parameters of the instance.

✅ Before deleting the drifted StackSet instance, we cache the parameter overwrites, and after successful deletion, we re-deploy the StackSet instance with the cached parameter overwrites.

🚨AWS Step Functions Payload size: AWS Step Functions supports payload sizes up to 256KB. For our solution, we need more payloads between states, especially if we want to pass our log or the concurrent parameter overwrites per StackSet.

✅ We store our states in an S3 bucket to pass the state. At the end of the execution, we delete the state from S3 so as not to influence the next Step Function execution with the wrong state.

Documentation

After each execution of the StackSet Health StepFunction, we aim to notify our GAS team about the actions taken during the previous run. Therefore, we have implemented a Teams notification that includes a status update, a link to the generated dashboard, and a link to the log file.

The following screenshot illustrates an example of a Teams notification. It provides a summary report and directs you to the dashboard and log file for further details.

Figure 3: Documentation
Figure 3: Documentation

Figure 3: Status updates via Teams

Implementing StepFunction for StackSet Self-Healing

Our StackSet Health Dashboard is a simple HTML file generated by a Lambda function, stored in S3, and distributed via a CloudFront. You can integrate this dashboard into your Confluence or any other internal wiki. This dashboard is secured via the CloudFormation function - in addition, you can also add a firewall to restrict access to a specific CIDR or geographic region and prevent third-party access. The screenshot below shows an example of the overall StackSet Health status information for an entire AWS organization.

Figure 4: Dashboard
Figure 4: Dashboard

Figure 4: Dashboard

Conclusion

In this post, we demonstrated a solution to automatically heal AWS CloudFormation StackSets at scale. By implementing this solution in our organization, we were able to reduce manual StackSet healing efforts by 4 hours per week, improve the overall reliability of our StackSets, increase compliance in our organization, and gain daily visibility into all StackSet instances using the dashboards. In summary, the self-healing CloudFormation StackSets solution combines automation, monitoring, and self-healing capabilities to provide a robust and resilient system for StackSets.

Want to be part of our team?

3 people like this.

0No comments yet.

Write a comment
Answer to: Reply directly to the topic

Written by

Alexander Mannsfeld
Alexander Mannsfeld
Cloud Solution Architect
David Krohn
David Krohn
AWS Solution Architect @globaldatanet

Similar Articles

We want to improve out content with your feedback.

How interesting is this blogpost?

We have received your feedback.

Cookies erlauben?

OTTO und drei Partner brauchen deine Einwilligung (Klick auf "OK") bei einzelnen Datennutzungen, um Informationen auf einem Gerät zu speichern und/oder abzurufen (IP-Adresse, Nutzer-ID, Browser-Informationen).
Die Datennutzung erfolgt für personalisierte Anzeigen und Inhalte, Anzeigen- und Inhaltsmessungen sowie um Erkenntnisse über Zielgruppen und Produktentwicklungen zu gewinnen. Mehr Infos zur Einwilligung gibt’s jederzeit hier. Mit Klick auf den Link "Cookies ablehnen" kannst du deine Einwilligung jederzeit ablehnen.

Datennutzungen

OTTO arbeitet mit Partnern zusammen, die von deinem Endgerät abgerufene Daten (Trackingdaten) auch zu eigenen Zwecken (z.B. Profilbildungen) / zu Zwecken Dritter verarbeiten. Vor diesem Hintergrund erfordert nicht nur die Erhebung der Trackingdaten, sondern auch deren Weiterverarbeitung durch diese Anbieter einer Einwilligung. Die Trackingdaten werden erst dann erhoben, wenn du auf den in dem Banner auf otto.de wiedergebenden Button „OK” klickst. Bei den Partnern handelt es sich um die folgenden Unternehmen:
Google Inc., Meta Platforms Ireland Limited, elbwalker GmbH
Weitere Informationen zu den Datenverarbeitungen durch diese Partner findest du in der Datenschutzerklärung auf otto.de/jobs. Die Informationen sind außerdem über einen Link in dem Banner abrufbar.