AWS CodeDeploy, Auto Scale Groups, and debugging recursive deployment limbo

AWS Code Deploy is a very beautiful service. Let’s give you a brief overview of the deployed architecture:

The Code Deploy has Applications that have Deployment Groups. These Deployment Groups can be called into a release cycle, managed by the AWS Code Pipeline.

While configuring a Deployment Group, we can choose to let the release happen to an Auto Scaling Group (ASG). On selecting an ASG, the Code Deploy creates Lifecycle Hooks for the Auto Scale Groups.

The Lifecycle Hooks integrates the Auto Scale Groups with the Code Deploy to create release/deployments cycles whenever a Scale-out activity is triggered.

Lifecycle Hooks under the Auto Scaling Group

Below is a sample Auto Scaling Group Configuration to release to a say “asg-prod-api” Auto Scaling Group, which is linked to an Application Load Balancer and requests routed to the instances with the help of a Target Group — say “tg-prod-api” — with a One at a Time deployment configuration.

Life-cycle Hook

A point to notice — is once you have selected the Auto Scale Group — the Console displays “Matching instances” and has a link — “Click here for details” which pops up the list of instances at that moment matching the above condition. That is, the instances deployed in that Auto Scaling Group.

The integration is seamless and works like poetry — and by far is one of the most seamless pieces of infrastructure management software written — trying to provide a rather simple interface to a relatively complex release cycle.

There are various permutations possible, which can create multiple possibilities. To debug a release deployment that is not working, there are a few basic sanity checks which one would perform. Check if:

a) The designated instance is passing Health Check in the Target Group.
b) The instance is running AWS Code Deploy Agent and the Service of the AWS Code Deploy Agent is running.
c) The instance is attached to a designated code deploy IAM Role allowing it the necessary permissions.
d) Check the Version of the AWS Code Deploy Agent — and make sure you have the latest CodeDeploy Agent running.

Generally, if the above actions are in order, a build should typically run. If it doesn’t run, you log into the instance and check for the Code Deploy Agent Logs. Remember to search the logs with your instance ID, in case you are running a golden ami which has logs of your previous runs.

Note: To install the Code Deploy Agent using User Data, you could refer to windows | linux.

If you are still unable to find the issue in deployment, you need to look at your instance state. Recently, while working with some deployments, we came across an interesting issue, which triggered this article.

Atypical Scenario:

There are times when you come up with a new Launch Configuration. You would typically detach the instance and let a new instance kick-in and replace the traffic for the instance. Once the earlier instance is detached, you terminate it manually if you didn’t require that instance anymore.

Say, you detach an instance and in the time the other instance is being spun out, you kill the instance you detached. But, the detach action does not complete in full. An instance gets terminated, without the detach draining the instance ID from the Auto Scale Group. You could witness two interesting events based on the time when you terminate the detach action:

a) The Lifecycle hook brings the scale-out process to a stall and when you check the Activity Tab of the Auto Scale Group, you could see something like: “Reason: Instance failed to complete users Lifecycle Action: Lifecycle Action with token xxx was abandoned.”

b) A new machine does spin out, but your Code Deploy doesn’t respond, although your instance is running fine.

To overcome the issue (a), you go ahead and stop the running deployment. On stopping the deployment, since the lifecycle hook gets a kill action, the lifecycle hook releases the terminated machine, and spins out a new machine.

Now, issue (b) occurs. The new machine released is perfectly fine, but the Code Deploy is running without a response. In such a case, you go ahead and log into the machine to check, and you find that your Code Deploy Agent is running fine. You check for the AWS Code Deploy Agent’s logs and find nothing. The instances IAM Roles are in order and everything seems to be fine. However, the Code Deploy is just waiting and no deploy events have been fired.

If a case like this arises, you go back to your Code Deploy > Applications > Deployment Group — and select the active Deployment Group and hit Edit to view Matching Instances.

There is a possibility, that the instance you terminated, might just show up as one of the matching instances to which the current deployment is trying to run:

If this is the case, you’ve run into a classic recursive loop where we suspect the instance which you had tried to detach had not completed the detach action and has some reference still attached to the Auto Scale Group.

Every-time you try and stop the deployment, the last instance gets terminated as it’s attached to the lifecycle hook. The lifecycle gets an abandon of its last action and it terminates the machine and deploys a new one. Essentially you have walked yourself into a recursive deployment limbo.

The only workaround here is to create another Tag and add that tag to the new launch configuration. This would stop the matching instances from picking up the old terminated instance from the auto scale group and you could break the recursion. Alternatively, you wait for a few hours for the terminated instances to clean off, long enough for the deploy to work after the internal clean up kicks in.

In parallel we have also reported this issue to the AWS Team, who are already working on it. They believe that this is “a condition where the Auto Scale Group is facing a race condition”.

The dharma.h Engineering Team experienced this atypical scenario during the execution of a deployment.

Contact Us

drop us a line