I recently changed jobs and inherited an infrastructure built with Terraform with an application which deploys to EC2 via AWS CodeDeploy. The ASG’s and target groups were all built and managed by Terraform, and CodeDeploy used In-Place deployments. In-place deployments means that the deployment automation does this:
- Stop the application on the existing servers, which brings the website down for customers
- Install the new version of the application on to the same existing servers
- Start the application
Obviously, stopping the application and bringing the website down is a huge anti-pattern, so one of the first tasks assigned to me at the new job was to implement blue/green deployments, roughly like this:
- Leave the existing application on the existing servers alone, so there’s no downtime
- Spin up new servers
- Install the new version on the new servers
- Redirect traffic to the new servers
- Kill the old servers
And all of the above can be done without downtime for the website or the customers.
AWS explains this (kinda poorly) on this doc. But there were a bunch of gotchas that I ran into and I didn’t find them documented anywhere on the internet, so hopefully I’ll try to help the next person by documenting them here.
A quick suggestion before getting into it
Everything described in this blog post kind of sucks and I wouldn’t recommend using this if you have other options. If you can deploy containers on ECS without using EC2 or CodeDeploy at all, that’s a way better experience and you should do that. You only want to go down this road if you’re in a position where you’re already locked in to EC2 and already locked in to CodeDeploy and you can’t do anything about it.
Okay, now we can get into it.
CodeDeploy Blue/Green on EC2 plays terribly with Terraform by default
The most common way of using CodeDeploy Blue/Green with EC2 is to let CodeDeploy create and delete your ASG’s, which it does like this:
- You initially create an ASG on your own, presumably using Terraform
- Then each time your application deploys, CodeDeploy comes along and copies that ASG to a whole new ASG
- CodeDeploy deploys to the instances in the new ASG
- CodeDeploy deletes the old ASG
The above pattern is entirely incompatible with Terraform, because then Terraform will no longer manage your ASG(s).
Here are the Terraform resources I ended up using to work around the above pattern:
1# You only need one load balancer
2resource "aws_lb" "default" {
3 ...
4}
5
6# You only want one target group, which works for blue and for green instances
7resource "aws_lb_target_group" "default" {
8 ...
9}
10
11# You need two ASG's, one for blue and one for green.
12# Both should be hooked up to the same target group.
13# And you'll need to `ignore_changes` on min_size/max_size/desired_capacity
14# because when green is serving all the traffic then
15# blue will scale down to zero, and vice versa.
16
17resource "aws_autoscaling_group" "blue" {
18 ...
19 target_group_arns = [aws_lb_target_group.default.arn]
20 lifecycle {
21 ignore_changes = [
22 min_size,
23 max_size,
24 desired_capacity,
25 ]
26 }
27}
28
29resource "aws_autoscaling_group" "green" {
30 ...
31 target_group_arns = [aws_lb_target_group.default.arn]
32 lifecycle {
33 ignore_changes = [
34 min_size,
35 max_size,
36 desired_capacity,
37 ]
38 }
39}
40
41resource "aws_codedeploy_deployment_group" "default" {
42 ...
43
44 autoscaling_groups = [
45 aws_autoscaling_group.blue.name, # This is super non-intuitive, but you only want the blue ASG here, and you will ignore_changes on this attribute
46 ]
47
48 deployment_style {
49 deployment_option = "WITH_TRAFFIC_CONTROL"
50 deployment_type = "BLUE_GREEN"
51 }
52
53 load_balancer_info {
54 target_group_info {
55 name = aws_lb_target_group.default.name
56 }
57 }
58
59 blue_green_deployment_config {
60 deployment_ready_option {
61 action_on_timeout = "CONTINUE_DEPLOYMENT"
62 }
63
64 green_fleet_provisioning_option {
65 action = "DISCOVER_EXISTING" # The other option is `COPY_AUTO_SCALING_GROUP`, which is basically incompatible with Terraform
66 }
67
68 terminate_blue_instances_on_deployment_success {
69 action = "KEEP_ALIVE" # The other option is `TERMINATE`, which doesn't mean to terminate individual instances, it means to delete the entire old ASG after deployment, which is basically incompatible with Terraform
70 }
71 }
72
73 lifecycle {
74 ignore_changes = [
75 autoscaling_groups, # Ignore this because this will bounce between blue and green with each deployment and isn't really managed by Terraform
76 ]
77 }
78}
You can see in the above snippet that there’s a lot of stuff that Terraform isn’t managing, which I’ll walk through below:
- Lines 66-70 and 78-82: Terraform should
ignore_changeson the ASG min/max/desired capacity, because when the green ASG is serving all the traffic, you will want the blue ASG to scale to zero, and vice versa. - Line 90: we only specify the blue ASG in the deployment group, which is super non-intuitive. The reason for this is because in a blue/green setup where you’re alternating between ASG’s, the deployment group doesn’t control which ASG will be used for any given deployment. That gets decided by way of the
aws deploy create-deploymentAPI call that you make for any given deployment. See #5 in the next section for more on this. - Line 110:
DISCOVER_EXISTINGmeans that you want to deploy to already-existing already-running instances, and CodeDeploy should not provision its own instances on which to deploy. It’ll be up to you to provision those instances prior to each deployment. See #3 in the next section for more on this. - Line 114:
KEEP_ALIVEthe old instances after deployment, which means you’ll need to terminate the old instances yourself after each deployment. See #7 in the next section for more on this. The other option isTERMINATE, which doesn’t mean to terminate individual instances, it means to delete the entire old ASG after deployment, which is basically incompatible with Terraform.
How to do deployments
Okay, so you have the above infrastructure created, and you can see from the notes above that CodeDeploy is not managing a bunch of stuff, so how do you actually deploy? All of the API calls which CodeDeploy isn’t managing for you will need to be made on your own as part of your own automation. Here are the API calls which you will need to deploy, and presumably you would automate these API calls in your deploy pipeline.
- Figure out if the blue ASG or the green ASG was used for the previous deployment, so that you can then deploy to the other one:
aws deploy get-deployment-group .... You can find the previous deployment ASG name in.deploymentGroupInfo.autoScalingGroups[0]. - Now you know which ASG to deploy to, and it will probably have leftover lifecycle hooks from the prior deployment. You’ll want to delete those lifecycle hooks using
aws autoscaling delete-lifecycle-hook .... If you don’t delete the lifecycle hooks before spinning up instances in the to-be-deployed ASG then your instances will hang in aPending:Waitstate forever. - Scale up instances in the to-be-deployed ASG by way of
aws autoscaling update-auto-scaling-group --min-size X --max-size Y --desired-capacity Z .... Then wait for the instances to reachInService, either with asleepor by polling thedescribe-auto-scaling-groupsAPI. - You’ll have more reliable deployments if you temporarily suspend Launch/Terminate events in both ASG’s before deploying, which you can do with
aws autoscaling suspend-processes .... - Create a deployment, and this is where you tell CodeDeploy which ASG should be deployed to.
aws deploy create-deployment --target-instances={\"autoScalingGroups\":[\"$green_auto_scaling_group\"]} .... Shout out to this person on Stack Overflow for the inspiration on this one. - Wait for the deployment to complete, and you can use
aws deploy get deployment ...to poll the status. - If the deployment to that ASG succeeded, you can then resume Launch/Terminate processes on both ASG’s (
aws autoscaling resume-processes ...) and then scale the other ASG to zero (aws autoscaling update-auto-scaling-group --min-size 0 --max-size 0 --desired-capacity 0 ...)
Conclusion
As you can see, this is not a “just works” experience where CodeDeploy manages alternating between ASG’s and CodeDeploy manages scaling instances up/down as one might expect it to be. You have to manage all that stuff yourself with a bunch of API calls.
I personally will be happy to get away from this and move to ECS ASAP.