by Kender Elford
First, A Little Bit of Background
For some time now, my team has been well versed in building JVM-based services that run in EC2. We adopted continuous delivery years ago and are now comfortable deploying our software in a canary style, which has saved our bacon on several occasions. Recently, we’ve begun to dip our toes into newer AWS technologies, including Lambda and Amazon SQS FIFO (First-In-First-Out) queues. We have a long history of using SNS (Simple Notification Service) to deliver events to our various pipelines, but Amazon SNS lacks the ability for a subscription to write a “group id” for FIFOs. We felt this was a good place for us to start experimenting with Lambdas. We ended up building a re-usable AWS Lambda component to take care of these subscriptions for us that can be deployed using our existing deployment infrastructure.
A piece of the puzzle was still missing for us was: How were we going to address canary deployments? Here, I’ll share how we solved this.
Canaries for Lambda Functions?
If you’ve been practicing continuous delivery for your AWS EC2 based services, you’re probably already deploying changes as canaries. In order to limit potential impact to your customers, you likely send a portion of your traffic to new code. This same process can be applied to Lambda Functions.
They are based on two abstractions over the Function — Version and Alias. As presented in the AWS console, these can be a little confusing. Both appear under the “Qualifiers” menu.
A Version is basically exactly what it sounds like: a historical record of a code artifact. By default, whenever you upload new code to a function, it’s applied to a special Version called “$LATEST”. A new Version can be created by selecting the $LATEST version from Qualifiers, then selecting “Publish new version” from the “Actions” menu.
The Alias abstraction allows you to create a name that refers to one or twoVersions. There is a default Alias called “Unqualified,” that refers to the $LATEST Version. When you select two Versions for an Alias to use, you can also now specify how much traffic the second one will receive from its event source (e.g.: Amazon SNS, Amazon Kinesis, API Gateway, etc.).
An important aspect to understand about Versions and Aliases is that any of them can be configured to receive events, and they all operate independently from each other. For our use case, we create a single Alias to use for our production canary and only configure to receive events. Make sure that none of your Versions, nor the Unqualified Alias, are configured to receive events.
Once you get to the point of understanding how Versions and Aliases work, setting up a canary in the console is pretty self-evident:
Select your production canary Alias
Pick the stable Version in the first box
Pick the canary Version in the second box
Select the traffic distribution.
Check it out in the logs! You’ll see the Version in the log stream name, in the square brackets.
Automate this Business!
On our team the AWS automation tool of choice is AWS CloudFormation. This allows you to declare resources and their configurations however you want and lets AWS figure out how to make that happen.
To make our usage of CloudFormation repeatable, we use Python and Troposphere with our deployment framework and reusable recipes. We have a re-usable Lambda Function for connecting an SNS topic to a SQS FIFO and a recipe for deploying, configuring and canarying it.
CloudFormation has three types of resources for deploying Lambda Functions in a canary style:
Function — This is where the “code” for your function is maintained. What “code” means in this case depends on the runtime type for your function. In the case of Java, the “code” is a reference to a .jar file in S3, that contains all of the assembled .class files and resources to execute in the JVM. When this resource is created for the first time, or updated with new code any subsequent time, the “$LATEST” Version is automatically updated, like in the UI.
Version — A version can be created or removed, but not updated. Whenever it’s created, it becomes a “copy” of whatever “$LATEST” is currently. CloudFormation is smart enough to know that an update to the Function comes first. The version also allows you to specify a base64 encoded sha256 signature of the expected function code, so you can’t accidentally create a version of something you didn’t expect (e.g. create a new Version without updating the Function or upload the wrong artifact to the Function, etc.). CloudFormation will automatically create incrementing Version numbers for you.
Here’s a diagram that describes how all of these Resources interact with one another:
Putting This All Together, In Practice
So, now that you understand what CloudFormation Resources to use and how they will behave, what does the process look like for working with Lambdas?
Deploy the Initial Version
So, you have your code ready to go.
Upload your code artifact to S3 so that it can be referenced by your Function in your CloudFormation template. Make sure you capture the SHA256!
Create your CloudFormation template with a Function, Version and Alias. Make sure to name your Alias something obvious, like “production-canary.” The Alias is only going to reference the single Version that you’ve created.
Create your CloudFormation stack with that template, and wait for all of the resources to be created.
Use your method of choice to subscribe your Lambda Function’s “canary” Alias to your event source. We typically use the console for this step to “flip the switch” and connect our code to the ecosystem.
You’re done! Check out the logs, and notice all the things you messed up!
Deploy an Update
Ok, so it wasn’t perfect. But now it is! . . . Right? Probably not. You’ll want to limit the impact to customers by deploying this as a canary.
Upload your new code artifact to a new location in S3.
Update your CloudFormation template. Change reference to the code in your Function. This triggers CloudFormation to update the $LATEST Version. Add a new Version, leaving the existing Version alone. Reference the new Version in the existing Alias’s routing configuration, with an appropriately small weight set on the canary Version
Update your existing stack with the new template. You should create a change set and review before executing the change.
Observe the log streams for your Lambda. You’ll see that some of them are for your canary.
Oh no! It’s still not right! Don’t panic!
Remove the canary Version from your template along with its canary weight.
After the stack has completed updating (should be fast), everything should be as it was before you deployed the canary.
Alright! Everything is fixed up!
Change the Version reference in the Alias to the new Version in your template.
The previous Version should no longer be referenced anywhere in your template, so you can remove that.
Remove the RoutingConfiguration from your Alias, so that all traffic is now going to the new Version.
Update your stack.
Once you understand how the resources interact with one another, deploying a Lambda Function in a canary style with CloudFormation is a fairly simple matter.
Here is an example template which illustrates a complete canary stack: Snippet-Link