Forum Posts

ariel-moreno12
Mar 20, 2019
In Cloud
The automated platform has a Jenkins cluster with a dedicated Jenkins master and workers inside an autoscaling group . A few months ago, I gave a talk at Nexus User Conference 2018 on how to build a fully automated CI/CD platform on AWS using Terraform, Packer & Ansible. The session illustrated how concepts like infrastructure as code, immutable infrastructure, serverless, cluster discovery, etc can be used to build a highly available and cost-effective pipeline. The platform I built is represented in the following diagram: 📷 The platform has a Jenkins cluster with a dedicated Jenkins master and workers inside an autoscaling group. Each push event to the code repository will trigger the Jenkins master which will schedule a new build on one of the available slave nodes. The slave nodes will be responsible of running the unit and pre-integration tests, building the Docker image, storing the image to a private registry and deploying a container based on that image to Docker Swarm cluster. If you missed my talk, you can watched it again on YouTube On this post, I will walk through how to deploy the Jenkins cluster on AWS using the latest automation tools. The cluster will be deployed into a VPC with 2 public and 2 private subnets across 2 availability zones. The stack will consists of an autoscaling group of Jenkins workers in a private subnets and a private instance for the Jenkins master sitting behind an elastic Load balancer. To add or remove Jenkins workers on-demand, the CPU utilization of the ASG will be used to trigger a scale out (CPU > 80%) or scale in (CPU < 20%) event. 📷 To get started, we will create 2 AMIs (Amazon Machine Image) for our instances. To do so, we will use Packer, which allows you to bake your own image. The first AMI will be used to create the Jenkins master instance. The AMI uses the Amazon Linux Image as a base image and for provisioning part it uses a simple shell script: The shell script will be used to install the necessary dependencies, packages and security patches: It will install the latest stable version of Jenkins and configure its settings: Create a Jenkins admin user.Create a SSH, GitHub and Docker registry credentials.Install all needed plugins (Pipeline, Git plugin, Multi-branch Project, etc).Disable remote CLI, JNLP and unnecessary protocols.Enable CSRF (Cross Site Request Forgery) protection.Install Telegraf agent for collecting resource and Docker metrics. The second AMI will be used to create the Jenkins workers, similarly to the first AMI, it will be using the Amazon Linux Image as a base image and a script to provision the instance: A Jenkins worker requires the Java JDK environment and Git to be installed. In addition, the Docker community edition (building Docker images) and a data collector (monitoring) will be installed. Now our Packer template files are defined, issue the following commands to start baking the AMIs: Packer will launch a temporary EC2 instance from the base image specified in the template file and provision the instance with the given shell script. Finally, it will create an image from the instance. The following is an example of the output: 📷 Sign in to AWS Management Console, navigate to “EC2 Dashboard” and click on “AMI”, 2 new AMIs should be created as below: 📷 Now our AMIs are ready to use, let’s deploy our Jenkins cluster to AWS. To achieve that, we will use an infrastructure as code tool called Terraform, it allows you to describe your entire infrastructure in templates files. I have divided each component of my infrastructure to a template file. The following template file is responsible of creating an EC2 instance from the Jenkins master’s AMI built earlier: Another template file used as a reference to each AMI built with Packer: The Jenkins workers (aka slaves) will be inside an autoscaling group of a minimum of 3 instances. The instances will be created from a launch configuration based on the Jenkins slave’s AMI: To leverage the power of automation, we will make the worker instance join the cluster automatically (cluster discovery) using Jenkins RESTful API: At boot time, the user-data script above will be invoked and the instance private IP address will be retrieved from the instance meta-data and a groovy script will be executed to make the node join the cluster: Moreover, to be able to scale out and scale in instances on demand, I have defined 2 CloudWatch metric alarms based on the CPU utilisation of the autoscaling group: Finally, an Elastic Load Balancer will be created in front of the Jenkins master’s instance and a new DNS record pointing to the ELB domain will be added to Route 53: Once the stack is defined, provision the infrastructure with terraform applycommand: The command takes an additional parameter, a variables file with the AWS credentials and VPC settings (You can create a new VPC with Terraform from here): Terraform will display an execution plan (list of resources that will be created in advance), type yes to confirm and the stack will be created in few seconds: 📷 Jump back to EC2 dashboards, a list of EC2 instances will created: 📷 In the terminal session, under the Outputs section, the Jenkins URL will be displayed: 📷 Point your favorite browser to the URL displayed, the Jenkins login screen will be displayed. Sign in using the credentials provided while baking the Jenkins master’s AMI: 📷 If you click on “Credentials” from the navigation pane, a set of credentials should be created out of the box: 📷 The same goes for “Plugins”, a list of needed packages will be installed also: 📷 Once the Autoscaling group finished creating the EC2 instances, the instances will join the cluster automatically as you can see in the following screenshot: 📷 You should now be ready to create your own CI/CD pipeline! 📷 You can take this further and build a dynamic dashboard in your favorite visualization tool like Grafana to monitor your cluster resource usage based on the metrics collected by the agent installed on each EC2 instance: 📷https://github.com/mlabouardy/grafana-dashboards Credit : Mohamed Labouardy @ Medium
2
1
925
ariel-moreno12
Mar 20, 2019
In CI CD
There are a variety of techniques to deploy new applications to production, so choosing the right strategy is an important decision, weighing the options in terms of the impact of change on the system, and on the end-users. In this post, we are going to talk about the following strategies: Recreate: Version A is terminated then version B is rolled out.Ramped (also known as rolling-update or incremental): Version B is slowly rolled out and replacing version A.Blue/Green: Version B is released alongside version A, then the traffic is switched to version B. Canary: Version B is released to a subset of users, then proceed to a full rollout. A/B testing: Version B is released to a subset of users under specific condition. Shadow: Version B receives real-world traffic alongside version A and doesn’t impact the response. Let’s take a look at each strategy and see which strategy would fit best for a particular use case. For the sake of simplicity, we used Kubernetes and tested the example against Minikube. Examples of configuration and step-by-step approaches on for each strategy can be found in this git repository. Recreate The recreate strategy is a dummy deployment which consists of shutting down version A then deploying version B after version A is turned off. This technique implies downtime of the service that depends on both shutdown and boot duration of the application. Pros: Easy to setup.Application state entirely renewed. Cons: High impact on the user, expect downtime that depends on both shutdown and boot duration of the application. Ramped The ramped deployment strategy consists of slowly rolling out a version of an application by replacing instances one after the other until all the instances are rolled out. It usually follows the following process: with a pool of version A behind a load balancer, one instance of version B is deployed. When the service is ready to accept traffic, the instance is added to the pool. Then, one instance of version A is removed from the pool and shut down. Depending on the system taking care of the ramped deployment, you can tweak the following parameters to increase the deployment time: Parallelism, max batch size: Number of concurrent instances to roll out.Max surge: How many instances to add in addition of the current amount.Max unavailable: Number of unavailable instances during the rolling update procedure. Pros: Easy to set up.Version is slowly released across instances. Convenient for stateful applications that can handle rebalancing of the data. Cons: Rollout/rollback can take time.Supporting multiple APIs is hard.No control over traffic. Blue/Green The blue/green deployment strategy differs from a ramped deployment, version B (green) is deployed alongside version A (blue) with exactly the same amount of instances. After testing that the new version meets all the requirements the traffic is switched from version A to version B at the load balancer level. Pros: Instant rollout/rollback.Avoid versioning issue, the entire application state is changed in one go. Cons: Expensive as it requires double the resources.Proper test of the entire platform should be done before releasing to production.Handling stateful applications can be hard. Canary A canary deployment consists of gradually shifting production traffic from version A to version B. Usually the traffic is split based on weight. For example, 90 percent of the requests go to version A, 10 percent go to version B. This technique is mostly used when the tests are lacking or not reliable or if there is little confidence about the stability of the new release on the platform. Pros: Version released for a subset of users.Convenient for error rate and performance monitoring.Fast rollback. Con: Slow rollout. A/B testing A/B testing deployments consists of routing a subset of users to a new functionality under specific conditions. It is usually a technique for making business decisions based on statistics, rather than a deployment strategy. However, it is related and can be implemented by adding extra functionality to a canary deployment so we will briefly discuss it here. This technique is widely used to test conversion of a given feature and only roll-out the version that converts the most. Here is a list of conditions that can be used to distribute traffic amongst the versions: By browser cookieQuery parametersGeolocalisationTechnology support: browser version, screen size, operating system, etc.Language Pros: Several versions run in parallel.Full control over the traffic distribution. Cons: Requires intelligent load balancer.Hard to troubleshoot errors for a given session, distributed tracing becomes mandatory. Shadow A shadow deployment consists of releasing version B alongside version A, fork version A’s incoming requests and send them to version B as well without impacting production traffic. This is particularly useful to test production load on a new feature. A rollout of the application is triggered when stability and performance meet the requirements. This technique is fairly complex to setup and needs special requirements, especially with egress traffic. For example, given a shopping cart platform, if you want to shadow test the payment service you can end-up having customers paying twice for their order. In this case, you can solve it by creating a mocking service that replicates the response from the provider. Pros: Performance testing of the application with production traffic.No impact on the user.No rollout until the stability and performance of the application meet the requirements. Cons: Expensive as it requires double the resources.Not a true user test and can be misleading.Complex to setup.Requires mocking service for certain cases. To Sum Up There are multiple ways to deploy a new version of an application and it really depends on the needs and budget. When releasing to development/staging environments, a recreate or ramped deployment is usually a good choice. When it comes to production, a ramped or blue/green deployment is usually a good fit, but proper testing of the new platform is necessary. Blue/green and shadow strategies have more impact on the budget as it requires double resource capacity. If the application lacks in tests or if there is little confidence about the impact/stability of the software, then a canary, a/ b testing or shadow release can be used. If your business requires testing of a new feature amongst a specific pool of users that can be filtered depending on some parameters like geolocation, language, operating system or browser features, then you may want to use the a/b testing technique. Last but not least, a shadow release is complex and requires extra work to mock egress traffic which is mandatory when calling external dependencies with mutable actions (email, bank, etc.). However, this technique can be useful when migrating to a new database technology and use shadow traffic to monitor system performance under load. Below is a diagram to help you choose the right strategy: Depending on the Cloud provider or platform, the following docs can be a good start to understand deployment: Amazon Web Services Docker Swarm Google Cloud Kubernetes I hope this was useful, if you have any questions/feedback feel free to comment below Originally published in: https://thenewstack.io/deployment-strategies/
2
1
25
ariel-moreno12
Mar 20, 2019
In Cloud
In the present day scenario, people really value that moment of instant gratification they get upon leveraging Infrastructure as a Service. Otherwise which would have been a significant effort for solutioning those use cases in house. Yes, we are talking about the power of Cloud Computing , which present day software product/service companies use extensively. All of it works perfectly well for you and keeps your life easy until one fine day you start talking about things at SCALE. You start dealing with problems like storing and organising your data at large scale, making internal as well customer facing systems highly available, efficient monitoring, logging and alerting mechanisms etc. All of these things can really be a matter of few clicks if you are into using public cloud solutions for real. So, as the legends say With Great Power Comes Great Responsibility, one can understand and work around the engineering level responsibilities for the same, but the next big thing which hits the individuals and organisations is that you need to pay your bills for all the resources you leverage on public cloud solutions. Like it or not, your cloud costs can be a bomb if you don’t plan it well. I work extensively with AWS Cloud and to be just a little more specific, at Indix HQ, a Data as a Service company, are building world’s largest cloud catalog for structured marketplace product information. The scale of data is in hundreds of TBs which we deliver to our customers through API’s and Bulk Feeds. As a DevOps and Infrastructure person, I was responsible for designing, developing and maintaining a stable, reliable and secure cloud infrastructure for our internal and customer facing apps on AWS Cloud. But AWS like any other public cloud service providers, comes at a cost. When your infrastructure on cloud is serving and processing data at large scale, you are sure to use resources on cloud extensively and nothing is cheap then. Your bills are in thousands of dollars then. This is where your Finance and Engineering ninjas adopt cost control measures and try to bring standardisation in the use of resources on cloud as per common best practices. Some of the standards that we follow at Indix are: Choosing the right instance types for deploying applications, so that there is no under utilisation of the computation power offered the selected instance type. Using Spot Instances for systems which are not mission critical and Reserved Instances for mission critical ones. Proper tagging of all the resources. This reduces the risk of anything getting unmonitored. Keeping a track of service usage through Cost Explorer One of our biggest challenge in terms of controlling costs on AWS despite of following all standard best practices for the same was that we weren’t able to track and hence respond to an unknown event. To elaborate more on this, think of the following possibilities : An unexpected autoscaling event during 3 AM at night which scales up and then scales down your cluster which you are likely to keep a track for in terms of cost control. Human error of EC2 instances being left idle. Untracked jobs which have a potential to incur huge cost due to high Data Transfer. Though AWS through Cost Explorer tool helps us analyse cost very efficiently, but we were not sure if we could leverage in a programmatic manner to build our own custom solutions on top of it. One fine day we came across an articlefrom Jeff Barr from AWS which talks about how we can use AWS Cost Usage Report (CUR) to analyse the cost distribution on AWS. Though the blog says it all, but we wanted to build an automated system which could help us track our costs in the desired time granularity level possible. CUR, a CSV file though, is a really complex thing and it was going to take ages for us to parse it and extract the required cost data for our use case. Alternatively, we imported our CSV data to Amazon Redshift where we can store large scale data in a Postgres database without caring about the underlying infrastructure for the same. So this way we could store our CUR data in form of a SQL table and can fetch the required data from it using simple/complex queries. It was a fair win over the effort involved in writing parsing scripts for CUR otherwise. Added to that, AWS helps us do this entire import process by just a few clicks. It allows you store your billing reports to an S3 bucket and import to Redshift. 📷Configuring CUR for granularity and S3 bucket location. Image Source : AWS Blog Post configuration of reports you’ll be able get CUR’s in your S3 bucket. We also further configured our reports to make it available into a Redshift cluster by giving a table schema for the report. (You have options to do the same in the same billing console). This meant, we just needed a new redshift cluster and we’r done on the cost data loading part. Once we had the data in our Redshift cluster, we started running queries on it right away. 📷Redshift console. Image Source : AWS Blog Everything we expected after getting inspired from the blog article was working fine for us. The journey was half complete though. What we needed was an end to end automated system which could do the above jobs for us and also alert our engineers or engineering managers whenever there is a certain threshold cost is crossed for monitored resources on AWS. We then decided to leverage the power of AWS Lambda which is a serverless architecture service provided by AWS. We had the following design for Lambda functions : Lambda Function #1 : Automates the process of fetching CUR from S3 and uploading it to Redshift. Lambda Function #2 and #3 : Reads a set of specified configurations from a JSON file, frames queries from that information, compares the obtained costs from queries with some threshold data and then sends Slack alerts if the thresholds are crossed. Let’s take a deep dive in Lambda functions #2 and #3. These are nothing but cron based functions which are responsible for alerting people on Slack (we use it for our internal office communications) whenever a resource on AWS crosses incurs more than a certain specified cost. 📷Configuration read by Lambda functions for creating queries Above image shows the kind of configurations our Lambda functions #2 and #3 read for framing queries. For the above specified configs, the query formed goes something like this : select sum(cast(lineitem_unblendedcost as float)) from #TABLE_NAME where #TAG_NAME='#TAG_VALUE'; The result of this query gives us sum which is compared against specified Threshold value and if the thresholds are crossed, alerts are being sent to specified Channel on Slack tagging the concerned Engineering Manager(EM_Name). Following is one of the sample alert : 📷 So this is it :) Over the day, if the cost shows an unexpected behaviour over our monitored resources, we get to know for sure :) Overall graphical representation of the architecture is something like this : 📷Architecture Diagram for Cost Alerting System I named it up as PLUTUS :) Soon enough, at OpsLyft we are going to release this as a full fledged product which you can use it within your AWS environment. Credit : Aayush Kumar @ Medium
2
0
8
ariel-moreno12
Mar 20, 2019
In Cloud
Credit : Marius @ Medium At my workplace we are heavily using GitLab CI pipelines to orchestrate infrastructure as code on AWS, so we were looking for a lightweight solution to run simple GitLab pipelines on AWS. This article will give deep insights into the proof of concept workflow of making the first serverless GitLab runner work, as well as some general experience of trying out Lambda Layers for the first time. You can find the entire code to this experiment here: https://gitlab.com/msvechla/gitlab-lambda-runner To reduce the blast-radius inside AWS, we are isolating data and their workloads in different AWS accounts. As using IAM users with static API credentials is an anti-pattern (static credentials can leak or accidentally get committed to a repository), a different solution is needed to authorize pipeline runs. Inside the pipeline we mainly use terraform and terragrunt to setup our infrastructure. Internally, both tools use the AWS SDK for go to authenticate and authorize to AWS. This gives us multiple options to specify credentials, e.g. via environment variables, a shared credentials file or an IAM role for Amazon EC2. When looking at existing open-source projects that deploy gitlab-runner on AWS, I stumbled upon npalm/terraform-aws-gitlab-runner. This is an awesome terraform module to run gitlab-runner on EC2 spot instances. However, this originally did not support “native” IAM authentication of builds, as the module is using gitlab runner with the docker+machine executor under the hood. After creating a pull request that enables the configuration of an “runners_iam_instance_profile”, we can now use this “hack” to inject IAM credentials of the EC2 instance-profile in the metadata service as environment variables to the runner: CRED=\$(curl http://169.254.169.254:80/latest/meta-data/iam/security-credentials/my-iam-instance-proifile/);export AWS_SECRET_ACCESS_KEY=\`echo \$CRED |jq -r .SecretAccessKey\`;export AWS_ACCESS_KEY_ID=\`echo \$CRED |jq -r .AccessKeyId\`;export AWS_SESSION_TOKEN=\`echo \$CRED |jq -r .Token\` While this solution works and is perfect for running heavy application builds cost-efficient on EC2 spot instances, it has a lot of overhead. This solution needs at least two EC2 instances and docker to be able to perform builds, which is a lot of components to manage in each of our AWS accounts, just for running terraform. This is what sparked the idea of running builds completely serverless on lambda. A proof of concept While there are some obvious hard limitations, like the lambda execution timeout of 15 minutes, it still made sense to give it a try, as the idea of running builds on lambda in theory looked much more lean than managing multiple EC2 instances in all our accounts. Also a quick google search revealed that running gitlab builds on lambda apparently had not been done yet, so this seemed like an interesting proof of concept. 📷 So what exactly did we want to evaluate: gitlab builds can be executed on lambdagitlab executor can inherit IAM permissions based on assigned roleadditional binaries and its dependencies can be executed during the build (e.g. terraform) The idea was then to finally evaluate the solution by creating a simple s3 bucket with terraform from within a pipeline, that was executed on lambda. As the boundaries for the experiment were set, it was now time to figure out a way to make this work. Looking at the existing GitLab runner executors, a lambda based executor did of course not exist yet. There was however a shellexecutor, which looked like it could be re-purposed to run builds inside a serverless function. The Shell executor is a simple executor that allows you to execute builds locally to the machine that the Runner is installed. In our case this “machine” would simply be our lambda function. Going over some more gitlab-runner documentation, the run-single command could then be used to execute a single build and exit. Additionally this command does not need a configuration file, as all options can be specified via parameters — perfect for our lambda function. While in theory this should work, we still needed a solution for actually running all of this from within a lambda function. AWS recently released Lambda Layers, which allows us to share “libraries, a custom runtime, or other dependencies” with multiple Lambda functions. This was the last missing piece to the puzzle. The rough concept was now to trigger the GitLab runner in “run-single” mode and shell executor from a simple golang lambda function. This GitLab runner would then execute assigned jobs as usual, while our dependencies like gitlab-runner itself, or a terraform binary would be provided by Lambda Layers. Building the Lambda Layers Getting started with Lambda Layers is a simple as uploading a zip file with the content you want to share with your functions. The content will then be available in the Lambda execution context within the /opt/ directory. After uploading the gitlab-runner binary to a Lambda Layer and writing a simple golang Lambda function to execute it with the necessary parameters, I experienced the first minor success: When invoking the Lambda function, the builds that were tagged with the Lambda runner, started to execute! 📷first contact While this showed that the connection was working, the initial pre-build step of checking out the git repository was failing. Also there are a lot of confusing “id” outputs on the screen, but we will get to that later. Apparently the Lambda runtime does not come with git pre-installed, which of course makes sense. As providing GitLab runner to the function with Lambda Layers worked flawlessly, I attempted to do the same with the git binary. According to the Lambda docs, the runtime is based on the Amazon Linux AMI, so getting a compatible binary was straight forward. Next try. 📷 While the git binary itself started cloning the repository, it looked like some dependencies were missing. After a lot of debugging, I noticed a missing “git-core” binary. During the debug session I stumbled upon the aws-lambda-container-image-converter, aka “img2lambda”. This looked like an awesome tool to build custom lambda layers based on Dockerfiles. After giving it a try and adding the missing git-core binary to the Lambda Layer, the error message was gone, but the content was still not available on the filesystem. I realized quickly, that copying over one missing dependency after another would take forever, so I needed a different approach. This is when I found git-lambda-layer, a ready to use lambda layer with git installed. I highly recommend looking at their Dockerfile, as it gives great insight to how layers with multiple dependencies can be built. Switching to this layer worked wonders and I finally got the first successful build. 📷echo from the other side Running a terraform job on Lambda As the basics were now finally working, the last step was to actually run a terraform job to create an s3 bucket in one of our AWS accounts. Img2lambda came in very handy here again, so adding the terraform binary was straight forward. I also added a simple main.tf file to create the bucket, assigned an IAM role with full s3 access to the lambda function and voila, mission accomplished: 📷first s3 bucket created from a serverless gitlab runner Evaluation and Further Work At the end this proof of concept was a full success, as all criteria that have been defined upfront, were evaluated successfully. However there are of course still issues that need to be solved. The main problem is triggering the lambda function when a build job requires it. During the proof of concept I manually triggered the lambda function after a build job started. This could be solved by regularly triggering the lambda function via a CloudWatch schedule, however it would not be very efficient. Ideally the lambda should be triggered from the GitLab server directly. Another solution would be to implement a GitLab runner “lambda” executor, that listens for incoming jobs and then triggers the lambda function. Further possibilities can be evaluated in a future proof of concept. Also there is still the error about the “id” error messages during pipeline execution. Running the “id” command manually shows that there is indeed no name for the lambda users group: uid=488(sbx_user1059) gid=487 groups=487 I traced down the id calls to the /etc/profile inside the Lambda function, which of course gets triggered by the shell executor, but I did not find a solution here yet. Feel free to leave a comment if you have an idea on how to solve this. Try it out yourself As a lot of people are excited to run builds on lambda, I open-sourced all necessary files, as well as a deployment script to get you started. Check it out here: https://gitlab.com/msvechla/gitlab-lambda-runner While it does not make sense to run most traditional build-jobs inside a Lambda function, there are use-cases where it can make sense. If you are interested in working on this, feel free to open a PR or let me know what you think about this on twitter.
2
0
21
ariel-moreno12
More actions