Skip to content

16 November 2023

AWS cost savings with a simple solution

logo
Leo Howard

Scaling Down Redshift Clusters and RDS DBs

Overview

    In my team, I have been responsible for migrating the old AWS account to a new account with new environments (dev/staging/uat/prod). The old account was done both through manual creation and CloudFormation.
    My approach to the new account has been done (as much as possible) via Terraform.
    All four envs now look pretty much identical and have the likes of Lambdas, Buckets, Redshift Clusters, RDS DB’s, VPCs etc. all up and running.
    However, since this new account isn’t being officially used yet, there is no point having costly AWS features running.
    The biggest expenditures for us are EC2 Instances, Redshift Cluster and RDS DB. We need our instances up always but we decided to control when the Redshift Clusters and RDS DB’s would come on.
    It was decided as follows:
  • DEV – Redshift would be paused between 1800-0630 UTC and the RDS DB would be turned on for one hour per day for maintenance reasons
  • STAGING – Redshift would be turned off until further notice and the RDS DB would be turned on for one hour per day for maintenance reasons/li>
  • UAT – Redshift would be turned off until further notice and the RDS DB would be turned on for one hour per day for maintenance reasons/li>
  • PROD – Redshift would be kept on all day and the RDS DB would be turned on for one hour per day for maintenance reasons/li>
    I wasn’t able to find everything I was looking for in one place online to help me solve these issues so hopefully this will help someone in the future that may be looking for a quick fix…

Redshift

    Assuming your Cluster is up and running normally, this is all you need to implement in order to satisfy the above requirements for each env.
    Here we have the pause action:
resource "aws_redshift_scheduled_action" "redshift_pause_scheduled_action" {
  name     = "tf-redshift-pause-scheduled-action"
  schedule = "cron(0 18 * * ? *)"
  iam_role = aws_iam_role.redshift_pause_resume_role.arn
  enable   = var.env == "prod" ? false : true
  target_action {
    pause_cluster {
      cluster_identifier = "redshift-cluster-${var.env}"
    }
  }
}
    The cron job is set to pause the cluster at 1800 UTC each night.
    Since I want to keep the cluster on constantly in prod, there is a conditional statement in there that allows the prod env to ignore the pause action.
    Ensure you put the name of your cluster in the right place.
    In my case, this is done through cluster_identifier = “redshift-cluster-${var.env}”
    The var.env relates to a pre-defined variable:
variable "env" {
  description = "Environment infrastructure is being deployed to"
  type        = string
  default     = "dev"
}
    Then in my .tfvars file for each env, this variable is hardcoded:
env         = "dev"
    Next we have the resume action:
resource "aws_redshift_scheduled_action" "redshift_resume_scheduled_action" {
  name     = "tf-redshift-resume-scheduled-action"
  schedule = "cron(30 6 * * ? *)"
  iam_role = aws_iam_role.redshift_pause_resume_role.arn
  #  enable   = var.env == "prod" ? false : true # when we want to start up staging and uat again, use this line and remove one below
  enable = var.env == "dev" ? true : false

  target_action {
    resume_cluster {
      cluster_identifier = "redshift-cluster-${var.env}"
    }
  }
}
    The cron job is set to resume the cluster at 0630 UTC each morning.
    The original conditional statement (which is now commented out for later use) was the same as in the pause schedule where it would only apply to envs that were not prod.
    Now, because we want prod on constantly and we only want dev to resume again (and leave staging/uat paused), I changed the statement to only apply to the dev env since it automatically does not apply to prod since prod was never paused.
    Finally, we have the necessary IAM role to grant permissions allowing the actions to actually pause the cluster:
resource "aws_iam_role" "redshift_pause_resume_role" {
  name        = "${var.customer_name}-${var.project_name}-redshift-scheduled-action-role-${var.env}"
  description = "Redshift Pause Resume Scheduled Role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "scheduler.redshift.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      },
    ]
  })

  path = "/"
  managed_policy_arns = [
    "arn:aws:iam::aws:policy/AmazonRedshiftFullAccess"
  ]
}
    Hopefully that was all pretty straightforwards (assuming you are familiar with Terraform and AWS).

RDS DB

    This one was a bit more tricky. It involved creating a lambda in Python, then in Terraform I needed to set up the lambda IaC, IAM permissions and CloudWatch event rules.
    Let’s start with the Python lambda:
import os

import boto3


def lambda_handler(event, context):
    rds = boto3.client('rds')
    env = os.getenv("env")

    db_instance_identifier = f'sql-server-ingestion-data-{env}'

    # Check the current state of the RDS instance
    response = rds.describe_db_instances(DBInstanceIdentifier=db_instance_identifier)
    db_instance = response['DBInstances'][0]
    db_instance_status = db_instance['DBInstanceStatus']

    # Start the RDS instance if not running
    if db_instance_status != 'available':
        rds.start_db_instance(DBInstanceIdentifier=db_instance_identifier)
        return f"Starting RDS instance: {db_instance_identifier}"

    # Stop the RDS instance if running
    else:
        rds.stop_db_instance(DBInstanceIdentifier=db_instance_identifier)
        return f"Stopping RDS instance: {db_instance_identifier}"
    Feel free to use this but remember to set the db_instance_identifier to whatever name yours is called. In my case, the RDS DB is the same across the accounts but with the specific env tacked on to the end each time. I get this from the var.env that is run to ensure the correct DB is queried (as explained in the Redshift Cluster section).
    Basically, at the time of the cron job trigger, the DB will be checked to see if it is active (on). If it is, it gets turned off and at the next cron job trigger, it will check the status, see that it is now active and turn it off etc.
    Here is the IaC lambda set up:
data "archive_file" "rds_start_stop_lambda_archive" {
  type        = "zip"
  source_dir  = "src/rds/"
  output_path = "archive/rds.zip"
}

resource "aws_lambda_function" "rds_start_stop_lambda" {

  depends_on = [data.archive_file.rds_start_stop_lambda_archive]

  filename         = data.archive_file.rds_start_stop_lambda_archive.output_path
  function_name    = "${var.customer_name}-${var.project_name}-rds-start-stop-lambda-${var.env}"
  description      = "Lambda to Start/Stop RDS DB ${var.env}"
  role             = aws_iam_role.rds_start_stop_lambda_role.arn
  handler          = "app.lambda_handler"
  source_code_hash = data.archive_file.rds_start_stop_lambda_archive.output_base64sha256
  runtime          = "python3.9"
  timeout          = 60
  memory_size      = 128
  layers = [
  ]

  environment {
    variables = {
      "env" = var.env
    }
  }
}
    Pretty standard Terraform lambda set up. This is a structure I really like and have made it the standard across all the team repos containing IaC.
    Next we need the start and stop triggers under the lambda set up above.
    Remember to create two event triggers, one for starting and one for stopping the DB. I completely forgot these and it left me scratching my head for a while.
    Here is the start trigger:
resource "aws_lambda_permission" "rds_start_trigger" {
  statement_id  = "AllowExecutionFromEventBridgeStart"
  action        = "lambda:InvokeFunction"
  principal     = "events.amazonaws.com"
  function_name = aws_lambda_function.rds_start_stop_lambda.function_name
  source_arn    = aws_cloudwatch_event_rule.rds_start_lambda_schedule_rule.arn
}
    And here is the stop trigger:
resource "aws_lambda_permission" "rds_stop_trigger" {
  statement_id  = "AllowExecutionFromEventBridgeStop"
  action        = "lambda:InvokeFunction"
  principal     = "events.amazonaws.com"
  function_name = aws_lambda_function.rds_start_stop_lambda.function_name
  source_arn    = aws_cloudwatch_event_rule.rds_stop_lambda_schedule_rule.arn
}
    Note that the statement_id is different in each trigger (essential). The action and principal values should be the ones I have set in my own code. Attach each trigger to the lambda IaC you created earlier and also attach the respective rule (see next section) to each trigger.
    Now we add the CloudWatch start/stop rules and targets.
    Below is the rule and target to start the DB. As you can see, the cron job will trigger at 0800 UTC each day:
# Event for starting RDS DB

resource "aws_cloudwatch_event_rule" "rds_start_lambda_schedule_rule" {
  name                = "${var.customer_name}-${var.project_name}-rds-start-schedule-rule-${var.env}"
  description         = "Schedule to start the RDS DB"
  schedule_expression = "cron(0 8 * * ? *)"
}

resource "aws_cloudwatch_event_target" "rds_start_lambda_target" {
  rule      = aws_cloudwatch_event_rule.rds_start_lambda_schedule_rule.name
  target_id = "${var.customer_name}-${var.project_name}-rds-start-schedule-target-${var.env}"
  arn       = aws_lambda_function.rds_start_stop_lambda.arn
}
    Below is the rule and target to stop the DB. As you can see, the cron job will trigger at 0900 UTC each day:
# Event for stopping RDS DB

resource "aws_cloudwatch_event_rule" "rds_stop_lambda_schedule_rule" {
  name                = "${var.customer_name}-${var.project_name}-rds-stop-schedule-rule-${var.env}"
  description         = "Schedule to stop the RDS DB"
  schedule_expression = "cron(0 9 * * ? *)"
}

resource "aws_cloudwatch_event_target" "rds_stop_lambda_target" {
  rule      = aws_cloudwatch_event_rule.rds_stop_lambda_schedule_rule.name
  target_id = "${var.customer_name}-${var.project_name}-rds-stop-schedule-target-${var.env}"
  arn       = aws_lambda_function.rds_start_stop_lambda.arn
}
    The reason for this one hour window for the DB in all the envs is to allow for necessary maintenance. I also found that stopping a DB for more than 7 days can be tricky because of this maintenance need.
    Remember to give your rule and target a clearly identifiable name and description and ensure the cron is correct for your intentions. Terraform crons can be different from the standard generators you find online.
    Finally we have the required IAM role and policies to get this all working:
# IAM for starting and stopping RDS DB

resource "aws_iam_role" "rds_start_stop_lambda_role" {

  name        = "${var.customer_name}-${var.project_name}-rds-start-stop-role-${var.env}"
  description = "RDS DB Start Stop Role ${var.env}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })

  path = "/"
  managed_policy_arns = [
    "arn:aws:iam::aws:policy/AmazonRDSDataFullAccess",
    "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess",
    "arn:aws:iam::aws:policy/service-role/AWSLambdaRole",
    "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
  ]
}

resource "aws_iam_policy" "rds_policy" {
  name        = "${var.customer_name}-${var.project_name}-rds-policy-${var.env}"
  description = "Policy for RDS permissions ${var.env}"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        "Action" : [
          "rds:DescribeDBInstances",
          "rds:DescribeDBClusters",
          "rds:StartDBInstance",
          "rds:StopDBInstance",
          "rds:StartDBCluster",
          "rds:StopDBCluster",
          "rds:ListTagsForResource"
        ],
        Effect   = "Allow",
        Resource = "arn:aws:rds:eu-west-2:${var.aws_account}:db:*"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "describe_db_instances_attachment" {
  policy_arn = aws_iam_policy.rds_policy.arn
  role       = aws_iam_role.rds_start_stop_lambda_role.name
}
    You probably wont need as many managed policy arns as I have put in but I like to cover the bases.
    I have also specified the DB resource arn in the RDS policy for security reasons. Just like with var.env earlier in the article, here I pass in var.aws_account which cycles through the 12 digit AWS account ID assigned to each env in the .tfvars file.
    That is pretty much it!
    Push your code, get it into your environments and check that it all works.
    Below you can see what the status of the Redshift cluster in the dev, staging, uat and prod envs are respectively:
    Redshift Cluster in Dev env:

Redshift Cluster in Dev env

    Redshift Cluster in Staging env:

Redshift Cluster in Staging env

    Redshift Cluster in UAT env:

Redshift Cluster in UAT env

    Redshift Cluster in Prod env:

Redshift Cluster in Prod env

    And below are some screenshots of the lambda triggers in the Prod env and the (now) status of the RDS DB:
    Status of RDS DB in Prod env after applying the Start/Stop Lambda:

EventBridge triggers for starting:stopping RDS DB

    EventBridge triggers for starting/stopping RDS DB

Status of RDS DB in Prod env after applying the StartStop Lambda

    I wrote this mostly for my own benefit as this was a good little project I accomplished but since it is such a basic and useful thing to do, I expect others out there will appreciate this easy guide to saving your company money.
    Bring it up as your own idea and take all the praise when you show how much you save the team per month!