Skip to content

NimTechnology

Trình bày các công nghệ CLOUD một cách dễ hiểu.

  • Kubernetes & Container
    • Docker
    • Kubernetes
      • Ingress
      • Pod
    • Helm Chart
    • OAuth2 Proxy
    • Isito-EnvoyFilter
    • Apache Kafka
      • Kafka
      • Kafka Connect
      • Lenses
    • Vault
    • Longhorn – Storage
    • VictoriaMetrics
    • MetalLB
    • Kong Gateway
  • CI/CD
    • ArgoCD
    • ArgoWorkflows
    • Argo Events
    • Spinnaker
    • Jenkins
    • Harbor
    • TeamCity
    • Git
      • Bitbucket
  • Coding
    • DevSecOps
    • Terraform
      • GCP – Google Cloud
      • AWS – Amazon Web Service
      • Azure Cloud
    • Golang
    • Laravel
    • Python
    • Jquery & JavaScript
    • Selenium
  • Log, Monitor & Tracing
    • DataDog
    • Prometheus
    • Grafana
    • ELK
      • Kibana
      • Logstash
  • BareMetal
    • NextCloud
  • Toggle search form

[AWS] Discovering how to design Cluster Autoscaler on EKS.

Posted on September 29, 2022September 20, 2024 By nim No Comments on [AWS] Discovering how to design Cluster Autoscaler on EKS.

Một ngày đẹp trời bạn nhận thấy cluster eks nó tự động tăng số lượng node rồi lại tự giảm.
Nếu bạn chưa biết tạo sao thì tìm hiểu Cluster Autoscaler trên EKS

Contents

Toggle
  • 1) Look into Cluster Autoscaler on EKS
  • 2) Install Cluster Autoscale on EKS
  • 3) Cluster Autoscaler Testing
  • Issues
    • Prevent the Cluster Autoscaler from evicting the pods when scaling down nodes
  • Pods announce ” X Insufficient cpu, 1 Insufficient memory”.
  • How does the cluster-autoscaler delete nodes efficiently?

1) Look into Cluster Autoscaler on EKS

Khi mà các pod vào trạng thái insufficient resource
Khi mà node vào trnagj thái Idie và không pod nào trên cluster thì nó sẽ thông báo xóa node đó.

2) Install Cluster Autoscale on EKS

Để làm được bài này thì mình nghĩ các bạn nên đọc bài viết bên dưới trước.

[AWS] EKS IAM Roles for Service Accounts (IRSA) using Terraform
# Resource: IAM Policy for Cluster Autoscaler
resource "aws_iam_policy" "cluster_autoscaler_iam_policy" {
  name        = "${local.name}-AmazonEKSClusterAutoscalerPolicy"
  path        = "/"
  description = "EKS Cluster Autoscaler Policy"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DescribeInstanceTypes"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
})
}

# Resource: IAM Role for Cluster Autoscaler
## Create IAM Role and associate it with Cluster Autoscaler IAM Policy
resource "aws_iam_role" "cluster_autoscaler_iam_role" {
  name = "${local.name}-cluster-autoscaler"

  # Terraform's "jsonencode" function converts a Terraform expression result to valid JSON syntax.
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRoleWithWebIdentity"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Federated = "${data.terraform_remote_state.eks.outputs.aws_iam_openid_connect_provider_arn}"
        }
        Condition = {
          StringEquals = {
            "${data.terraform_remote_state.eks.outputs.aws_iam_openid_connect_provider_extract_from_arn}:sub": "system:serviceaccount:kube-system:cluster-autoscaler"
          }
        }        
      },
    ]
  })

  tags = {
    tag-key = "cluster-autoscaler"
  }
}


# Associate IAM Policy to IAM Role
resource "aws_iam_role_policy_attachment" "cluster_autoscaler_iam_role_policy_attach" {
  policy_arn = aws_iam_policy.cluster_autoscaler_iam_policy.arn 
  role       = aws_iam_role.cluster_autoscaler_iam_role.name
}

output "cluster_autoscaler_iam_role_arn" {
  description = "Cluster Autoscaler IAM Role ARN"
  value = aws_iam_role.cluster_autoscaler_iam_role.arn
}

resource “aws_iam_policy” “cluster_autoscaler_iam_policy” {}
==> Tạo policy cho phép access và acction trên AutoScaling

resource “aws_iam_role” “cluster_autoscaler_iam_role” {}
==> tạo Assume Role cho phép sử dụng service account để access AutoScaling

Giờ bạn tiền hay lấy các credentials để helm Install.

# Datasource: EKS Cluster Auth 
data "aws_eks_cluster_auth" "cluster" {
  name = data.terraform_remote_state.eks.outputs.cluster_id
}

# HELM Provider
provider "helm" {
  kubernetes {
    host                   = data.terraform_remote_state.eks.outputs.cluster_endpoint
    cluster_ca_certificate = base64decode(data.terraform_remote_state.eks.outputs.cluster_certificate_authority_data)
    token                  = data.aws_eks_cluster_auth.cluster.token
  }
}
# Install Cluster Autoscaler using HELM

# Resource: Helm Release 
resource "helm_release" "cluster_autoscaler_release" {
  depends_on = [aws_iam_role.cluster_autoscaler_iam_role ]            
  name       = "${local.name}-ca"

  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"

  namespace = "kube-system"   

  set {
    name  = "cloudProvider"
    value = "aws"
  }

  set {
    name  = "autoDiscovery.clusterName"
    value = data.terraform_remote_state.eks.outputs.cluster_id
  }

  set {
    name  = "awsRegion"
    value = var.aws_region
  }

  set {
    name  = "rbac.serviceAccount.create"
    value = "true"
  }

  set {
    name  = "rbac.serviceAccount.name"
    value = "cluster-autoscaler"
  }

  set {
    name  = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = "${aws_iam_role.cluster_autoscaler_iam_role.arn}"
  }
  # Additional Arguments (Optional) - To Test How to pass Extra Args for Cluster Autoscaler
  #set {
  #  name = "extraArgs.scan-interval"
  #  value = "20s"
  #}    
   
}

Giờ bạn đó có helm chart của cluster-autoscaler để apply vào k8s

Outputs:

cluster_autoscaler_helm_metadata = tolist([
  {
    "app_version" = "1.23.0"
    "chart" = "cluster-autoscaler"
    "name" = "sap-dev-ca"
    "namespace" = "kube-system"
    "revision" = 1
    "values" = "{\"autoDiscovery\":{\"clusterName\":\"SAP-dev-eksdemo\"},\"awsRegion\":\"us-east-1\",\"cloudProvider\":\"aws\",\"rbac\":{\"serviceAccount\":{\"annotations\":{\"eks.amazonaws.com/role-arn\":\"arn:aws:iam::250887682577:role/SAP-dev-cluster-autoscaler\"},\"create\":true,\"name\":\"cluster-autoscaler\"}}}"
    "version" = "9.21.0"
  },
])
cluster_autoscaler_iam_role_arn = "arn:aws:iam::250887682577:role/SAP-dev-cluster-autoscaler"

Giờ recheck 1 số thứ

 kubectl -n kube-system get pods

Bạn có thể coi logs của pod cluster autoscaler

kubectl -n kube-system logs -f $(kubectl -n kube-system get pods | egrep -o 'hr-dev-ca-aws-cluster-autoscaler-[A-Za-z0-9-]+')

Kiểm tra SA

root@work-space-u20:~# kubectl -n kube-system describe sa cluster-autoscaler
Name:                cluster-autoscaler
Namespace:           kube-system
Labels:              app.kubernetes.io/instance=sap-dev-ca
                     app.kubernetes.io/managed-by=Helm
                     app.kubernetes.io/name=aws-cluster-autoscaler
                     helm.sh/chart=cluster-autoscaler-9.21.0
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::250887682577:role/SAP-dev-cluster-autoscaler
                     meta.helm.sh/release-name: sap-dev-ca
                     meta.helm.sh/release-namespace: kube-system
Image pull secrets:  <none>
Mountable secrets:   cluster-autoscaler-token-bzdgg
Tokens:              cluster-autoscaler-token-bzdgg
Events:              <none>

3) Cluster Autoscaler Testing

cluster-autoscaler-sample-app.yaml
>>>>>>>>>>>>>>>>
>>>>>>>>>>>


Bạn có thể để ý sau khi có nhiều pod được run lên thì node bắt đầu scale lên.

https://github.com/mrnim94/terraform-aws/tree/master/elk-cluster-autoscaler
bạn có thể thao khảo các file trong repo này.

Issues

Prevent the Cluster Autoscaler from evicting the pods when scaling down nodes

Cluster Autoscaler will Evaluation of Nodes for Scale Down:

  • Underutilization Check: The CA periodically checks if nodes are underutilized (i.e., the resources are not being fully utilized by the pods running on them). This involves evaluating the resource requests of all pods on each node against the node’s total capacity.
  • Candidate Nodes: Nodes that have been underutilized for a configurable period (default is 10 minutes) are marked as candidates for scale down.

NExt Step, Cluster AutoScaler make Simulation of Pod Eviction.

  • Eviction Simulation: For each candidate node, the CA simulates the eviction of all pods on that node. It then tries to reschedule these pods onto other nodes in the cluster, taking into account the resource requests, affinity/anti-affinity rules, taints, and tolerations.
  • Safety Checks: During simulation, the CA checks several conditions to ensure that scaling down is safe, including:
    • Ensuring pod disruption budgets (PDBs) are respected.
    • Verifying that no critical system pods are evicted, or if they are, that they can be successfully rescheduled.
    • Checking if pods are marked with cluster-autoscaler.kubernetes.io/safe-to-evict: "false", indicating they should not be evicted.

Vậy là nếu bạn không muốn scale node mà đang chưa 1 important pod của bạn thì bạn có thể sử dụng cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

Refer to:
https://github.com/kubernetes/autoscaler/issues/3183

Pods announce ” X Insufficient cpu, 1 Insufficient memory”.

Mặc dù

How does the cluster-autoscaler delete nodes efficiently?

Bạn thấy Cluster Auto Scaler đang scaling up quá nhiều node.
và bạn thấy log bên dưới.

I0920 08:15:36.290137       1 eligibility.go:162] Node ip-10-195-95-231.us-west-2.compute.internal unremovable: memory requested (56.8488% of allocatable) is above the scale-down utilization threshold

Với log trên thì bạn có thể thấy là Cluster Auto-Scaler gét được tổng số memory requested của các pods trên 1 node sau đó nó làm phép tính

mặc định thì trong cluster autoscaler sẽ set là:
scale-down-utilization-threshold: 0.5
Điều này đồng nghĩa con số này phải dưới 50% thì nó mới remove node đó.

Với môi trường developer thì bạn sẽ nên chỉnh lên 0.8 thì saving cost hơn.


AWS - Amazon Web Service

Post navigation

Previous Post: [AWS] Deploying Redis on AWS
Next Post: [Kubernetes] Lesson8: k8s Easy – Service – Service account – ConfigMaps and Secrets

More Related Articles

[EKS] Adjusting things to migrate EKS legacy to new versions. AWS - Amazon Web Service
[S3] You will naturally see files on S3 being deleted AWS - Amazon Web Service
[AWS] How to increase the disk size of a Windows EC2 machine? AWS - Amazon Web Service
[EKS/ S3 Mount Point] Create a Persistent Volume on EKS using an S3 Bucket. AWS - Amazon Web Service
[Terraform] Infrastructure Automation With Terraform – Lesson 1: Setup AWS AWS - Amazon Web Service
[EKS] Upgrade eks cluster by eksctl AWS - Amazon Web Service

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tham Gia Group DevOps nhé!
Để Nim có nhiều động lực ra nhiều bài viết.
Để nhận được những thông báo mới nhất.

Recent Posts

  • [Azure] The subscription is not registered to use namespace ‘Microsoft.ContainerService’ May 8, 2025
  • [Azure] Insufficient regional vcpu quota left May 8, 2025
  • [WordPress] How to add a Dynamic watermark on WordPress. May 6, 2025
  • [vnet/Azure] VNet provisioning via Terraform. April 28, 2025
  • [tracetcp] How to perform a tracert command using a specific port. April 3, 2025

Archives

  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021

Categories

  • BareMetal
    • NextCloud
  • CI/CD
    • Argo Events
    • ArgoCD
    • ArgoWorkflows
    • Git
      • Bitbucket
    • Harbor
    • Jenkins
    • Spinnaker
    • TeamCity
  • Coding
    • DevSecOps
    • Golang
    • Jquery & JavaScript
    • Laravel
    • NextJS 14 & ReactJS & Type Script
    • Python
    • Selenium
    • Terraform
      • AWS – Amazon Web Service
      • Azure Cloud
      • GCP – Google Cloud
  • Kubernetes & Container
    • Apache Kafka
      • Kafka
      • Kafka Connect
      • Lenses
    • Docker
    • Helm Chart
    • Isito-EnvoyFilter
    • Kong Gateway
    • Kubernetes
      • Ingress
      • Pod
    • Longhorn – Storage
    • MetalLB
    • OAuth2 Proxy
    • Vault
    • VictoriaMetrics
  • Log, Monitor & Tracing
    • DataDog
    • ELK
      • Kibana
      • Logstash
    • Fluent
    • Grafana
    • Prometheus
  • Uncategorized
  • Admin

Copyright © 2025 NimTechnology.