When K8s pods are stuck mounting large volumes

refer:
https://blog.devgenius.io/when-k8s-pods-are-stuck-mounting-large-volumes-2915e6656cb8

Recently we ran into the following problem with our Loki deployment on AWS/EKS. On every deployment or restart of a Loki Pod, mounting the persistent volume took longer and longer. It started with a few minutes delay and ended up with nearly 25 minutes on our production cluster. Having no solution for this we avoided new deployments if possible, knowing this was not an acceptable workaround.

Events:
Type    Reason      Age      From              Message
— — — — — — — — — — — — -
Normal  Scheduled   23m50s   default-scheduler Successfully assigned default/filecr34t0r-0 to ip-100–64–8–204.eu-central-1.compute.internal
Normal  SuccessfulAttachVolume 23m48s attachdetach-controller AttachVolume.Attach succeeded for volume “pvc-ef3366b8-464c-11ed-b878-0242ac120002”
Warning FailedMount  5m43s (x6 over 18m) kubelet Unable to attach or mount volumes: unmounted volumes=[vol], unattached volumes=[vol kube-api-access-7wzcs]: timed out waiting for the condition
Normal  Pulled       106s   kubelet            Container image “grafana/loki:2.6.1” already present on machine
Normal  Created      106s   kubelet            Created container loki
Normal  Started      106s   kubelet            Started container loki

Then I began to investigate the matter. On test and prod we use automatic provisioned gp3 volumes. AWS volume monitor showed heavy I/O activities during the mount time. The volume on test had about 1.3 million files and the mount took about 7 minutes. On prod the volume had 4.3 million and needed 24 minutes to mount. Ok, it seems to correlate to the number of files. With the gp3´s 3000 iOPs we can do the following calculation:

Test: 1300000/3000/60 = 7.2 minutes
Prod 4300000/3000/60 = 23.8 minutes

By searching K8s docs and blogs I found the solution: Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod’s securityContext when that volume is mounted. For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup.

With the fsGroupChangePolicy field inside a securityContext you can control the way that Kubernetes checks and manages ownership and permissions for a volume. Possible values:

OnRootMismatch: Only change permissions and ownership if the permission and the ownership of root directory does not match with expected permissions of the volume. This could help shorten the time it takes to change ownership and permission of a volume.
Always: Always change permission and ownership of the volume when volume is mounted.

template:
  spec:
    containers:
      ...
    securityContext:
      fsGroup: 10001
      runAsGroup: 10001
      runAsNonRoot: true
      runAsUser: 10001
      fsGroupChangePolicy: "OnRootMismatch"

With this modification the startup of our Loki instance changed back to below two minutes.

This all only applies if your Deployment or StatefulSet has configured a securityContext, which you hopefully have done. 😉

Addendum: The huge number of files resulted from Loki producing many chunks and this was because of a too liberal use of custom labels. We reduced the number of labels in the meantime and as the number of files is shrinking the query in Grafana gets faster too.

More Related Articles

Leave a Reply Cancel reply