replicaset controller分析

replicaset controller简介

replicaset controller是kube-controller-manager组件中众多控制器中的一个,是 replicaset 资源对象的控制器,其通过对replicaset、pod 2种资源的监听,当这2种资源发生变化时会触发 replicaset controller 对相应的replicaset对象进行调谐操作,从而完成replicaset期望副本数的调谐,当实际pod的数量未达到预期时创建pod,当实际pod的数量超过预期时删除pod。

replicaset controller主要作用是根据replicaset对象所期望的pod数量与现存pod数量做比较,然后根据比较结果创建/删除pod,最终使得replicaset对象所期望的pod数量与现存pod数量相等。

replicaset controller架构图

replicaset controller的大致组成和处理流程如下图,replicaset controller对pod和replicaset对象注册了event handler,当有事件时,会watch到然后将对应的replicaset对象放入到queue中,然后syncReplicaSet方法为replicaset controller调谐replicaset对象的核心处理逻辑所在,从queue中取出replicaset对象,做调谐处理。

replicaset controller分析分为3大块进行,分别是:

(1)replicaset controller初始化和启动分析;

(2)replicaset controller核心处理逻辑分析;

(3)replicaset controller expectations机制分析。

本篇博客进行replicaset controller核心处理逻辑分析。

replicaset controller核心处理逻辑分析

基于v1.17.4

经过前面分析的replicaset controller的初始化与启动,知道了replicaset controller监听replicaset、pod对象的add、update与delete事件,然后对replicaset对象做相应的调谐处理,这里来接着分析replicaset controller的调谐处理(核心处理)逻辑,从rsc.syncHandler作为入口进行分析。

rsc.syncHandler

rsc.syncHandler即rsc.syncReplicaSet方法,主要逻辑:

(1)获取replicaset对象以及关联的pod对象列表;

(2)调用rsc.expectations.SatisfiedExpectations,判断上一轮对replicaset期望副本的创删操作是否完成,也可以认为是判断上一次对replicaset对象的调谐操作中,调用的rsc.manageReplicas方法是否执行完成;

(3)如果上一轮对replicaset期望副本的创删操作已经完成,且replicaset对象的DeletionTimestamp字段为nil,则调用rsc.manageReplicas做replicaset期望副本的核心调谐处理,即创删pod;

(4)调用calculateStatus计算replicaset的status,并更新。

  1. // syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled,
  2. // meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be
  3. // invoked concurrently with the same key.
  4. func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
  5. startTime := time.Now()
  6. defer func() {
  7. klog.V(4).Infof("Finished syncing %v %q (%v)", rsc.Kind, key, time.Since(startTime))
  8. }()
  9. namespace, name, err := cache.SplitMetaNamespaceKey(key)
  10. if err != nil {
  11. return err
  12. }
  13. rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
  14. if errors.IsNotFound(err) {
  15. klog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
  16. rsc.expectations.DeleteExpectations(key)
  17. return nil
  18. }
  19. if err != nil {
  20. return err
  21. }
  22. rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
  23. selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
  24. if err != nil {
  25. utilruntime.HandleError(fmt.Errorf("error converting pod selector to selector: %v", err))
  26. return nil
  27. }
  28. // list all pods to include the pods that don't match the rs`s selector
  29. // anymore but has the stale controller ref.
  30. // TODO: Do the List and Filter in a single pass, or use an index.
  31. allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
  32. if err != nil {
  33. return err
  34. }
  35. // Ignore inactive pods.
  36. filteredPods := controller.FilterActivePods(allPods)
  37. // NOTE: filteredPods are pointing to objects from cache - if you need to
  38. // modify them, you need to copy it first.
  39. filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
  40. if err != nil {
  41. return err
  42. }
  43. var manageReplicasErr error
  44. if rsNeedsSync && rs.DeletionTimestamp == nil {
  45. manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
  46. }
  47. rs = rs.DeepCopy()
  48. newStatus := calculateStatus(rs, filteredPods, manageReplicasErr)
  49. // Always updates status as pods come up or die.
  50. updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)
  51. if err != nil {
  52. // Multiple things could lead to this update failing. Requeuing the replica set ensures
  53. // Returning an error causes a requeue without forcing a hotloop
  54. return err
  55. }
  56. // Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.
  57. if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
  58. updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
  59. updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
  60. rsc.queue.AddAfter(key, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
  61. }
  62. return manageReplicasErr
  63. }

1 rsc.expectations.SatisfiedExpectations

该方法主要是判断上一轮对replicaset期望副本的创删操作是否完成,也可以认为是判断上一次对replicaset对象的调谐操作中,调用的rsc.manageReplicas方法是否执行完成。待上一次创建删除pod的操作完成后,才能进行下一次的rsc.manageReplicas方法调用。

若某replicaset对象的调谐中从未调用过rsc.manageReplicas方法,或上一轮调谐时创建/删除pod的数量已达成或调用rsc.manageReplicas后已达到超时期限(超时时间5分钟),则返回true,代表上一次创建删除pod的操作完成,可以进行下一次的rsc.manageReplicas方法调用,否则返回false。

expectations记录了replicaset对象在某一次调谐中期望创建/删除的pod数量,pod创建/删除完成后,该期望数会相应的减少,当期望创建/删除的pod数量小于等于0时,说明上一次调谐中期望创建/删除的pod数量已经达到,返回true。

关于Expectations机制后面会做详细分析。

  1. // pkg/controller/controller_utils.go
  2. func (r *ControllerExpectations) SatisfiedExpectations(controllerKey string) bool {
  3. if exp, exists, err := r.GetExpectations(controllerKey); exists {
  4. if exp.Fulfilled() {
  5. klog.V(4).Infof("Controller expectations fulfilled %#v", exp)
  6. return true
  7. } else if exp.isExpired() {
  8. klog.V(4).Infof("Controller expectations expired %#v", exp)
  9. return true
  10. } else {
  11. klog.V(4).Infof("Controller still waiting on expectations %#v", exp)
  12. return false
  13. }
  14. } else if err != nil {
  15. klog.V(2).Infof("Error encountered while checking expectations %#v, forcing sync", err)
  16. } else {
  17. // When a new controller is created, it doesn't have expectations.
  18. // When it doesn't see expected watch events for > TTL, the expectations expire.
  19. // - In this case it wakes up, creates/deletes controllees, and sets expectations again.
  20. // When it has satisfied expectations and no controllees need to be created/destroyed > TTL, the expectations expire.
  21. // - In this case it continues without setting expectations till it needs to create/delete controllees.
  22. klog.V(4).Infof("Controller %v either never recorded expectations, or the ttl expired.", controllerKey)
  23. }
  24. // Trigger a sync if we either encountered and error (which shouldn't happen since we're
  25. // getting from local store) or this controller hasn't established expectations.
  26. return true
  27. }
  28. func (exp *ControlleeExpectations) isExpired() bool {
  29. return clock.RealClock{}.Since(exp.timestamp) > ExpectationsTimeout // ExpectationsTimeout = 5 * time.Minute
  30. }

2 核心创建删除pod方法-rsc.manageReplicas

核心创建删除pod方法,主要是根据replicaset所期望的pod数量与现存pod数量做比较,然后根据比较结果来创建/删除pod,最终使得replicaset对象所期望的pod数量与现存pod数量相等,需要特别注意的是,每一次调用rsc.manageReplicas方法,创建/删除pod的个数上限为500。

在replicaset对象的调谐中,rsc.manageReplicas方法不一定每一次都会调用执行,只有当rsc.expectations.SatisfiedExpectations方法返回true,且replicaset对象的DeletionTimestamp属性为空时,才会进行rsc.manageReplicas方法的调用。

先简单的看一下代码,代码后面会做详细的逻辑分析。

  1. // pkg/controller/replicaset/replica_set.go
  2. func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
  3. diff := len(filteredPods) - int(*(rs.Spec.Replicas))
  4. rsKey, err := controller.KeyFunc(rs)
  5. if err != nil {
  6. utilruntime.HandleError(fmt.Errorf("Couldn't get key for %v %#v: %v", rsc.Kind, rs, err))
  7. return nil
  8. }
  9. if diff < 0 {
  10. diff *= -1
  11. if diff > rsc.burstReplicas {
  12. diff = rsc.burstReplicas
  13. }
  14. // TODO: Track UIDs of creates just like deletes. The problem currently
  15. // is we'd need to wait on the result of a create to record the pod's
  16. // UID, which would require locking *across* the create, which will turn
  17. // into a performance bottleneck. We should generate a UID for the pod
  18. // beforehand and store it via ExpectCreations.
  19. rsc.expectations.ExpectCreations(rsKey, diff)
  20. glog.V(2).Infof("Too few replicas for %v %s/%s, need %d, creating %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
  21. // Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize
  22. // and double with each successful iteration in a kind of "slow start".
  23. // This handles attempts to start large numbers of pods that would
  24. // likely all fail with the same error. For example a project with a
  25. // low quota that attempts to create a large number of pods will be
  26. // prevented from spamming the API service with the pod create requests
  27. // after one of its pods fails. Conveniently, this also prevents the
  28. // event spam that those failures would generate.
  29. successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
  30. boolPtr := func(b bool) *bool { return &b }
  31. controllerRef := &metav1.OwnerReference{
  32. APIVersion: rsc.GroupVersion().String(),
  33. Kind: rsc.Kind,
  34. Name: rs.Name,
  35. UID: rs.UID,
  36. BlockOwnerDeletion: boolPtr(true),
  37. Controller: boolPtr(true),
  38. }
  39. err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, controllerRef)
  40. if err != nil && errors.IsTimeout(err) {
  41. // Pod is created but its initialization has timed out.
  42. // If the initialization is successful eventually, the
  43. // controller will observe the creation via the informer.
  44. // If the initialization fails, or if the pod keeps
  45. // uninitialized for a long time, the informer will not
  46. // receive any update, and the controller will create a new
  47. // pod when the expectation expires.
  48. return nil
  49. }
  50. return err
  51. })
  52. // Any skipped pods that we never attempted to start shouldn't be expected.
  53. // The skipped pods will be retried later. The next controller resync will
  54. // retry the slow start process.
  55. if skippedPods := diff - successfulCreations; skippedPods > 0 {
  56. glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
  57. for i := 0; i < skippedPods; i++ {
  58. // Decrement the expected number of creates because the informer won't observe this pod
  59. rsc.expectations.CreationObserved(rsKey)
  60. }
  61. }
  62. return err
  63. } else if diff > 0 {
  64. if diff > rsc.burstReplicas {
  65. diff = rsc.burstReplicas
  66. }
  67. glog.V(2).Infof("Too many replicas for %v %s/%s, need %d, deleting %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
  68. // Choose which Pods to delete, preferring those in earlier phases of startup.
  69. podsToDelete := getPodsToDelete(filteredPods, diff)
  70. // Snapshot the UIDs (ns/name) of the pods we're expecting to see
  71. // deleted, so we know to record their expectations exactly once either
  72. // when we see it as an update of the deletion timestamp, or as a delete.
  73. // Note that if the labels on a pod/rs change in a way that the pod gets
  74. // orphaned, the rs will only wake up after the expectations have
  75. // expired even if other pods are deleted.
  76. rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))
  77. errCh := make(chan error, diff)
  78. var wg sync.WaitGroup
  79. wg.Add(diff)
  80. for _, pod := range podsToDelete {
  81. go func(targetPod *v1.Pod) {
  82. defer wg.Done()
  83. if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
  84. // Decrement the expected number of deletes because the informer won't observe this deletion
  85. podKey := controller.PodKey(targetPod)
  86. glog.V(2).Infof("Failed to delete %v, decrementing expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
  87. rsc.expectations.DeletionObserved(rsKey, podKey)
  88. errCh <- err
  89. }
  90. }(pod)
  91. }
  92. wg.Wait()
  93. select {
  94. case err := <-errCh:
  95. // all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
  96. if err != nil {
  97. return err
  98. }
  99. default:
  100. }
  101. }
  102. return nil
  103. }

diff = 现存pod数量 - 期望的pod数量

  1. diff := len(filteredPods) - int(*(rs.Spec.Replicas))

(1)当现存pod数量比期望的少时,需要创建pod,进入创建pod的逻辑代码块。

(2)当现存pod数量比期望的多时,需要删除pod,进入删除pod的逻辑代码块。

一次同步操作中批量创建或删除pod的个数上限为rsc.burstReplicas,即500个。

  1. // pkg/controller/replicaset/replica_set.go
  2. const (
  3. // Realistic value of the burstReplica field for the replica set manager based off
  4. // performance requirements for kubernetes 1.0.
  5. BurstReplicas = 500
  6. // The number of times we retry updating a ReplicaSet's status.
  7. statusUpdateRetries = 1
  8. )
  1. if diff > rsc.burstReplicas {
  2. diff = rsc.burstReplicas
  3. }

接下来分析一下创建/删除pod的逻辑代码块。

2.1 创建pod逻辑代码块

主要逻辑:

(1)运算获取需要创建的pod数量,并设置数量上限500;

(2)调用rsc.expectations.ExpectCreations,将本轮调谐期望创建的pod数量设置进expectations;

(3)调用slowStartBatch函数来对pod进行创建逻辑处理;

(4)调用slowStartBatch函数完成后,计算获取创建失败的pod的数量,然后调用相应次数的rsc.expectations.CreationObserved方法,减去本轮调谐中期望创建的pod数量。

为什么要减呢?因为expectations记录了replicaset对象在某一次调谐中期望创建/删除的pod数量,pod创建/删除完成后,replicaset controller会watch到pod的创建/删除事件,从而调用rsc.expectations.CreationObserved方法来使期望创建/删除的pod数量减少。当有相应数量的pod创建/删除失败后,replicaset controller是不会watch到相应的pod创建/删除事件的,所以必须把本轮调谐期望创建/删除的pod数量做相应的减法,否则本轮调谐中的期望创建/删除pod数量永远不可能小于等于0,这样的话,rsc.expectations.SatisfiedExpectations方法就只会等待expectations超时期限到达才会返回true了。

  1. diff *= -1
  2. if diff > rsc.burstReplicas {
  3. diff = rsc.burstReplicas
  4. }
  5. rsc.expectations.ExpectCreations(rsKey, diff)
  6. glog.V(2).Infof("Too few replicas for %v %s/%s, need %d, creating %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
  7. successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
  8. boolPtr := func(b bool) *bool { return &b }
  9. controllerRef := &metav1.OwnerReference{
  10. APIVersion: rsc.GroupVersion().String(),
  11. Kind: rsc.Kind,
  12. Name: rs.Name,
  13. UID: rs.UID,
  14. BlockOwnerDeletion: boolPtr(true),
  15. Controller: boolPtr(true),
  16. }
  17. err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, controllerRef)
  18. if err != nil && errors.IsTimeout(err) {
  19. // Pod is created but its initialization has timed out.
  20. // If the initialization is successful eventually, the
  21. // controller will observe the creation via the informer.
  22. // If the initialization fails, or if the pod keeps
  23. // uninitialized for a long time, the informer will not
  24. // receive any update, and the controller will create a new
  25. // pod when the expectation expires.
  26. return nil
  27. }
  28. return err
  29. })
  30. if skippedPods := diff - successfulCreations; skippedPods > 0 {
  31. glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
  32. for i := 0; i < skippedPods; i++ {
  33. // Decrement the expected number of creates because the informer won't observe this pod
  34. rsc.expectations.CreationObserved(rsKey)
  35. }
  36. }
  37. return err

2.1.1 slowStartBatch

来看到slowStartBatch,可以看到创建pod的算法为:

(1)每次批量创建的 pod 数依次为 1、2、4、8......,呈指数级增长,起与要创建的pod数量相同的goroutine来负责创建pod。

(2)创建pod按1、2、4、8...的递增趋势分多批次进行,若某批次创建pod有失败的(如apiserver限流,丢弃请求等,注意:超时除外,因为initialization处理有可能超时),则后续批次不再进行,结束本次函数调用。

  1. // pkg/controller/replicaset/replica_set.go
  2. // slowStartBatch tries to call the provided function a total of 'count' times,
  3. // starting slow to check for errors, then speeding up if calls succeed.
  4. //
  5. // It groups the calls into batches, starting with a group of initialBatchSize.
  6. // Within each batch, it may call the function multiple times concurrently.
  7. //
  8. // If a whole batch succeeds, the next batch may get exponentially larger.
  9. // If there are any failures in a batch, all remaining batches are skipped
  10. // after waiting for the current batch to complete.
  11. //
  12. // It returns the number of successful calls to the function.
  13. func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {
  14. remaining := count
  15. successes := 0
  16. for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) {
  17. errCh := make(chan error, batchSize)
  18. var wg sync.WaitGroup
  19. wg.Add(batchSize)
  20. for i := 0; i < batchSize; i++ {
  21. go func() {
  22. defer wg.Done()
  23. if err := fn(); err != nil {
  24. errCh <- err
  25. }
  26. }()
  27. }
  28. wg.Wait()
  29. curSuccesses := batchSize - len(errCh)
  30. successes += curSuccesses
  31. if len(errCh) > 0 {
  32. return successes, <-errCh
  33. }
  34. remaining -= batchSize
  35. }
  36. return successes, nil
  37. }
rsc.podControl.CreatePodsWithControllerRef

前面定义的创建pod时调用的方法为rsc.podControl.CreatePodsWithControllerRef

  1. func (r RealPodControl) CreatePodsWithControllerRef(namespace string, template *v1.PodTemplateSpec, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
  2. if err := validateControllerRef(controllerRef); err != nil {
  3. return err
  4. }
  5. return r.createPods("", namespace, template, controllerObject, controllerRef)
  6. }
  7. func (r RealPodControl) createPods(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error {
  8. pod, err := GetPodFromTemplate(template, object, controllerRef)
  9. if err != nil {
  10. return err
  11. }
  12. if len(nodeName) != 0 {
  13. pod.Spec.NodeName = nodeName
  14. }
  15. if len(labels.Set(pod.Labels)) == 0 {
  16. return fmt.Errorf("unable to create pods, no labels")
  17. }
  18. newPod, err := r.KubeClient.CoreV1().Pods(namespace).Create(pod)
  19. if err != nil {
  20. // only send an event if the namespace isn't terminating
  21. if !apierrors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
  22. r.Recorder.Eventf(object, v1.EventTypeWarning, FailedCreatePodReason, "Error creating: %v", err)
  23. }
  24. return err
  25. }
  26. accessor, err := meta.Accessor(object)
  27. if err != nil {
  28. klog.Errorf("parentObject does not have ObjectMeta, %v", err)
  29. return nil
  30. }
  31. klog.V(4).Infof("Controller %v created pod %v", accessor.GetName(), newPod.Name)
  32. r.Recorder.Eventf(object, v1.EventTypeNormal, SuccessfulCreatePodReason, "Created pod: %v", newPod.Name)
  33. return nil
  34. }

2.2 删除逻辑代码块

主要逻辑:

(1)运算获取需要删除的pod数量,并设置数量上限500;

(2)根据要缩容删除的pod数量,先调用getPodsToDelete函数找出需要删除的pod列表;

(3)调用rsc.expectations.ExpectCreations,将本轮调谐期望删除的pod数量设置进expectations;

(4)每个pod拉起一个goroutine,调用rsc.podControl.DeletePod来删除该pod;

(5)对于删除失败的pod,会调用rsc.expectations.DeletionObserved方法,减去本轮调谐中期望创建的pod数量。

至于为什么要减,原因跟上面创建逻辑代码块中分析的一样。

(6)等待所有gorouutine完成,return返回。

  1. if diff > rsc.burstReplicas {
  2. diff = rsc.burstReplicas
  3. }
  4. glog.V(2).Infof("Too many replicas for %v %s/%s, need %d, deleting %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
  5. // Choose which Pods to delete, preferring those in earlier phases of startup.
  6. podsToDelete := getPodsToDelete(filteredPods, diff)
  7. rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))
  8. errCh := make(chan error, diff)
  9. var wg sync.WaitGroup
  10. wg.Add(diff)
  11. for _, pod := range podsToDelete {
  12. go func(targetPod *v1.Pod) {
  13. defer wg.Done()
  14. if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
  15. // Decrement the expected number of deletes because the informer won't observe this deletion
  16. podKey := controller.PodKey(targetPod)
  17. glog.V(2).Infof("Failed to delete %v, decrementing expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
  18. rsc.expectations.DeletionObserved(rsKey, podKey)
  19. errCh <- err
  20. }
  21. }(pod)
  22. }
  23. wg.Wait()
  24. select {
  25. case err := <-errCh:
  26. // all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
  27. if err != nil {
  28. return err
  29. }
  30. default:
  31. }

2.2.1 getPodsToDelete

getPodsToDelete:根据要缩容删除的pod数量,然后返回需要删除的pod列表。

  1. // pkg/controller/replicaset/replica_set.go
  2. func getPodsToDelete(filteredPods, relatedPods []*v1.Pod, diff int) []*v1.Pod {
  3. // No need to sort pods if we are about to delete all of them.
  4. // diff will always be <= len(filteredPods), so not need to handle > case.
  5. if diff < len(filteredPods) {
  6. podsWithRanks := getPodsRankedByRelatedPodsOnSameNode(filteredPods, relatedPods)
  7. sort.Sort(podsWithRanks)
  8. }
  9. return filteredPods[:diff]
  10. }
  11. func getPodsRankedByRelatedPodsOnSameNode(podsToRank, relatedPods []*v1.Pod) controller.ActivePodsWithRanks {
  12. podsOnNode := make(map[string]int)
  13. for _, pod := range relatedPods {
  14. if controller.IsPodActive(pod) {
  15. podsOnNode[pod.Spec.NodeName]++
  16. }
  17. }
  18. ranks := make([]int, len(podsToRank))
  19. for i, pod := range podsToRank {
  20. ranks[i] = podsOnNode[pod.Spec.NodeName]
  21. }
  22. return controller.ActivePodsWithRanks{Pods: podsToRank, Rank: ranks}
  23. }
筛选要删除的pod逻辑

按照下面的排序规则,从上到下进行排序,各个条件相互互斥,符合其中一个条件则排序完成:

(1)优先删除没有绑定node的pod;

(2)优先删除处于Pending状态的pod,然后是Unknown,最后才是Running;

(3)优先删除Not ready的pod,然后才是ready的pod;

(4)按同node上所属replicaset的pod数量排序,优先删除所属replicaset的pod数量多的node上的pod;

(5)按pod ready的时间排序,优先删除ready时间最短的pod;

(6)优先删除pod中容器重启次数较多的pod;

(7)按pod创建时间排序,优先删除创建时间最短的pod。

  1. // pkg/controller/controller_utils.go
  2. func (s ActivePodsWithRanks) Less(i, j int) bool {
  3. // 1. Unassigned < assigned
  4. // If only one of the pods is unassigned, the unassigned one is smaller
  5. if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
  6. return len(s.Pods[i].Spec.NodeName) == 0
  7. }
  8. // 2. PodPending < PodUnknown < PodRunning
  9. if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {
  10. return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]
  11. }
  12. // 3. Not ready < ready
  13. // If only one of the pods is not ready, the not ready one is smaller
  14. if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
  15. return !podutil.IsPodReady(s.Pods[i])
  16. }
  17. // 4. Doubled up < not doubled up
  18. // If one of the two pods is on the same node as one or more additional
  19. // ready pods that belong to the same replicaset, whichever pod has more
  20. // colocated ready pods is less
  21. if s.Rank[i] != s.Rank[j] {
  22. return s.Rank[i] > s.Rank[j]
  23. }
  24. // TODO: take availability into account when we push minReadySeconds information from deployment into pods,
  25. // see https://github.com/kubernetes/kubernetes/issues/22065
  26. // 5. Been ready for empty time < less time < more time
  27. // If both pods are ready, the latest ready one is smaller
  28. if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {
  29. readyTime1 := podReadyTime(s.Pods[i])
  30. readyTime2 := podReadyTime(s.Pods[j])
  31. if !readyTime1.Equal(readyTime2) {
  32. return afterOrZero(readyTime1, readyTime2)
  33. }
  34. }
  35. // 6. Pods with containers with higher restart counts < lower restart counts
  36. if maxContainerRestarts(s.Pods[i]) != maxContainerRestarts(s.Pods[j]) {
  37. return maxContainerRestarts(s.Pods[i]) > maxContainerRestarts(s.Pods[j])
  38. }
  39. // 7. Empty creation time pods < newer pods < older pods
  40. if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
  41. return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
  42. }
  43. return false
  44. }

2.2.2 rsc.podControl.DeletePod

删除pod的方法。

  1. // pkg/controller/controller_utils.go
  2. func (r RealPodControl) DeletePod(namespace string, podID string, object runtime.Object) error {
  3. accessor, err := meta.Accessor(object)
  4. if err != nil {
  5. return fmt.Errorf("object does not have ObjectMeta, %v", err)
  6. }
  7. klog.V(2).Infof("Controller %v deleting pod %v/%v", accessor.GetName(), namespace, podID)
  8. if err := r.KubeClient.CoreV1().Pods(namespace).Delete(podID, nil); err != nil && !apierrors.IsNotFound(err) {
  9. r.Recorder.Eventf(object, v1.EventTypeWarning, FailedDeletePodReason, "Error deleting: %v", err)
  10. return fmt.Errorf("unable to delete pods: %v", err)
  11. }
  12. r.Recorder.Eventf(object, v1.EventTypeNormal, SuccessfulDeletePodReason, "Deleted pod: %v", podID)
  13. return nil
  14. }

3 calculateStatus

calculateStatus函数计算并返回replicaset对象的status。

怎么计算status呢?

(1)根据现存pod数量、Ready状态的pod数量、availabel状态的pod数量等,给replicaset对象的status的Replicas、ReadyReplicas、AvailableReplicas等字段赋值;

(2)根据replicaset对象现有status中的condition配置以及前面调用rsc.manageReplicas方法后是否有错误,来决定给status新增condition或移除condition,conditionTypeReplicaFailure

当调用rsc.manageReplicas方法出错,且replicaset对象的status中,没有conditionTypeReplicaFailure的condition,则新增conditionTypeReplicaFailure的condition,表示该replicaset创建/删除pod出错;

当调用rsc.manageReplicas方法没有任何错误,且replicaset对象的status中,有conditionTypeReplicaFailure的condition,则去除该condition,表示该replicaset创建/删除pod成功。

  1. func calculateStatus(rs *apps.ReplicaSet, filteredPods []*v1.Pod, manageReplicasErr error) apps.ReplicaSetStatus {
  2. newStatus := rs.Status
  3. // Count the number of pods that have labels matching the labels of the pod
  4. // template of the replica set, the matching pods may have more
  5. // labels than are in the template. Because the label of podTemplateSpec is
  6. // a superset of the selector of the replica set, so the possible
  7. // matching pods must be part of the filteredPods.
  8. fullyLabeledReplicasCount := 0
  9. readyReplicasCount := 0
  10. availableReplicasCount := 0
  11. templateLabel := labels.Set(rs.Spec.Template.Labels).AsSelectorPreValidated()
  12. for _, pod := range filteredPods {
  13. if templateLabel.Matches(labels.Set(pod.Labels)) {
  14. fullyLabeledReplicasCount++
  15. }
  16. if podutil.IsPodReady(pod) {
  17. readyReplicasCount++
  18. if podutil.IsPodAvailable(pod, rs.Spec.MinReadySeconds, metav1.Now()) {
  19. availableReplicasCount++
  20. }
  21. }
  22. }
  23. failureCond := GetCondition(rs.Status, apps.ReplicaSetReplicaFailure)
  24. if manageReplicasErr != nil && failureCond == nil {
  25. var reason string
  26. if diff := len(filteredPods) - int(*(rs.Spec.Replicas)); diff < 0 {
  27. reason = "FailedCreate"
  28. } else if diff > 0 {
  29. reason = "FailedDelete"
  30. }
  31. cond := NewReplicaSetCondition(apps.ReplicaSetReplicaFailure, v1.ConditionTrue, reason, manageReplicasErr.Error())
  32. SetCondition(&newStatus, cond)
  33. } else if manageReplicasErr == nil && failureCond != nil {
  34. RemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure)
  35. }
  36. newStatus.Replicas = int32(len(filteredPods))
  37. newStatus.FullyLabeledReplicas = int32(fullyLabeledReplicasCount)
  38. newStatus.ReadyReplicas = int32(readyReplicasCount)
  39. newStatus.AvailableReplicas = int32(availableReplicasCount)
  40. return newStatus
  41. }

4 updateReplicaSetStatus

主要逻辑:

(1)判断新计算出来的status中的各个属性如Replicas、ReadyReplicas、AvailableReplicas以及Conditions是否与现存replicaset对象的status中的一致,一致则不用做更新操作,直接return;

(2)调用c.UpdateStatus更新replicaset的status。

  1. // pkg/controller/replicaset/replica_set_utils.go
  2. func updateReplicaSetStatus(c appsclient.ReplicaSetInterface, rs *apps.ReplicaSet, newStatus apps.ReplicaSetStatus) (*apps.ReplicaSet, error) {
  3. // This is the steady state. It happens when the ReplicaSet doesn't have any expectations, since
  4. // we do a periodic relist every 30s. If the generations differ but the replicas are
  5. // the same, a caller might've resized to the same replica count.
  6. if rs.Status.Replicas == newStatus.Replicas &&
  7. rs.Status.FullyLabeledReplicas == newStatus.FullyLabeledReplicas &&
  8. rs.Status.ReadyReplicas == newStatus.ReadyReplicas &&
  9. rs.Status.AvailableReplicas == newStatus.AvailableReplicas &&
  10. rs.Generation == rs.Status.ObservedGeneration &&
  11. reflect.DeepEqual(rs.Status.Conditions, newStatus.Conditions) {
  12. return rs, nil
  13. }
  14. // Save the generation number we acted on, otherwise we might wrongfully indicate
  15. // that we've seen a spec update when we retry.
  16. // TODO: This can clobber an update if we allow multiple agents to write to the
  17. // same status.
  18. newStatus.ObservedGeneration = rs.Generation
  19. var getErr, updateErr error
  20. var updatedRS *apps.ReplicaSet
  21. for i, rs := 0, rs; ; i++ {
  22. klog.V(4).Infof(fmt.Sprintf("Updating status for %v: %s/%s, ", rs.Kind, rs.Namespace, rs.Name) +
  23. fmt.Sprintf("replicas %d->%d (need %d), ", rs.Status.Replicas, newStatus.Replicas, *(rs.Spec.Replicas)) +
  24. fmt.Sprintf("fullyLabeledReplicas %d->%d, ", rs.Status.FullyLabeledReplicas, newStatus.FullyLabeledReplicas) +
  25. fmt.Sprintf("readyReplicas %d->%d, ", rs.Status.ReadyReplicas, newStatus.ReadyReplicas) +
  26. fmt.Sprintf("availableReplicas %d->%d, ", rs.Status.AvailableReplicas, newStatus.AvailableReplicas) +
  27. fmt.Sprintf("sequence No: %v->%v", rs.Status.ObservedGeneration, newStatus.ObservedGeneration))
  28. rs.Status = newStatus
  29. updatedRS, updateErr = c.UpdateStatus(rs)
  30. if updateErr == nil {
  31. return updatedRS, nil
  32. }
  33. // Stop retrying if we exceed statusUpdateRetries - the replicaSet will be requeued with a rate limit.
  34. if i >= statusUpdateRetries {
  35. break
  36. }
  37. // Update the ReplicaSet with the latest resource version for the next poll
  38. if rs, getErr = c.Get(rs.Name, metav1.GetOptions{}); getErr != nil {
  39. // If the GET fails we can't trust status.Replicas anymore. This error
  40. // is bound to be more interesting than the update failure.
  41. return nil, getErr
  42. }
  43. }
  44. return nil, updateErr
  45. }
c.UpdateStatus
  1. // staging/src/k8s.io/client-go/kubernetes/typed/apps/v1/replicaset.go
  2. func (c *replicaSets) UpdateStatus(replicaSet *v1.ReplicaSet) (result *v1.ReplicaSet, err error) {
  3. result = &v1.ReplicaSet{}
  4. err = c.client.Put().
  5. Namespace(c.ns).
  6. Resource("replicasets").
  7. Name(replicaSet.Name).
  8. SubResource("status").
  9. Body(replicaSet).
  10. Do().
  11. Into(result)
  12. return
  13. }

总结

replicaset controller架构图

replicaset controller的大致组成和处理流程如下图,replicaset controller对pod和replicaset对象注册了event handler,当有事件时,会watch到然后将对应的replicaset对象放入到queue中,然后syncReplicaSet方法为replicaset controller调谐replicaset对象的核心处理逻辑所在,从queue中取出replicaset对象,做调谐处理。

replicaset controller核心处理逻辑

replicaset controller的核心处理逻辑是根据replicaset对象里期望的pod数量以及现存pod数量的比较,当期望pod数量比现存pod数量多时,调用创建pod算法创建出新的pod,直至达到期望数量;当期望pod数量比现存pod数量少时,调用删除pod算法,并根据一定的策略对现存pod列表做排序,从中按顺序选择多余的pod然后删除,直至达到期望数量。

replicaset controller创建pod算法

replicaset controller创建pod的算法是,按1、2、4、8...的递增趋势分多批次进行(每次调谐中创建pod的数量上限为500个,超过上限的会在下次调谐中再创建),若某批次创建pod有失败的(如apiserver限流,丢弃请求等,注意:超时除外,因为initialization处理有可能超时),则后续批次的pod创建不再进行,需等待该repliaset对象下次调谐时再触发该pod创建算法,进行pod的创建,直至达到期望数量。

replicaset controller删除pod算法

replicaset controller删除pod的算法是,先根据一定的策略将现存pod列表做排序,然后按顺序从中选择指定数量的pod,拉起与要删除的pod数量相同的goroutine来删除pod(每次调谐中删除pod的数量上限为500个),并等待所有goroutine执行完成。删除pod有失败的(如apiserver限流,丢弃请求)或超过500上限的部分,需等待该repliaset对象下次调谐时再触发该pod删除算法,进行pod的删除,直至达到期望数量。

筛选要删除的pod逻辑

按照下面的排序规则,从上到下进行排序,各个条件相互互斥,符合其中一个条件则排序完成:

(1)优先删除没有绑定node的pod;

(2)优先删除处于Pending状态的pod,然后是Unknown,最后才是Running;

(3)优先删除Not ready的pod,然后才是ready的pod;

(4)按同node上所属replicaset的pod数量排序,优先删除所属replicaset的pod数量多的node上的pod;

(5)按pod ready的时间排序,优先删除ready时间最短的pod;

(6)优先删除pod中容器重启次数较多的pod;

(7)按pod创建时间排序,优先删除创建时间最短的pod。

expectations机制

关于expectations机制的分析,会在下一篇博客中进行。

k8s replicaset controller分析(2)-核心处理逻辑分析的更多相关文章

  1. k8s replicaset controller分析(1)-初始化与启动分析

    replicaset controller分析 replicaset controller简介 replicaset controller是kube-controller-manager组件中众多控制 ...

  2. k8s replicaset controller 分析(3)-expectations 机制分析

    replicaset controller分析 replicaset controller简介 replicaset controller是kube-controller-manager组件中众多控制 ...

  3. k8s endpoints controller分析

    k8s endpoints controller分析 endpoints controller简介 endpoints controller是kube-controller-manager组件中众多控 ...

  4. external-attacher源码分析(2)-核心处理逻辑分析

    kubernetes ceph-csi分析目录导航 基于tag v2.1.1 https://github.com/kubernetes-csi/external-attacher/releases/ ...

  5. k8s daemonset controller源码分析

    daemonset controller分析 daemonset controller简介 daemonset controller是kube-controller-manager组件中众多控制器中的 ...

  6. k8s garbage collector分析(2)-处理逻辑分析

    garbage collector介绍 Kubernetes garbage collector即垃圾收集器,存在于kube-controller-manger中,它负责回收kubernetes中的资 ...

  7. k8s deployment controller源码分析

    deployment controller简介 deployment controller是kube-controller-manager组件中众多控制器中的一个,是 deployment 资源对象的 ...

  8. k8s statefulset controller源码分析

    statefulset controller分析 statefulset简介 statefulset是Kubernetes提供的管理有状态应用的对象,而deployment用于管理无状态应用. 有状态 ...

  9. kube-scheduler源码分析(2)-核心处理逻辑分析

    kube-scheduler源码分析(2)-核心处理逻辑分析 kube-scheduler简介 kube-scheduler组件是kubernetes中的核心组件之一,主要负责pod资源对象的调度工作 ...

随机推荐

  1. springMVC学习总结(一) --springMVC搭建

    springMVC学习总结(一) --springMVC搭建 搭建项目 1.创建一个web项目,并在项目中的src文件夹下创建一个包com.myl.controller. 2.添加相应jar包 3.在 ...

  2. GoLang设计模式04 - 单例模式

    单例模式恐怕是最为人熟知的一种设计模式了.它同样也是创建型模式的一种.当某个struct只允许有一个实例的时候,我们会用到这种设计模式.这个struct的唯一的实例被称为单例对象.下面是需要创建单例对 ...

  3. Vs code自动生成Doxygen格式注释

    前言 ​ 程序中注释的规范和统一性的重要性不言而喻,本文就推荐一种在用vscode编写代码时自动化生成标准化注释格式的方法,关于Doxygen规范及其使用可查看博文 代码注释规范之Doxygen. ​ ...

  4. Linux常用命令 - nl命令详解

    21篇测试必备的Linux常用命令,每天敲一篇,每次敲三遍,每月一循环,全都可记住!! https://www.cnblogs.com/poloyy/category/1672457.html 显示行 ...

  5. Coreos配置docker镜像加速器

    CoreOS配置docker镜像加速器 CoreOS下的Docker配置是通过flannel unit来实现的. 1) 通过命令 systemctl cat docker 可以看出配置文件的默认位置 ...

  6. Identity用户管理入门一(框架搭建)

    理论知识微软官方文档最完整,最详细,这里只一步步的介绍如何使用,地址:https://docs.microsoft.com/zh-cn/aspnet/core/security/authenticat ...

  7. JS006. 详解自执行函数原理与数据类型的快速转换 (声明语句、表达式、运算符剖析)

    今天的主角: Operator Description 一元正值符 " + "(MDN) 一元运算符, 如果操作数在之前不是number,试图将其转换为number. 圆括号运算符 ...

  8. Linux上安装服务器监视工具,名为Scout_Realtime。

    如何从浏览器监视Linux服务器和进程指标 在服务器上安装Ruby 1.9.3+ sudo yum -y install rubygems-devel 在Linux系统上安装了Ruby之后,现在可以使 ...

  9. 使用Visual Studio Code 开发 ESP8266

    使用Visual Studio Code 开发 ESP8266 ESP8266+ArduinoIDE+VSCode开发ESP8266. 首先说明一下ESP8266并不是某一WiFi模块的名字(我以前是 ...

  10. PHP的变量赋值

    这个标题估计很多人会不屑一顾,变量赋值?excuse me?我们学开发的第一课就会了好不好.但是,就是这样基础的东西,反而会让很多人蒙圈,比如,值和引用的关系.今天,我们就来具体讲讲. 首先,定义变量 ...