为什么删除Pod时webhook收到三次delete请求
最近在玩admission webhook时,发现一个奇怪的现象:我配置了validatingWebhookConfiguration使其监听pod的删除操作,结果发现每次删除Pod的时候,webhook会收到三次delete请求:
从日志打印上可以分析出,第一次删除请求来自于kubectl客户端,后面两次来自于pod所在的node节点。为什么会收到三次delete请求呢?
删除一个Pod的过程
通过阅读kube-apiserver和kubelet源码,我把一个pod的删除过程总结成如下这幅流程图,三个红色加粗的请求即为webhook收到的三次delete请求。
kube-apiserver处理第一次删除请求
首先,由kubectl发来的delete请求,会经过kube-apiserver的admission-controller进行准入校验。我们定义了admission webhook,所以kube-apiserver会将该请求相关的信息封装在AdmissionReview结构体中发送给webhook。这是第一次webhook收到delete请求。
kube-apiserver作为一个http服务器,它的handler在staging/src/k8s.io/apiserver/pkg/endpoints/installer.go
文件中的registerResourceHandlers
函数中定义。其中DELETE
请求的handler是restfulDeleteResource
:
1case "DELETE": // Delete a resource.
2 // ...
3
4 handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, deprecated, removedRelease, restfulDeleteResource(gracefulDeleter, isGracefulDeleter, reqScope, admit))
5
6 ...
restfulDeleteResource
调用DeleteResource
,后者则调用staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go
文件中的Delete
方法对对象进行删除
1func restfulDeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {
2 return func(req *restful.Request, res *restful.Response) {
3 handlers.DeleteResource(r, allowsOptions, &scope, admit)(res.ResponseWriter, req.Request)
4 }
5}
1func DeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope *RequestScope, admit admission.Interface) http.HandlerFunc {
2 //...
3
4 trace.Step("About to delete object from database")
5 wasDeleted := true
6 userInfo, _ := request.UserFrom(ctx)
7 staticAdmissionAttrs := admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Delete, options, dryrun.IsDryRun(options.DryRun), userInfo)
8 result, err := finishRequest(timeout, func() (runtime.Object, error) {
9 obj, deleted, err := r.Delete(ctx, name, rest.AdmissionToValidateObjectDeleteFunc(admit, staticAdmissionAttrs, scope), options)
10 wasDeleted = deleted
11 return obj, err
12 })
13 if err != nil {
14 scope.err(err, w, req)
15 return
16 }
17 trace.Step("Object deleted from database")
18
19 ...
20}
Delete
方法中,在BeforeDelete
函数中判断是否需要优雅删除,判断的标准是DeletionGracePeriodSeconds
值是否为0,不为零则认为是优雅删除,kube-apiserver不会立即将这个API对象从etcd中删除,否则直接删除。
对于Pod而言,默认DeletionGracePeriodSeconds
为30秒,因此这里不会被kube-apiserver立刻删除掉。而是将DeletionTimestamp
设置为当前时间,DeletionGracePeriodSeconds
设置为默认值30秒。
kubelet杀掉容器
kube-apiserver设置好DeletionTimestamp
和DeletionGracePeriodSeconds
这两个字段后,kubelet 会watch到Pod的更新。那kubelet list-watch机制又是怎么实现的呢?
Kubelet在makePodSourceConfig
函数中,监听了三种类型的Pod:通过文件系统上的配置文件配置的静态Pod,通过web 网络上的配置文件配置的静态Pod,以及kube-apiserver中的pod。我们主要关心第三种。
Kubelet通过reflactor watch到Pod资源发生变化后,是通过channel的方式将Pod及其变化传递给syncLoop主控制循环中进行处理的,并没有使用informer+workqueque的方式。
kubelet的主控制循环在pkg/kubelet/kubelet.go
文件中的syncLoopIteration
函数:
1func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
2 syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
3 select {
4 case u, open := <-configCh:
5 // Update from a config source; dispatch it to the right handler
6 // callback.
7 if !open {
8 klog.Errorf("Update channel is closed. Exiting the sync loop.")
9 return false
10 }
11
12 switch u.Op {
13 case kubetypes.ADD:
14 klog.V(2).Infof("SyncLoop (ADD, %q): %q", u.Source, format.Pods(u.Pods))
15 // After restarting, kubelet will get all existing pods through
16 // ADD as if they are new pods. These pods will then go through the
17 // admission process and *may* be rejected. This can be resolved
18 // once we have checkpointing.
19 handler.HandlePodAdditions(u.Pods)
20 case kubetypes.UPDATE:
21 klog.V(2).Infof("SyncLoop (UPDATE, %q): %q", u.Source, format.PodsWithDeletionTimestamps(u.Pods))
22 handler.HandlePodUpdates(u.Pods)
23 case kubetypes.REMOVE:
24 klog.V(2).Infof("SyncLoop (REMOVE, %q): %q", u.Source, format.Pods(u.Pods))
25 handler.HandlePodRemoves(u.Pods)
26 case kubetypes.RECONCILE:
27 klog.V(4).Infof("SyncLoop (RECONCILE, %q): %q", u.Source, format.Pods(u.Pods))
28 handler.HandlePodReconcile(u.Pods)
29 case kubetypes.DELETE:
30 klog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
31 // DELETE is treated as a UPDATE because of graceful deletion.
32 handler.HandlePodUpdates(u.Pods)
33 case kubetypes.RESTORE:
34 klog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
35 // These are pods restored from the checkpoint. Treat them as new
36 // pods.
37 handler.HandlePodAdditions(u.Pods)
38 case kubetypes.SET:
39 // TODO: Do we want to support this?
40 klog.Errorf("Kubelet does not support snapshot update")
41 }
42
43 ...
当Pod的DeletionTimestamp
被设置时,Kubelet会走入kubetypes.DELETE
这个分支,最终会调用到pkg/kubelet/kubelet.go
中的syncPod
函数,syncPod
这个函数是 kubelet 核心处理函数。这个函数会调用到容器运行时的KillPod
方法,该方法进而又会以goroutine的方式,使用pkg/kubelet/kuberuntime/kuberuntime_container.go
中定义的killContainer
方法并行的杀掉所有容器。killContainer
的代码实现如下所示:
1func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
2 ...
3
4 // From this point, pod and container must be non-nil.
5 gracePeriod := int64(minimumGracePeriodInSeconds)
6 switch {
7 case pod.DeletionGracePeriodSeconds != nil:
8 gracePeriod = *pod.DeletionGracePeriodSeconds
9 case pod.Spec.TerminationGracePeriodSeconds != nil:
10 gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
11 }
12
13 if len(message) == 0 {
14 message = fmt.Sprintf("Stopping container %s", containerSpec.Name)
15 }
16 m.recordContainerEvent(pod, containerSpec, containerID.ID, v1.EventTypeNormal, events.KillingContainer, message)
17
18 // Run internal pre-stop lifecycle hook
19 if err := m.internalLifecycle.PreStopContainer(containerID.ID); err != nil {
20 return err
21 }
22
23 // Run the pre-stop lifecycle hooks if applicable and if there is enough time to run it
24 if containerSpec.Lifecycle != nil && containerSpec.Lifecycle.PreStop != nil && gracePeriod > 0 {
25 gracePeriod = gracePeriod - m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)
26 }
27 // always give containers a minimal shutdown window to avoid unnecessary SIGKILLs
28 if gracePeriod < minimumGracePeriodInSeconds {
29 gracePeriod = minimumGracePeriodInSeconds
30 }
31 if gracePeriodOverride != nil {
32 gracePeriod = *gracePeriodOverride
33 klog.V(3).Infof("Killing container %q, but using %d second grace period override", containerID, gracePeriod)
34 }
35
36 klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)
37
38 err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
39 if err != nil {
40 klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
41 } else {
42 klog.V(3).Infof("Container %q exited normally", containerID.String())
43 }
44
45 m.containerRefManager.ClearRef(containerID)
46
47 return err
48}
这个方法就是先调用prestop hook,然后在通过runtimeService.StopContainer
方法杀掉容器进程,整个过程总时长不能超过DeletionGracePeriodSeconds
。注意,prestop hook是不会进行重试的,失败了kubelet也不管,容器还是照杀不误。
statusManager发送删除请求
kubelet以goroutine的方式运行着一个statusManager
,它的作用就是周期性的监听Pod的状态变化,然后执行func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
。在syncPod
中,注意到有如下的逻辑:
1func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
2 ...
3
4 if m.canBeDeleted(pod, status.status) {
5 deleteOptions := metav1.DeleteOptions{
6 GracePeriodSeconds: new(int64),
7 // Use the pod UID as the precondition for deletion to prevent deleting a
8 // newly created pod with the same name and namespace.
9 Preconditions: metav1.NewUIDPreconditions(string(pod.UID)),
10 }
11 err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, deleteOptions)
12 if err != nil {
13 klog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
14 return
15 }
16 klog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))
17 m.deletePodStatus(uid)
18 }
19}
也就是说,statusManager
发现Pod可以被删除的时候,就会去调用clientset的delete接口将Pod资源从kube-apiserver中删掉。那什么时候Pod可以被删除呢?自然是在上一步中,kubelet将Pod的容器、卷、cgroup sandbox等资源统统删除掉,就可以被删除了。
这里,webhook就会收到第二次删除请求,而且这次请求中,将GracePeriodSeconds
设置为了0,这就代表着kube-apiserver收到这个DELETE请求后,可以将Pod从etcd中删除了。
第三次delete请求
webhook为什么会收到第三次delete请求,这个问题着实困扰了我很久。
从日志的serviceAccount的信息来看,很像是节点上的组件又发了一次DELETE请求。是kubelet吗?还是kube-proxy?但是查看相关日志和代码,没有发现任何可疑点。
其实,第三次DELETE请求是kube-apiserver自己发的。
在第一部分中,我提到kube-apiserver收到DELETE请求后最终会调用staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go
文件中的Delete
方法,然后由于走的是优雅删除,它更新完Pod的DeletionTimestamp
和DeletionGracePeriodSeconds
两个字段后,就返回了。
现在,第二次DELETE请求将GracePeriodSeconds
设置为了0,于是现在可以开始执行实际的删除操作了。
1func (e *Store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
2 ...
3 // delete immediately, or no graceful deletion supported
4 klog.V(6).Infof("going to delete %s from registry: ", name)
5 out = e.NewFunc()
6 if err := e.Storage.Delete(ctx, key, out, &preconditions, storage.ValidateObjectFunc(deleteValidation), dryrun.IsDryRun(options.DryRun)); err != nil {
7 // Please refer to the place where we set ignoreNotFound for the reason
8 // why we ignore the NotFound error .
9 if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {
10 // The lastExisting object may not be the last state of the object
11 // before its deletion, but it's the best approximation.
12 out, err := e.finalizeDelete(ctx, lastExisting, true)
13 return out, true, err
14 }
15 return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)
16 }
17 ...
18}
在e.Storage.Delete
方法中,定义了storage.ValidateObjectFunc(deleteValidation)
参数,仔细阅读这个方法的实现细节,原来,kube-apiserver在进行删除前,还会再对这个删除操作执行一次准入控制校验,即Validating和Mutating。代码逻辑见staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go
中的conditionalDelete
函数:
1func (s *store) conditionalDelete(ctx context.Context, key string, out runtime.Object, v reflect.Value, preconditions *storage.Preconditions, validateDeletion storage.ValidateObjectFunc) error {
2 ...
3 for {
4 origState, err := s.getState(getResp, key, v, false)
5 if err != nil {
6 return err
7 }
8 if preconditions != nil {
9 if err := preconditions.Check(key, origState.obj); err != nil {
10 return err
11 }
12 }
13 if err := validateDeletion(ctx, origState.obj); err != nil {
14 return err
15 }
16 startTime := time.Now()
17 txnResp, err := s.client.KV.Txn(ctx).If(
18 clientv3.Compare(clientv3.ModRevision(key), "=", origState.rev),
19 ).Then(
20 clientv3.OpDelete(key),
21 ).Else(
22 clientv3.OpGet(key),
23 ).Commit()
24 ...
25
26}
validateDeletion 即为进行DELETE准入控制校验的地方,这个过程中必定会调用到Validating webhook,也就有了第三次delete请求。至于为什么要再做一次准入控制,我也不太明白。