为什么删除Pod时webhook收到三次delete请求

最近在玩admission webhook时,发现一个奇怪的现象:我配置了validatingWebhookConfiguration使其监听pod的删除操作,结果发现每次删除Pod的时候,webhook会收到三次delete请求:

从日志打印上可以分析出,第一次删除请求来自于kubectl客户端,后面两次来自于pod所在的node节点。为什么会收到三次delete请求呢?

删除一个Pod的过程

通过阅读kube-apiserver和kubelet源码,我把一个pod的删除过程总结成如下这幅流程图,三个红色加粗的请求即为webhook收到的三次delete请求。

kube-apiserver处理第一次删除请求

首先,由kubectl发来的delete请求,会经过kube-apiserver的admission-controller进行准入校验。我们定义了admission webhook,所以kube-apiserver会将该请求相关的信息封装在AdmissionReview结构体中发送给webhook。这是第一次webhook收到delete请求。

kube-apiserver作为一个http服务器,它的handler在staging/src/k8s.io/apiserver/pkg/endpoints/installer.go文件中的registerResourceHandlers函数中定义。其中DELETE请求的handler是restfulDeleteResource

1case "DELETE": // Delete a resource.
2    // ...
3
4    handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, deprecated, removedRelease, restfulDeleteResource(gracefulDeleter, isGracefulDeleter, reqScope, admit))
5
6    ...

restfulDeleteResource调用DeleteResource,后者则调用staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go文件中的Delete方法对对象进行删除

1func restfulDeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {
2	return func(req *restful.Request, res *restful.Response) {
3		handlers.DeleteResource(r, allowsOptions, &scope, admit)(res.ResponseWriter, req.Request)
4	}
5}
 1func DeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope *RequestScope, admit admission.Interface) http.HandlerFunc {
 2    //...
 3
 4    trace.Step("About to delete object from database")
 5		wasDeleted := true
 6		userInfo, _ := request.UserFrom(ctx)
 7		staticAdmissionAttrs := admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Delete, options, dryrun.IsDryRun(options.DryRun), userInfo)
 8		result, err := finishRequest(timeout, func() (runtime.Object, error) {
 9			obj, deleted, err := r.Delete(ctx, name, rest.AdmissionToValidateObjectDeleteFunc(admit, staticAdmissionAttrs, scope), options)
10			wasDeleted = deleted
11			return obj, err
12		})
13		if err != nil {
14			scope.err(err, w, req)
15			return
16		}
17        trace.Step("Object deleted from database")
18        
19        ...
20}

Delete方法中,在BeforeDelete函数中判断是否需要优雅删除,判断的标准是DeletionGracePeriodSeconds值是否为0,不为零则认为是优雅删除,kube-apiserver不会立即将这个API对象从etcd中删除,否则直接删除。

对于Pod而言,默认DeletionGracePeriodSeconds为30秒,因此这里不会被kube-apiserver立刻删除掉。而是将DeletionTimestamp设置为当前时间,DeletionGracePeriodSeconds设置为默认值30秒。

kubelet杀掉容器

kube-apiserver设置好DeletionTimestampDeletionGracePeriodSeconds这两个字段后,kubelet 会watch到Pod的更新。那kubelet list-watch机制又是怎么实现的呢?

Kubelet在makePodSourceConfig函数中,监听了三种类型的Pod:通过文件系统上的配置文件配置的静态Pod,通过web 网络上的配置文件配置的静态Pod,以及kube-apiserver中的pod。我们主要关心第三种。

Kubelet通过reflactor watch到Pod资源发生变化后,是通过channel的方式将Pod及其变化传递给syncLoop主控制循环中进行处理的,并没有使用informer+workqueque的方式

kubelet的主控制循环在pkg/kubelet/kubelet.go文件中的syncLoopIteration函数:

 1func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
 2	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
 3	select {
 4	case u, open := <-configCh:
 5		// Update from a config source; dispatch it to the right handler
 6		// callback.
 7		if !open {
 8			klog.Errorf("Update channel is closed. Exiting the sync loop.")
 9			return false
10		}
11
12		switch u.Op {
13		case kubetypes.ADD:
14			klog.V(2).Infof("SyncLoop (ADD, %q): %q", u.Source, format.Pods(u.Pods))
15			// After restarting, kubelet will get all existing pods through
16			// ADD as if they are new pods. These pods will then go through the
17			// admission process and *may* be rejected. This can be resolved
18			// once we have checkpointing.
19			handler.HandlePodAdditions(u.Pods)
20		case kubetypes.UPDATE:
21			klog.V(2).Infof("SyncLoop (UPDATE, %q): %q", u.Source, format.PodsWithDeletionTimestamps(u.Pods))
22			handler.HandlePodUpdates(u.Pods)
23		case kubetypes.REMOVE:
24			klog.V(2).Infof("SyncLoop (REMOVE, %q): %q", u.Source, format.Pods(u.Pods))
25			handler.HandlePodRemoves(u.Pods)
26		case kubetypes.RECONCILE:
27			klog.V(4).Infof("SyncLoop (RECONCILE, %q): %q", u.Source, format.Pods(u.Pods))
28			handler.HandlePodReconcile(u.Pods)
29		case kubetypes.DELETE:
30			klog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
31			// DELETE is treated as a UPDATE because of graceful deletion.
32			handler.HandlePodUpdates(u.Pods)
33		case kubetypes.RESTORE:
34			klog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
35			// These are pods restored from the checkpoint. Treat them as new
36			// pods.
37			handler.HandlePodAdditions(u.Pods)
38		case kubetypes.SET:
39			// TODO: Do we want to support this?
40			klog.Errorf("Kubelet does not support snapshot update")
41        }
42
43        ...

当Pod的DeletionTimestamp被设置时,Kubelet会走入kubetypes.DELETE这个分支,最终会调用到pkg/kubelet/kubelet.go中的syncPod函数,syncPod 这个函数是 kubelet 核心处理函数。这个函数会调用到容器运行时的KillPod方法,该方法进而又会以goroutine的方式,使用pkg/kubelet/kuberuntime/kuberuntime_container.go中定义的killContainer方法并行的杀掉所有容器。killContainer的代码实现如下所示:

 1func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
 2	...
 3
 4	// From this point, pod and container must be non-nil.
 5	gracePeriod := int64(minimumGracePeriodInSeconds)
 6	switch {
 7	case pod.DeletionGracePeriodSeconds != nil:
 8		gracePeriod = *pod.DeletionGracePeriodSeconds
 9	case pod.Spec.TerminationGracePeriodSeconds != nil:
10		gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
11	}
12
13	if len(message) == 0 {
14		message = fmt.Sprintf("Stopping container %s", containerSpec.Name)
15	}
16	m.recordContainerEvent(pod, containerSpec, containerID.ID, v1.EventTypeNormal, events.KillingContainer, message)
17
18	// Run internal pre-stop lifecycle hook
19	if err := m.internalLifecycle.PreStopContainer(containerID.ID); err != nil {
20		return err
21	}
22
23	// Run the pre-stop lifecycle hooks if applicable and if there is enough time to run it
24	if containerSpec.Lifecycle != nil && containerSpec.Lifecycle.PreStop != nil && gracePeriod > 0 {
25		gracePeriod = gracePeriod - m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)
26	}
27	// always give containers a minimal shutdown window to avoid unnecessary SIGKILLs
28	if gracePeriod < minimumGracePeriodInSeconds {
29		gracePeriod = minimumGracePeriodInSeconds
30	}
31	if gracePeriodOverride != nil {
32		gracePeriod = *gracePeriodOverride
33		klog.V(3).Infof("Killing container %q, but using %d second grace period override", containerID, gracePeriod)
34	}
35
36	klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)
37
38	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
39	if err != nil {
40		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
41	} else {
42		klog.V(3).Infof("Container %q exited normally", containerID.String())
43	}
44
45	m.containerRefManager.ClearRef(containerID)
46
47	return err
48}  

这个方法就是先调用prestop hook,然后在通过runtimeService.StopContainer方法杀掉容器进程,整个过程总时长不能超过DeletionGracePeriodSeconds。注意,prestop hook是不会进行重试的,失败了kubelet也不管,容器还是照杀不误。

statusManager发送删除请求

kubelet以goroutine的方式运行着一个statusManager,它的作用就是周期性的监听Pod的状态变化,然后执行func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {。在syncPod中,注意到有如下的逻辑:

 1func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
 2    ...
 3
 4    if m.canBeDeleted(pod, status.status) {
 5		deleteOptions := metav1.DeleteOptions{
 6			GracePeriodSeconds: new(int64),
 7			// Use the pod UID as the precondition for deletion to prevent deleting a
 8			// newly created pod with the same name and namespace.
 9			Preconditions: metav1.NewUIDPreconditions(string(pod.UID)),
10		}
11		err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, deleteOptions)
12		if err != nil {
13			klog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
14			return
15		}
16		klog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))
17		m.deletePodStatus(uid)
18	}
19}

也就是说,statusManager发现Pod可以被删除的时候,就会去调用clientset的delete接口将Pod资源从kube-apiserver中删掉。那什么时候Pod可以被删除呢?自然是在上一步中,kubelet将Pod的容器、卷、cgroup sandbox等资源统统删除掉,就可以被删除了。

这里,webhook就会收到第二次删除请求,而且这次请求中,将GracePeriodSeconds设置为了0,这就代表着kube-apiserver收到这个DELETE请求后,可以将Pod从etcd中删除了。

第三次delete请求

webhook为什么会收到第三次delete请求,这个问题着实困扰了我很久。

从日志的serviceAccount的信息来看,很像是节点上的组件又发了一次DELETE请求。是kubelet吗?还是kube-proxy?但是查看相关日志和代码,没有发现任何可疑点。

其实,第三次DELETE请求是kube-apiserver自己发的。

在第一部分中,我提到kube-apiserver收到DELETE请求后最终会调用staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go文件中的Delete方法,然后由于走的是优雅删除,它更新完Pod的DeletionTimestampDeletionGracePeriodSeconds两个字段后,就返回了。

现在,第二次DELETE请求将GracePeriodSeconds设置为了0,于是现在可以开始执行实际的删除操作了。

 1func (e *Store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
 2    ...
 3    // delete immediately, or no graceful deletion supported
 4    klog.V(6).Infof("going to delete %s from registry: ", name)
 5    out = e.NewFunc()
 6    if err := e.Storage.Delete(ctx, key, out, &preconditions, storage.ValidateObjectFunc(deleteValidation), dryrun.IsDryRun(options.DryRun)); err != nil {
 7        // Please refer to the place where we set ignoreNotFound for the reason
 8        // why we ignore the NotFound error .
 9        if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {
10            // The lastExisting object may not be the last state of the object
11            // before its deletion, but it's the best approximation.
12            out, err := e.finalizeDelete(ctx, lastExisting, true)
13            return out, true, err
14        }
15        return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)
16    }
17    ...
18}

e.Storage.Delete方法中,定义了storage.ValidateObjectFunc(deleteValidation)参数,仔细阅读这个方法的实现细节,原来,kube-apiserver在进行删除前,还会再对这个删除操作执行一次准入控制校验,即Validating和Mutating。代码逻辑见staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go中的conditionalDelete函数:

 1func (s *store) conditionalDelete(ctx context.Context, key string, out runtime.Object, v reflect.Value, preconditions *storage.Preconditions, validateDeletion storage.ValidateObjectFunc) error {
 2    ...
 3    for {
 4		origState, err := s.getState(getResp, key, v, false)
 5		if err != nil {
 6			return err
 7		}
 8		if preconditions != nil {
 9			if err := preconditions.Check(key, origState.obj); err != nil {
10				return err
11			}
12		}
13		if err := validateDeletion(ctx, origState.obj); err != nil {
14			return err
15		}
16		startTime := time.Now()
17		txnResp, err := s.client.KV.Txn(ctx).If(
18			clientv3.Compare(clientv3.ModRevision(key), "=", origState.rev),
19		).Then(
20			clientv3.OpDelete(key),
21		).Else(
22			clientv3.OpGet(key),
23        ).Commit()
24    ...
25
26}

validateDeletion 即为进行DELETE准入控制校验的地方,这个过程中必定会调用到Validating webhook,也就有了第三次delete请求。至于为什么要再做一次准入控制,我也不太明白。