鳩小屋

落書き帳

Kubeletのポッド作成処理(docker-shim,containerd)

kubeletとCRIの処理を雑に確認したメモ。

podの中身

pod内のアプリコンテナでは、user namespaceやcgroup namespace以外のnamespaceが共有されています。
f:id:FallenPigeon:20210918194346p:plain
f:id:FallenPigeon:20210918194359p:plain

pod内ではアプリコンテナだけでなくpauseコンテナも動いています。

https://speakerdeck.com/devinjeon/kubernetes-pod-internals-with-the-fundamentals-of-containers
f:id:FallenPigeon:20210918194744p:plain

pauseコンテナには、ipc,network namespaceを提供してアプリコンテナの障害時にもネットワーク設定?を維持する役割があるらしいです。

このようにpodでは、(複数の)アプリコンテナ+pauseコンテナが動作していて、ipc,network namespaceを共有している状態になります。

www.suse.com

pod作成概要

k8sのworker nodeで動作するkubeletとコンテナランタイムの間でCRIという規格でポッドが作成されます。
ポッドの作成に関連する主なCRIは下記になります。
1.RunPodSandboxでpod(pauseコンテナ)を作成
2.CreateContainerでアプリコンテナを作成
3.StartContainerでアプリコンテナを作成

kubernetes.io
f:id:FallenPigeon:20210918195202p:plain

コンテナ作成については、containerdとruncの以前の記事で触れたので、今回はポッド作成処理の1を雑に追います。
RunPodSandboxは文字どおりポッド環境を作成するCRIで、ポッドネットワーク等を整備する処理になっているようです。
なお、コンテナランタイムによってポッドの定義は異なるため、katacontaneirのようなVMM型とcontainerd(runc)のようなkernel共有型で処理は異なるようです。

service RuntimeService {

    // Sandbox operations.

    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}  
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}  
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}  
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}  
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}  

    // Container operations.  
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}  
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}  
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}  
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}  
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}  
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}

    ...  
}

ポッド作成の流れ

kubeletのポッド作成は下記のようなフローで行われるようです。
1.ポッド作成のイベントを受信するとHandlePodAdditionsが呼び出される。
2.最終的にkubeRuntimeManager.SyncpodからdockershimやCRI経由でポッド作成やコンテナ作成が行われる。

toutiao.io
f:id:FallenPigeon:20210919120615p:plain
f:id:FallenPigeon:20210918200240j:plain

syncLoopIteration

func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
	case u, open := <-configCh:
		// Update from a config source; dispatch it to the right handler
		// callback.
		if !open {
			klog.ErrorS(nil, "Update channel is closed, exiting the sync loop")
			return false
		}

		switch u.Op {
		case kubetypes.ADD:
			klog.V(2).InfoS("SyncLoop ADD", "source", u.Source, "pods", format.Pods(u.Pods))
			// After restarting, kubelet will get all existing pods through
			// ADD as if they are new pods. These pods will then go through the
			// admission process and *may* be rejected. This can be resolved
			// once we have checkpointing.
			handler.HandlePodAdditions(u.Pods)
		case kubetypes.UPDATE:
			klog.V(2).InfoS("SyncLoop UPDATE", "source", u.Source, "pods", format.Pods(u.Pods))
			handler.HandlePodUpdates(u.Pods)
		case kubetypes.REMOVE:
			klog.V(2).InfoS("SyncLoop REMOVE", "source", u.Source, "pods", format.Pods(u.Pods))
			handler.HandlePodRemoves(u.Pods)
		case kubetypes.RECONCILE:
			klog.V(4).InfoS("SyncLoop RECONCILE", "source", u.Source, "pods", format.Pods(u.Pods))
			handler.HandlePodReconcile(u.Pods)
		case kubetypes.DELETE:
			klog.V(2).InfoS("SyncLoop DELETE", "source", u.Source, "pods", format.Pods(u.Pods))
			// DELETE is treated as a UPDATE because of graceful deletion.
			handler.HandlePodUpdates(u.Pods)
		case kubetypes.SET:
			// TODO: Do we want to support this?
			klog.ErrorS(nil, "Kubelet does not support snapshot update")
		default:
			klog.ErrorS(nil, "Invalid operation type received", "operation", u.Op)
		}

HandlePodAdditions

HandlePodAdditions→dispatchWork→podWorkers.UpdatePodの順でポッド作成処理が呼び出されます。

func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
	start := kl.clock.Now()
	sort.Sort(sliceutils.PodsByCreationTime(pods))
	for _, pod := range pods {
		existingPods := kl.podManager.GetPods()
		kl.podManager.AddPod(pod)

		if kubetypes.IsMirrorPod(pod) {
			kl.handleMirrorPod(pod, start)
			continue
		}

		if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
			// We failed pods that we rejected, so activePods include all admitted
			// pods that are alive.
			activePods := kl.filterOutInactivePods(existingPods)

			// Check if we can admit the pod; if not, reject it.
			if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
				kl.rejectPod(pod, reason, message)
				continue
			}
		}
		mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
		//非同期のpod起動処理を実行
		kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
		//ポッドが起動するコンテナそれぞれのStartup/Liveness/Readinessのprobe workerを起動
		kl.probeManager.AddPod(pod)
	}
}
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
	// Run the sync in an async worker.
	kl.podWorkers.UpdatePod(UpdatePodOptions{
		Pod:        pod,
		MirrorPod:  mirrorPod,
		UpdateType: syncType,
		StartTime:  start,
	})
	// Note the number of containers for new pods.
	if syncType == kubetypes.SyncPodCreate {
		metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
	}
}

UpdatePod

続いてUpdatePod→managePodLoop(goroutine)→syncPodFnで処理が移り、syncPodFnが実際にコンテナの起動処理を行っていきます。
syncPodFnでは、ポッドstatusをAPI serverと同期したり、pod用ディレクトリを作成しているらしいです。
そして、なんやかんやでkubeRuntimeManager.Syncpodを呼び出します。

func (p *podWorkers) UpdatePod(options UpdatePodOptions) {
...
	if podUpdates, exists = p.podUpdates[uid]; !exists {
		// We need to have a buffer here, because checkForUpdates() method that
		// puts an update into channel is called from the same goroutine where
		// the channel is consumed. However, it is guaranteed that in such case
		// the channel is empty, so buffer of size 1 is enough.
		podUpdates = make(chan podWork, 1)
		p.podUpdates[uid] = podUpdates

		// Creating a new pod worker either means this is a new pod, or that the
		// kubelet just restarted. In either case the kubelet is willing to believe
		// the status of the pod for the first pod worker sync. See corresponding
		// comment in syncPod.
		go func() {
			defer runtime.HandleCrash()
			p.managePodLoop(podUpdates)
		}()
	}
...
func (p *podWorkers) managePodLoop(podUpdates <-chan podWork) {
	var lastSyncTime time.Time
	for update := range podUpdates {
		pod := update.Options.Pod

		klog.V(4).InfoS("Processing pod event", "pod", klog.KObj(pod), "podUID", pod.UID, "updateType", update.WorkType)
		err := func() error {
			// The worker is responsible for ensuring the sync method sees the appropriate
			// status updates on resyncs (the result of the last sync), transitions to
			// terminating (no wait), or on terminated (whatever the most recent state is).
			// Only syncing and terminating can generate pod status changes, while terminated
			// pods ensure the most recent status makes it to the api server.
			var status *kubecontainer.PodStatus
			var err error
			switch {
			case update.Options.RunningPod != nil:
				// when we receive a running pod, we don't need status at all
			default:
				// wait until we see the next refresh from the PLEG via the cache (max 2s)
				// TODO: this adds ~1s of latency on all transitions from sync to terminating
				//  to terminated, and on all termination retries (including evictions). We should
				//  improve latency by making the the pleg continuous and by allowing pod status
				//  changes to be refreshed when key events happen (killPod, sync->terminating).
				//  Improving this latency also reduces the possibility that a terminated
				//  container's status is garbage collected before we have a chance to update the
				//  API server (thus losing the exit code).
				status, err = p.podCache.GetNewerThan(pod.UID, lastSyncTime)
			}
			if err != nil {
				// This is the legacy event thrown by manage pod loop all other events are now dispatched
				// from syncPodFn
				p.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
				return err
			}

			ctx := p.contextForWorker(pod.UID)

			// Take the appropriate action (illegal phases are prevented by UpdatePod)
			switch {
			case update.WorkType == TerminatedPodWork:
				err = p.syncTerminatedPodFn(ctx, pod, status)

			case update.WorkType == TerminatingPodWork:
				...
			default:
				err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)
			}

			lastSyncTime = time.Now()
			return err
		}()
...
}

syncPodFn

func (kl *Kubelet) syncPod(ctx context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) error {
...
	// Record pod worker start latency if being created
	// TODO: make pod workers record their own latencies
	if updateType == kubetypes.SyncPodCreate {
		if !firstSeenTime.IsZero() {
			// This is the first time we are syncing the pod. Record the latency
			// since kubelet first saw the pod if firstSeenTime is set.
			metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		} else {
			klog.V(3).InfoS("First seen time not recorded for pod",
				"podUID", pod.UID,
				"pod", klog.KObj(pod))
		}
	}

	// Generate final API pod status with pod and status manager status
	apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)
	// The pod IP may be changed in generateAPIPodStatus if the pod is using host network. (See #24576)
	// TODO(random-liu): After writing pod spec into container labels, check whether pod is using host network, and
	// set pod IP to hostIP directly in runtime.GetPodStatus
	podStatus.IPs = make([]string, 0, len(apiPodStatus.PodIPs))
	for _, ipInfo := range apiPodStatus.PodIPs {
		podStatus.IPs = append(podStatus.IPs, ipInfo.IP)
	}

	if len(podStatus.IPs) == 0 && len(apiPodStatus.PodIP) > 0 {
		podStatus.IPs = []string{apiPodStatus.PodIP}
	}

	// If the pod should not be running, we request the pod's containers be stopped. This is not the same
	// as termination (we want to stop the pod, but potentially restart it later if soft admission allows
	// it later). Set the status and phase appropriately
	runnable := kl.canRunPod(pod)
	if !runnable.Admit {
		// Pod is not runnable; and update the Pod and Container statuses to why.
		if apiPodStatus.Phase != v1.PodFailed && apiPodStatus.Phase != v1.PodSucceeded {
			apiPodStatus.Phase = v1.PodPending
		}
		apiPodStatus.Reason = runnable.Reason
		apiPodStatus.Message = runnable.Message
		// Waiting containers are not creating.
		const waitingReason = "Blocked"
		for _, cs := range apiPodStatus.InitContainerStatuses {
			if cs.State.Waiting != nil {
				cs.State.Waiting.Reason = waitingReason
			}
		}
		for _, cs := range apiPodStatus.ContainerStatuses {
			if cs.State.Waiting != nil {
				cs.State.Waiting.Reason = waitingReason
			}
		}
	}

	// Record the time it takes for the pod to become running.
	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
		!firstSeenTime.IsZero() {
		metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
	}

	kl.statusManager.SetPodStatus(pod, apiPodStatus)

	// Pods that are not runnable must be stopped - return a typed error to the pod worker
	if !runnable.Admit {
		klog.V(2).InfoS("Pod is not runnable and must have running containers stopped", "pod", klog.KObj(pod), "podUID", pod.UID, "message", runnable.Message)
		var syncErr error
		p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
		if err := kl.killPod(pod, p, nil); err != nil {
			kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
			syncErr = fmt.Errorf("error killing pod: %v", err)
			utilruntime.HandleError(syncErr)
		} else {
			// There was no error killing the pod, but the pod cannot be run.
			// Return an error to signal that the sync loop should back off.
			syncErr = fmt.Errorf("pod cannot be run: %s", runnable.Message)
		}
		return syncErr
	}

	// If the network plugin is not ready, only start the pod if it uses the host network
	if err := kl.runtimeState.networkErrors(); err != nil && !kubecontainer.IsHostNetworkPod(pod) {
		kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, "%s: %v", NetworkNotReadyErrorMsg, err)
		return fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, err)
	}

	...

	// Make data directories for the pod
	if err := kl.makePodDataDirs(pod); err != nil {
		...
	}

	// Volume manager will not mount volumes for terminating pods
	// TODO: once context cancellation is added this check can be removed
	if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
		...
	}

	// Fetch the pull secrets for the pod
	pullSecrets := kl.getPullSecretsForPod(pod)

	// Call the container runtime's SyncPod callback
	result := kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)
	kl.reasonCache.Update(pod.UID, result)
	if err := result.Error(); err != nil {
		// Do not return error if the only failures were pods in backoff
		for _, r := range result.SyncResults {
			if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
				// Do not record an event here, as we keep all event logging for sync pod failures
				// local to container runtime so we get better errors
				return err
			}
		}

		return nil
	}

	return nil
}

kubeRuntimeManager.Syncpod

ようやくそれらしき処理が見えてきました。
4. Create sandbox if necessary.
6. Create init containers.
7. Create normal containers.
あたりでpodsandboxの作成やアプリコンテナの起動を行っているようです。

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create ephemeral containers.
//  6. Create init containers.
//  7. Create normal containers.
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
	// Step 1: Compute sandbox and container changes.
	...

	// Step 2: Kill the pod if the sandbox has changed.
	...
		// Step 3: kill any running containers in this pod which are not to keep.
		...

	// Step 4: Create a sandbox for the pod if necessary.
	podSandboxID := podContainerChanges.SandboxID
	if podContainerChanges.CreateSandbox {
		var msg string
		var err error

		klog.V(4).InfoS("Creating PodSandbox for pod", "pod", klog.KObj(pod))
		metrics.StartedPodsTotal.Inc()
		createSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))
		result.AddSyncResult(createSandboxResult)
		podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
		if err != nil {
			// createPodSandbox can return an error from CNI, CSI,
			// or CRI if the Pod has been deleted while the POD is
			// being created. If the pod has been deleted then it's
			// not a real error.
			//
			// SyncPod can still be running when we get here, which
			// means the PodWorker has not acked the deletion.
			if m.podStateProvider.IsPodTerminationRequested(pod.UID) {
				klog.V(4).InfoS("Pod was deleted and sandbox failed to be created", "pod", klog.KObj(pod), "podUID", pod.UID)
				return
			}
			metrics.StartedPodsErrorsTotal.WithLabelValues(err.Error()).Inc()
			createSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)
			klog.ErrorS(err, "CreatePodSandbox for pod failed", "pod", klog.KObj(pod))
			ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
			if referr != nil {
				klog.ErrorS(referr, "Couldn't make a ref to pod", "pod", klog.KObj(pod))
			}
			m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, "Failed to create pod sandbox: %v", err)
			return
		}
		klog.V(4).InfoS("Created PodSandbox for pod", "podSandboxID", podSandboxID, "pod", klog.KObj(pod))

		podSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
		if err != nil {
			ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
			if referr != nil {
				klog.ErrorS(referr, "Couldn't make a ref to pod", "pod", klog.KObj(pod))
			}
			m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedStatusPodSandBox, "Unable to get pod sandbox status: %v", err)
			klog.ErrorS(err, "Failed to get pod sandbox status; Skipping pod", "pod", klog.KObj(pod))
			result.Fail(err)
			return
		}

		// If we ever allow updating a pod from non-host-network to
		// host-network, we may use a stale IP.
		if !kubecontainer.IsHostNetworkPod(pod) {
			// Overwrite the podIPs passed in the pod status, since we just started the pod sandbox.
			podIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, podSandboxStatus)
			klog.V(4).InfoS("Determined the ip for pod after sandbox changed", "IPs", podIPs, "pod", klog.KObj(pod))
		}
	}

	// the start containers routines depend on pod ip(as in primary pod ip)
	// instead of trying to figure out if we have 0 < len(podIPs)
	// everytime, we short circuit it here
	podIP := ""
	if len(podIPs) != 0 {
		podIP = podIPs[0]
	}

	// Get podSandboxConfig for containers to start.
	configPodSandboxResult := kubecontainer.NewSyncResult(kubecontainer.ConfigPodSandbox, podSandboxID)
	result.AddSyncResult(configPodSandboxResult)
	podSandboxConfig, err := m.generatePodSandboxConfig(pod, podContainerChanges.Attempt)
	if err != nil {
		message := fmt.Sprintf("GeneratePodSandboxConfig for pod %q failed: %v", format.Pod(pod), err)
		klog.ErrorS(err, "GeneratePodSandboxConfig for pod failed", "pod", klog.KObj(pod))
		configPodSandboxResult.Fail(kubecontainer.ErrConfigPodSandbox, message)
		return
	}

	// Helper containing boilerplate common to starting all types of containers.
	// typeName is a description used to describe this type of container in log messages,
	// currently: "container", "init container" or "ephemeral container"
	// metricLabel is the label used to describe this type of container in monitoring metrics.
	// currently: "container", "init_container" or "ephemeral_container"
	start := func(typeName, metricLabel string, spec *startSpec) error {
		startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, spec.container.Name)
		result.AddSyncResult(startContainerResult)

		isInBackOff, msg, err := m.doBackOff(pod, spec.container, podStatus, backOff)
		if isInBackOff {
			startContainerResult.Fail(err, msg)
			klog.V(4).InfoS("Backing Off restarting container in pod", "containerType", typeName, "container", spec.container, "pod", klog.KObj(pod))
			return err
		}

		metrics.StartedContainersTotal.WithLabelValues(metricLabel).Inc()
		klog.V(4).InfoS("Creating container in pod", "containerType", typeName, "container", spec.container, "pod", klog.KObj(pod))
		// NOTE (aramase) podIPs are populated for single stack and dual stack clusters. Send only podIPs.
		if msg, err := m.startContainer(podSandboxID, podSandboxConfig, spec, pod, podStatus, pullSecrets, podIP, podIPs); err != nil {
			// startContainer() returns well-defined error codes that have reasonable cardinality for metrics and are
			// useful to cluster administrators to distinguish "server errors" from "user errors".
			metrics.StartedContainersErrorsTotal.WithLabelValues(metricLabel, err.Error()).Inc()
			startContainerResult.Fail(err, msg)
			// known errors that are logged in other places are logged at higher levels here to avoid
			// repetitive log spam
			switch {
			case err == images.ErrImagePullBackOff:
				klog.V(3).InfoS("Container start failed in pod", "containerType", typeName, "container", spec.container, "pod", klog.KObj(pod), "containerMessage", msg, "err", err)
			default:
				utilruntime.HandleError(fmt.Errorf("%v %+v start failed in pod %v: %v: %s", typeName, spec.container, format.Pod(pod), err, msg))
			}
			return err
		}

		return nil
	}

	// Step 5: start ephemeral containers
	...

	// Step 6: start the init container.
	if container := podContainerChanges.NextInitContainerToStart; container != nil {
		// Start the next init container.
		if err := start("init container", metrics.InitContainer, containerStartSpec(container)); err != nil {
			return
		}

		// Successfully started the container; clear the entry in the failure
		klog.V(4).InfoS("Completed init container for pod", "containerName", container.Name, "pod", klog.KObj(pod))
	}

	// Step 7: start containers in podContainerChanges.ContainersToStart.
	for _, idx := range podContainerChanges.ContainersToStart {
		start("container", metrics.Container, containerStartSpec(&pod.Spec.Containers[idx]))
	}

	return
}

createPodSandbox

generatePodSandboxConfigでpodのconfigを生成し、m.runtimeService.RunPodSandboxでRunPodSandbox CRIを呼び出しているようです。

// createPodSandbox creates a pod sandbox and returns (podSandBoxID, message, error).
func (m *kubeGenericRuntimeManager) createPodSandbox(pod *v1.Pod, attempt uint32) (string, string, error) {
	podSandboxConfig, err := m.generatePodSandboxConfig(pod, attempt)
	if err != nil {
		message := fmt.Sprintf("Failed to generate sandbox config for pod %q: %v", format.Pod(pod), err)
		klog.ErrorS(err, "Failed to generate sandbox config for pod", "pod", klog.KObj(pod))
		return "", message, err
	}

	// Create pod logs directory
	err = m.osInterface.MkdirAll(podSandboxConfig.LogDirectory, 0755)
	if err != nil {
		message := fmt.Sprintf("Failed to create log directory for pod %q: %v", format.Pod(pod), err)
		klog.ErrorS(err, "Failed to create log directory for pod", "pod", klog.KObj(pod))
		return "", message, err
	}

	runtimeHandler := ""
	if m.runtimeClassManager != nil {
		runtimeHandler, err = m.runtimeClassManager.LookupRuntimeHandler(pod.Spec.RuntimeClassName)
		if err != nil {
			message := fmt.Sprintf("Failed to create sandbox for pod %q: %v", format.Pod(pod), err)
			return "", message, err
		}
		if runtimeHandler != "" {
			klog.V(2).InfoS("Running pod with runtime handler", "pod", klog.KObj(pod), "runtimeHandler", runtimeHandler)
		}
	}

	podSandBoxID, err := m.runtimeService.RunPodSandbox(podSandboxConfig, runtimeHandler)
	if err != nil {
		message := fmt.Sprintf("Failed to create sandbox for pod %q: %v", format.Pod(pod), err)
		klog.ErrorS(err, "Failed to create sandbox for pod", "pod", klog.KObj(pod))
		return "", message, err
	}

	return podSandBoxID, "", nil
}
RunPodSandbox(CRI client)

RunPodSandboxがgRPCとして定義されていて、CRI clientが実装されています。

func (r *remoteRuntimeService) RunPodSandbox(config *runtimeapi.PodSandboxConfig, runtimeHandler string) (string, error) {
	// Use 2 times longer timeout for sandbox operation (4 mins by default)
	// TODO: Make the pod sandbox timeout configurable.
	timeout := r.timeout * 2

	klog.V(10).InfoS("[RemoteRuntimeService] RunPodSandbox", "config", config, "runtimeHandler", runtimeHandler, "timeout", timeout)

	ctx, cancel := getContextWithTimeout(timeout)
	defer cancel()

	resp, err := r.runtimeClient.RunPodSandbox(ctx, &runtimeapi.RunPodSandboxRequest{
		Config:         config,
		RuntimeHandler: runtimeHandler,
	})
	if err != nil {
		klog.ErrorS(err, "RunPodSandbox from runtime service failed")
		return "", err
	}

	if resp.PodSandboxId == "" {
		errorMessage := fmt.Sprintf("PodSandboxId is not set for sandbox %q", config.GetMetadata())
		err := errors.New(errorMessage)
		klog.ErrorS(err, "RunPodSandbox failed")
		return "", err
	}

	klog.V(10).InfoS("[RemoteRuntimeService] RunPodSandbox Response", "podSandboxID", resp.PodSandboxId)

	return resp.PodSandboxId, nil
}
service RuntimeService {
    // Version returns the runtime name, runtime version, and runtime API version.
    rpc Version(VersionRequest) returns (VersionResponse) {}

    // RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
    // the sandbox is in the ready state on success.
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
    // StopPodSandbox stops any running process that is part of the sandbox and
    // reclaims network resources (e.g., IP addresses) allocated to the sandbox.
    // If there are any running containers in the sandbox, they must be forcibly
    // terminated.
    // This call is idempotent, and must not return an error if all relevant
    // resources have already been reclaimed. kubelet will call StopPodSandbox
    // at least once before calling RemovePodSandbox. It will also attempt to
    // reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
    // multiple StopPodSandbox calls are expected.
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
    // RemovePodSandbox removes the sandbox. If there are any running containers
    // in the sandbox, they must be forcibly terminated and removed.
    // This call is idempotent, and must not return an error if the sandbox has
    // already been removed.
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
    // PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
    // present, returns an error.
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
    // ListPodSandbox returns a list of PodSandboxes.
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

    // CreateContainer creates a new container in specified PodSandbox
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
    // StartContainer starts the container.
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
    // StopContainer stops a running container with a grace period (i.e., timeout).
    // This call is idempotent, and must not return an error if the container has
    // already been stopped.
    // TODO: what must the runtime do after the grace period is reached?
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
    // RemoveContainer removes the container. If the container is running, the
    // container must be forcibly removed.
    // This call is idempotent, and must not return an error if the container has
    // already been removed.
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
    // ListContainers lists all containers by filters.
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
    // ContainerStatus returns status of the container. If the container is not
    // present, returns an error.
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
    // UpdateContainerResources updates ContainerConfig of the container.
    rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
    // ReopenContainerLog asks runtime to reopen the stdout/stderr log file
    // for the container. This is often called after the log file has been
    // rotated. If the container is not running, container runtime can choose
    // to either create a new log file and return nil, or return an error.
    // Once it returns error, new container log file MUST NOT be created.
    rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

    // ExecSync runs a command in a container synchronously.
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
    // Exec prepares a streaming endpoint to execute a command in the container.
    rpc Exec(ExecRequest) returns (ExecResponse) {}
    // Attach prepares a streaming endpoint to attach to a running container.
    rpc Attach(AttachRequest) returns (AttachResponse) {}
    // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
    rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

    // ContainerStats returns stats of the container. If the container does not
    // exist, the call returns an error.
    rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
    // ListContainerStats returns stats of all running containers.
    rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}

    // PodSandboxStats returns stats of the pod sandbox. If the pod sandbox does not
    // exist, the call returns an error.
    rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
    // ListPodSandboxStats returns stats of the pod sandboxes matching a filter.
    rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}

    // UpdateRuntimeConfig updates the runtime configuration based on the given request.
    rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

    // Status returns the status of the runtime.
    rpc Status(StatusRequest) returns (StatusResponse) {}
}

func (c *runtimeServiceClient) RunPodSandbox(ctx context.Context, in *RunPodSandboxRequest, opts ...grpc.CallOption) (*RunPodSandboxResponse, error) {
	out := new(RunPodSandboxResponse)
	err := c.cc.Invoke(ctx, "/runtime.v1alpha2.RuntimeService/RunPodSandbox", in, out, opts...)
	if err != nil {
		return nil, err
	}
	return out, nil
}
RunPodSandbox(docker)

Step 2のds.client.CreateContainerでpauseコンテナの作成、Step 4のds.client.StartContainerで起動を行っています。
以降はdocker cliからdockerdにリクエストを送る要領でコンテナが作成されていくはずです。
他には、resolv.confのオーバーライドやCNIによるネットワーク設定が行われます。

kubernetes/pkg/kubelet/dockershim/docker_sandbox.go

// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
// the sandbox is in ready state.
// For docker, PodSandbox is implemented by a container holding the network
// namespace for the pod.
// Note: docker doesn't use LogDirectory (yet).
func (ds *dockerService) RunPodSandbox(ctx context.Context, r *runtimeapi.RunPodSandboxRequest) (*runtimeapi.RunPodSandboxResponse, error) {
	config := r.GetConfig()

	// Step 1: Pull the image for the sandbox.
	image := defaultSandboxImage
	podSandboxImage := ds.podSandboxImage
	if len(podSandboxImage) != 0 {
		image = podSandboxImage
	}

	// NOTE: To use a custom sandbox image in a private repository, users need to configure the nodes with credentials properly.
	// see: https://kubernetes.io/docs/user-guide/images/#configuring-nodes-to-authenticate-to-a-private-registry
	// Only pull sandbox image when it's not present - v1.PullIfNotPresent.
	if err := ensureSandboxImageExists(ds.client, image); err != nil {
		return nil, err
	}

	// Step 2: Create the sandbox container.
	if r.GetRuntimeHandler() != "" && r.GetRuntimeHandler() != runtimeName {
		return nil, fmt.Errorf("RuntimeHandler %q not supported", r.GetRuntimeHandler())
	}
	createConfig, err := ds.makeSandboxDockerConfig(config, image)
	if err != nil {
		return nil, fmt.Errorf("failed to make sandbox docker config for pod %q: %v", config.Metadata.Name, err)
	}
	createResp, err := ds.client.CreateContainer(*createConfig)
	if err != nil {
		createResp, err = recoverFromCreationConflictIfNeeded(ds.client, *createConfig, err)
	}

	if err != nil || createResp == nil {
		return nil, fmt.Errorf("failed to create a sandbox for pod %q: %v", config.Metadata.Name, err)
	}
	resp := &runtimeapi.RunPodSandboxResponse{PodSandboxId: createResp.ID}

	ds.setNetworkReady(createResp.ID, false)
	defer func(e *error) {
		// Set networking ready depending on the error return of
		// the parent function
		if *e == nil {
			ds.setNetworkReady(createResp.ID, true)
		}
	}(&err)

	// Step 3: Create Sandbox Checkpoint.
	if err = ds.checkpointManager.CreateCheckpoint(createResp.ID, constructPodSandboxCheckpoint(config)); err != nil {
		return nil, err
	}

	// Step 4: Start the sandbox container.
	// Assume kubelet's garbage collector would remove the sandbox later, if
	// startContainer failed.
	err = ds.client.StartContainer(createResp.ID)
	if err != nil {
		return nil, fmt.Errorf("failed to start sandbox container for pod %q: %v", config.Metadata.Name, err)
	}

	// Rewrite resolv.conf file generated by docker.
	// NOTE: cluster dns settings aren't passed anymore to docker api in all cases,
	// not only for pods with host network: the resolver conf will be overwritten
	// after sandbox creation to override docker's behaviour. This resolv.conf
	// file is shared by all containers of the same pod, and needs to be modified
	// only once per pod.
	if dnsConfig := config.GetDnsConfig(); dnsConfig != nil {
		containerInfo, err := ds.client.InspectContainer(createResp.ID)
		if err != nil {
			return nil, fmt.Errorf("failed to inspect sandbox container for pod %q: %v", config.Metadata.Name, err)
		}

		if err := rewriteResolvFile(containerInfo.ResolvConfPath, dnsConfig.Servers, dnsConfig.Searches, dnsConfig.Options); err != nil {
			return nil, fmt.Errorf("rewrite resolv.conf failed for pod %q: %v", config.Metadata.Name, err)
		}
	}

	// Do not invoke network plugins if in hostNetwork mode.
	if config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetNetwork() == runtimeapi.NamespaceMode_NODE {
		return resp, nil
	}

	// Step 5: Setup networking for the sandbox.
	// All pod networking is setup by a CNI plugin discovered at startup time.
	// This plugin assigns the pod ip, sets up routes inside the sandbox,
	// creates interfaces etc. In theory, its jurisdiction ends with pod
	// sandbox networking, but it might insert iptables rules or open ports
	// on the host as well, to satisfy parts of the pod spec that aren't
	// recognized by the CNI standard yet.
	cID := kubecontainer.BuildContainerID(runtimeName, createResp.ID)
	networkOptions := make(map[string]string)
	if dnsConfig := config.GetDnsConfig(); dnsConfig != nil {
		// Build DNS options.
		dnsOption, err := json.Marshal(dnsConfig)
		if err != nil {
			return nil, fmt.Errorf("failed to marshal dns config for pod %q: %v", config.Metadata.Name, err)
		}
		networkOptions["dns"] = string(dnsOption)
	}
	err = ds.network.SetUpPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID, config.Annotations, networkOptions)
	if err != nil {
		errList := []error{fmt.Errorf("failed to set up sandbox container %q network for pod %q: %v", createResp.ID, config.Metadata.Name, err)}

		// Ensure network resources are cleaned up even if the plugin
		// succeeded but an error happened between that success and here.
		err = ds.network.TearDownPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID)
		if err != nil {
			errList = append(errList, fmt.Errorf("failed to clean up sandbox container %q network for pod %q: %v", createResp.ID, config.Metadata.Name, err))
		}

		err = ds.client.StopContainer(createResp.ID, defaultSandboxGracePeriod)
		if err != nil {
			errList = append(errList, fmt.Errorf("failed to stop sandbox container %q for pod %q: %v", createResp.ID, config.Metadata.Name, err))
		}

		return resp, utilerrors.NewAggregate(errList)
	}

	return resp, nil
}
RunPodSandbox(containerd)

containerdのサーバ側の処理も確認してみると、docker-shimの処理にもあったようなコンテナ作成やネットワーク設定を行っているらしき処理が見えます。
containerdではコンテナをtaskとして管理するため、コンテナ作成処理はcontainer.NewTaskやtask.Startが該当すると思われます。

// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
// the sandbox is in ready state.
func (c *criService) RunPodSandbox(ctx context.Context, r *runtime.RunPodSandboxRequest) (_ *runtime.RunPodSandboxResponse, retErr error) {
	config := r.GetConfig()
	log.G(ctx).Debugf("Sandbox config %+v", config)

	// Generate unique id and name for the sandbox and reserve the name.
	id := util.GenerateID()
	metadata := config.GetMetadata()
	if metadata == nil {
		return nil, errors.New("sandbox config must include metadata")
	}
	name := makeSandboxName(metadata)
	log.G(ctx).Debugf("Generated id %q for sandbox %q", id, name)
	// Reserve the sandbox name to avoid concurrent `RunPodSandbox` request starting the
	// same sandbox.
	if err := c.sandboxNameIndex.Reserve(name, id); err != nil {
		return nil, errors.Wrapf(err, "failed to reserve sandbox name %q", name)
	}
	defer func() {
		// Release the name if the function returns with an error.
		if retErr != nil {
			c.sandboxNameIndex.ReleaseByName(name)
		}
	}()

	// Create initial internal sandbox object.
	sandbox := sandboxstore.NewSandbox(
		sandboxstore.Metadata{
			ID:             id,
			Name:           name,
			Config:         config,
			RuntimeHandler: r.GetRuntimeHandler(),
		},
		sandboxstore.Status{
			State: sandboxstore.StateUnknown,
		},
	)

	// Ensure sandbox container image snapshot.
	image, err := c.ensureImageExists(ctx, c.config.SandboxImage, config)
	if err != nil {
		return nil, errors.Wrapf(err, "failed to get sandbox image %q", c.config.SandboxImage)
	}
	containerdImage, err := c.toContainerdImage(ctx, *image)
	if err != nil {
		return nil, errors.Wrapf(err, "failed to get image from containerd %q", image.ID)
	}

	ociRuntime, err := c.getSandboxRuntime(config, r.GetRuntimeHandler())
	if err != nil {
		return nil, errors.Wrap(err, "failed to get sandbox runtime")
	}
	log.G(ctx).Debugf("Use OCI %+v for sandbox %q", ociRuntime, id)

	podNetwork := true

	if goruntime.GOOS != "windows" &&
		config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetNetwork() == runtime.NamespaceMode_NODE {
		// Pod network is not needed on linux with host network.
		podNetwork = false
	}
	if goruntime.GOOS == "windows" &&
		config.GetWindows().GetSecurityContext().GetHostProcess() {
		//Windows HostProcess pods can only run on the host network
		podNetwork = false
	}

	if podNetwork {
		// If it is not in host network namespace then create a namespace and set the sandbox
		// handle. NetNSPath in sandbox metadata and NetNS is non empty only for non host network
		// namespaces. If the pod is in host network namespace then both are empty and should not
		// be used.
		var netnsMountDir = "/var/run/netns"
		if c.config.NetNSMountsUnderStateDir {
			netnsMountDir = filepath.Join(c.config.StateDir, "netns")
		}
		sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
		if err != nil {
			return nil, errors.Wrapf(err, "failed to create network namespace for sandbox %q", id)
		}
		sandbox.NetNSPath = sandbox.NetNS.GetPath()
		defer func() {
			if retErr != nil {
				deferCtx, deferCancel := ctrdutil.DeferContext()
				defer deferCancel()
				// Teardown network if an error is returned.
				if err := c.teardownPodNetwork(deferCtx, sandbox); err != nil {
					log.G(ctx).WithError(err).Errorf("Failed to destroy network for sandbox %q", id)
				}

				if err := sandbox.NetNS.Remove(); err != nil {
					log.G(ctx).WithError(err).Errorf("Failed to remove network namespace %s for sandbox %q", sandbox.NetNSPath, id)
				}
				sandbox.NetNSPath = ""
			}
		}()

		// Setup network for sandbox.
		// Certain VM based solutions like clear containers (Issue containerd/cri-containerd#524)
		// rely on the assumption that CRI shim will not be querying the network namespace to check the
		// network states such as IP.
		// In future runtime implementation should avoid relying on CRI shim implementation details.
		// In this case however caching the IP will add a subtle performance enhancement by avoiding
		// calls to network namespace of the pod to query the IP of the veth interface on every
		// SandboxStatus request.
		if err := c.setupPodNetwork(ctx, &sandbox); err != nil {
			return nil, errors.Wrapf(err, "failed to setup network for sandbox %q", id)
		}
	}

	// Create sandbox container.
	// NOTE: sandboxContainerSpec SHOULD NOT have side
	// effect, e.g. accessing/creating files, so that we can test
	// it safely.
	spec, err := c.sandboxContainerSpec(id, config, &image.ImageSpec.Config, sandbox.NetNSPath, ociRuntime.PodAnnotations)
	if err != nil {
		return nil, errors.Wrap(err, "failed to generate sandbox container spec")
	}
	log.G(ctx).Debugf("Sandbox container %q spec: %#+v", id, spew.NewFormatter(spec))
	sandbox.ProcessLabel = spec.Process.SelinuxLabel
	defer func() {
		if retErr != nil {
			selinux.ReleaseLabel(sandbox.ProcessLabel)
		}
	}()

	// handle any KVM based runtime
	if err := modifyProcessLabel(ociRuntime.Type, spec); err != nil {
		return nil, err
	}

	if config.GetLinux().GetSecurityContext().GetPrivileged() {
		// If privileged don't set selinux label, but we still record the MCS label so that
		// the unused label can be freed later.
		spec.Process.SelinuxLabel = ""
	}

	// Generate spec options that will be applied to the spec later.
	specOpts, err := c.sandboxContainerSpecOpts(config, &image.ImageSpec.Config)
	if err != nil {
		return nil, errors.Wrap(err, "failed to generate sanbdox container spec options")
	}

	sandboxLabels := buildLabels(config.Labels, image.ImageSpec.Config.Labels, containerKindSandbox)

	runtimeOpts, err := generateRuntimeOptions(ociRuntime, c.config)
	if err != nil {
		return nil, errors.Wrap(err, "failed to generate runtime options")
	}

	snapshotterOpt := snapshots.WithLabels(snapshots.FilterInheritedLabels(config.Annotations))
	opts := []containerd.NewContainerOpts{
		containerd.WithSnapshotter(c.config.ContainerdConfig.Snapshotter),
		customopts.WithNewSnapshot(id, containerdImage, snapshotterOpt),
		containerd.WithSpec(spec, specOpts...),
		containerd.WithContainerLabels(sandboxLabels),
		containerd.WithContainerExtension(sandboxMetadataExtension, &sandbox.Metadata),
		containerd.WithRuntime(ociRuntime.Type, runtimeOpts)}

	container, err := c.client.NewContainer(ctx, id, opts...)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create containerd container")
	}
	defer func() {
		if retErr != nil {
			deferCtx, deferCancel := ctrdutil.DeferContext()
			defer deferCancel()
			if err := container.Delete(deferCtx, containerd.WithSnapshotCleanup); err != nil {
				log.G(ctx).WithError(err).Errorf("Failed to delete containerd container %q", id)
			}
		}
	}()

	// Create sandbox container root directories.
	sandboxRootDir := c.getSandboxRootDir(id)
	if err := c.os.MkdirAll(sandboxRootDir, 0755); err != nil {
		return nil, errors.Wrapf(err, "failed to create sandbox root directory %q",
			sandboxRootDir)
	}
	defer func() {
		if retErr != nil {
			// Cleanup the sandbox root directory.
			if err := c.os.RemoveAll(sandboxRootDir); err != nil {
				log.G(ctx).WithError(err).Errorf("Failed to remove sandbox root directory %q",
					sandboxRootDir)
			}
		}
	}()
	volatileSandboxRootDir := c.getVolatileSandboxRootDir(id)
	if err := c.os.MkdirAll(volatileSandboxRootDir, 0755); err != nil {
		return nil, errors.Wrapf(err, "failed to create volatile sandbox root directory %q",
			volatileSandboxRootDir)
	}
	defer func() {
		if retErr != nil {
			// Cleanup the volatile sandbox root directory.
			if err := c.os.RemoveAll(volatileSandboxRootDir); err != nil {
				log.G(ctx).WithError(err).Errorf("Failed to remove volatile sandbox root directory %q",
					volatileSandboxRootDir)
			}
		}
	}()

	// Setup files required for the sandbox.
	if err = c.setupSandboxFiles(id, config); err != nil {
		return nil, errors.Wrapf(err, "failed to setup sandbox files")
	}
	defer func() {
		if retErr != nil {
			if err = c.cleanupSandboxFiles(id, config); err != nil {
				log.G(ctx).WithError(err).Errorf("Failed to cleanup sandbox files in %q",
					sandboxRootDir)
			}
		}
	}()

	// Update sandbox created timestamp.
	info, err := container.Info(ctx)
	if err != nil {
		return nil, errors.Wrap(err, "failed to get sandbox container info")
	}

	// Create sandbox task in containerd.
	log.G(ctx).Tracef("Create sandbox container (id=%q, name=%q).",
		id, name)

	taskOpts := c.taskOpts(ociRuntime.Type)
	// We don't need stdio for sandbox container.
	task, err := container.NewTask(ctx, containerdio.NullIO, taskOpts...)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create containerd task")
	}
	defer func() {
		if retErr != nil {
			deferCtx, deferCancel := ctrdutil.DeferContext()
			defer deferCancel()
			// Cleanup the sandbox container if an error is returned.
			if _, err := task.Delete(deferCtx, WithNRISandboxDelete(id), containerd.WithProcessKill); err != nil && !errdefs.IsNotFound(err) {
				log.G(ctx).WithError(err).Errorf("Failed to delete sandbox container %q", id)
			}
		}
	}()

	// wait is a long running background request, no timeout needed.
	exitCh, err := task.Wait(ctrdutil.NamespacedContext())
	if err != nil {
		return nil, errors.Wrap(err, "failed to wait for sandbox container task")
	}

	nric, err := nri.New()
	if err != nil {
		return nil, errors.Wrap(err, "unable to create nri client")
	}
	if nric != nil {
		nriSB := &nri.Sandbox{
			ID:     id,
			Labels: config.Labels,
		}
		if _, err := nric.InvokeWithSandbox(ctx, task, v1.Create, nriSB); err != nil {
			return nil, errors.Wrap(err, "nri invoke")
		}
	}

	if err := task.Start(ctx); err != nil {
		return nil, errors.Wrapf(err, "failed to start sandbox container task %q", id)
	}

	if err := sandbox.Status.Update(func(status sandboxstore.Status) (sandboxstore.Status, error) {
		// Set the pod sandbox as ready after successfully start sandbox container.
		status.Pid = task.Pid()
		status.State = sandboxstore.StateReady
		status.CreatedAt = info.CreatedAt
		return status, nil
	}); err != nil {
		return nil, errors.Wrap(err, "failed to update sandbox status")
	}

	// Add sandbox into sandbox store in INIT state.
	sandbox.Container = container

	if err := c.sandboxStore.Add(sandbox); err != nil {
		return nil, errors.Wrapf(err, "failed to add sandbox %+v into store", sandbox)
	}

	// start the monitor after adding sandbox into the store, this ensures
	// that sandbox is in the store, when event monitor receives the TaskExit event.
	//
	// TaskOOM from containerd may come before sandbox is added to store,
	// but we don't care about sandbox TaskOOM right now, so it is fine.
	c.eventMonitor.startSandboxExitMonitor(context.Background(), id, task.Pid(), exitCh)

	return &runtime.RunPodSandboxResponse{PodSandboxId: id}, nil
}

startContainer

startContainerでは、イメージのプル、コンテナの作成、コンテナの実行が行われます。
デフォルトのサンドボックスイメージはgcr.io/google_containers/pause-amd64:3.0のようにpauseコンテナが指定されているらしいです。

// startContainer starts a container and returns a message indicates why it is failed on error.
// It starts the container through the following steps:
// * pull the image
// * create the container
// * start the container
// * run the post start lifecycle hooks (if applicable)
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {
	container := spec.container

	// Step 1: pull the image.
	imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets, podSandboxConfig)
	...

	// Step 2: create the container.
	// For a new container, the RestartCount should be 0
	restartCount := 0
	containerStatus := podStatus.FindContainerStatusByName(container.Name)
	...

	target, err := spec.getTargetID(podStatus)

	containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)

	// Step 3: start the container.
	err = m.runtimeService.StartContainer(containerID)

	m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, fmt.Sprintf("Started container %s", container.Name))

	containerMeta := containerConfig.GetMetadata()
	sandboxMeta := podSandboxConfig.GetMetadata()
	legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
		sandboxMeta.Namespace)
	containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)

	// Step 4: execute the post start hook.
	if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
		kubeContainerID := kubecontainer.ContainerID{
			Type: m.runtimeName,
			ID:   containerID,
		}
		msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
	}
	return "", nil
}

GitOps:GitHub Actions+Argo CD+Kubernetes

kurobato.hateblo.jp
前回はDockerfileのリポジトリを更新すると自動ビルドされたコンテナイメージがdocker hubにプッシュされる環境を用意しました。
次は、Argo CDというツールを使って、k8sマニフェストリポジトリを更新すると自動でコンテナがデプロイされる環境を構築します。

argoproj.github.io

環境

実機ホストOS:ubuntu
Kubernetes(minikube+virtualbox)

参考:k8s構築手順
Istio:マイクロサービス基盤入門 - 鳩小屋

Argo CDインストール

argocdネームスペースを作成してminikube k8s上にargoCDをインストールします。

#k8sロードバランサの用意(argocd-serverへアクセス用)
#minikubeで提供されるtunnelを起動します。
$ minikube tunnel
Status:	
	machine: minikube
	pid: 4324
	route: 10.96.0.0/12 -> 192.168.99.100
	minikube: Running
	services: [argocd-server, gitops-service, istio-ingressgateway]

#k8s node確認
$ kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE               KERNEL-VERSION   CONTAINER-RUNTIME
minikube       Ready    control-plane,master   11d    v1.21.2   192.168.99.100           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m02   Ready                     136m   v1.21.2   192.168.99.101           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m03   Ready                     135m   v1.21.2   192.168.99.102           Buildroot 2020.02.12   4.19.182         docker://20.10.6

#argocdのインストール
$ kubectl create namespace argocd
$ kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

ポッドの確認

いろいろポッドがデプロイされています。
argocd-serverがユーザアクセス用のサーバみたいです。

$ kubectl get pod -n argocd
NAME                                  READY   STATUS    RESTARTS   AGE
argocd-application-controller-0       1/1     Running   0          10m
argocd-dex-server-68c7bf5fdd-flk9c    1/1     Running   0          10m
argocd-redis-7547547c4f-pcmlk         1/1     Running   0          10m
argocd-repo-server-58f87478b8-lhg78   1/1     Running   0          10m
argocd-server-6f4fcdc5dc-bpmgc        1/1     Running   0          10m

サービスの確認

argocd-serverはデフォルトでは外部公開されていません。
argocd-serverに接続できるようにサービスタイプをLoadBalancerに変更します。

$ kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
$ kubectl get services -n argocd
NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
argocd-dex-server       ClusterIP      10.98.145.53             5556/TCP,5557/TCP,5558/TCP   3h5m
argocd-metrics          ClusterIP      10.110.187.11            8082/TCP                     3h5m
argocd-redis            ClusterIP      10.96.238.136            6379/TCP                     3h5m
argocd-repo-server      ClusterIP      10.107.230.1             8081/TCP,8084/TCP            3h5m
argocd-server           LoadBalancer   10.107.71.74    10.107.71.74   80:30364/TCP,443:32221/TCP   3h5m
argocd-server-metrics   ClusterIP      10.107.149.9             8083/TCP                     3h5m

これでargocd-serverサービス経由でargocd-serverポッドにアクセスできます。

CLIのダウンロード

sudo curl -sSL -o /usr/local/bin/argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
sudo chmod +x /usr/local/bin/argocd

argocd-server へのアクセス

10.107.71.74:80にアクセスします。

f:id:FallenPigeon:20210821094108p:plain

初期パスワードの確認(初期ユーザはadmin)

$ kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo
XXnTZyfLjyehwude

リポジトリ登録

f:id:FallenPigeon:20210821105042p:plain

k8sマニフェストのあるリポジトリをargocdに登録します。

f:id:FallenPigeon:20210821104641p:plain

k8sにデプロイするサンプルアプリとサービス

Hello GitOps!を返すwebサーバコンテナをk8sのdeployment podとしてデプロイし、ロードバランサで公開します。

package main

import (
  "fmt"
  "net/http"
)

func handler(w http.ResponseWriter, r *http.Request){
  fmt.Fprintf(w,"Hello GitOps!!")
}

func main(){
  http.HandleFunc("/",handler)
  http.ListenAndServe(":8080",nil)
}
# Stage-1
FROM golang:1.16 as builder
COPY ./app/main.go ./
RUN go build -o /gitops-go-app ./main.go

# Satge-2
FROM ubuntu
EXPOSE 8080
COPY --from=builder /gitops-go-app /.
ENTRYPOINT ["./gitops-go-app"]
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gitops-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gitops
  template:
    metadata:
      labels:
        app: gitops
    spec:
      containers:
      - name: gitops
        image: docker.io/l3j5g7d9/gitops-go-app:latest
        imagePullPolicy: IfNotPresent
apiVersion: v1
kind: Service
metadata:
  name: gitops-service
spec:
  type: LoadBalancer
  ports:
    - name: gitops
      protocol: TCP
      port: 80
      targetPort: 8080
  selector:
    app: gitops

f:id:FallenPigeon:20210821104711p:plain

マニフェストが認識されました。

手動同期

設定が手動同期になっているため、同期ボタンをポチります。
f:id:FallenPigeon:20210821105145p:plain

すると、deployment podが展開されたような表示になります。
f:id:FallenPigeon:20210821105158p:plain

k8s上で動いているか確認します。

$ kubectl get pods -n default
NAME                                 READY   STATUS    RESTARTS   AGE
gitops-deployment-64879cfb89-n48hc   2/2     Running   0          21m
gitops-deployment-64879cfb89-q878s   2/2     Running   0          21m
gitops-deployment-64879cfb89-v58gk   2/2     Running   0          21m

$ kubectl get services -n default
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
gitops-service   LoadBalancer   10.106.186.67   10.106.186.67   80:32018/TCP   2m45s
kubernetes       ClusterIP      10.96.0.1                 443/TCP        11d

$ curl 10.106.186.67:80
Hello GitOps!!

ポッドとサービスが作成され、ロードバランサ経由でwebサーバにアクセスできています。

自動同期

次は自動同期を有効化します。これでk8sリポジトリを変更すると自動検出してk8sにデプロイしてくれるようになるっぽいです。
f:id:FallenPigeon:20210821111202p:plain

リポジトリk8sマニフェストのポッド名を変更してみます。
f:id:FallenPigeon:20210821111732p:plain

すると、argoとk8s上で変更内容が自動反映されました。

f:id:FallenPigeon:20210821112103p:plain

$ kubectl get pods -n default
NAME                                         READY   STATUS    RESTARTS   AGE
gitops-deployment-changed-64879cfb89-q2gnj   2/2     Running   0          5m21s
gitops-deployment-changed-64879cfb89-rk6cn   2/2     Running   0          5m21s
gitops-deployment-changed-64879cfb89-trjn9   2/2     Running   0          5m21s

$ kubectl get services -n default
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
gitops-service   LoadBalancer   10.106.186.67   10.106.186.67   80:32018/TCP   43m
kubernetes       ClusterIP      10.96.0.1                 443/TCP        11d

$ curl 10.106.186.67:80
Hello GitOps!!

これにてCD完成です。GUIポチポチできるので楽ちんですね。

CICDパイプラインの作成

ここまでGithub Actions(CI)とArgo CD(CD)の環境を構築してきました。
最後にこれらを統合してCICDパイプラインにします。
つまり、アプリケーションのソースコードとDockerfileをリポジトリにプッシュするとビルドからデプロイまで全自動で行われるようにします。

1.Github Actions:Dockerfileのリポジトリ更新をトリガとして自動ビルド+コンテナイメージのプッシュ
2.Argo CD:k8sマニフェストリポジトリ更新をトリガとして自動デプロイ

方法としては、Github Actionsの処理でk8sマニフェストリポジトリを更新する処理を加えることでArgo CDの同期処理が起動するようにします。
処理としてはapp.yamlのポッド名をgitops-deployment[デプロイ番号]に書き換えてプッシュするだけです。
実際には、Helmも組み込んでKubernetesマニフェストを管理するのが無難っぽいのですが、少しややこしくなるので今回は省きます。

余談ですがyqコマンドなんてあったんですね。最近インフラレイヤはyamlだらけなので重宝しそうです。

code/.github/workflows$ nano main.yml 
name: Github Action CI

on:
  push:
    branches: [ main ]

jobs:
  build:
    name: GitOps Workflow
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      #Buildkitによるイメージビルド
      - name: Build an image from Dockerfile
        run: |
          DOCKER_BUILDKIT=1 docker image build . -f app/Dockerfile --tag ${{ secrets.DOCKERUSER }}/gitops-go-app:latest
      #Trivyによるイメージスキャン  
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: '${{ secrets.DOCKERUSER }}/gitops-go-app:latest'
          format: 'table'
          exit-code: '1'
          ignore-unfixed: true
          severity: 'CRITICAL,HIGH'
      #DockerHubにイメージプッシュ
      - name: Push Image
        run: |
          docker login docker.io --username ${{ secrets.DOCKERUSER }} --password ${{ secrets.DOCKERPASSWORD }}
          docker image push ${{ secrets.DOCKERUSER }}/gitops-go-app:latest
      #Kubernetesマニフェストの更新
      - name: Change Pod Name
        run: |
          echo -e "machine github.com\nlogin ${{ secrets.GITHUBUSER }}\npassword ${{ secrets.GITHUBTOKEN }}" > ~/.netrc
          git config --global user.email ${{ secrets.EMAIL }}
          git config --global user.name ${{ secrets.GITHUBUSER }}
          git clone https://github.com/${{ secrets.GITHUBUSER }}/config.git
          cd config/manifest
          yq e '.metadata.name = "gitops-deployment${{ github.run_number }}"' -i app.yaml
          git add app.yaml
          git commit -m ${{ github.run_number }} -a
          git push origin main

Github Actionsのワークフロー完了後にリポジトリのapp.yamlを確認するとポッドラベルが更新されています。

f:id:FallenPigeon:20210821170355p:plain

f:id:FallenPigeon:20210821170410p:plain

次にArgo CDのコンソールを確認すると、更新されたポッド名が反映されています。

f:id:FallenPigeon:20210821170834p:plain

k8sでもちゃんと動いています。

$ kubectl get pods -n default
NAME                                   READY   STATUS    RESTARTS   AGE
gitops-deployment31-64879cfb89-6ds4x   2/2     Running   0          24m
gitops-deployment31-64879cfb89-hfhh2   2/2     Running   0          24m
gitops-deployment31-64879cfb89-tmlf9   2/2     Running   0          24m

$ kubectl get services -n default
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
gitops-service   LoadBalancer   10.106.186.67   10.106.186.67   80:32018/TCP   6h31m
kubernetes       ClusterIP      10.96.0.1                 443/TCP        12d

$ curl 10.106.186.67:80
Hello GitOps!!

これでCICDパイプラインの完成です。
とりあえず動いたので満足。

Github Actions:CI


CICDはAWS CodeXシリーズしか触ったことがなかったのでGithub Actions(CI)を少し動かしてみました。
Dockerfileやアプリケーションを更新すると自動でイメージビルドとイメージプッシュが行われるシンプルな環境を構成します。

アプリケーションとDockerfileの用意

user@user-HP:~/gitops/code/app$ ls
Dockerfile  main.go

user@user-HP:~/gitops/code/app$ cat main.go 
package main

import (
  "fmt"
  "net/http"
)

func handler(w http.ResponseWriter, r *http.Request){
  fmt.Fprintf(w,"Hello GitOps!!")
}

func main(){
  http.HandleFunc("/",handler)
  http.ListenAndServe(":8080",nil)
}
user@user-HP:~/gitops/code/app$ cat Dockerfile 
# Stage-1
FROM golang:1.16 as builder
COPY ./app/main.go ./
RUN go build -o /gitops-go-app ./main.go

# Satge-2
FROM ubuntu
EXPOSE 8080
COPY --from=builder /gitops-go-app /.
ENTRYPOINT ["./gitops-go-app"]

上記が格納されたgitリポジトリを用意します。
f:id:FallenPigeon:20210815180703p:plain

GitHub Actionsのセットアップ

f:id:FallenPigeon:20210815180844p:plain
f:id:FallenPigeon:20210815181101p:plain

Secretの登録

ワークフローの設定で参照するシークレットを作成します。
f:id:FallenPigeon:20210815181437p:plain

ローカルリポジトリとの同期

user@user-HP:~/gitops/code$ git pull

ワークフローの設定

user@user-HP:~/gitops/code/.github/workflows$ nano main.yml                                                                       

name: Github Action CI

# mainブランチへのプッシュをトリガーにする
on:
  push:
    branches: [ main ]

jobs:
  build:
    name: GitOps Workflow
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      #Buildkitによるイメージビルド
      - name: Build an image from Dockerfile
        run: |
          DOCKER_BUILDKIT=1 docker image build . -f app/Dockerfile --tag ${{ secrets.DOCKERUSER }}/gitops-go-app:latest
      #Trivyによるイメージスキャン  
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: '${{ secrets.DOCKERUSER }}/gitops-go-app:latest'
          format: 'table'
          exit-code: '1'
          ignore-unfixed: true
          severity: 'CRITICAL,HIGH'
      #Docker Hubにイメージプッシュ
      - name: Push Image
        run: |
          docker login docker.io --username ${{ secrets.DOCKERUSER }} --password ${{ secrets.DOCKERPASSWORD }}
          docker image push ${{ secrets.DOCKERUSER }}/gitops-go-app:latest

リポジトリへのプッシュ

user@user-HP:~/gitops/code$ git add .
user@user-HP:~/gitops/code$ git commit -m "main.yml change"
user@user-HP:~/gitops/code$ git push -u origin main

ワークフローの確認

自動でワークフローが起動します。

f:id:FallenPigeon:20210815182859p:plain

エラーが出たとき
f:id:FallenPigeon:20210815183412p:plain

成功時
f:id:FallenPigeon:20210815183128p:plain

Docker Hubの確認

ワークフローが完了するとlatestタグのついたイメージが格納されているのが確認できます。
f:id:FallenPigeon:20210815183607p:plain

まとめ

セキュリティ観点では、ソースコード、Dockerfile、コンテナイメージの診断処理も実施すると良さそうです。
Argo CD等でKubernetesと連携すれば、コンテナ型CICDパイプラインが完成するはず。

Istio:マイクロサービス基盤入門

Istioデモ環境構築のメモです。
minikubeでvirtualboxKubernetesを構築し、Istioをインストールします。

virtualboxのインストール

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"

#virtualboxのインストール
$ sudo apt-get install virtualbox

kubectlのインストール

#curlのインストール
$ sudo apt install curl

#kubectlのダウンロード
$ curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl"
#実行権限付与
$ chmod +x ./kubectl
#パスを通す
$ sudo mv ./kubectl /usr/local/bin/kubectl

kubectlの動作確認
$ kubectl version --client
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

dockerのインストール(VirtualBoxKubernetesでは不要)

#パッケージ更新
$ sudo apt update
#必要なパッケージをインストール
$ sudo apt install apt-transport-https ca-certificates software-properties-common
#dockerリポジトリの追加
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
#パッケージ更新
$ sudo apt update
#dockerのインストール
$ sudo apt install docker-ce
#dockerの動作確認
$ systemctl status docker
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-09 14:18:39 JST; 3h 4min ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 1436 (dockerd)
      Tasks: 12
     Memory: 12.2M
     CGroup: /system.slice/docker.service
             └─1436 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

$ docker version
user@user-HP:~$ docker version
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:54:27 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

#sudo を省略するためにuserをDockerに追加
$ sudo usermod -aG docker user

#ログアウト後に再起動
$ sudo systemctl restart docker

minikubeのインストール

#minikubeバイナリのダウンロード
$ curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
#実行権限付与
$ chmod +x minikube
#パス設定
$ sudo mkdir -p /usr/local/bin/
$ sudo install minikube /usr/local/bin/

kubernetesの構築

virtualbox

node数を3とすると、マスターノードが1つ、ワーカノードが2つ、合計3つのVMが起動します。
今回のIstio構築では、実機構成に近いvirtualboxパターンを利用します。

#kubernetesの構築
$ minikube start --vm-driver=virtualbox node 3

f:id:FallenPigeon:20210809175104p:plain

#ノード情報を確認
$ kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE               KERNEL-VERSION   CONTAINER-RUNTIME
minikube       Ready    control-plane,master   3h48m   v1.21.2   192.168.99.100           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m02   Ready                     3h21m   v1.21.2   192.168.99.101           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m03   Ready                     3h      v1.21.2   192.168.99.102           Buildroot 2020.02.12   4.19.182         docker://20.10.6

CONTAINER-RUNTIMEがdockerになっていますが、これはホストOSのDockerではなく、virtualboxvm上で動作しているDockerになります。

docker版

minikube start --driver=docker

Istioの構築

以降、公式の手順でIstio環境を構築します。
Istio / Getting Started

Istioパッケージのダウンロード

$ curl -L https://istio.io/downloadIstio | sh -
$ cd istio-1.10.3
#istioctlのパス設定
$ sudo mv bin/istioctl /usr/local/bin

サンプル設定の適用

$ istioctl install --set profile=demo -y
#Envoyサイドカープロキシの自動挿入設定
$ kubectl label namespace default istio-injection=enabled

サンプルアプリケーション

$ kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml

構成ファイルの中身を確認してみます。Details service、Ratings service、Reviews service、Productpage servicesの4つの(Kubernetes)Serviceが定義されています。
具体的には、kind: Deploymentがアプリケーションポッド(コンテナ)、kind: ServiceAccountがポッドに割り当てるサービスアカウント(AWS IAMのようなもの)、kind: Serviceがコンテナへのポートバインドを表しています。
特にこのService(ClusterIP)は、Kubernetes上でのみ通用する仮想IPアドレス(ポッドに割り当てられるClusterIP)の9080ポートとコンテナ本体の9080ポートを紐付けています。
これによって、Kubernetes内のポッドからClusterIP:9080にアクセスすればコンテナ本体の9080ポートにアクセスできることになります。ただし、外部との疎通はありません。

$ less samples/bookinfo/platform/kube/bookinfo.yaml
##################################################################################################
# Details service
##################################################################################################
apiVersion: v1
kind: Service
metadata:
  name: details
  labels:
    app: details
    service: details
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: details

apiVersion: v1
kind: ServiceAccount
metadata:
  name: bookinfo-details
  labels:
    account: details

apiVersion: apps/v1
kind: Deployment
metadata:
  name: details-v1
  labels:
    app: details
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: details
      version: v1
  template:
    metadata:
      labels:
        app: details
        version: v1
    spec:
      serviceAccountName: bookinfo-details
      containers:
      - name: details
        image: docker.io/istio/examples-bookinfo-details-v1:1.16.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9080
        securityContext:
          runAsUser: 1000

##################################################################################################
# Ratings service
##################################################################################################
apiVersion: v1
kind: Service
metadata:
  name: ratings
  labels:
    app: ratings
    service: ratings
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: ratings

apiVersion: v1
kind: ServiceAccount
metadata:
  name: bookinfo-ratings
  labels:
    account: ratings

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ratings-v1
  labels:
    app: ratings
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ratings
      version: v1
  template:
    metadata:
      labels:
        app: ratings
        version: v1
    spec:
      serviceAccountName: bookinfo-ratings
      containers:
      - name: ratings
        image: docker.io/istio/examples-bookinfo-ratings-v1:1.16.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9080
        securityContext:
          runAsUser: 1000

##################################################################################################
# Reviews service
##################################################################################################
apiVersion: v1
kind: Service
metadata:
  name: reviews
  labels:
    app: reviews
    service: reviews
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews

apiVersion: v1
kind: ServiceAccount
metadata:
  name: bookinfo-reviews
  labels:
    account: reviews

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v1
  labels:
    app: reviews
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v1
  template:
    metadata:
      labels:
        app: reviews
        version: v1
    spec:
      serviceAccountName: bookinfo-reviews
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
        securityContext:
          runAsUser: 1000
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v2
  labels:
    app: reviews
    version: v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v2
  template:
    metadata:
      labels:
        app: reviews
        version: v2
    spec:
      serviceAccountName: bookinfo-reviews
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v2:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
        securityContext:
          runAsUser: 1000
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v3
  labels:
    app: reviews
    version: v3
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v3
  template:
    metadata:
      labels:
        app: reviews
        version: v3
    spec:
      serviceAccountName: bookinfo-reviews
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v3:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
        securityContext:
          runAsUser: 1000
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}

##################################################################################################
# Productpage services
##################################################################################################
apiVersion: v1
kind: Service
metadata:
  name: productpage
  labels:
    app: productpage
    service: productpage
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: productpage

apiVersion: v1
kind: ServiceAccount
metadata:
  name: bookinfo-productpage
  labels:
    account: productpage

apiVersion: apps/v1
kind: Deployment
metadata:
  name: productpage-v1
  labels:
    app: productpage
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: productpage
      version: v1
  template:
    metadata:
      labels:
        app: productpage
        version: v1
    spec:
      serviceAccountName: bookinfo-productpage
      containers:
      - name: productpage
        image: docker.io/istio/examples-bookinfo-productpage-v1:1.16.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        securityContext:
          runAsUser: 1000
      volumes:
      - name: tmp
        emptyDir: {}

上記の構成ファイルから以下のアプリケーションがデプロイされます。

f:id:FallenPigeon:20210809183650p:plain

初期構成の確認

ここでkubernetesのnamespaceを確認すると、istio-systemというnamespaceが作成されています。
その他はkubernetesのデフォルトnamespaceになります。特に指定がない場合はアプリケーションポッドはdefault namespaceにデプロイされます。
ちなみにこのnamespaceはlinux kernelのnamespaceとは別物でkubernetes固有の論理単位になります。

$ kubectl get namespaces
NAME              STATUS   AGE
default           Active   4h12m
istio-system      Active   131m
kube-node-lease   Active   4h12m
kube-public       Active   4h12m
kube-system       Active   4h12m

各namespaceにデプロイされたpodを確認します。

$ kubectl get pods -o wide -n istio-system
NAME                                   READY   STATUS    RESTARTS   AGE    IP           NODE           NOMINATED NODE   READINESS GATES
istio-egressgateway-5547fcc8fc-zpd72   1/1     Running   0          136m   10.244.2.4   minikube-m03              
istio-ingressgateway-8f568d595-s68gh   1/1     Running   0          136m   10.244.2.5   minikube-m03              
istiod-568d797f55-6rvp5                1/1     Running   0          136m   10.244.1.3   minikube-m02              

istio-system namespaceにはistio-egressgateway(出口)とistio-ingressgateway(入口)がポッドとしてデプロイされています。
Kubernetes cluster外と通信するには、これらのエンドポイントにアクセスすることになります。

#ポッド一覧
$ kubectl get pods -o wide -n default
NAME                              READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE   READINESS GATES
details-v1-79f774bdb9-dxbmb       2/2     Running   0          130m   10.244.1.7    minikube-m02              
productpage-v1-6b746f74dc-mt5sn   2/2     Running   0          130m   10.244.1.9    minikube-m02              
ratings-v1-b6994bb9-9gmdq         2/2     Running   0          130m   10.244.2.9    minikube-m03              
reviews-v1-545db77b95-zhr6v       2/2     Running   0          130m   10.244.1.8    minikube-m02              
reviews-v2-7bf8c9648f-4n9c6       2/2     Running   0          130m   10.244.0.3    minikube                  
reviews-v3-84779c7bbc-8trsh       2/2     Running   0          130m   10.244.2.10   minikube-m03              

#ポッドの詳細情報
$ kubectl describe pod productpage-v1-6b746f74dc-mt5sn
Name:         productpage-v1-6b746f74dc-mt5sn
Namespace:    default
Priority:     0
Node:         minikube-m02/192.168.99.101
Start Time:   Mon, 09 Aug 2021 16:11:35 +0900
Labels:       app=productpage
...
Containers:
  productpage:
    Container ID:   docker://25ae5f0dbd0984a2d8a675ffbe71592279ce936215b12c417aa06d543e201919
    Image:          docker.io/istio/examples-bookinfo-productpage-v1:1.16.2
    Image ID:       docker-pullable://istio/examples-bookinfo-productpage-v1@sha256:63ac3b4fb6c3ba395f5d044b0e10bae513afb34b9b7d862b3a7c3de7e0686667
    Port:           9080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 09 Aug 2021 16:11:36 +0900
    Ready:          True
    Restart Count:  0
    Environment:    
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f7hl9 (ro)
  istio-proxy:
    Container ID:  docker://f48cac26a0bcce1c056044843c44389ec3cb7d55e477d7a39014b8c0746c9b61
    Image:         docker.io/istio/proxyv2:1.10.3
    Image ID:      docker-pullable://istio/proxyv2@sha256:a78b7a165744384d95f75d157c34e02d6b4355aaf8fe2a2c75914832bdf764e8
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      productpage.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Mon, 09 Aug 2021 16:11:37 +0900
...


default namespaceには、アプリケーションを構成するDetails、Ratings、Reviews、Productpageのポッドが配置されています。
さらに、productpage-v1-6b746f74dc-mt5snポッドの構成情報を確認すると、productpageコンテナとistio-proxyコンテナの2つのコンテナが動作しています。
productpageコンテナがアプリケーション本体でistio-proxyコンテナはポッド間ルーティングを行うプロキシになります。つまり、Istioアプリケーションポッドの通信はすべてistio-proxyを経由します。

次にDetails(Pod)、Ratings(Pod)、Reviews(Pod)、Productpage(Pod)に割り当てられたserviceも確認します。
エンドポイントのClusterIPはVMのIPと異なることがわかります。これらを紐付けるのが後述するistio-egressgateway(出口)とistio-ingressgateway(入口)になります。

$ kubectl get services -n default
NAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
details       ClusterIP   10.102.12.28             9080/TCP   3h7m
kubernetes    ClusterIP   10.96.0.1                443/TCP    5h14m
productpage   ClusterIP   10.103.127.69            9080/TCP   3h7m
ratings       ClusterIP   10.102.176.254           9080/TCP   3h7m
reviews       ClusterIP   10.109.191.110           9080/TCP   3h7m
#ノード情報を確認
$ kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE               KERNEL-VERSION   CONTAINER-RUNTIME
minikube       Ready    control-plane,master   3h48m   v1.21.2   192.168.99.100           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m02   Ready                     3h21m   v1.21.2   192.168.99.101           Buildroot 2020.02.12   4.19.182         docker://20.10.6
minikube-m03   Ready                     3h      v1.21.2   192.168.99.102           Buildroot 2020.02.12   4.19.182         docker://20.10.6
外部公開

デプロイしたアプリケーションService(Pod)をIstio gatewayに関連付けます。
これは

#紐付け
$ kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml

#構成ファイルの内容
$ less samples/bookinfo/networking/bookinfo-gateway.yaml
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: bookinfo-gateway
spec:
  selector:
    istio: ingressgateway # use istio default controller
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: bookinfo
spec:
  hosts:
  - "*"
  gateways:
  - bookinfo-gateway
  http:
  - match:
    - uri:
        exact: /productpage
    - uri:
        prefix: /static
    - uri:
        exact: /login
    - uri:
        exact: /logout
    - uri:
        prefix: /api/v1/products
    route:
    - destination:
        host: productpage
        port:
          number: 9080

istio-ingressgatewayの設定を確認すると、http2のポートが30800、httpsのポートが31633であることが分かります。
これは、node IP:30800にアクセスすると、(istio-ingressgatewayポッドの)コンテナの8080に転送されることを表します。

$ kubectl -n istio-system get service istio-ingressgateway -o json
...
    "spec": {
        "clusterIP": "10.106.165.119",
        "clusterIPs": [
            "10.106.165.119"
        ],
        "externalTrafficPolicy": "Cluster",
        "ipFamilies": [
            "IPv4"
        ],
        "ipFamilyPolicy": "SingleStack",
        "ports": [
            {
                "name": "status-port",
                "nodePort": 31759,
                "port": 15021,
                "protocol": "TCP",
                "targetPort": 15021
            },
            {
                "name": "http2",
                "nodePort": 30800,
                "port": 80,
                "protocol": "TCP",
                "targetPort": 8080
            },
            {
                "name": "https",
                "nodePort": 31633,
                "port": 443,
                "protocol": "TCP",
                "targetPort": 8443
            },
            {
                "name": "tcp",
                "nodePort": 32007,
                "port": 31400,
                "protocol": "TCP",
                "targetPort": 31400
            },
            {
                "name": "tls",
                "nodePort": 30666,
                "port": 15443,
                "protocol": "TCP",
                "targetPort": 15443
            }
        ],
        "selector": {
            "app": "istio-ingressgateway",
            "istio": "ingressgateway"
        },
        "sessionAffinity": "None",
        "type": "LoadBalancer"
...

続いてistio-ingressgatewayポッドのコンテナを確認します。
確かに8080で待ち受けています。このコンテナがistio-ingressgatewayの本体(Envoy)です。

$ kubectl describe pod -n istio-system istio-ingressgateway-8f568d595-s68gh
Name:         istio-ingressgateway-8f568d595-s68gh
Namespace:    istio-system
Priority:     0
Node:         minikube-m03/192.168.99.102
Start Time:   Mon, 09 Aug 2021 16:06:47 +0900
Labels:       app=istio-ingressgateway
...
Status:       Running
IP:           10.244.2.5
IPs:
  IP:           10.244.2.5
Controlled By:  ReplicaSet/istio-ingressgateway-8f568d595
Containers:
  istio-proxy:
    Container ID:  docker://1396e222cd2aa759fff144ef299f4a19910b47ea4891e8d22caee5ad9e973ee8
    Image:         docker.io/istio/proxyv2:1.10.3
    Image ID:      docker-pullable://istio/proxyv2@sha256:a78b7a165744384d95f75d157c34e02d6b4355aaf8fe2a2c75914832bdf764e8
    Ports:         15021/TCP, 8080/TCP, 8443/TCP, 31400/TCP, 15443/TCP, 15090/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
...

サイドカープロキシの設定にはlistener(受信),router(紐付け),cluster(送信),endpoint(ノード)があります。
まずistio-ingressgatewayの受信ポートを確認します。
8080で受信したトラフィック送信先がRoute: http.80となっています。
補足ですがingressgatewaysの設定で8080が80にマップされています。
つまり、ingressgatewaysポッドが8080で受信したトラフィックは80に転送される形になります。

$ istioctl proxy-config listener istio-ingressgateway-8f568d595-s68gh -n istio-system
ADDRESS PORT  MATCH DESTINATION
0.0.0.0 8080  ALL   Route: http.80
0.0.0.0 15021 ALL   Inline Route: /healthz/ready*
0.0.0.0 15090 ALL   Inline Route: /stats/prometheus*
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        resources:
          requests:
            cpu: 10m
            memory: 40Mi
        service:
          ports:
            ## You can add custom gateway ports in user values overrides, but it must include those ports since helm replaces.
            # Note that AWS ELB will by default perform health checks on the first port
            # on this list. Setting this to the health check port will ensure that health
            # checks always work. https://github.com/istio/istio/issues/12503
            - port: 15021
              targetPort: 15021
              name: status-port
            - port: 80
              targetPort: 8080
              name: http2
            - port: 443
              targetPort: 8443
              name: https
            - port: 31400
              targetPort: 31400
              name: tcp
              # This is the port where sni routing happens
            - port: 15443
              targetPort: 15443
              name: tls


次にrouteを確認するとhttp.80がVIRTUAL SERVICEに転送されています。
つまりVIRTUAL SERVICE(80→9080)によりproductpageサービスノードの9080に転送されます。

$ istioctl proxy-config route istio-ingressgateway-8f568d595-s68gh -n istio-system
NAME        DOMAINS     MATCH                  VIRTUAL SERVICE
http.80     *           /productpage           bookinfo.default
http.80     *           /static*               bookinfo.default
http.80     *           /login                 bookinfo.default
http.80     *           /logout                bookinfo.default
http.80     *           /api/v1/products*      bookinfo.default
            *           /stats/prometheus*     
            *           /healthz/ready*        

VIRTUAL SERVICEによってproductpageサービスノードの9080にトラフィックは到達しますが、直接アプリケーションポッドには転送されません。

productpageポッドのサイドカープロキシの設定を確認すると、15001と15006の受信ポートがあります。
すべての送受信トラフィックiptablesの設定で送信時はプロキシの15001、受信時はプロキシの15006にリダイレクションされます。
今回は15006 Trans: raw_buffer; Addr: *:9080に該当するため、 プロキシからproductpageの9080に転送されます。
これでやっとアプリケーションポッドにリクエストが到達します。

https://speakerdeck.com/110y/tour-of-istio?slide=39

$ istioctl proxy-config listener productpage-v1-6b746f74dc-mt5sn
...
0.0.0.0       15001 ALL                                                                                             PassthroughCluster
0.0.0.0       15001 Addr: *:15001                                                                                   Non-HTTP/Non-TCP
0.0.0.0       15006 Addr: *:15006                                                                                   Non-HTTP/Non-TCP
0.0.0.0       15006 Trans: tls; App: istio-http/1.0,istio-http/1.1,istio-h2; Addr: 0.0.0.0/0                        InboundPassthroughClusterIpv4
0.0.0.0       15006 Trans: raw_buffer; App: HTTP; Addr: 0.0.0.0/0                                                   InboundPassthroughClusterIpv4
0.0.0.0       15006 Trans: tls; App: TCP TLS; Addr: 0.0.0.0/0                                                       InboundPassthroughClusterIpv4
0.0.0.0       15006 Trans: raw_buffer; Addr: 0.0.0.0/0                                                              InboundPassthroughClusterIpv4
0.0.0.0       15006 Trans: tls; Addr: 0.0.0.0/0                                                                     InboundPassthroughClusterIpv4
0.0.0.0       15006 Trans: tls; App: istio,istio-peer-exchange,istio-http/1.0,istio-http/1.1,istio-h2; Addr: *:9080 Cluster: inbound|9080||
0.0.0.0       15006 Trans: raw_buffer; Addr: *:9080                                                                 Cluster: inbound|9080||
...

$ istioctl proxy-config route productpage-v1-6b746f74dc-mt5sn
NAME                                                          DOMAINS                               MATCH                  VIRTUAL SERVICE
kube-dns.kube-system.svc.cluster.local:9153                   kube-dns.kube-system                  /*                     
80                                                            istio-egressgateway.istio-system      /*                     
80                                                            istio-ingressgateway.istio-system     /*                     
9080                                                          details                               /*                     
9080                                                          productpage                           /*                     
9080                                                          ratings                               /*                     
9080                                                          reviews                               /*                     
15010                                                         istiod.istio-system                   /*                     
15014                                                         istiod.istio-system                   /*                     
istio-ingressgateway.istio-system.svc.cluster.local:15021     istio-ingressgateway.istio-system     /*                     
                                                              *                                     /stats/prometheus*     
InboundPassthroughClusterIpv4                                 *                                     /*                     
inbound|9080||                                                *                                     /*                     
inbound|9080||                                                *                                     /* 


$ istioctl proxy-config cluster productpage-v1-6b746f74dc-k5lwk
SERVICE FQDN                                            PORT      SUBSET          DIRECTION     TYPE             DESTINATION RULE
                                                        9080      -               inbound       ORIGINAL_DST  

これで透過型プロキシを経由したルーティングを確認しました。
ブラウザで192.168.99.100:30800(nodeport)にアクセスすると以下のようなサイトが表示されます。

f:id:FallenPigeon:20210809214650p:plain

Tracee:コンテナトレーサ

今回のテーマはAqua社が主導するコンテナトレーサ:traceeです。
システムコールフックのeBPFを利用していることからsysdig社Falcoが類似ツールだと思います。
aquasecurity.github.io

インストール

#eBPFモジュールのビルド
$ git clone --recursive https://github.com/aquasecurity/tracee.git
$ cd tracee
$ make bpf
$ ls
tracee.bpf.5_11_0-25-generic.v0_6_0-7-g26a9eb2.o  tracee.bpf.core.o

eBPFモジュールをマウントしてtraceeコンテナを起動します。

# docker run --privileged -it -v /path/in/host/tracee.bpf.5_11_0-25-generic.v0_6_0-7-g26a9eb2.o:/path/in/container/tracee.bpf.o -e TRACEE_BPF_FILE=/path/in/container/tracee.bpf.o aquasec/tracee
Loaded signature(s):  [TRC-1 TRC-2 TRC-3 TRC-4 TRC-5 TRC-6 TRC-7]

7つのシグネチャがロードされたみたいですが、特に何も出力されません。

動作確認

シグネチャ機能は鋭意作成中とのことです。
現状のシグネチャにはアンチデバッグやコードインジェクションなど、マルウェアチックな挙動が並んでいます。

シグネチャ機能

We are currently working on creating a library of behavioral signature detections. Currently, the following are available:

Name Description
Standard Input/Output Over Socket Redirection of process's standard input/output to socket
Anti-Debugging Process uses anti-debugging technique to block debugger
Code injection Possible code injection into another process
Dynamic Code Loading Writing to executable allocated memory region
Fileless Execution Executing a process from memory, without a file in the disk
kernel module loading Attempt to load a kernel module detection
LD_PRELOAD Usage of LD_PRELOAD to allow hooks on process

今回はアンチデバッグ検出機能を確認してみます。

#適当なコンテナを起動します。
$ sudo docker run --rm -it ubuntu
root@d23f0851e601:/#

#デバッグ検出プログラムの用意
$ nano antidebug.c
#include <stdio.h>
#include <sys/ptrace.h>

int main()
{
        if (ptrace(PTRACE_TRACEME, 0, 1, 0) < 0) {
                printf("Debugging Dedected , Fuck You !\n");
                return 1;
        }
        printf("Normal Execution\n");
        return 0;
}

$gcc antidebug.c

# デバッグ検出プログラムをコンテナに配置
$docker cp a.out d23f0851e601:/a.out

#コンテナでデバッグ検出プログラムを実行
root@d23f0851e601:/# ./a.out 
Normal Execution

コンテナでデバッグ検出プログラムを実行すると、traceeコンテナのコンソールでAnti-Debuggingのアラートが出力されました。
おー動いた。

# docker run --privileged -it -v /path/in/host/tracee.bpf.5_11_0-25-generic.v0_6_0-7-g26a9eb2.o:/path/in/container/tracee.bpf.o -e TRACEE_BPF_FILE=/path/in/container/tracee.bpf.o aquasec/tracee
Loaded signature(s):  [TRC-1 TRC-2 TRC-3 TRC-4 TRC-5 TRC-6 TRC-7]

Detection
Time: 2021-08-08T12:52:30Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: a.out
Hostname: d23f0851e601

シグネチャはRego言語で書かれている?ようです。
MITRE ATT&CKの記載もあるので、実装方針はfalcoと似ているのでしょうか。

package tracee.TRC_2

__rego_metadoc__ := {
    "id": "TRC-2",
    "version": "0.1.0",
    "name": "Anti-Debugging",
    "description": "Process uses anti-debugging technique to block debugger",
    "tags": ["linux", "container"],
    "properties": {
        "Severity": 3,
        "MITRE ATT&CK": "Defense Evasion: Execution Guardrails",
    }
}

tracee_selected_events[eventSelector] {
	eventSelector := {
		"source": "tracee",
		"name": "ptrace"
	}
}

tracee_match {
    input.eventName == "ptrace"
    arg := input.args[_]
    arg.name == "request"
    arg.value == "PTRACE_TRACEME"
}

トレース機能

traceサブコマンド実行するシステムコールトレースっぽいログが出力されます。
フォレンジックあたりで使えそうです。
Traceeという名前からトレース機能がメインと思われますが、今後シグネチャが充実すればEDRツールとしても期待できそうです。

# docker run --privileged -it -v /path/in/host/tracee.bpf.123.o:/path/in/container/tracee.bpf.o -e TRACEE_BPF_FILE=/path/in/container/tracee.bpf.o aquasec/tracee trace
TIME             UID    COMM             PID     TID     RET              EVENT                ARGS
...
13:36:50:056945  1000   dbus-daemon      1835    1835    0                close                fd: 9
13:36:50:059291  1000   snap-store       2279    2279    0                security_file_open   pathname: /usr/share/zoneinfo/Japan, flags: O_RDONLY|O_LARGEFILE, dev: 7340033, inode: 9322
13:36:50:059272  1000   snap-store       2279    2279    16               openat               dirfd: -100, pathname: /etc/localtime, flags: O_RDONLY, mode: 0
13:36:50:059416  1000   snap-store       2279    2279    0                fstat                fd: 16, statbuf: 0x7FFFBE8FF1D0
13:36:50:059466  1000   snap-store       2279    2279    0                close                fd: 16
13:36:50:059629  1000   snap-store       2279    2279    0                security_file_open   pathname: /usr/share/zoneinfo/Japan, flags: O_RDONLY|O_LARGEFILE, dev: 7340033, inode: 9322
13:36:50:059617  1000   snap-store       2279    2279    16               openat               dirfd: -100, pathname: /etc/localtime, flags: O_RDONLY, mode: 0
13:36:50:059712  1000   snap-store       2279    2279    0                fstat                fd: 16, statbuf: 0x7FFFBE8FF1D0
13:36:50:059746  1000   snap-store       2279    2279    0                close                fd: 16
13:36:50:106080  118    pool-whoopsie    838     8281    0                security_file_open   pathname: /etc/services, flags: O_RDONLY|O_LARGEFILE, dev: 8388611, inode: 1704147
13:36:50:106070  118    pool-whoopsie    838     8281    11               openat               dirfd: -100, pathname: /etc/services, flags: O_RDONLY|O_CLOEXEC, mode: 0
13:36:50:106162  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:106192  118    pool-whoopsie    838     8281    0                security_file_open   pathname: /etc/services, flags: O_RDONLY|O_LARGEFILE, dev: 8388611, inode: 1704147
13:36:50:106187  118    pool-whoopsie    838     8281    11               openat               dirfd: -100, pathname: /etc/services, flags: O_RDONLY|O_CLOEXEC, mode: 0
13:36:50:106238  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:106402  118    pool-whoopsie    838     8281    0                security_socket_create family: AF_NETLINK, type: SOCK_RAW, protocol: 0, kern: 0
13:36:50:106398  118    pool-whoopsie    838     8281    11               socket               domain: AF_NETLINK, type: SOCK_RAW|SOCK_CLOEXEC, protocol: 0
13:36:50:106429  118    pool-whoopsie    838     8281    0                bind                 sockfd: 11, addr: {'sa_family': 'AF_NETLINK'}, addrlen: 12
13:36:50:106465  118    pool-whoopsie    838     8281    0                getsockname          sockfd: 11, addr: {'sa_family': 'AF_NETLINK'}, addrlen: 0x7F9770D5A3D4
13:36:50:106538  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:106577  118    pool-whoopsie    838     8281    0                security_file_open   pathname: /etc/hosts, flags: O_RDONLY|O_LARGEFILE, dev: 8388611, inode: 1704099
13:36:50:106572  118    pool-whoopsie    838     8281    11               openat               dirfd: -100, pathname: /etc/hosts, flags: O_RDONLY|O_CLOEXEC, mode: 0
13:36:50:106632  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:106668  118    pool-whoopsie    838     8281    0                security_socket_create family: AF_INET, type: SOCK_DGRAM, protocol: 0, kern: 0
13:36:50:106665  118    pool-whoopsie    838     8281    11               socket               domain: AF_INET, type: SOCK_DGRAM|SOCK_NONBLOCK|SOCK_CLOEXEC, protocol: 0
13:36:50:106707  118    pool-whoopsie    838     8281    0                security_socket_connect sockfd: 11, remote_addr: {'sa_family': 'AF_INET','sin_port': '53','sin_addr': '127.0.0.53'}
13:36:50:106702  118    pool-whoopsie    838     8281    0                connect              sockfd: 11, addr: {'sa_family': 'AF_INET','sin_port': '53','sin_addr': '127.0.0.53'}, addrlen: 16
13:36:50:107084  101    systemd-resolve  592     592     0                security_socket_create family: AF_INET, type: SOCK_DGRAM, protocol: 0, kern: 0
13:36:50:107081  101    systemd-resolve  592     592     16               socket               domain: AF_INET, type: SOCK_DGRAM|SOCK_NONBLOCK|SOCK_CLOEXEC, protocol: 0
13:36:50:107142  101    systemd-resolve  592     592     0                security_socket_connect sockfd: 16, remote_addr: {'sin_addr': '192.168.11.1','sa_family': 'AF_INET','sin_port': '53'}
13:36:50:107138  101    systemd-resolve  592     592     0                connect              sockfd: 16, addr: {'sa_family': 'AF_INET','sin_port': '53','sin_addr': '192.168.11.1'}, addrlen: 16
13:36:50:138031  101    systemd-resolve  592     592     0                close                fd: 16
13:36:50:138165  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:138236  118    pool-whoopsie    838     8281    0                security_socket_create family: AF_INET, type: SOCK_DGRAM, protocol: 0, kern: 0
13:36:50:138200  118    pool-whoopsie    838     8281    11               socket               domain: AF_INET, type: SOCK_DGRAM|SOCK_CLOEXEC, protocol: 0
13:36:50:138273  118    pool-whoopsie    838     8281    0                security_socket_connect sockfd: 11, remote_addr: {'sin_addr': '162.213.33.108','sa_family': 'AF_INET','sin_port': '0'}
13:36:50:138268  118    pool-whoopsie    838     8281    0                connect              sockfd: 11, addr: {'sa_family': 'AF_INET','sin_port': '0','sin_addr': '162.213.33.108'}, addrlen: 16
13:36:50:138301  118    pool-whoopsie    838     8281    0                getsockname          sockfd: 11, addr: {'sa_family': 'AF_INET','sin_port': '35285','sin_addr': '10.0.2.15'}, addrlen: 0x7F9770D5A540
13:36:50:138317  118    pool-whoopsie    838     8281    0                connect              sockfd: 11, addr: {'sa_family': 'AF_UNSPEC'}, addrlen: 16
13:36:50:138334  118    pool-whoopsie    838     8281    0                security_socket_connect sockfd: 11, remote_addr: {'sa_family': 'AF_INET','sin_port': '0','sin_addr': '162.213.33.132'}
13:36:50:138330  118    pool-whoopsie    838     8281    0                connect              sockfd: 11, addr: {'sin_addr': '162.213.33.132','sa_family': 'AF_INET','sin_port': '0'}, addrlen: 16
13:36:50:138349  118    pool-whoopsie    838     8281    0                getsockname          sockfd: 11, addr: {'sa_family': 'AF_INET','sin_port': '57882','sin_addr': '10.0.2.15'}, addrlen: 0x7F9770D5A540
13:36:50:138371  118    pool-whoopsie    838     8281    0                close                fd: 11
13:36:50:181517  1000   gnome-shell      1982    1982    0                cap_capable          cap: CAP_SYS_ADMIN
13:36:50:181618  1000   gnome-shell      1982    1982    0                cap_capable          cap: CAP_SYS_ADMIN
13:36:50:181730  1000   gnome-shell      1982    1982    0                cap_capable          cap: CAP_SYS_ADMIN
13:36:50:181908  1000   gnome-shell      1982    1982    0                cap_capable          cap: CAP_SYS_ADMIN

Falco:コンテナの動的脅威アラート

Falcoを少し触ってみたのでメモ。
Falcoはシステムコールベースの脅威アラート機能を提供するOSSです。
MITRE ATT&CKの内容も取り込まれているようです。

falco.org
Matrix - Enterprise | MITRE ATT&CK®

インストール

ドキュメントに従ってインストールします。
Install | Falco

#Trust the falcosecurity GPG key, configure the apt repository, and update the package list
curl -s https://falco.org/repo/falcosecurity-3672BA8F.asc | apt-key add -
echo "deb https://download.falco.org/packages/deb stable main" | tee -a /etc/apt/sources.list.d/falcosecurity.list
apt-get update -y

#Install kernel headers
apt-get -y install linux-headers-$(uname -r)

#Install Falco
apt-get install -y falco

#Run Falco as a service
systemctl start falco

設定ファイル

$ ls /etc/falco/
falco.yaml  falco_rules.local.yaml  falco_rules.yaml  k8s_audit_rules.yaml  rules.available  rules.d

デフォルトルールはfalco_rules.yamlに定義されていて、falco_rules.local.yamlに追記することでユーザ定義のルールを追加できるみたいです。

以下の定義は/ と /rootへの書込みを検出するルールです。
condition:のところが条件式になっていて対象のファイルパスや対象外の条件が記載されています。
output:のところがログ出力になっていて、commandやコンテナIDが含まれています。
内部的にはシステムコールベースみたいですが、ユーザルールでは抽象化されているので、システムコールが分からなくても何とかなりそうです。

$ less /etc/falco/falco_rules.yaml
...
  • rule: Write below root
desc: an attempt to write to any file directly below / or /root condition: > root_dir and evt.dir = < and open_write and proc_name_exists and not fd.name in (known_root_files) and not fd.directory pmatch (known_root_directories) and not exe_running_docker_save and not gugent_writing_guestagent_log and not dse_writing_tmp and not zap_writing_state and not airflow_writing_state and not rpm_writing_root_rpmdb and not maven_writing_groovy and not chef_writing_conf and not kubectl_writing_state and not cassandra_writing_state and not galley_writing_state and not calico_writing_state and not rancher_writing_root and not runc_writing_exec_fifo and not mysqlsh_writing_state and not known_root_conditions and not user_known_write_root_conditions and not user_known_write_below_root_activities output: "File below / or /root opened for writing (user=%user.name user_loginuid=%user.loginuid command=%proc.cmdline parent=%proc.pname file=%fd.name program=%proc.name container_id=%container.id image=%container.image.repository)"

動作確認

適当にコンテナを動かしてコマンドを実行してみます。

$ sudo docker run --rm -it ubuntu
# touch unchi
# apt-get update
# apt-get install curl

falcoの出力を確認します。

# sudo journalctl -f -u falco.service
 8月 07 15:28:23 user-VirtualBox falco[6718]: 15:28:23.754459162: Error File below / or /root opened for writing (user=root user_loginuid=1000 command=nano /etc/falco/falco.yaml parent=bash file=/root/.local/share/nano/search_history program=nano container_id=host image=)
...
 8月 07 15:36:27 user-VirtualBox falco[6718]: 15:36:27.977718775: Error File below / or /root opened for writing (user=root user_loginuid=-1 command=touch unchi parent=bash file=/unchi program=touch container_id=e5e80beb3428 image=ubuntu)
...
 8月 07 15:55:59 user-VirtualBox falco[6718]: 15:55:59.199264022: Error Package management process launched in container (user=root user_loginuid=-1 command=apt-get install curl container_id=e5e80beb3428 container_name=magical_kepler image=ubuntu:latest)

/rootへの書込みやパッケージマネージャの起動が検出されています。
また、container_id=hostとcontainer_id=コンテナIDでコンテナとホストOSの挙動も区別できています。

適当にインストールするだけでもある程度機能しそうですが、アラート機能しかないので、悪性挙動のブロック処理は独自実装が必要です。有償製品のsysdigを利用すれば、ブロック機能も使えるみたいですが。。

Falco単体では実運用は厳しいかもしれません。テスト工程のサンドボックステストあたりなら使えそうです。