2021-05-09

コンテナエンジンの商用シェア

参考
コンテナランタイム
コンテナオーケストレータ
まとめ

参考

www.scsk.jp
www.stackrox.com
docs.microsoft.com
cloud.google.com
www.redhat.com

コンテナランタイム

f:id:FallenPigeon:20210509154814p:plain
こちらは、2020年におけるsysdigユーザのコンテナランタイム利用実態の調査結果です。sysdig社はコンテナ脅威検出ツールFalcoなどを開発しているコンテナセキュリティベンダです。アンケートの対象者は、コンテナネイティブなEDR製品を導入している点から、イケイケドンドンなユーザグループであると推測できます。

さて、昨年は79%であったDockerが今年は50%まで落ち込んでいますが、この1年間でcontainerdとCRI-Oの両方が大きく伸びています。
containerdについては、Dockerプロジェクトから切り離され、CRIサポートが安定したきたこともあり、Kubernetesランタイムとして利用ケースが増えているようです。特に、Kubernetesプロジェクトが2021年後半にDockerの使用を正式に非推奨とすることを発表したことも注目に値します。これは、DockerがKubernetesのCRIをネイティブサポートしておらず、Docker APIとCRIを仲介するDocker shimの保守コストやオーバヘッドが問題とされたためです。KubernetesランタイムにおけるDocker→containerdの流れは、Azure Kubernetes Service (AKS)やGoogle Kubernetes Engine（GKE）など、クラウド Kubernetesサービスのドキュメントでも言及されています。

f:id:FallenPigeon:20210509161619p:plain

一方、CRI-Oは、Kubernetesへの最適化を主軸にして開発が行われているランタイムになります。CRI-Oの商用シェアについては、OpenShiftによるところが大きいのではないでしょうか。OpenShiftはKubernetesをベースにCICDパイプラインの機能などを追加した商用オーケストレータです。特に、Red Hat OpenShift Container Platform 4では、CRI-Oがデフォルトランタイムとして採用されています。そのため、Red Hat OpenShift Container Platform 4の導入がCRI-Oの商用シェア増加の一因と考えられます。

この調査結果では、Dockerがオワコンになったようにも見えますが、そこまで短絡的になる必要はないはずです。この調査結果は、あくまでコンテナのランタイムエンジン、つまり、コンテナの実行にフォーカスしたものです。コンテナツールには、コンテナの実行以外にもコンテナイメージの管理という役割があります。この機能では、Dockerがcontainerdを上回っている点があります。また、商用導入には、ドキュメントの整備やサポートなども重要になってきます。この点も、Docker社がバックについているDockerは一定のアドバンテージがあります。

containerdは、Kubernetesランタイムとして十分に動作するものの、ユーザビリティやドキュメント整備が比較的貧弱といえます。containerdのコミュニティではこれらの改善が盛んに行われています。具体的には、containerdのcliクライアントには機能が貧弱なctrコマンドというものがありますが、nerdctlというDockerコマンドライクなcliツールも開発されています。また、OCIイメージの暗号化機能など、Dockerにはない先進的な技術も組み込まれています。さらに、ユーザドキュメントの整備も進んでおり、進捗具合によっては、総合的にDockerを突き放す可能性もあります。

コンテナオーケストレータ

f:id:FallenPigeon:20210509154833p:plain

こちらは、sysdig社とStackRox社によるコンテナオーケストレータ利用実態の調査結果です。StackRox社もsysdig社と同じくコンテナセキュリティベンダです。余談ですが、今年に入ってRed Hat社に買収されたようです。

調査結果から言えるのは、Kubernetes一強の一言でしょう。クラウドサービスやOpenShiftなど、Kubernetes系列のものを合算するとDocker SwarmやAmazon ECSなどを大きく引き離していることが分かります。さすがCloud Native Computing Foundation (CNCF)の旗艦だけあり、オーケストレータ界隈のデファクトスタンダードとして君臨しています。拡張ツールの豊富さや開発速度が圧倒的であり、他の追随を許さないという状態です。コンテナ界隈に革新的な変化がない限りこの天下は続きそうです。

まとめ

コンテナランタイムでは、Dockerの時代が終わりを告げようとしています。逆に、オーケストレータではKubernetesの一強状態が続きそうです。コンテナランタイムの利用動向はオーケストレータ(Kubernetes)の動向も反映しています。その意味では、Kubernetesがコンテナ界の支配者になりつつあるのかもしれません。
一方、コンテナアプリケーションがコンテナランタイムに依存するというケースは少ないため、動けばなんでもいいという考え方も一理あります。そのため、上記を強く意識せず、コンテナランタイムの選択に困ったときには、ランタイムやオーケストレータのシェアを参考にするぐらいがいいのかもしれません。

2021-05-05

High-level Container Runtime:containerd Internals

コンテナランタイム containerd コンテナ

参考
Architecture
Scope
tasks service:Create
TaskManager.Create
- startShim
shim.Create
- NewContainer

今回のテーマは高レベルコンテナランタイムcontainerdの内部処理についてです。
こちらは、2021/05時点の記事です。

参考

containerdの概要と最近の機能

github.com
event.cloudnativedays.jp
techblog.ap-com.co.jp

Architecture

f:id:FallenPigeon:20210505113040p:plain
f:id:FallenPigeon:20210505122601p:plain

containerdは、Moduler Monolith Archtectureと呼ばれるマイクロサービスのようなアーキテクチャを採用しています。
具体的には、単一バイナリが内部的に複数のServiceに分割され、Service間がAPIでアクセスする仕組みになっています。
各ServiceはPluginとして起動時に登録され、主にgRPCと呼ばれるGoogleが開発を開始したリモートプロシージャコール (RPC) システムでAPI実装されています。

メジャーなcontainerd clientは、Kubernetesのノードエージェントであるkubeletやcontainerdに付属する専用clientがあります。
前者はCRI(Container Runtime Interface)と呼ばれるKubernetes API、後者はcontainerd APIを採用していますが、いずれもgRPCベースで実装されています。
特にcontainerdは、CRI ServiceというかたちでKubernetes APIをサポートしているため、Kubernetesランタイムとしての存在感を増しています。

さて、containerdでは下記の手順でコンテナが起動されます。

Imageのpull:client.Pull
Containerの作成:client.NewContainer
Taskの作成:container.NewTask
ステータス通知用channelの取得:task.Wait
Taskの起動:task.Start

f:id:FallenPigeon:20210505122157p:plain

Server視点では、コンテナイメージがプルされるとコンテナイメージのデータがContentに格納され、ImagesやContainersなどのメタデータで管理されます。
また、低レベルコンテナランタイムに渡されるコンテナのルートファイルシステムであるsnapshotが作成され、コンテナがTaskとして起動されます。

上記にあるとおり、containerdにはContaierとTaskという二つのコンテナ概念があり、実際に低レベルコンテナランタイムを制御しているのはTaskになります。
コンテナの実行処理では、Task Serviceがshimと呼ばれるサーバをコンテナ単位で起動し、shimがruncを制御することでコンテナを起動します。

Contaier:containerdにおけるメタデータとしてのコンテナ
Task:containerdにおけるプロセスとしてのコンテナ

Task Serviceとshimの間はttRPCという別のリモートプロシージャコールで実装されています。

Scope

containerdには、コンテナイメージの管理からコンテナの実行まで様々な機能があります。
今回は、Taskの作成、つまりコンテナのプロセス作成処理にフォーカスを当てます。
具体的には、Task作成のAPIを提供するTask Serviceがshimを起動して、shimがruncコマンドを実行してコンテナを生成するところまでになります。

f:id:FallenPigeon:20210505113054p:plain

tasks service:Create

containerdにおいてAPIを提供するServiceはcontainerd/servicesで実装されています。
また、APIの定義はcontainerd/api/servicesの.protoファイルにあります。

/* containerd/api/services/tasks/v1/tasks.proto */
service Tasks {
	rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
	rpc Start(StartRequest) returns (StartResponse);
	rpc Delete(DeleteTaskRequest) returns (DeleteResponse);
	rpc DeleteProcess(DeleteProcessRequest) returns (DeleteResponse);
	...
}

message CreateTaskRequest {
	...
}

message CreateTaskResponse {
	...
}

tasks serviceに対してgRPCのTask Createメソッドがコールされると、containerd/services/tasks/local.goに実装されたCreateメソッドが処理されます。
内部的にはTaskManager.CreateメソッドがTaskの作成を行っています。
上記処理の完了後は、gRPCレスポンスがgRPCクライアントに送信されます。

/*containerd/services/tasks/local.go*/
type local struct {
	runtimes   map[string]runtime.PlatformRuntime
	containers containers.Store
	store      content.Store
	publisher  events.Publisher

	monitor   runtime.TaskMonitor
	v2Runtime *v2.TaskManager
}
...
func (l *local) Create(ctx context.Context, r *api.CreateTaskRequest, _ ...grpc.CallOption) (*api.CreateTaskResponse, error) {
	container, err := l.getContainer(ctx, r.ContainerID)
	...
	//TaskManager.Createのオプション設定
	opts := runtime.CreateOpts{
		Spec: container.Spec,
		IO: runtime.IO{
			Stdin:    r.Stdin,
			Stdout:   r.Stdout,
			Stderr:   r.Stderr,
			Terminal: r.Terminal,
		},
		Checkpoint:     checkpointPath,
		Runtime:        container.Runtime.Name,
		RuntimeOptions: container.Runtime.Options,
		TaskOptions:    r.Options,
	}
	...
	//TaskManagerの取得
	rtime, err := l.getRuntime(container.Runtime.Name)
	...
	//TaskManager.Createでタスクを作成
	c, err := rtime.Create(ctx, r.ContainerID, opts)
	...
	//gRPCレスポンス
	return &api.CreateTaskResponse{
		ContainerID: r.ContainerID,
		Pid:         c.PID(),
	}, nil
}

TaskManager.Create

TaskManager.Createメソッドを確認すると、OCI runtime bundleの作成とcontainerd-shimの起動が行われています。

/* containerd/runtime/v2/manager.go */
func (m *TaskManager) Create(ctx context.Context, id string, opts runtime.CreateOpts) (_ runtime.Task, retErr error) {

	//OCI runtime bundleの作成
	bundle, err := NewBundle(ctx, m.root, m.state, id, opts.Spec.Value)
	if err != nil {
		return nil, err
	}
	...
	//containerd-shimの起動
	shim, err := m.startShim(ctx, bundle, id, opts)
	if err != nil {
		return nil, err
	}
	...
}

startShim

startShimではshimを起動して戻り値としてttRPCクライアントを格納したshimインスタンスを取得します。

/* containerd/runtime/v2/manager.go */
func (m *TaskManager) startShim(ctx context.Context, bundle *Bundle, id string, opts runtime.CreateOpts) (*shim, error) {
	ns, err := namespaces.NamespaceRequired(ctx)
	topts := opts.TaskOptions

	b := shimBinary(ctx, bundle, opts.Runtime, m.containerdAddress, m.containerdTTRPCAddress, m.events, m.tasks)

	//shimを起動して戻り値としてttRPC接続先を取得
	shim, err := b.Start(ctx, topts, func()
	...
	return shim, nil
}

/* containerd/runtime/v2/binary.go */
func shimBinary(ctx context.Context, bundle *Bundle, runtime, containerdAddress string, containerdTTRPCAddress string, events *exchange.Exchange, rt *runtime.TaskList) *binary {
	return &binary{
		bundle:                 bundle,
		runtime:                runtime,
		containerdAddress:      containerdAddress,
		containerdTTRPCAddress: containerdTTRPCAddress,
		events:                 events,
		rtTasks:                rt,
	}
}

shimBinary.Startメソッドでは、shim起動の起動コマンドが作成され、shimが起動されます。
shimの起動コマンドには、shimのオプションや低レベルランタイムであるruncに渡されるOCIバンドルの情報が格納されています。

/* containerd/runtime/v2/binary.go */
func (b *binary) Start(ctx context.Context, opts *types.Any, onClose func()) (_ *shim, err error) {
	args := []string{"-id", b.bundle.ID}
	args = append(args, "start")

	//shim起動用のコマンド構築(shimのコマンド、OCIバンドルの情報など)
	cmd, err := client.Command(
		ctx,
		b.runtime,
		b.containerdAddress,
		b.containerdTTRPCAddress,
		b.bundle.Path,
		opts,
		args...,
	)
	//shim起動コマンドの実行
	//outはttRPCの接続先アドレス
	out, err := cmd.CombinedOutput()

	//アドレスの生成
	address := strings.TrimSpace(string(out))
	//コネクションの作成
	conn, err := client.Connect(address, client.AnonDialer)

	//shimに接続するためのttRPCクライアントの取得
	client := ttrpc.NewClient(conn, ttrpc.WithOnClose(onCloseWithShimLog))
	return &shim{
		bundle:  b.bundle,
		client:  client,
		task:    task.NewTaskClient(client),
		events:  b.events,
		rtTasks: b.rtTasks,
	}, nil
}

shim起動コマンドの実行はCombinedOutputメソッドで行われ、内部ではosexec.Cmdでshimが起動されています。
また、TaskManagerがshimとやり取りするためのttRPCの接続先アドレスも提供されます。
次に、CombinedOutputで得られた接続先アドレスから、shimとのコネクションを作成し、shimに接続するためのttRPCクライアントを作成しています。

func (cmd *cmdWrapper) CombinedOutput() ([]byte, error) {
	out, err := (*osexec.Cmd)(cmd).CombinedOutput()
	return out, handleError(err)
}

最後に、shim.Createメソッドでtaskの登録(コンテナの作成)が行われます。
shim内部では、runcが起動され、コンテナが作成されていますが、実行はされません。

/* containerd/runtime/v2/manager.go */
func (m *TaskManager) Create(ctx context.Context, id string, opts runtime.CreateOpts) (_ runtime.Task, retErr error) {

	//OCI runtime bundleの作成
	bundle, err := NewBundle(ctx, m.root, m.state, id, opts.Spec.Value)
	if err != nil {
		return nil, err
	}
	...
	//containerd-shimの起動+ttRPC接続
	shim, err := m.startShim(ctx, bundle, id, opts)
	if err != nil {
		return nil, err
	}
	//ttRPC経由でshimにtaskを登録(コンテナは起動していない)
	t, err := shim.Create(ctx, opts)

	return t, nil
}

shim.Create

以降、shim.Createからrunc createコマンドが実行されてコンテナが作成されるまでの過程を追います。
shim.Createメソッドでは、s.task.Createで「shim」のttRPC:Createが呼び出されます。

/* containerd/runtime/v2/shim.go */
type shim struct {
	bundle  *Bundle
	client  *ttrpc.Client
	task    task.TaskService
	taskPid int
	events  *exchange.Exchange
	rtTasks *runtime.TaskList
}

/* containerd/runtime/v2/shim.go */
func (s *shim) Create(ctx context.Context, opts runtime.CreateOpts) (runtime.Task, error) {
	topts := opts.TaskOptions
	...
	request := &task.CreateTaskRequest{
		ID:         s.ID(),
		Bundle:     s.bundle.Path,
		Stdin:      opts.IO.Stdin,
		Stdout:     opts.IO.Stdout,
		Stderr:     opts.IO.Stderr,
		Terminal:   opts.IO.Terminal,
		Checkpoint: opts.Checkpoint,
		Options:    topts,
	}
	for _, m := range opts.Rootfs {
		request.Rootfs = append(request.Rootfs, &types.Mount{
			Type:    m.Type,
			Source:  m.Source,
			Options: m.Options,
		})
	}
	response, err := s.task.Create(ctx, request)
	...
	s.taskPid = int(response.Pid)
	return s, nil
}

ttRPCのtask.Createはcontainerd/runtime/v2/taskで定義されています。

/*containerd/runtime/v2/task/shim.proto*/
service Task {
	rpc State(StateRequest) returns (StateResponse);
	rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
	rpc Start(StartRequest) returns (StartResponse);
	rpc Delete(DeleteRequest) returns (DeleteResponse);
	rpc Pids(PidsRequest) returns (PidsResponse);
	rpc Pause(PauseRequest) returns (google.protobuf.Empty);
	rpc Resume(ResumeRequest) returns (google.protobuf.Empty);
	rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty);
	rpc Kill(KillRequest) returns (google.protobuf.Empty);
	rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty);
	rpc ResizePty(ResizePtyRequest) returns (google.protobuf.Empty);
	rpc CloseIO(CloseIORequest) returns (google.protobuf.Empty);
	rpc Update(UpdateTaskRequest) returns (google.protobuf.Empty);
	rpc Wait(WaitRequest) returns (WaitResponse);
	rpc Stats(StatsRequest) returns (StatsResponse);
	rpc Connect(ConnectRequest) returns (ConnectResponse);
	rpc Shutdown(ShutdownRequest) returns (google.protobuf.Empty);
}

message CreateTaskRequest {
	string id = 1;
	string bundle = 2;
	repeated containerd.types.Mount rootfs = 3;
	bool terminal = 4;
	string stdin = 5;
	string stdout = 6;
	string stderr = 7;
	string checkpoint = 8;
	string parent_checkpoint = 9;
	google.protobuf.Any options = 10;
}

func (c *taskClient) Create(ctx context.Context, req *CreateTaskRequest) (*CreateTaskResponse, error) {
	var resp CreateTaskResponse
	if err := c.client.Call(ctx, "containerd.task.v2.Task", "Create", req, &resp); err != nil {
		...
	}
	return &resp, nil
}

shim側のtask.Createはcontainerd/runtime/v2/runc/v2/service.goに実装されています。

/* containerd/runtime/v2/runc/v2/service.go */
// Create a new initial process and container with the underlying OCI runtime
func (s *service) Create(ctx context.Context, r *taskAPI.CreateTaskRequest) (_ *taskAPI.CreateTaskResponse, err error) {
	
	//コンテナの作成
	container, err := runc.NewContainer(ctx, s.platform, r)
	...
	//ttRPCのレスポンス
	return &taskAPI.CreateTaskResponse{
		Pid: uint32(container.Pid()),
	}, nil
}

//runc.Container
type Container struct {
	mu sync.Mutex

	// ID of the container
	ID string
	// Bundle path
	Bundle string

	// cgroup is either cgroups.Cgroup or *cgroupsv2.Manager
	cgroup          interface{}
	process         process.Process
	processes       map[string]process.Process
	reservedProcess map[string]struct{}
}

NewContainer

NewContainerメソッドではrunc.Containerが作成されます。
NewContainerメソッドで重要なのは下記2つの処理です。
①コンテナ内initプロセスの管理インスタンスを生成するnewInitメソッド
②上記で作成されたInitインスタンスがコンテナを作成するCreateメソッド

/* containerd/runtime/v2/runc/container.go */
package runc

// NewContainer returns a new runc container
func NewContainer(ctx context.Context, platform stdio.Platform, r *task.CreateTaskRequest) (_ *Container, retErr error) {
	ns, err := namespaces.NamespaceRequired(ctx)
	...
	var mounts []process.Mount
	for _, m := range r.Rootfs {
		mounts = append(mounts, process.Mount{
			Type:    m.Type,
			Source:  m.Source,
			Target:  m.Target,
			Options: m.Options,
		})
	}

	rootfs := ""
	if len(mounts) > 0 {
		rootfs = filepath.Join(r.Bundle, "rootfs")
		if err := os.Mkdir(rootfs, 0711); err != nil && !os.IsExist(err) {
			return nil, err
		}
	}

	config := &process.CreateConfig{
		ID:               r.ID,
		Bundle:           r.Bundle,
		Runtime:          opts.BinaryName,
		Rootfs:           mounts,
		Terminal:         r.Terminal,
		Stdin:            r.Stdin,
		Stdout:           r.Stdout,
		Stderr:           r.Stderr,
		Checkpoint:       r.Checkpoint,
		ParentCheckpoint: r.ParentCheckpoint,
		Options:          r.Options,
	}

	//コンテナ内のinitプロセスインスタンス
	p, err := newInit(
		ctx,
		r.Bundle,
		filepath.Join(r.Bundle, "work"),
		ns,
		platform,
		config,
		&opts,
		rootfs,
	)

	//initプロセスのCreateメソッドでコンテナを作成
	if err := p.Create(ctx, config); err != nil {
		return nil, errdefs.ToGRPC(err)
	}

	container := &Container{
		ID:              r.ID,
		Bundle:          r.Bundle,
		process:         p,
		processes:       make(map[string]process.Process),
		reservedProcess: make(map[string]struct{}),
	}
	...
	return container, nil
}

newInit

newInitメソッドではInitインスタンスが作成されます。
処理を確認するとNewRuncメソッドでRuncインスタンスが作成され、process.NewメソッドでInitインスタンスを作成しています。

/* containerd/runtime/v2/runc/container.go */
func newInit(ctx context.Context, path, workDir, namespace string, platform stdio.Platform,
	r *process.CreateConfig, options *options.Options, rootfs string) (*process.Init, error) {

	//Runcのインスタンスを生成
	runtime := process.NewRunc(options.Root, path, namespace, options.BinaryName, options.CriuPath, options.SystemdCgroup)
	p := process.New(r.ID, runtime, stdio.Stdio{
		Stdin:    r.Stdin,
		Stdout:   r.Stdout,
		Stderr:   r.Stderr,
		Terminal: r.Terminal,
	})
	p.Bundle = r.Bundle
	p.Platform = platform
	p.Rootfs = rootfs
	p.WorkDir = workDir
	p.IoUID = int(options.IoUid)
	p.IoGID = int(options.IoGid)
	p.NoPivotRoot = options.NoPivotRoot
	p.NoNewKeyring = options.NoNewKeyring
	p.CriuWorkPath = options.CriuWorkPath
	if p.CriuWorkPath == "" {
		// if criu work path not set, use container WorkDir
		p.CriuWorkPath = p.WorkDir
	}
	return p, nil
}

NewRunc

NewRuncメソッドはRuncインスタンスを作成します。
Runcインスタンスは、runcのサブコマンドに対応したメソッドがあり、runcを実行する機能があります。
そのため、このRuncインスタンスがcontainerdにおけるruncの現身といえます。

/* containerd/pkg/process/init.go */
// NewRunc returns a new runc instance for a process
func NewRunc(root, path, namespace, runtime, criu string, systemd bool) *runc.Runc {
	if root == "" {
		root = RuncRoot
	}
	return &runc.Runc{
		Command:       runtime,
		Log:           filepath.Join(path, "log.json"),
		LogFormat:     runc.JSON,
		PdeathSignal:  unix.SIGKILL,
		Root:          filepath.Join(root, namespace),
		Criu:          criu,
		SystemdCgroup: systemd,
	}
}

/* vendor/github.com/containerd/go-runc/runc_unix.go */
// Runc is the client to the runc cli
type Runc struct {
	//If command is empty, DefaultCommand is used
	Command       string
	Root          string
	Debug         bool
	Log           string
	LogFormat     Format
	PdeathSignal  unix.Signal
	Setpgid       bool
	Criu          string
	SystemdCgroup bool
	Rootless      *bool // nil stands for "auto"
}

/* containerd/vendor/github.com/containerd/go-runc/runc.go */

// List returns all containers created inside the provided runc root directory
/* containerd/vendor/github.com/containerd/go-runc/runc.go */
func (r *Runc) List(context context.Context) (*Container, error) {
	...
}

// Create creates a new container and returns its pid if it was created successfully
func (r *Runc) Create(context context.Context, id, bundle string, opts *CreateOpts) error {
	args := string{"create", "--bundle", bundle}
	...
}

// Kill sends the specified signal to the container
func (r *Runc) Kill(context context.Context, id string, sig int, opts *KillOpts) error {
	...
}

//runc startの実行
// Start will start an already created container
func (r *Runc) Start(context context.Context, id string) error {
	return r.runOrError(r.command(context, "start", id))
}

process.New

process.NewではInitインスタンスが作成されます。
このinitインスタンスにはCreateメソッドやStartメソッドなど、Initプロセスの操作を行うメソッドが用意されています。

/* containerd/pkg/process/init.go */
//process.New
func New(id string, runtime *runc.Runc, stdio stdio.Stdio) *Init {
	p := &Init{
		id:        id,
		runtime:   runtime,
		pausing:   new(atomicBool),
		stdio:     stdio,
		status:    0,
		waitBlock: make(chan struct{}),
	}
	...
	return p
}

/* containerd/pkg/process/init.go */
// Init represents an initial process for a container
type Init struct {
	...
	id       string
	Bundle   string
	console  console.Console
	Platform stdio.Platform
	io       *processIO
	runtime  *runc.Runc
	status       int
	exited       time.Time
	pid          int
	stdin        io.Closer
	stdio        stdio.Stdio
	Rootfs       string
	IoUID        int
	IoGID        int
	NoPivotRoot  bool
}

// Create the process with the provided config
func (p *Init) Create(ctx context.Context, r *CreateConfig) error {
	...
}

// Start the init process
func (p *Init) Start(ctx context.Context) error {
	...
}

Init.Create

ここでNewContainerメソッドに戻ります。
newInitメソッドでinitインスタンスが作成された後に、Createメソッドが呼ばれています。
これが前述の (p *Init) Createメソッドに該当します。

func NewContainer(ctx context.Context, platform stdio.Platform, r *task.CreateTaskRequest) (_ *Container, retErr error) {
	...
	//コンテナ内のinitプロセスインスタンス
	p, err := newInit(
		ctx,
		r.Bundle,
		filepath.Join(r.Bundle, "work"),
		ns,
		platform,
		config,
		&opts,
		rootfs,
	)
	...
	//InitインスタンスのCreateメソッドでコンテナを作成
	if err := p.Create(ctx, config); err != nil {
		return nil, errdefs.ToGRPC(err)
	}
	...
}

Runc.Create

処理を確認するとさらに内部でp.runtime.Createメソッドを呼び出していることが分かります。
このruntimeはRuncインスタンスであり、同じくCreateメソッドが定義されています。

// Create the process with the provided config
func (p *Init) Create(ctx context.Context, r *CreateConfig) error {
	...
	if err := p.runtime.Create(ctx, r.ID, r.Bundle, opts); err != nil {
		...
	}
	...
	pid, err := pidFile.Read()
	...
	p.pid = pid
	return nil
}

Runc.Createメソッドの処理を確認すると、先頭でrunc create --bundle bundleというコマンドを生成しています。
OCIイメージとOCIランタイムバンドル - 鳩小屋
これは、runcがOCI runtime bundleからコンテナを起動するコマンドになります。
次に、Monitor.Startメソッドにcmdが指定されて呼び出されていることが確認できます。

// Create creates a new container and returns its pid if it was created successfully
func (r *Runc) Create(context context.Context, id, bundle string, opts *CreateOpts) error {
        //runc createコマンドの生成
	args := []string{"create", "--bundle", bundle}
	...
	cmd := r.command(context, append(args, id)...)
	...
	ec, err := Monitor.Start(cmd)
	...
	status, err := Monitor.Wait(cmd, ec)
	...
	return err
}

Monitor.Startメソッドでは、exec.Cmd.Startが指定されてrunc createが実行されていることが分かります。
これで無事コンテナが起動されました。

// Start starts the command a registers the process with the reaper
func (m *Monitor) Start(c *exec.Cmd) (chan runc.Exit, error) {
	ec := m.Subscribe()
	if err := c.Start(); err != nil {
		...
	}
	return ec, nil
}

あとは、shimがTask Serviceに対してttRPCレスポンスを返し、Task ServiceがクライアントにgRPCレスポンスを返して一連の処理が完了します。
また、Task Createでは、コンテナは作成されますが、実行はされません。
コンテナを実行するには、Task Serviceに対してTask Startを呼出します。
こちらの処理も同様にshim経由でruncが制御される流れになります。(たぶん)

2021-05-02

Low-level Container Runtime:Runc Internals

コンテナ runc コンテナランタイム

参考
おさらい
runc architecture
- file
  - main.go and command
- process
runc create
runc init
- nsenter
  - nsexec
- runc init(After nsexec)
  - linuxStandardInit.Init
runc start

低レベルコンテナランタイムruncの内部処理のまとめです。

f:id:FallenPigeon:20210502181030p:plain

参考

2021/05現在:Container Runtime Meetupのrunc概説と大きな相違はありませんでした。
短時間で要点だけ抑えたい方は、下記資料を参照した方がよいかもしれません。
medium.com

runcの理解には、コンテナ技術やGo言語の基礎知識が必要です。
今回はDockerの操作程度では飽き足りないというような上級者向けの内容となっています。
自分の理解も大概怪しいので多少間違ってても許してください(´･ω･`)
github.com

おさらい

#OCIランタイムバンドル
$ ls ubuntu-bundle/
config.json rootfs
#コンテナ生成
$runc run --bundle ubuntu-bundle ubuntu-container
root@umoci-default:/#

/*config.json*/
{
    "process": {
        "terminal":false,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"  
        ],
        "cwd": "/",
    },
    "root": {
        "path": "rootfs",
        "readonly": true
    },   
    "linux": {
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "network"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ],
    }
}

runc architecture

file

runcは主に「各コマンドに対応するルートディレクトリのgoファイル」と「直接コンテナを管理/起動/作成するためlibcontainer」で構成されています。
runcが実行されると、main.goから引数に対応するコマンドが実行されます。各コマンドは、libcontainer経由でlinuxの機能にアクセスして、コンテナを制御します。
f:id:FallenPigeon:20210430085924p:plain

runc
├── main.go
├── create.go
├── delete.go
├── events.go
├── exec.go
├── init.go
├── kill.go
├── man
├── notify_socket.go
├── pause.go
├── ps.go
├── restore.go
├── rlimit_linux.go
├── rootless_linux.go
├── run.go
├── signals.go
├── spec.go
├── start.go
├── state.go
├── tty.go
├── update.go
├── utils.go
├── utils_linux.go
├── libcontainer
│   ├── README.md
│   ├── SPEC.md
│   ├── apparmor
│   ├── capabilities
│   ├── cgroups
│   ├── configs
│   ├── console_linux.go
│   ├── container.go
│   ├── container_linux.go
│   ├── container_linux_test.go
│   ├── criu_opts_linux.go
│   ├── devices
│   ├── error.go
│   ├── error_test.go
│   ├── factory.go
│   ├── factory_linux.go
│   ├── factory_linux_test.go
│   ├── generic_error.go
│   ├── generic_error_test.go
│   ├── init_linux.go
│   ├── integration
│   ├── intelrdt
│   ├── keys
│   ├── logs
│   ├── message_linux.go
│   ├── network_linux.go
│   ├── notify_linux.go
│   ├── notify_linux_test.go
│   ├── notify_linux_v2.go
│   ├── nsenter
│   ├── process.go
│   ├── process_linux.go
│   ├── restored_process.go
│   ├── rootfs_linux.go
│   ├── rootfs_linux_test.go
│   ├── seccomp
│   ├── setns_init_linux.go
│   ├── specconv
│   ├── stacktrace
│   ├── standard_init_linux.go
│   ├── state_linux.go
│   ├── state_linux_test.go
│   ├── stats_linux.go
│   ├── sync.go
│   ├── system
│   ├── user
│   ├── userns
│   └── utils

main.go and command

runcのコマンドはGoのCLIツールとして実装されていて、cli.Commandのところでコマンド一覧が定義されています。また、app.Run(os.Args)で指定されたサブコマンドを実行しています。

/* main.go */
func main() {
        app := cli.NewApp()
        app.Name = "runc"
        app.Usage = usage
...
        app.Commands = []cli.Command{
        checkpointCommand,
        createCommand,
        deleteCommand,
        eventsCommand,
        execCommand,
        initCommand,
        killCommand,
        listCommand,
        pauseCommand,
        psCommand,
        restoreCommand,
        resumeCommand,
        runCommand,
        specCommand,
        startCommand,
        stateCommand,
        updateCommand,
        }
...
        if err := app.Run(os.Args); err != nil {
                fatal(err)
        }

cli.Commandに指定されたコマンドは、ルートディレクトリの各goファイルに定義されています。 createCommandを例に挙げると、create.goに定義されていて、サブコマンド名はcreateとなっています。

runc createを実行すると、create.goのAction: func(context *cli.Context)が実行されます。

/*create.go*/

var createCommand = cli.Command{
        Name: "create",
        Usage: "create a container",
        ArgsUsage: `

...
        Action: func(context *cli.Context) error {
                if err := checkArgs(context, 1, exactArgs); err != nil {
                        return err
                }
                if err := revisePidFile(context); err != nil {
                        return err
                }
                spec, err := setupSpec(context)
                if err != nil {
                        return err
                }
                status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
                if err != nil {
                        return err
                }
                // exit with the container's exit status so any external supervisor is
                // notified of the exit with the correct exit status.
                os.Exit(status)
                return nil
        },
}

process

runcをプロセス構成の視点で見ると、
まず、runc createコマンドでコンテナの管理オブジェクトなどが作成されます。
次に、runc initでnamespaceなどの具体的な初期化が行われてコンテナが作成されます。
特にrunc initは複雑なので管理データや同期処理に注意してください。

f:id:FallenPigeon:20210502181030p:plain

以降、runc create→runc init→runc startの順でruncの処理を追います。
同期処理の都合で、説明が前後するところもありますが、上記フローを眺めながら、コードを確認するとよいと思います。

runc create

createコマンドの処理で重要なのはsetupSpecとstartContainerのところです。

/*create.go*/
        Action: func(context *cli.Context) error {
...
                spec, err := setupSpec(context)
...
                status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
...

以降、下記の関数階層を辿りながらcreateコマンドの処理を追います。

 setupSpec(context)
 startContainer(context, spec, CT_ACT_CREATE, nil) 
   |- createContainer
      |- specconv.CreateLibcontainerConfig
      |- loadFactory(context)
         |- libcontainer.New(......)
      |- factory.Create(id, config)
   |- runner.run(spec.Process)
      |- newProcess(*config, r.init) 
      |- r.container.Start(process)
         |- c.createExecFifo()
         |- c.start(process)
            |- c.newParentProcess(process)
            |- parent.start()

setupSpec

setupSpecメソッドは、ユーザに指定されたランタイムバンドルを参照して、設定情報をrunc用のデータ構造に格納します。
1.コマンドライン入力から-bで指定されたOCIバンドルディレクトリを参照します。引数がない場合、デフォルトは現在のディレクトリを参照します。
2.config.jsonを読み取り、設定をhttps://github.com/opencontainers/runtime-spec/blob/master/specs-go/config.goで定義されているGoデータ構造specs.Specに変換します。内容はすべてOCI準拠です。

startContainer

Linuxプラットフォームを使用しているため、実際の呼び出しはutils_linux.goのstartContainer()になります。
startContainerメソッドは、linuxContainerと呼ばれるコンテナの雛形やコンテナの作成を担うLinuxFactoryをインスタンスとして生成します。

3番目の引数はCT_ACT_CREATEとなっていますが、これはコンテナが作成されるだけということを意味します。 startContainerメソッドは、createContainerメソッドを呼び出して、linuxContainerオブジェクト(コンテナの雛形)を作成し、runner.runメソッドを介してコンテナを開始します。

/* utils_linux.go */
func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
	id := context.Args().First()
...
	//specからruncのコンテナ構成(linuxContainer)を作成
	container, err := createContainer(context, id, spec)
...
	r := &runner{
		enableSubreaper: !context.Bool("no-subreaper"),
		shouldDestroy:   true,
		container:       container,//linuxContainer
		listenFDs:       listenFDs,
		notifySocket:    notifySocket,
		consoleSocket:   context.String("console-socket"),
		detach:          context.Bool("detach"),
		pidFile:         context.String("pid-file"),
		preserveFDs:     context.Int("preserve-fds"),
		action:          action,
		criuOpts:        criuOpts,
		init:            true,//process.Initフィールドを設定するために使用
		logLevel:        logLevel,
	}
	return r.run(spec.Process)
}

createContainerメソッドやrunner.runメソッドを説明する前にlinuxContainerと呼ばれるコンテナの雛形やコンテナの作成を担うLinuxFactoryを説明します。

linuxContainer

runCには、Containerという抽象インターフェイスがコンテナオブジェクトを表すために存在します。これにはBaseContainerインターフェイスが含まれています。内部メソッド名を確認するとコンテナ管理の処理が定義されていることが分かります。
linuxContainerが抽象インターフェースの具体的な実装です。以下はその定義であり、initPathが重要です。

/* libcontainer/container.go */
type BaseContainer interface {
	ID() string
	Status() (Status, error)
	State() (*State, error)
	Config() configs.Config
	Processes() (int, error)
	Stats() (*Stats, error)
	Set(config configs.Config) error
	Start(process *Process) (err error)
	Run(process *Process) (err error)
	Destroy() error
	Signal(s os.Signal, all bool) error
	Exec() error
}

/* libcontainer/container_linux.go */
type Container interface {
	BaseContainer

	Checkpoint(criuOpts *CriuOpts) error
	Restore(process *Process, criuOpts *CriuOpts) error
	Pause() error
	Resume() error
	NotifyOOM() (<-chan struct{}, error)
	NotifyMemoryPressure(level PressureLevel) (<-chan struct{}, error)
}

type linuxContainer struct {
	id                   string
	root                 string
	config               *configs.Config
	cgroupManager        cgroups.Manager
	intelRdtManager      intelrdt.Manager
	initPath             string
	initArgs             string
	initProcess          parentProcess
	initProcessStartTime uint64
	criuPath             string
	newuidmapPath        string
	newgidmapPath        string
	m                    sync.Mutex
	criuVersion          int
	state                containerState
	created              time.Time
	fifo                 *os.File
}

LinuxFactory

runCでは、すべてのコンテナはコンテナファクトリによって作成されます。ファクトリは次のように定義された抽象インターフェイスであり、4つのメソッドが含まれています。また、LinuxFactoryが抽象インターフェースの具体的な実装です。

/*libcontainer/factory.go*/
type Factory interface {
	Create(id string, config *configs.Config) (Container, error)
	Load(id string) (Container, error)
	StartInitialization() error
	Type() string
}

/*libcontainer/factory_linux.go*/
type LinuxFactory struct {
	// Root directory for the factory to store state.
	Root string

	// InitPath is the path for calling the init responsibilities for spawning
	// a container.
	InitPath string

	// InitArgs are arguments for calling the init responsibilities for spawning
	// a container.
	InitArgs []string

	// CriuPath is the path to the criu binary used for checkpoint and restore of
	// containers.
	CriuPath string

	// New{u,g}idmapPath is the path to the binaries used for mapping with
	// rootless containers.
	NewuidmapPath string
	NewgidmapPath string

	// Validator provides validation to container configurations.
	Validator validate.Validator

	// NewCgroupsManager returns an initialized cgroups manager for a single container.
	NewCgroupsManager func(config *configs.Cgroup, paths map[string]string) cgroups.Manager

	// NewIntelRdtManager returns an initialized Intel RDT manager for a single container.
	NewIntelRdtManager func(config *configs.Config, id string, path string) intelrdt.Manager
}

createContainer

startContainerメソッドから呼び出されたcreateContainerメソッドは、Config.config型のconfigにLibcontainer構成情報を格納します。
次にcontextをロードしてLinuxFactoryを生成します。
最後にLinuxFactoryは、CreateメソッドでlinuxContainerを生成します。

func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
	rootlessCg, err := shouldUseRootlessCgroupManager(context)
	if err != nil {
		return nil, err
	}
        //OCI仕様に従ってLibcontainerのconfigを作成
	config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
		CgroupName:                id,
		UseSystemdCgroup:      context.GlobalBool("systemd-cgroup"),
		NoPivotRoot:                 context.Bool("no-pivot"),
		NoNewKeyring:             context.Bool("no-new-keyring"),
		Spec:                              spec,
		RootlessEUID:                os.Geteuid() != 0,
		RootlessCgroups:          rootlessCg,
	})
	if err != nil {
		return nil, err
	}
        //LinuxFactoryを生成
	factory, err := loadFactory(context)
	if err != nil {
		return nil, err
	}
        //FactoryのCreateメソッドでlinuxContainerを生成
	return factory.Create(id, config)
}

CreateLibcontainerConfig

Libcontainerのconfigの作成手順を確認します。

/*libcontainer/specconv/spec_linux.go*/
func CreateLibcontainerConfig(opts *CreateOpts) (*configs.Config, error) {
       // runcの作業ディレクトリを、ランタイムバンドルがあるカレントディレクトリに設定。
	rcwd, err := os.Getwd()
	if err != nil {
		return nil, err
	}
	...
	// config.jsonのrootfsディレクトリを設定。
	rootfsPath := spec.Root.Path
	if !filepath.IsAbs(rootfsPath) {
		rootfsPath = filepath.Join(cwd, rootfsPath)
	}
	labels := string{}
	for k, v := range spec.Annotations {
		labels = append(labels, k+"="+v)
	}
	// 既存のcreateOptsを整理
	config := &configs.Config{
		Rootfs:          rootfsPath,
		NoPivotRoot:     opts.NoPivotRoot,
		Readonlyfs:      spec.Root.Readonly,
		Hostname:        spec.Hostname,
		Labels:          append(labels, "bundle="+cwd),
		NoNewKeyring:    opts.NoNewKeyring,
		RootlessEUID:    opts.RootlessEUID,
		RootlessCgroups: opts.RootlessCgroups,
	}
 	// config.jsonのmountsフィールドに対応する、仕様に従ってディレクトリをマウント。
        // /Proc、/dev、/dev/pts、/dev/shm、/dev/mqueue、/sys/、/sys/fs/cgroupなど
	for _, m := range spec.Mounts {
		config.Mounts = append(config.Mounts, createLibcontainerMount(cwd, m))
	}

	// マウント・パーティション、デフォルト・マウント・パーティション AllowedDevices、OCI準拠パーティションの作成
	// AllowedDevices https://github.com/opencontainers/runc/blob/master/libcontainer/specconv/spec_linux.go
	defaultDevs, err := createDevices(spec, config)
	if err != nil {
		return nil, err
	}

	legacySubsystems = subsystem{
		&fs.CpusetGroup{},
		&fs.DevicesGroup{},
		&fs.MemoryGroup{},
		&fs.CpuGroup{},
		&fs.CpuacctGroup{},
		&fs.PidsGroup{},
		&fs.BlkioGroup{},
		&fs.HugetlbGroup{},
		&fs.PerfEventGroup{},
		&fs.FreezerGroup{},
		&fs.NetPrioGroup{},
		&fs.NetClsGroup{},
		&fs.NameGroup{GroupName: "name=systemd"},
	}
	// cgroup構成を作成。
	c, err := CreateCgroupConfig(opts, defaultDevs)
	if err != nil {
		return nil, err
	}

	config.Cgroups = c

	// linux-specific configをセット。
	if spec.Linux != nil {
		...

		//デフォルトでpid、network、ipc、uts、mountのnamespaceをロード。
		for _, ns := range spec.Linux.Namespaces {
			t, exists := namespaceMapping[ns.Type]
			if !exists {
				return nil, fmt.Errorf("namespace %q does not exist", ns)
			}
			if config.Namespaces.Contains(t) {
				return nil, fmt.Errorf("malformed spec file: duplicated ns %q", ns)
			}
			config.Namespaces.Add(t, ns.Path)
		}
		if config.Namespaces.Contains(configs.NEWNET) && config.Namespaces.PathOf(configs.NEWNET) == "" {
			config.Networks = []*configs.Network{
				{
					Type: "loopback",
				},
			}
		}
		// user namespaceがある場合は、ユーザのroot IDとgroup IDを設定。
		if config.Namespaces.Contains(configs.NEWUSER) {
			if err := setupUserNamespace(spec, config); err != nil {
				return nil, err
			}
		}
		...
  		// Intelチップパラメータを設定。
		if spec.Linux.IntelRdt != nil {
			config.IntelRdt = &configs.IntelRdt{}
			if spec.Linux.IntelRdt.L3CacheSchema != "" {
				config.IntelRdt.L3CacheSchema = spec.Linux.IntelRdt.L3CacheSchema
			}
			if spec.Linux.IntelRdt.MemBwSchema != "" {
				config.IntelRdt.MemBwSchema = spec.Linux.IntelRdt.MemBwSchema
			}
		}
	}
	if spec.Process != nil {
  		// oomスコアを設定。
		config.OomScoreAdj = spec.Process.OOMScoreAdj
		// privileges
		config.NoNewPrivileges = spec.Process.NoNewPrivileges
 		// umask
		config.Umask = spec.Process.User.Umask
		// selinux
		if spec.Process.SelinuxLabel != "" {
			config.ProcessLabel = spec.Process.SelinuxLabel
		}
		// コンテナに一部の特権を付与。
		if spec.Process.Capabilities != nil {
			config.Capabilities = &configs.Capabilities{
				Bounding:    spec.Process.Capabilities.Bounding,
				Effective:   spec.Process.Capabilities.Effective,
				Permitted:   spec.Process.Capabilities.Permitted,
				Inheritable: spec.Process.Capabilities.Inheritable,
				Ambient:     spec.Process.Capabilities.Ambient,
			}
		}
	}
	// コンテナライフサイクルフック
	/*
	preStart : initプロセスを開始する前のフックコメントによると、フックは放棄されました
	CreateRuntime : pivot_rootの実行前。
	CreateContainer : CreateRuntimeの実行後。
	Poststart :ユーザプロセスの実行前。
	StartContainer :ユーザプロセスが開始されておらずcreatedの状態。
	poststart :initプロセスの初期化を受信後。
	startContainer initプロセスが開始プロセスの情報を受信後。
	Poststop 
	*/
	createHooks(spec, config)
	config.Version = specs.Version
	return config, nil
}

loadFactory

loadFactoryメソッドは、contextを引数としてLinuxFactoryを返します。これは、libcontainer.New()メソッドで実装されています。libcontainer.New()メソッドはLinuxFactoryを返し、InitPathが「/proc/self/exe」(現在のexeファイル、つまりrunc本体)に設定されています。InitArgsは、os.Args[0]が現在のruncのパスであり、基本的にInitPathと同じです。つまり、runc initとなります。

/* utils/utils_linux.go */
func loadFactory(context *cli.Context) (libcontainer.Factory, error) {
        ...
        return libcontainer.New(abs, cgroupManager, intelRdtManager,
                libcontainer.CriuPath(context.GlobalString("criu")),
                libcontainer.NewuidmapPath(newuidmap),
                libcontainer.NewgidmapPath(newgidmap))
}

/* libcontainer/factory_linux.go */
func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
        ...
	l := &LinuxFactory{
		Root:      root,	
		InitPath:  "/proc/self/exe", /*runc本体*/
		InitArgs:  []string{os.Args[0], "init"},	 /*runc init*/
		Validator: validate.New(),
		CriuPath:  "criu",
	}
        ...
        return l, nil
}

factory.Create

factory.Createメソッドは、LinuxFactoryに記録されたInitPathやInitArgsをlinuxContainerに割り当てて返します。

/*libcontainer/factory_linux.go*/
func (l *LinuxFactory) Create(id string, config *configs.Config) (Container, error) {
        ...
	c := &linuxContainer{
		id:            id,//コンテナID
		root:          containerRoot, /*コンテナ状態ファイルの格納先:デフォルトは/run/runc/{container id}/*/
		config:        config,
		initPath:      l.InitPath, /* /proc/self/exe(runc本体) */
		initArgs:      l.InitArgs,/*runc init*/ 
		criuPath:      l.CriuPath,
		newuidmapPath: l.NewuidmapPath,
		newgidmapPath: l.NewgidmapPath,
		cgroupManager: l.NewCgroupsManager(config.Cgroups, nil),
	}
        ...
        return c, nil
}

ここまでがstartContainerメソッドから呼び出されたcreateContainerがlinuxContainerを生成する過程になります。
次は、startContainerメソッドから呼び出されるrunner.runメソッドを説明します。

runner.run

runnerは、これまでの設定内容をロードして、内部でrunc initプロセスを実行します。
このrunc initプロセスは、namespaceを設定するnsexecやコンテナ(execveで変身)となる重要なプロセスです。

詳細は後述しますが、runc creareプロセスとrunc initプロセス(3つのプロセスに分裂)が連携することで最終的にコンテナが実行(exec)されます。

① :現在実行中(runner.run)のrunc creareプロセス:コンテナ生成フローを最上位で制御
② :①に実行されたrunc init親プロセス(nsexec):runc creareとの中継ぎや子プロセスの同期を行う中間管理職のようなプロセス。
③ :②に実行されたrunc init子プロセス(nsexec):実際にnamespaceを設定
④ :③に実行されたrunc init孫プロセス(nsexec(c言語)→go言語→コンテナ(exec)):②と③が役目を終えてexit()しても生き残り、最終的にコンテナとプロセス。

runner.runメソッドの入力はspec.Process構造体で、内容はconfig.json由来です。
spec.Processはconfig.jsonのprocess項目がGo言語フォーマットに変換されただけの内容のため、注意を払う必要はありません。

/* utils_linux.go */
func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
	id := context.Args().First()
...
	container, err := createContainer(context, id, spec)
...
        // runnerは、これまでの設定内容をロードしてrunc initプロセス(nsexecやコンテナ本体)を実行。
	r := &runner{
		enableSubreaper: !context.Bool("no-subreaper"),
		shouldDestroy:   true,
		container:       container,
		listenFDs:       listenFDs,
		notifySocket:    notifySocket,
		consoleSocket:   context.String("console-socket"),
		detach:          context.Bool("detach"),
		pidFile:         context.String("pid-file"),
		preserveFDs:     context.Int("preserve-fds"),
		action:          action,
		criuOpts:        criuOpts,
		init:            true,
		logLevel:        logLevel,
	}
	return r.run(spec.Process)
}

runner.runメソッドは、下記の2つの処理で構成されています。
1.newProcess()メソッドを呼び出して、spec.Processを使用してlibcontainer.Processを作成します。2番目の引数がtrueになっていますが、これは、新しく作成されたプロセスが、コンテナの最初のプロセスになることを意味します。
2.r.action(CT_ACT_CREATE)の値に従って、libcontainer.Processを操作する方法を決定します。

/*utils_linux.go*/
func (r *runner) run(config *specs.Process) (int, error) {
        ...
	process, err := newProcess(*config, r.init, r.logLevel)
        ...
	switch r.action {
	case CT_ACT_CREATE:
		err = r.container.Start(process)
	case CT_ACT_RESTORE:
		err = r.container.Restore(process, r.criuOpts)
	case CT_ACT_RUN:
		err = r.container.Run(process)
	default:
		panic("Unknown action")
        ...
	return status, err
}

newProcess

libcontainer.Process構造体は/libcontainer/process.goで定義されていて、そのほとんどはspec.Process(config.json:process)由来のものです。

/*libcontainer/process.go*/
package libcontainer
...
type Process struct {
	Args string
	Env string
	User string
	AdditionalGroups string
	Cwd string
	Stdin io.Reader
	Stdout io.Writer
	Stderr io.Writer
	ExtraFiles *os.File
	ConsoleWidth  uint16
	ConsoleHeight uint16
	Capabilities *configs.Capabilities
	AppArmorProfile string
	Label string
	NoNewPrivileges *bool
	Rlimits []configs.Rlimit
	ConsoleSocket *os.File
	Init bool
	ops processOperations
	LogLevel string
}

linuxContainer.Start

linuxContainer.Start(r.container.Start)メソッドは、下記の処理を実行します。
1.fifoの作成：後で使用するexec.fifoという名前のパイプを作成します
2.start()メソッドを呼び出します(Sが大文字と小文字で違うメソッドです)

/*libcontainer/container_linux.go*/
func (c *linuxContainer) Start(process *Process) error {
        ...
	if process.Init {
		if err := c.createExecFifo(); err != nil { //後でexecを呼び出すためのFIFOの作成
			return err
		}
	}
	if err := c.start(process); err != nil { 
		if process.Init {
			c.deleteExecFifo()
		}
		return err
	}
	return nil
}

Start()メソッドから呼び出されたstart()メソッドは、下記の処理を実行します。
1.ParentProcessを作成する
2.このParentProcessのstart()メソッドを呼び出します

func (c *linuxContainer) start(process *Process) (retErr error) {
	parent, err := c.newParentProcess(process)
        ...
	err := parent.start();//runc init親プロセスを実行
        ...
	if process.Init {
		if c.config.Hooks != nil {
			s, err := c.currentOCIState()
			if err != nil {
				return err
			}
			// poststart hook を実行
			if err := c.config.Hooks[configs.Poststart].RunHooks(s); err != nil {
				if err := ignoreTerminateErrors(parent.terminate()); err != nil {
					logrus.Warn(errorsf.Wrapf(err, "Running Poststart hook"))
				}
				return err
			}
		}
	}
	...
}

runCでは、parentProcessは次のような抽象インターフェイスです。

/*libcontainer/process_linux.go*/
type parentProcess interface {
	// pid returns the pid for the running process.
	pid() int

	// start starts the process execution.
	start() error

	// send a SIGKILL to the process and wait for the exit.
	terminate() error

	// wait waits on the process returning the process state.
	wait() (*os.ProcessState, error)

	// startTime returns the process start time.
	startTime() (uint64, error)

	signal(os.Signal) error

	externalDescriptors() string

	setExternalDescriptors(fds string)
}

newParentProcess

newParentProcess()メソッドにはinitProcessとsetnsProcessの2つの実装があり、前者はコンテナ内に最初のプロセスを作成するために使用され、後者は既存のコンテナ内に新しいプロセスを作成するために使用されます。今回は、p.Init = trueであるため、initProcessが作成されます。

/*libcontainer/container_linux.go*/
func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
	//runc initの②親プロセスと③子プロセスの間に通信用Socket Pairを作成。
	parentPipe, childPipe, err := utils.NewSockPair("init")
	//exec.Cmdの作成 
	cmd, err := c.commandTemplate(p, childPipe) 

	if !p.Init {
		return c.newSetnsProcess(p, cmd, parentPipe, childPipe) 
	}
	//extraFilesにexec.fifoを追加
	if err := c.includeExecFifo(cmd); err != nil { 
		return nil, newSystemErrorWithCause(err, "including execfifo in cmd.Exec setup")
	}
	//initProcessオブジェクトを生成し、_LIBCONTAINER_INITTYPEをstandardに設定。
	//主に名前空間とoomスコアを格納するbootstrapDataを生成。
	//init-parent、init-childの通信チャネルを設定。
	//initProcessを作成 
	return c.newInitProcess(p, cmd, parentPipe, childPipe)
}

newParentProcess()メソッドには4つのステップがあります。最初の3ステップは、initProcessを生成するステップ4の準備処理です。

1.SocketPairを作成します。作成されたSocketPairはinitProcessに入力されます。

2.exec.Cmdを作成します。コードは次のとおりです。ここでは、cmdによって実行される実行可能プログラムを設定し、パラメータはc.initPath、つまり、LinuxFactoryの「/proc/self/exe」と「init」から取得します。新しく実行されたプログラムはrunC自体ですが、パラメータはinitになり、外部で作成されたSocketPairのchildPipeがcmd.ExtraFilesに配置され、_LIBCONTAINER_INITPIPE =％dがcmd.Envに追加されます。

/*libcontainer/container_linux.go*/
func (c *linuxContainer) commandTemplate(p *Process, childPipe *os.File) (*exec.Cmd, error) {
	cmd := exec.Command(c.initPath, c.initArgs[1:]...)
	cmd.Args[0] = c.initArgs[0]
	//stdストリームはruncinitコマンドに渡され、最終的にrunc initを介してコンテナに渡されます
    	cmd.Stdin = p.Stdin
	cmd.Stdout = p.Stdout
	cmd.Stderr = p.Stderr
	cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
	//childPipeは、親プロセス(現在のruncプロセス)との通信に使用されます
	cmd.ExtraFiles = append(cmd.ExtraFiles, childPipe)
	//stdストリームは最初の3つのfd番号(0、1、2)を占めるため、fd番号を環境変数_LIBCONTAINER_INITPIPEを介してruncinitに渡します
	//したがって、fdは3（stdioFdCount）を追加する必要があります
	cmd.Env = append(cmd.Env,
		fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
	)
	...
	return cmd, nil
}

3.includeExecFifo()メソッドは、事前に作成されたfifoを開き、そのfdをcmd.ExtraFilesに配置し、_LIBCONTAINER_FIFOFD =％dをcmd.Envに記録します。

4.newInitProcess()メソッドでInitProcessを作成します。ここでは、最初に_LIBCONTAINER_INITTYPE = "standard"をcmd.Envに追加し、次に新しいコンテナで作成する必要のある名前空間のタイプを構成から読み取り、使用するために変数データに格納します。最後にInitProcessを作成します。事前に作成されたリソースと変数はここで利用されます。

/*libcontainer/container_linux.go*/
func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, parentPipe, childPipe *os.File) (*initProcess, error) {
	//initタイプは、環境変数_LIBCONTAINER_INITTYPEを介してstandard（initStandard）に設定されます
	cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))
	nsMaps := make(map[configs.NamespaceType]string)
	for _, ns := range c.config.Namespaces {
		if ns.Path != "" {
			nsMaps[ns.Type] = ns.Path
		}
	}
	_, sharePidns := nsMaps[configs.NEWPID]
	data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps)
	if err != nil {
		return nil, err
	}
	return &initProcess{
		cmd:       cmd,
		childPipe:     childPipe,
		parentPipe:   parentPipe,
		manager:     c.cgroupManager,
		intelRdtManager:   c.intelRdtManager,
		config:     c.newInitConfig(p),
		container:   c,
		process:   p,          
		bootstrapData:  data,
		sharePidns:   sharePidns,
	}, nil
}

ここまで、linuxContainerのstart()メソッドでparentProcessオブジェクトが作成される過程を確認しました。
次は、parentProcessのstart()メソッドが呼び出されます。

func (c *linuxContainer) start(process *Process) error {
	parent, err := c.newParentProcess(process)  //parentProcessの作成 

	err := parent.start();  //parentProcessを起動
	...

parentProcess.start()

前述したように、newParentProcess()は、config.jsonから得られる設定に従って、initProcessオブジェクトを生成します。
このinitProcessには、以下の情報が含まれています。

cmdは、コンテナが実行する実行ファイルの名前、つまり"/proc/self/exe init(runc init)"を記録しています。
cmd.Envには、名前付きパイプのファイルディスクリプタが記録され、exec.fifoには、_LIBCONTAINER_FIFOFD=%dという名前のSocketPairが作成され、そのchildPipe側のディスクリプタが記録されます。 LIBCONTAINER_INITTYPE="standard"はコンテナ内のプロセスが初期プロセスであることを意味します。
initProcessのbootstrapDataには、新しいコンテナのためにどのNamespaceを作成するかが記録されています。

/* libcontainer/process_linux.go */
func (p *initProcess) start() error {
	p.cmd.Start() //runc initコマンドを実行(プロセス生成)        
	io.Copy(p.parentPipe, p.bootstrapData)//起動されたrunc initプロセスに設定情報を提供
	...
}

runc init

parentProcess.start()は、cmdに設定された実行ファイル"/proc/self/exe init(runc init)"を起動します。
この関数は、コマンドを実行するために新しいプロセスを起動します。(runc init 親プロセス②)
/proc/self/exeはruncプログラムそのものなので、これはrunc initを実行することになります。

io.Copyは、p.bootstrapDataからp.parentPipeを介してrunc init(nsexec)にデータを送信します。
新しいプロセスを作成する理由は、作成したコンテナが別のネームスペースで実行される必要があるためです。
これは、setns()システムコールによって行われますが、setns manページには次のような記載があります。

マルチスレッドのプロセスでは、setns()でuser namespaceを変更することはできません。

Goのランタイムはマルチスレッドであるため、setns()はGoのランタイムが始まる前に設定しなければならず、Goのランタイムが始まる前にcgoが埋め込まれたCコードを実行する必要があります。
この説明は，nsenterのREADMEに記載されています。runc initコマンド(プロセス)は，init.goファイルの最初に，nsenterパッケージをインポートします。

/* init.go */
import (
	"os"
	"runtime"

	"github.com/opencontainers/runc/libcontainer"
	_ "github.com/opencontainers/runc/libcontainer/nsenter"
	"github.com/urfave/cli"
)

nsenter

nsenterパッケージは、cgo経由で埋め込まれたC言語のコードで、nsexec()を呼び出します。

package nsenter
/* nsenter.go */
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__*1 init(void) {
	nsexec();
}
import "C"

nsexec

次に、nsexec()はコンテナのために新しい名前空間を作成します。
余談ですが、CVE-2019-5736の緩和処理も書かれていますね。

/* libcontainer/nsenter/nsexec.c */
void nsexec(void)
{
	int pipenum;
	jmp_buf env;
	int sync_child_pipe[2], sync_grandchild_pipe[2];
	struct nlconfig_t config = { 0 };

	//環境変数_LIBCONTAINER_INITPIPEから子パイプのfd番号を取得します
	pipenum = initpipe();
	if (pipenum == -1)
		return;

	//CVE-2019-5736の脆弱性を回避するために、現在のバイナリファイルがコピーされています
        //これはコンテナが/proc/self/exeを介してホストバイナリにアクセスできないようにするためです
	if (ensure_cloned_binary() < 0)
		bail("could not ensure we are a cloned binary");

	//initpipeからnamespaceの構成を読み取ります
	nl_parse(pipenum, &config);
   
	...

上記のCコードでは、initpipe()で親プロセスで設定されたパイプ(_LIBCONTAINER_INITPIPEで記録されたファイルディスクリプタ)を読込み、nl_parseを呼び出してこのパイプから変数configに設定を読み込んでいます。通信相手は、親(①runc create)プロセスです。次の図のように、runc createプロセスは、このパイプを通じて、新しいコンテナの構成をrunc init(nsexec)に送ります。

f:id:FallenPigeon:20210501113928p:plain

送信されたデータは、linuxContainerのbootstrapData()関数でnetlink msg形式のメッセージとしてラップされています。
この時点で、子プロセスは親プロセスから名前空間の構成を取得します。
次に、nsexec()は自分の子や孫との通信のために、さらに2つのソケットペアを作成します。

/*libcontainer/nsenter/nsexec.c*/
void nsexec(void)
{
	...
	/* セットアップが完了したときに子②に知らせることができるようにsocketpairを作成*/
	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sync_child_pipe) < 0) 
		bail("failed to setup sync pipe between parent and child");

	/*孫③と同期する新しいsocketpairを作成*/
	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sync_grandchild_pipe) < 0)
		bail("failed to setup sync pipe between parent and grandchild");
   
}

以下のスイッチ文で構成された処理では、現在のプロセス(②runc init)がcloneシステムコールを介して子プロセス(③runc init)を作成し、子プロセスがclone()システムコールを介して孫プロセス(④runc init)を作成します。

このようにrunc initには3つのプロセスがあります。

②runc init親プロセス:最初のプロセスはbootstrapDataを読み取り、2番目のプロセスのユーザマップの設定を完了します 。
③runc init子プロセス:2番目のプロセスは、namespaceのcreate/joinを行います。 
④runc init孫プロセス:3番目のプロセスは、nsexecの処理でcgroup namesapceの設定を完了し、runc initのgo言語処理でコンテナ内の環境を準備し、最後にコンテナのエントリーポイントを実行します。

②runc initも③runc initも最終的にはexit(0)で終了しますが、④runc initは終了せずに、runc initコマンドの後半部分(Go実装)を実行し続けます。そのため、最初の①runc createプロセスと④runc initプロセスだけが残ります。

f:id:FallenPigeon:20210502181030p:plain

enum sync_t {
	SYNC_USERMAP_PLS = 0x40,	/* Request parent to map our users. */
	SYNC_USERMAP_ACK = 0x41,	/* Mapping finished by the parent. */
	SYNC_RECVPID_PLS = 0x42,	/* Tell parent we're sending the PID. */
	SYNC_RECVPID_ACK = 0x43,	/* PID was correctly received by parent. */
	SYNC_GRANDCHILD = 0x44,	/* The grandchild is ready to run. */
	SYNC_CHILD_FINISH = 0x45,	/* The child or grandchild has finished. */
};
...
	switch (current_stage) {

	//②runc init親プロセス
	//新しい子(STAGE_CHILD)プロセスを作成し、そのuid_mapとgid_mapを作成します。
	//子プロセスは孫プロセスを作成し、PIDを送信します。
	case STAGE_PARENT:{
			...
			stage1_pid = clone_parent(&env, STAGE_CHILD);//runc init子プロセスのclone
			while (!stage1_complete) {
				...
				switch (s) {
				case SYNC_USERMAP_PLS: //子プロセスからのユーザマップの設定依頼
					...
					update_uidmap(config.uidmappath, stage1_pid, config.uidmap, config.uidmap_len);
					update_gidmap(config.gidmappath, stage1_pid, config.gidmap, config.gidmap_len);
					...
					s = SYNC_USERMAP_ACK;
					if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {//完了通知
						...
					}
				case SYNC_RECVPID_PLS:
					...
					/* 孫プロセスのPIDを取得*/
					if (read(syncfd, &stage2_pid, sizeof(stage2_pid)) != sizeof(stage2_pid)) {
						sane_kill(stage1_pid, SIGKILL);
						sane_kill(stage2_pid, SIGKILL);
						bail("failed to sync with stage-1: read(stage2_pid)");
					}

					/* Send ACK. */
					s = SYNC_RECVPID_ACK;
					if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {//完了通知
						...
					}
					...
					//子と孫のpidをrunc createに送信
					len =
					    dprintf(pipenum, "{\"stage1_pid\":%d,\"stage2_pid\":%d}\n", stage1_pid,
						    stage2_pid);
					...
					break;
				case SYNC_CHILD_FINISH://子プロセスの処理が完了したことを受信
					write_log(DEBUG, "stage-1 complete");
					stage1_complete = true;
					break;
				}
			}
			...
				write_log(DEBUG, "signalling stage-2 to run");
				s = SYNC_GRANDCHILD;
				if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {//孫プロセスへの同期開始を通知
					...
				}

				if (read(syncfd, &s, sizeof(s)) != sizeof(s))
						...

				switch (s) {
				case SYNC_CHILD_FINISH://孫プロセスの処理が完了したことを受信
					write_log(DEBUG, "stage-2 complete");
					stage2_complete = true;
					break;
				default:
					bail("unexpected sync value: %u", s);
				}
				...
		}
		break;

	//③runc init子プロセス
	// 要求された名前空間の共有を解除します。 
	//特に、CLONE_NEWUSER(user namespace)を要求された場合は、親プロセスにユーザマッピングを設定するように依頼します。 
	//次に、PID名前空間の孫を作成し、PIDを親プロセスに送信します。
	case STAGE_CHILD:{

			//他のnamespaceや特権チェックのコンテキストとして使用されるため、最初にuser namespaceを設定します。 
			if (config.cloneflags & CLONE_NEWUSER) {
				write_log(DEBUG, "unshare user namespace");
				if (unshare(CLONE_NEWUSER) < 0)//user namespaceを作成
						...
				//子プロセスにはユーザマッピングを操作する権限がないため、親プロセスに設定を要求します。
				s = SYNC_USERMAP_PLS;
				if (write(syncfd, &s, sizeof(s)) != sizeof(s))

				/* ... wait for mapping ... */
				write_log(DEBUG, "request stage-0 to map user namespace");
				if (read(syncfd, &s, sizeof(s)) != sizeof(s))//親プロセスの完了通知を受信
					...
				/* setresuidで自身をuser namespace内のrootに昇格 */
				if (setresuid(0, 0, 0) < 0)
					bail("failed to become root in user namespace");
			}

			//user namespace以外のnamespaceをunshare(分離)します。(cgroup,pidを除く)
			if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) < 0)
				...

			//pid namespcaeのために再度フォークします。setns（2）またはunshare（2）は、呼出し元プロセスのpid namespaceを変更しません。
			//変更すると、呼び出し元自身のPIDの概念が変更され、アプリケーションやライブラリがクラッシュします。
			//新しいpid namespaceはforkされたプロセスに割り当てられます。(runc initプロセスが3つある理由の一つ)

			stage2_pid = clone_parent(&env, STAGE_INIT);//孫プロセスの作成

			s = SYNC_RECVPID_PLS;
			//親プロセスに孫プロセスのPIDを受信するよう通知
			if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {
					...
			}
			//孫プロセスのPIDを親プロセスに送信
			if (write(syncfd, &stage2_pid, sizeof(stage2_pid)) != sizeof(stage2_pid)) {
				...
			}

			//親プロセスの完了通知を受信
			if (read(syncfd, &s, sizeof(s)) != sizeof(s)) {
				...
			}
			if (s != SYNC_RECVPID_ACK) {
				...
			}

			//子プロセスの処理が完了したことを親プロセスに通知
			s = SYNC_CHILD_FINISH;
			if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {
				...
			}
		}
		break;

	//④runc init孫プロセス
	//最上位の親がp.manager.Apply()でcgroupのセットアップを完了するまで待ちます。
	case STAGE_INIT:{

			//cgroup namespace
			if (config.cloneflags & CLONE_NEWCGROUP) {
				uint8_t value;
				if (read(pipenum, &value, sizeof(value)) != sizeof(value))
					bail("read synchronisation value failed");
				if (value == CREATECGROUPNS) {
					write_log(DEBUG, "unshare cgroup namespace");
					if (unshare(CLONE_NEWCGROUP) < 0)
						bail("failed to unshare cgroup namespace");
				} else
					bail("received unknown synchronisation value");
			}

			//孫プロセスの処理が完了したことを親プロセスに通知
			s = SYNC_CHILD_FINISH;
			if (write(syncfd, &s, sizeof(s)) != sizeof(s))
				bail("failed to sync with patent: write(SYNC_CHILD_FINISH)");

			//孫プロセスはnsexecの処理が完了した後もexitしないため、runc init (go実装)の処理が継続される。
			write_log(DEBUG, "<= nsexec container setup");
			write_log(DEBUG, "booting up go runtime ...");
			return;
		}
		break;
	default:
		bail("unknown stage '%d' for jump value", current_stage);
	}

ここでrunc createプロセスに戻ります。

func (p *initProcess) start() (retErr error) {
	defer p.messageSockPair.parent.Close()
	// runc initを実行(nsexecが実行される)
	err := p.cmd.Start()
	...
	// 子プロセスのcgroupを制限
	if err := p.manager.Apply(p.pid()); err != nil {
		return newSystemErrorWithCause(err, "applying cgroup configuration for process")
	}
	if p.intelRdtManager != nil {
		if err := p.intelRdtManager.Apply(p.pid()); err != nil {
			return newSystemErrorWithCause(err, "applying Intel RDT configuration for process")
		}
	}
	//bootstrapDataをrunc initプロセスに送信し、runcのinitプロセスがそれを受け取り、自分の名前空間を設定するなどの作業を行う。
	if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
		return newSystemErrorWithCause(err, "copying bootstrap data to pipe")
	}

	//initpipe経由で子プロセスのpidを取得する
	childPid, err := p.getChildPid()

	//子プロセスのファイルディスクリプタのパスを取得する
	fds, err := getPipeFds(childPid)

	// 新しいcgroupの名前空間(コンテナ用)を設定するようにinitプロセスに通知する
	if p.config.Config.Namespaces.Contains(configs.NEWCGROUP) && p.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
		if _, err := p.messageSockPair.parent.Write([]byte{createCgroupns}); err != nil {
			return newSystemErrorWithCause(err, "sending synchronization value to init process")
		}
	}

	// nsexecプロセスの実行を待機する
	// pidの情報はinitpipeで取得する
	// nsexecではbootstrapDataを受け取り、プロセスの名前空間を設定する
　　// その後、goで書かれたrunc initの処理に移る。
	if err := p.waitForChildExit(childPid); err != nil {
		return newSystemErrorWithCause(err, "waiting for our first child to exit")
	}
	
	...
	// init構成をinitプロセスに送信
 	if err := p.sendConfig(); err != nil {
		return newSystemErrorWithCause(err, "sending config to init process")
	}
	var (
		sentRun    bool
		sentResume bool
	)

	// init processと進行状況の同期
	// parseSyncはソケットが閉じられるまでループします。
	ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
		switch sync.Type {
		// initプロセスの準備完了時
		case procReady:
			// rlimitsの設定
			if err := setupRlimits(p.config.Rlimits, p.pid()); err != nil {
				return newSystemErrorWithCause(err, "setting rlimits for ready process")
			}
			// フックが実行できるのはマウント・ネームスペースがない場合のみ
                        //通常はマウント・ネームスペースが必要
			if !p.config.Config.Namespaces.Contains(configs.NEWNS) {
				// Setup cgroup before the hook, so that the prestart and CreateRuntime hook could apply cgroup permissions.
				if err := p.manager.Set(p.config.Config); err != nil {
					return newSystemErrorWithCause(err, "setting cgroup config for ready process")
				}
				...
				if p.config.Config.Hooks != nil {
					s, err := p.container.currentOCIState()
					if err != nil {
						return err
					}
					// 子プロセスのpid設定
					s.Pid = p.cmd.Process.Pid
 					// Statusの作成
					s.Status = specs.StateCreating
					hooks := p.config.Config.Hooks
					
					if err := hooks[configs.Prestart].RunHooks(s); err != nil {
						return err
					}
					if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
						return err
					}
				}
			}

			// generate a timestamp indicating when the container was started
			p.container.created = time.Now().UTC()
			p.container.state = &createdState{
				c: p.container,
			}

			state, uerr := p.container.updateState(p)
			if uerr != nil {
				return newSystemErrorWithCause(err, "store init state")
			}
			p.container.initProcessStartTime = state.InitProcessStartTime

			// 子プロセスが、動作を続けている状態
			if err := writeSync(p.messageSockPair.parent, procRun); err != nil {
				return newSystemErrorWithCause(err, "writing syncT 'run'")
			}
			sentRun = true

		// initプロセスからフックシグナルを受け取り、pivot_rootが実行される直前の状態。
		case procHooks:
			// プロセスのcgroupの設定
			if err := p.manager.Set(p.config.Config); err != nil {
				return newSystemErrorWithCause(err, "setting cgroup config for procHooks process")
			}

			if p.intelRdtManager != nil {
				if err := p.intelRdtManager.Set(p.config.Config); err != nil {
					return newSystemErrorWithCause(err, "setting Intel RDT config for procHooks process")
				}
			}
			// hookの実行
			if p.config.Config.Hooks != nil {
				s, err := p.container.currentOCIState()
				if err != nil {
					return err
				}
				s.Pid = p.cmd.Process.Pid
				s.Status = specs.StateCreating
				hooks := p.config.Config.Hooks

				if err := hooks[configs.Prestart].RunHooks(s); err != nil {
					return err
				}
				if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
					return err
				}
			}
			// pivot_rootの実行を再開・継続するようにinitプロセスに通知する
			if err := writeSync(p.messageSockPair.parent, procResume); err != nil {
				return newSystemErrorWithCause(err, "writing syncT 'resume'")
			}
			sentResume = true
		}

		return nil
	})
	// initコールバックを待ち, コールバックが成功した場合は、
        //残りの設定、つまりライフサイクルのHOOK呼び出しを完了します
	if !sentRun {
		return newSystemErrorWithCause(ierr, "container init")
	}
	// フックのコールバックが成功するのを待つ
	if p.config.Config.Namespaces.Contains(configs.NEWNS) && !sentResume {
		return newSystemError(errors.New("could not synchronise after executing prestart and CreateRuntime hooks with container process"))
	}
	// init pipeを閉じる
	if err := unix.Shutdown(int(p.messageSockPair.parent.Fd()), unix.SHUT_WR); err != nil {
		return newSystemErrorWithCause(err, "shutting down init pipe")
	}

	if ierr != nil {
		p.wait()
		return ierr
	}
	return nil
}

runc init(After nsexec)

この時点で、runc init 1号とrunc init 2号はexit()で終了しています。
runc init 3はnsexecで名前空間が設定された後、残りのコンテナ設定をgo言語で行い、最後にコンテナのエントリポイントを実行(exec)します。
runc init 3は最初にlibcontainer.Newを介してLinuxFactoryを作成し、LinuxFactoryのStartInitialization（）メソッドを呼び出します。

...
var initCommand = cli.Command{
	Name:  "init",
	Usage: `initialize the namespaces and launch the process (do not call it outside of runc)`,
	Action: func(context *cli.Context) error {
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			os.Exit(1)
		}
		panic("libcontainer: container init failed to exec")
	},
}

func (l *LinuxFactory) StartInitialization() (err error) {
        // initpipeのファイルディスクリプタを取得します
	envInitPipe := os.Getenv("_LIBCONTAINER_INITPIPE")
	pipefd, err := strconv.Atoi(envInitPipe)
	if err != nil {
		return fmt.Errorf("unable to convert _LIBCONTAINER_INITPIPE=%s to int: %s", envInitPipe, err)
	}
	// 親プロセスと通信するためのパイプを用意します
	pipe := os.NewFile(uintptr(pipefd), "pipe")
	defer pipe.Close()

	// runc createは、type standerと exec.fifo パイプラインを初期化します。
	fifofd := -1
	envInitType := os.Getenv("_LIBCONTAINER_INITTYPE")
	it := initType(envInitType)
	if it == initStandard {
		envFifoFd := os.Getenv("_LIBCONTAINER_FIFOFD")
		if fifofd, err = strconv.Atoi(envFifoFd); err != nil {
			return fmt.Errorf("unable to convert _LIBCONTAINER_FIFOFD=%s to int: %s", envFifoFd, err)
		}
	}

	// 継承されたプロセス環境をクリアする
	os.Clearenv()
	...
	//今回はstarndar linuxStandardInitを返します。execを実行している場合はlinuxSetnsInitを返します。
	i, err := newContainerInit(it, pipe, consoleSocket, fifofd)
	if err != nil {
		return err
	}
    //  Initの処理に入る。
	return i.Init()
}

linuxStandardInit.Init

/*libcontainer/standard_init_linux.go*/
func (l *linuxStandardInit) Init() error {
	// ネットワークを設定する
	if err := setupNetwork(l.config); err != nil {
		return err
	}
	// ルーティングを設定する
	if err := setupRoute(l.config.Config); err != nil {
		return err
	}

	// selinux設定
	selinux.GetEnabled()
	//ルートディレクトリのマウント、外部ボリュームのマウント、デバイスの作成を行って、ルートファイルシステムを整理します。
	// runc createにprestart hook（pivot_rootまたはchange_rootの前）を通知し、ルートディレクトリを隔離します。
	// prestart hookは、pivot_rootやchange_rootが実行される前に処理されます。
	if err := prepareRootfs(l.pipe, l.config); err != nil {
		return err
	}
	...

	// 最終的なルートファイルシステムを完成させる。主に必要なマウントポイントをマウントしていく。
	if l.config.Config.Namespaces.Contains(configs.NEWNS) {
		if err := finalizeRootfs(l.config.Config); err != nil {
			return err
		}
	}

	// ホスト名を設定する
	if hostname := l.config.Config.Hostname; hostname != "" {
		if err := unix.Sethostname(byte(hostname)); err != nil {
			return errors.Wrap(err, "sethostname")
		}
	}
	// apparmerの設定
	if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
		return errors.Wrap(err, "apply apparmor profile")
	}

	//net.ipv4.ip_forwardを/proc/sys/net/ipv4/ip_forwardに変換するなど、システムプロパティを書込む
	for key, value := range l.config.Config.Sysctl {
		if err := writeSystemProperty(key, value); err != nil {
			return errors.Wrapf(err, "write sysctl key %s", key)
		}
	}
	...
	if err != nil {
		return errors.Wrap(err, "get pdeath signal")
	}
	// 特権(昇格)の設定
	if l.config.NoNewPrivileges {
		if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
			return errors.Wrap(err, "set nonewprivileges")
		}
	}

	// 基本的な初期化が完了し、execを実行する準備ができたことをrunc createに通知
	if err := syncParentReady(l.pipe); err != nil {
		return errors.Wrap(err, "sync ready")
	}
	...
	...
	// seccompの設定
	if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
		if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
			return err
		}
	}
	// コンテナ用の特権ケイパビリティ、ユーザ、および作業ディレクトリを構成
	if err := finalizeNamespace(l.config); err != nil {
		return err
	}
	...
	// コンテナのコンテキスト、ルートファイルシステムなどはすべて準備できているため、実行可能ファイルがコンテナに存在するかどうかを確認
	// 現在のルートファイルシステムでは、実行可能なruncファイルが見つかるはずです
	name, err := exec.LookPath(l.config.Args[0])
	if err != nil {
		return err
	}
	// initfileのクローズ
	l.pipe.Close()
	// コンテナ開始コマンドを実行する前に、runc startでexec.fifoパイプがオープンされるのを待機します。
	// /proc/self/fd/にfd -> /run/runc//があります。
	fd, err := unix.Open("/proc/self/fd/"+strconv.Itoa(l.fifoFd), unix.O_WRONLY|unix.O_CLOEXEC, 0)
	if err != nil {
		return newSystemErrorWithCause(err, "open exec fifo")
	}
	//exec.fifoパイプラインにデータを書き込むと、init processがブロックされ、runc start呼出しを待機します。
	if _, err := unix.Write(fd, byte("0")); err != nil {
		return newSystemErrorWithCause(err, "write 0 exec fifo")
	}
	//exec.fifoをクローズ
	unix.Close(l.fifoFd)

	s := l.config.SpecState
	s.Pid = unix.Getpid()

	// 状態をcreatedに設定
	s.Status = specs.StateCreated
	if err := l.config.Config.Hooks[configs.StartContainer].RunHooks(s); err != nil {
		return err
	}
	// コンテナの開始
	// execでコンテナ処理に移行
	if err := unix.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
		return newSystemErrorWithCause(err, "exec user process")
	}
	return nil
}

runc start

/*start.go*/
...
                #コンテナの参照
		container, err := getContainer(context)
		if err != nil {
			return err
		}
		...
		switch status {
		case libcontainer.Created:
			...
			// initの代わりにexecコンテナプロセスを実行します
			if err := container.Exec(); err != nil {
				return err
			}
			if notifySocket != nil {
				return notifySocket.waitForContainer(container)
			}
			return nil
		case libcontainer.Stopped:
			return errors.New("cannot start a container that has stopped")
		case libcontainer.Running:
			return errors.New("cannot start an already running container")
		default:
			return fmt.Errorf(
...

func (c *linuxContainer) exec() error {
	path := filepath.Join(c.root, execFifoFilename)
	pid := c.initProcess.pid()
	// /run/runc//exec.fifoを読み取ります。
	blockingFifoOpenCh := awaitFifoOpen(path)
	// exec.fifoファイルの内容を取得するか、プロセスがゾンビプロセスになるのを待機します。
	for {
		select {
		case result := <-blockingFifoOpenCh:
			// handleFifoResultは、最終的にコンテンツを読み取った後、exec.fifoを削除します。
			return handleFifoResult(result)

		case <-time.After(time.Millisecond * 100):
			stat, err := system.Stat(pid)
			if err != nil || stat.State == system.Zombie {
				if err := handleFifoResult(fifoOpen(path, false)); err != nil {
					return errors.New("container process is already dead")
				}
				return nil
			}
		}
	}
}

*1:constructor

2021-04-29

OCIイメージとOCIランタイムバンドル

コンテナコンテナイメージコンテナランタイム

参考
OCI image(image-spec)
OCI runtime bundle(runtime-spec)
runcでコンテナを起動

コンテナイメージ第2弾です。
前回はDockerイメージとコンテナランタイムを紹介しました。

kurobato.hateblo.jp

今回は、下記の構成になります。
・DockerイメージをOCIイメージに変換
・OCIイメージをOCIランタイムバンドルに変換
・runcでOCIランタイムバンドルからコンテナを起動

f:id:FallenPigeon:20210429181504j:plain

参考

github.com

OCI image(image-spec)

skopeoというツールを使ってubuntuのDockerイメージをociイメージに変換します。

$skopeo copy docker://ubuntu:latest oci:ubuntu:latest

$cd ubuntu

$ tree
.
├── blobs
│   └── sha256
│       ├── 345e3491a907bb7c6f1bdddcf4a94284b8b6ddd77eb7d93f09432b17b20f2bbe
│       ├── 57671312ef6fdbecf340e5fed0fb0863350cd806c92b1fdd7978adbd02afc5c3
│       ├── 5e9250ddb7d0fa6d13302c7c3e6a0aa40390e42424caed1e5289077ee4054709
│       ├── 687ea84638c380e1856de9f1d923412efd27fe95d781e3fff07258b7f75cd9b7
│       └── 7c6bc520685c7c84faf88a21861b77456900f20ef4e31d0ee387595951f990de
├── index.json
└── oci-layout

OCIイメージはindex.json→image manifest→config,layersで構成されています。

index.jsonにマニフェストへの参照が書かれています。

$ jq . index.json
{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:687ea84638c380e1856de9f1d923412efd27fe95d781e3fff07258b7f75cd9b7",
      "size": 658,
      "annotations": {
        "org.opencontainers.image.ref.name": "latest"
      }
    }
  ]
}

$ jq . blobs/sha256/687ea84638c380e1856de9f1d923412efd27fe95d781e3fff07258b7f75cd9b7
{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:7c6bc520685c7c84faf88a21861b77456900f20ef4e31d0ee387595951f990de",
    "size": 2423
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:345e3491a907bb7c6f1bdddcf4a94284b8b6ddd77eb7d93f09432b17b20f2bbe",
      "size": 28539626
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:57671312ef6fdbecf340e5fed0fb0863350cd806c92b1fdd7978adbd02afc5c3",
      "size": 851
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:5e9250ddb7d0fa6d13302c7c3e6a0aa40390e42424caed1e5289077ee4054709",
      "size": 187
    }
  ]
}

マニフェストにはblobに格納されたconfigとlayersの参照情報が格納されています。
Dockerイメージとほぼ一緒ですね。

$ jq . blobs/sha256/7c6bc520685c7c84faf88a21861b77456900f20ef4e31d0ee387595951f990de
{
  "created": "2021-04-23T22:21:37.49442735Z",
  "architecture": "amd64",
  "os": "linux",
  "config": {
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/bash"
    ]
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:ccdbb80308cc5ef43b605ac28fac29c6a597f89f5a169bbedbb8dec29c987439",
      "sha256:63c99163f47292f80f9d24c5b475751dbad6dc795596e935c5c7f1c73dc08107",
      "sha256:2f140462f3bcf8cf3752461e27dfd4b3531f266fa10cda716166bd3a78a19103"
    ]
  },
  "history": [
    {
      "created": "2021-04-23T22:21:34.1865992Z",
      "created_by": "/bin/sh -c #(nop) ADD file:5c44a80f547b7d68b550b0e64aef898b361666857abf9a5c8f3f8d0567b8e8e4 in / "
    },
    {
      "created": "2021-04-23T22:21:35.354865637Z",
      "created_by": "/bin/sh -c set -xe \t\t&& echo '#!/bin/sh' > /usr/sbin/policy-rc.d \t&& echo 'exit 101' >> /usr/sbin/policy-rc.d \t&& chmod +x /usr/sbin/policy-rc.d \t\t&& dpkg-divert --local --rename --add /sbin/initctl \t&& cp -a /usr/sbin/policy-rc.d /sbin/initctl \t&& sed -i 's/^exit.*/exit 0/' /sbin/initctl \t\t&& echo 'force-unsafe-io' > /etc/dpkg/dpkg.cfg.d/docker-apt-speedup \t\t&& echo 'DPkg::Post-Invoke { \"rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true\"; };' > /etc/apt/apt.conf.d/docker-clean \t&& echo 'APT::Update::Post-Invoke { \"rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true\"; };' >> /etc/apt/apt.conf.d/docker-clean \t&& echo 'Dir::Cache::pkgcache \"\"; Dir::Cache::srcpkgcache \"\";' >> /etc/apt/apt.conf.d/docker-clean \t\t&& echo 'Acquire::Languages \"none\";' > /etc/apt/apt.conf.d/docker-no-languages \t\t&& echo 'Acquire::GzipIndexes \"true\"; Acquire::CompressionTypes::Order:: \"gz\";' > /etc/apt/apt.conf.d/docker-gzip-indexes \t\t&& echo 'Apt::AutoRemove::SuggestsImportant \"false\";' > /etc/apt/apt.conf.d/docker-autoremove-suggests"
    },
    {
      "created": "2021-04-23T22:21:36.274883825Z",
      "created_by": "/bin/sh -c [ -z \"$(apt-get indextargets)\" ]",
      "empty_layer": true
    },
    {
      "created": "2021-04-23T22:21:37.334286535Z",
      "created_by": "/bin/sh -c mkdir -p /run/systemd && echo 'docker' > /run/systemd/container"
    },
    {
      "created": "2021-04-23T22:21:37.49442735Z",
      "created_by": "/bin/sh -c #(nop)  CMD [\"/bin/bash\"]",
      "empty_layer": true
    }
  ]
}

OCI runtime bundle(runtime-spec)

ここまでで OCI image(image-spec)の構成を確認しました。しかしながら、OCI imageのままではコンテナを生成できません。
低レベルコンテナランタイムがコンテナを生成するためには、OCI imageの形式からOCI runtime bundleという形式に変換する必要かあります。

今度はumociというツールを使ってubuntuのOCI imageをOCI runtime bundleに変換します。

$cd ..
$umoci unpack --image ubuntu:latest ubuntu-bundle

$ ls ubuntu-bundle/
config.json  rootfs  sha256_687ea84638c380e1856de9f1d923412efd27fe95d781e3fff07258b7f75cd9b7.mtree  umoci.json

rootfsにubuntuのルートファイルシステムが含まれています。

$ ls ubuntu-bundle/rootfs
bin  boot  dev	etc  home  lib	lib32  lib64  libx32  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

config.jsonには、コンテナ実行時の設定が書かれています。
namespacesやcgroupなどコンテナ関連の用語があちこちに見つかると思います。
ホストOSの侵害につながる/procや/sysを読取り専用にしたり隠蔽したりする設定も確認できますね。

runcはconfig.jsonにしたがってコンテナを生成して実行します。

$cat ubuntu-bundle/config.json
{
	"ociVersion": "1.0.0",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"/bin/bash"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm",
			"HOME=/root"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"inheritable": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "rootfs"
	},
	"hostname": "umoci-default",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"ro"
			]
		}
	],
	"annotations": {
		"org.opencontainers.image.architecture": "amd64",
		"org.opencontainers.image.author": "",
		"org.opencontainers.image.created": "2021-04-23T22:21:37.49442735Z",
		"org.opencontainers.image.exposedPorts": "",
		"org.opencontainers.image.os": "linux",
		"org.opencontainers.image.stopSignal": ""
	},
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": false,
					"access": "rwm"
				}
			]
		},
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			}
		],
		"maskedPaths": [
			"/proc/kcore",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/sys/firmware",
			"/proc/scsi"
		],
		"readonlyPaths": [
			"/proc/asound",
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]
	}
}

runcでコンテナを起動

最後に、作成したOCIランタイムバンドルをruncで実行してみましょう。

$ sudo runc -h
NAME:
   runc - Open Container Initiative runtime

runc is a command line client for running applications packaged according to
the Open Container Initiative (OCI) format and is a compliant implementation of the
Open Container Initiative specification.

runc integrates well with existing process supervisors to provide a production
container runtime environment for applications. It can be used with your
existing process monitoring tools and the container will be spawned as a
direct child of the process supervisor.

Containers are configured using bundles. A bundle for a container is a directory
that includes a specification file named "config.json" and a root filesystem.
The root filesystem contains the contents of the container.

To start a new instance of a container:

    # runc run [ -b bundle ] 

Where "" is your name for the instance of the container that you
are starting. The name you provide for the container instance must be unique on
your host. Providing the bundle directory using "-b" is optional. The default
value for "bundle" is the current directory.

USAGE:
   runc [global options] command [command options] [arguments...]

VERSION:
   1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.13.15
libseccomp: 2.5.1

COMMANDS:
   checkpoint  checkpoint a running container
   create      create a container
   delete      delete any resources held by the container often used with detached container
   events      display container events such as OOM notifications, cpu, memory, and IO usage statistics
   exec        execute new process inside the container
   init        initialize the namespaces and launch the process (do not call it outside of runc)
   kill        kill sends the specified signal (default: SIGTERM) to the container's init process
   list        lists containers started by runc with the given root
   pause       pause suspends all processes inside the container
   ps          ps displays the processes running inside a container
   restore     restore a container from a previous checkpoint
   resume      resumes all processes that have been previously paused
   run         create and run a container
   spec        create a new specification file
   start       executes the user defined process in a created container
   state       output the state of a container
   update      update container resource constraints
   help, h     Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug             enable debug output for logging
   --log value         set the log file path where internal debug information is written
   --log-format value  set the format used by logs ('text' (default), or 'json') (default: "text")
   --root value        root directory for storage of container state (this should be located in tmpfs) (default: "/run/runc")
   --criu value        path to the criu binary used for checkpoint and restore (default: "criu")
   --systemd-cgroup    enable systemd cgroup support, expects cgroupsPath to be of form "slice:prefix:name" for e.g. "system.slice:runc:434234"
   --rootless value    ignore cgroup permission errors ('true', 'false', or 'auto') (default: "auto")
   --help, -h          show help
   --version, -v       print the version

OCIランタイム仕様ではコンテナの作成と実行がそれぞれ独立した操作として定義されていて、runcにもそれに対応するcreate、startサブコマンドが実装されています。
今回はこのサブコマンドをまとめたrun サブコマンドを使います。
runサブコマンドに--bundleオプションを付与し、ランタイムバンドルを指定します。
下記のコマンドを実行すると、コンテナを起動してそのシェルを利用することができます。

$sudo runc run --bundle ubuntu-bundle ubuntu-container
root@umoci-default:/# cat /etc/os-release                 //コンテナ内
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"

runcでコンテナが起動できることを確認しました。
listサブコマンドでコンテナを一覧表示することもできます。

$ sudo runc list
ID                 PID         STATUS      BUNDLE                     CREATED                         OWNER
ubuntu-container   17925       running     /home/user/ubuntu-bundle   2021-04-29T08:42:50.57736707Z   root

実行したコンテナはkillサブコマンドで停止できます。

$ sudo runc kill ubuntu-container KILL
$ sudo runc list
ID          PID         STATUS      BUNDLE      CREATED     OWNER

今回はOCIイメージの生成からruncによるコンテナ実行まで紹介しました。
OCI規格は、コンテナの標準仕様になるため、ここを抑えておけば他のコンテナランタイムへの足掛かりになるはずです。(たぶん)

2021-04-29

Dockerイメージの構成

コンテナコンテナイメージ

Dockerイメージの中身
manifest.json
Config
Layers
コンテナランタイム

Dockerイメージの内容を確認してみます。
f:id:FallenPigeon:20210425161556p:plain

Dockerイメージの中身

Dockerイメージをダウンロードしてtar形式で保存します。

$docker image pull ubuntu:latest

$mkdir docker-image

$cd docker-image

$docker save ubuntu:latest --output ubuntu.tar

展開して中身を確認すると、jsonファイルやいくつかのディレクトリで構成されていることが分かります。

$tar -xf ubuntu.tar

$tree
.
├── 26b77e58432b01665d7e876248c9056fa58bf4a7ab82576a024f5cf3dac146d6.json
├── 59e53182e47ea02b7d77268a13d8262c19efb4d3dd1eb6cba3af51116b9d0260
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── 75ce788460b10c20b25c56456cb658024953645afc1d838abacae3ec3f94cd7b
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── 7c7a59dd75533bba2f4d2c48a08a3e6b5de89ee96f1416981936ff7a9369e8ac
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── manifest.json
├── repositories
└── ubuntu.tar

manifest.json

manifest.jsonは、Docker imageの構成情報を上位で管理するファイルです。
こちらを確認することで、Docker imageの全体像をつかむことができます。

$jq . manifest.json 
[
  {
    "Config": "26b77e58432b01665d7e876248c9056fa58bf4a7ab82576a024f5cf3dac146d6.json",
    "RepoTags": [
      "ubuntu:latest"
    ],
    "Layers": [
      "59e53182e47ea02b7d77268a13d8262c19efb4d3dd1eb6cba3af51116b9d0260/layer.tar",
      "75ce788460b10c20b25c56456cb658024953645afc1d838abacae3ec3f94cd7b/layer.tar",
      "7c7a59dd75533bba2f4d2c48a08a3e6b5de89ee96f1416981936ff7a9369e8ac/layer.tar"
    ]
  }
]

Config

"Config"(26b77e58432b01665d7e876248c9056fa58bf4a7ab82576a024f5cf3dac146d6.json)には、主にイメージビルド時の環境情報が書かれています。
Dockerfile経由でイメージをビルドしたのであれば見覚えのある設定があるかもしれません。

$jq . 26b77e58432b01665d7e876248c9056fa58bf4a7ab82576a024f5cf3dac146d6.json 
{
  "architecture": "amd64",
  "config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/bash"
    ],
    "Image": "sha256:0543ca860eea0b5b793ae01f44ffc8d126f2d9dbd5092bb395091d292af8464b",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
...

Layers

"Layers"では、コンテナのイメージレイヤがアーカイブ形式で管理されています。
また、イメージレイヤのtarファイルを展開するとOS(ubuntu)関連のファイルが含まれていることが確認できます。

$ cd 59e53182e47ea02b7d77268a13d8262c19efb4d3dd1eb6cba3af51116b9d0260
$ tar -xf layer.tar
$ ls
VERSION    boot       etc        json       lib        lib64      media      opt        root       sbin       sys        usr
bin        dev        home       layer.tar  lib32      libx32     mnt        proc       run        srv        tmp        var

コンテナは、Configの実行時情報やイメージレイヤから作成されたファイルシステムをもとに実行されます。

コンテナランタイム

コンテナはコンテナランタイムと呼ばれるソフトウェアが実行しています。

コンテナランタイムには高レベルコンテナランタイムと低レベルコンテナランタイムがあります。

f:id:FallenPigeon:20210429105755g:plain
www.publickey1.jp

高レベルコンテナランタイム(containerd)はKubernetesとDockerと直接やり取りするランタイムです。基本的には、コンテナを論理単位で管理する機能がメインになっています。開発工程でいえば設計レベルのような粒度です。一方、低レベルコンテナランタイム(runc)は、システムコール発行によってOＳの機能に直接アクセスしながらコンテナを動作させます。高レベルコンテナランタイムと低レベルコンテナランタイムの違いとしては、OSへの依存度があります。高レベルコンテナランタイム(contianerd)は、特定のOSに依存しないため、理論上LinuxでもWindowsでも動作します。一方、低レベルコンテナランタイム(runc)は、Linux Kenelの機能に強く依存しているためWindowsでは動きません。WIndowsではhcsがLinuxのruncにあたります。概念レベルの仕様は同じですが、iptablesの箇所がWindows Firewallで構成されていたりと、実装は全く異なります。

f:id:FallenPigeon:20210429122407p:plain

いずれにせよ、実際にコンテナを実行するのは低レベルコンテナランタイムになります。そのため、Dockerイメージのコンフィグやファイルバンドルがリレー形式で、高レベルコンテナランタイムや低レベルランタイムに受け渡されていきます。特に、低レベルコンテナランタイム関連の仕様は、Open Container Initiativeという団体が策定していて、コンテナイメージはOCIイメージ(image-spec)として定義されています。DockerイメージはOCIイメージと高い互換性があるため、構成に大きな違いはありません。

github.com

ちなみに、Docker、contianerd、runcついでに言うとKubernetesはGo言語で書かれています。コンテナ関連の研究をしたり、コンテナエンジンの開発をする場合は、Go言語の知識が必要になります。逆に、コンテナを使うだけ(インタフェースレベルの理解)であれば、特に意識する必要はありません。セキュリティの視点で見ても、コンテナエンジンレイヤの脆弱性は短期間で修正されることが多いため、そこまでクリティカルではありません。ただし、脆弱性の公開やインシデントに即応するにはこれらの知識が必要になるかもしれません。
Windowsコンテナランタイムは、ブラックボックス実装のため、理解にはリバースエンジニアリングの知識が必要になります。実際、NTTの方ですらWinDBGで処理を追うのを断念したという記事を見かけたので、闇は深そうです。少なくとも私のようなポンコツが挑めば廃人確定でしょう。

OCIやruncについてはまた今度書きたいと思います。

2021-04-26

Cloud One:RASP (Runtime Application Self Protection)

クラウドセキュリティサーバレス

トレンドマイクロ:Cloud Oneセミナー聴講のまとめです。

テーマは、サーバレスサービスのランタイム防御を提供するRASP (Runtime Application Self Protection)です。

セミナー聴講して

くっきりとした姿が見えているわけではないけど、おぼろげながら、浮かんできたんです。“RASP”の機能が。

まず、Lambdaの責任共有モデルです。見てわかるとおり利用者の責任はアプリケーションレイヤが中心です。

f:id:FallenPigeon:20210426220138p:plain

こちらがRASPの概要です。聴講する前はアンチウィルスかと思っていたのですが、WAFのような機能が並んでいます。この後の説明で保護レイヤはアプリケーションレイヤがメインでその他は対象外とのこと。Apacheなどは見捨てられるようです。

対象言語はインタプリタ言語が並んでいる様子。

他の特徴は導入が簡単で、エージェントをダウンロードして、インポートするだけで機能するのだとか。

f:id:FallenPigeon:20210426220408p:plain

f:id:FallenPigeon:20210426220515p:plain

下記にインストール手順が載っていたので、確認。

Python

エージェントをインストールするには、次の手順に従います。

Cloud One-ApplicationSecurityパッケージtrend_app_protectをrequirements.txtに追加します。
pipを実行してパッケージをインストールします。 pip install -r requirements.txt
trend_app_protect.startWSGI スクリプトの上部にあるモジュールをインポートします。 import trend_app_protect.start
エージェントキーとシークレット(TREND_AP_KEYおよびTREND_AP_SECRET) は、環境変数を介して構成できます。
オプションで、プロジェクトのルートまたは/ etcの下のファイルtrend_app_protect.iniで構成を提供できますが、環境変数は引き続き優先されます。

[trend_app_protect]
key = my-key
secret = my-secret

php

１.PHPのバージョンとプラットフォーム用のエージェントパッケージをダウンロードします。

２.次のようにApacheを停止します
sudo service apache2 stop

３.Trend_app_protect-*.soをPHP拡張ディレクトリに移動します。

$ mv /path/to/trend_app_protect-*.so "$(php -r 'echo ini_get ("extension_dir");')"/trend_app_protect.so

４．以下をphp.iniファイル（または、存在する場合は/etc/php.d/の新しい.iniファイル）に追加します。

extension = trend_app_protect.so

; Add key and secret from the Application Protection dashboard
trend_app_protect.key = <your key>
trend_app_protect.secret = <your secret> ; Enable the extension

確かに簡単そうです。インタープリタなどに寄生してAPIフック?しているようです。

スクリプトの実行実体ってスクリプトエンジンなんですよ。意外に知られていないですけど

ただ知りたいのは、RASPの存在意義。

スライドを見る限りだと、

アプリケーションセキュリティは今のままではいけないと思います。だからこそ、RASPを導入しないといけないと思っている。

TMによればWAFと両方導入して多層防御を実現してほしいとのことです。HTTPベースとAPIベースで片方だけ検出できるようなケースがあるのかもしれません。横展開のようにWAFを経由しない攻撃を防御できる点も挙げていました。

なお、Cloud OneはRASP以外にも複数のモジュールで構成されています。そのため、S3やIaaSのように、サービス単位で攻撃フローを遮断するのが真の目的で、RASPはその一つという位置づけみたいです。

f:id:FallenPigeon:20210427200448p:plain

結論としては、RASPはアンチウィルスのような仕組みでWAF機能を実現するモジュールでした。導入時はアンチウィルスと同等の機能があると勘違いしないようにしましょう。

2021-04-26

ISMAPと政府統一基準の動向

クラウドセキュリティ

ISMAP登録サービス
政府系ガイドライン
クラウドセキュリティガイドライン
CSAとISMAP
政府統一基準の改定案

ISMAP登録サービス

ISMAP(Information system Security Management and Assessment Program)は、政府が求めるセキュリティ要件を満たしているクラウドサービスをあらかじめ評価・登録することにより、政府の調達におけるセキュリティ水準の確保とクラウドサービスの効率的な導入を目的とした制度です。ISMAPに登録されたクラウドサービス(事業者)は官公庁との癒ちゃ...信頼のもと、入札工程を短縮できます。そして、今年3月にISMAPの第一波登録リストが公開されました。

f:id:FallenPigeon:20210426194014p:plain

Microsoft Azureがないのは驚きですが、どうやら第一ウェーブにこだわらなかったようです。

また、Salesforce ServiceやHeroku ServicesなどのSaaSベンダ、PaaSベンダも並んでいます。これらのサービスはAWSのIaaS上に構築されていることから、第3者ベンダの基盤を利用してサービスを展開しても認証取得できることが分かります。

巨大ベンダは遅かれ早かれ取得していくと思われますが、判断が分かれるのは、中規模ベンダではないでしょうか。理由は、単純に費用対効果です。ISMAPはガバナンスから技術対策まで約1000項目に対応する必要があります。すでに国際規格のCSAに準拠していたり、日本の官公庁案件に比重を置いてない場合、認証コストが割に合わない可能性もあります。今後の登録動向に注目です。

さて、ISMAPは、「政府」と「クラウドセキュリティ」がキーワードのレギュレーションですが、関連規格も見てみましょう。

政府系ガイドライン

まず、官公庁の調達基準として「政府統一基準」というものがあり、オンプレミス環境を前提とした情報セキュリティ全般の管理策で構成されています。これは、政府にシステムを納品する際の基準となるため、元請けがSIerであろうとクラウドサービス事業者であろうと対策を実施する必要があります。

一方、ISMAPは、クラウドサービス事業者のみを対象とした管理策で構成されています。すでに説明したとおり、ISMAPはクラウドサービスを認証する制度のため、クラウド利用者の対策は含まれていません。そのため、SIerが元請けの場合は、クラウドベンダのISMAP認証有無に関わらず、利用者側の対策を実施する必要があります。

f:id:FallenPigeon:20210427195328p:plain

f:id:FallenPigeon:20210426205526p:plain

ちなみに、ISMAPは、ISOや政府統一基準をベースにしていて、残りを米国標準のFedRAMPで補う構成になっているようです。

①国際規格
　情報セキュリティガバナンス(ISO 27014)
　情報セキュリティマネジメント(ISO 27001、ISO 27002)
　クラウドサービスの情報セキュリティ(ISO 27001)
②政府統一基準
　①に含まれない内容かつ統一基準を満たすのに必要な項目を追加
　その趣旨を残したままCSP向けに書換え
③米国基準
　①、②に含まれない観点を追加
　FedRAMPのインシデントレスポンスに関連する内容を中心に項目を追加

クラウドセキュリティガイドライン

次に、クラウドセキュリティのガイドラインも確認してみましょう。図を見てわかるとおりクラウドセキュリティのガイドラインは3種類に大別されることが分かります。

①クラウド全般の対策

②クラウド利用者のみの対策

③クラウドベンダのみの対策

ここがクラウドセキュリティのややこしいところですが、クラウド利用者は自身の責任がまとまった②、クラウドベンダであれば③を主に利用します。一方、システムの発注側は、②や③には興味がなく、全体としてセキュリティが担保された①を重視します。このように、クラウドセキュリティのガイドラインは立場による使い分けが求められます。Salesforce ServiceやHeroku Servicesのように第3者クラウドベンダが絡むとさらにややこしくなることが想像できるかと思います。

f:id:FallenPigeon:20210427195345p:plain

CSAとISMAP

クラウドセキュリティの策定団体の有名どころとして、CSAがあります。CSAは、ガイドラインの発行だけでなく、クラウドベンダの認証制度も用意しています。

CSA セキュリティガイダンス

　　　クラウドセキュリティ全般の考え方

CSA CCM(Cloud Controls Matrix)／CAIQ

　　　クラウドサービス事業者向けの実装基準の位置づけ
　　　他の国際標準、業界標準との対応付けを含む

CSA STAR認証

　　　準拠事業者登録制度

f:id:FallenPigeon:20210426205839p:plain

クラウドベンダを対象としたISMAPとCSA CCMのベースガイドラインを比較してみましょう。(適当にググっただけなので間違っているかもしれません)

ISMAPがISO系列を広くカバーしているのに対し、CSAでは、他の同等規格をベースにしています。

f:id:FallenPigeon:20210426210408p:plain

ISMAPとCCMに関する取り組みとして、CSA JAPANがこれらのマッピングを試みています。国際的にみればISMAPはガラパゴス規格という扱いであり、国際標準の地位を確立したCSAと関連付けることで存在感を増したいという狙いがあるそうです。

ただ、途中経過の報告をみる限り芳しい結果が得られていないようです。ベースガイドラインの違いやカテゴライズの違いが影響しているのかもしれません。

f:id:FallenPigeon:20210426210827p:plain

CSAとISMAPはベースガイドラインが異なっているため、両方取得できれば無敵ですが、コスト面から片方しか取得しないベンダもいるはずです。クラウドサービスを利用する際は準拠状況を確認すると面白いかもしれません。

政府統一基準の改定案

4/26に政府統一基準改定のパブコメ募集が始まりました。

改定案を確認すると、「①クラウド利用者の対策」「②ゼロトラスト?」「③テレワーク」とあります。特に①は朗報といえるかもしれません。

今までクラウド利用者視点の対策は、オンプレミスベースで書かれた対策をクラウドに当てはめる作業が必要でしたが、今後は不要になるようです。

f:id:FallenPigeon:20210428204538p:plain