当前位置：网站首页>Container lifecycle

Container lifecycle

2022-06-24 05:55:00 【mariolu】

One 、 What is the container

In fact, there is no such thing as a container . The container consists of two Linux Primitives consist of ：

Namespace
Control group (cgroups)

Before studying what a container is , Learn how to be in Linux It is important to create and manage new processes in .

In the diagram above , The parent process can be considered an active shell conversation , Subprocesses can be thought of as being in shell Any command running in , for example ：ls、pwd. Now? , When running a new command , Will create a new process . This is done by the parent process by calling a function fork. When it creates a new independent process , It puts the processes of child processes ID (PID) Return to the parent process that called the function fork. At the right time , Both parents and children can continue to perform their tasks and terminate . Son PID It is important for the parent process to track the newly created process .

Two 、 Namespace

Let's continue to understand Linux What namespaces are there .

A namespace is an isolation primitive , It can help us isolate various types of resources . stay Linux in , At present, this operation can be performed on seven different types of resources . They are , There is no specific order ：

Network namespace
Mount
UTS perhaps Hostname namespace
Process ID or PID namespace
Inter process communication or IPC namespace
cgroup namespace
User namespace

By default , These namespaces already exist in the system .

All information about the process is contained in Next procfs, Usually installed in /proc. function echo $$ Will be given to the currently running process PID：

$ echo $$
448884

see /proc/<PID>/ns, You will see a list of namespaces used by the process . for example ：

$ ls /proc/448884/ns -lh
total 0
lrwxrwxrwx 1 root root 0 Feb 23 19:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 net -> 'net:[4026532008]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Feb 23 19:00 uts -> 'uts:[4026531838]'

For each namespace , All have a file , It points to the namespace ID The symbolic link . So for network namespaces , Namespace in the above example ID yes net:[4026532008].4026532008 Namely inode Number . For two processes in the same namespace , This number is the same .

stay Linux On , To create a new namespace , You can use system calls unshare. To create a new network namespace , Flag needs to be added -n. therefore , In possession of root The powers of the shell In the session , We will do the following ：

# unshare -n

You can see /proc/<PID>/ns Directory to verify that we actually created a new namespace ：

# ls -l /proc/$$/ns/net
lrwxrwxrwx 1 root root 0 Feb 23 18:46 /proc/447612/ns/net -> 'net:[4026533490]'

Namespace ID Different from the host network namespace we saw above .ip link After that, run the command to display only the loopback interface ：

# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

If there is any network interface , Such as WIFI Card or Ethernet port , They won't appear at all . in fact , If you try to run ping 127.0.0.1, What is usually taken for granted will not work ：

# ping 127.0.0.1
ping: connect: Network is unreachable

But why does this happen ？

Initially, a new network namespace was created , This behavior isolates the existing network resources in the default namespace . In this new namespace , The only available loopback Interface . However , It has not been assigned to it IP Address ：

# ip address
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

This indicates that the interface does not have IP Address , and state Also set to DOWN. This problem can be solved by running the following command ：

# ip address add dev lo local 127.0.0.1/8
# ip link set lo up

First , The interface is assigned IP Address 127.0.0.1, And set the status of the interface to UP, So that it can be used to listen for incoming network packets . Now? ping Will work as expected ：

# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.060 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.071 ms

To understand the concept of isolation , Will continue to try to make this new network interface （ We call it CHILD） Talk to host network namespace , vice versa .

To help understand , take PS1 This shell The variables in are set to be easy to recognize ：

# export PS1="[netns: CHILD]# "
[netns: CHILD]#

Also generate a with root New terminal for access , To run in shell Belongs to the host network namespace . Will be set again PS1 Variables to help easily identify host namespaces ：

# export PS1="[netns: HOST]# "
[netns: HOST]#

ip link Running the command on this interface will display the network interface currently installed in the system . for example ：

[netns: HOST]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 0e:94:18:de:da:b3 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:ad:0f:83:cc brd ff:ff:ff:ff:ff:ff
11: wlp61s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DORMANT group default qlen 1000
    link/ether fa:3d:a9:90:95:5d brd ff:ff:ff:ff:ff:ff

To list all network namespaces in the system , We can run ：

[netns: HOST]# ip netns list

, This will produce an empty output . So does that mean the command doesn't work or we did something wrong there , Even if a new network namespace was created before ？ The answer to both questions is no . Because in UNIX Everything in is a document , So it's time to ip Command to find the network namespace in the directory /var/run/netns. The directory is currently empty . therefore , We will first create an empty file , Then try running the command again ：

[netns: HOST]# touch /var/run/netns/child
[netns: HOST]# ip netns list
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
child

exactly child You see the namespace in the list , But I also saw a mistake . This is because there is no... That will run the new namespace shell Map to this file . So , We will mount /proc/<PID>/ns/net File binding to the new file we created above . This can be done by running the shell Execute the following command in to complete ：

[netns: CHILD]# mount -o bind /proc/$$/ns/net /var/run/netns/child
[netns: CHILD]# ip netns list
child

This time the command to list the network namespaces works correctly , There are no mistakes . This means that the namespace has been associated with ID relation 4026533490 To the file /var/run/netns/child, And namespaces are now persistent .

Now you need to find a way for the host and subnet namespaces to communicate with each other . So , A pair of virtual Ethernet devices will be created in the host network namespace ：

[netns: HOST]# ip link add veth0 type veth peer name veth1

In this order , Created a new one called ,veth0 And virtual Ethernet devices . The other end of the pair of devices is called veth1.

[netns: HOST]# ip link | grep veth
35: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
36: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

at present , Both devices exist in the host namespace . If ip link Run in subnet namespace , It can only loopback Display the address as before ：

[netns: CHILD]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

So what can be done to make veth One of the devices appears in the sub namespace ？ So , We will run the following command in the host network namespace , Because this is the current existence veth Location of equipment ：

[netns: HOST]# ip link set veth1 netns child

Here we indicate that veth1 Network devices are assigned to namespaces child.ip link This namespace does not show veth1 device ：

[netns: HOST]# ip link | grep veth
36: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

And on the other hand ,veth1 Now appears in the subnet namespace ：

[netns: CHILD]# ip link | grep veth
35: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

Before they can communicate with each other , There are two more steps , For each veth The device is assigned a IP Address and set the status to up：

[netns: HOST]# ip address add dev veth0 local 10.16.8.1/24
[netns: HOST]# ip link set veth0 up

You can use the following command to verify the results of the command ：

[netns: HOST]# ip address | grep veth -A 5
36: [email protected]: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
    link/ether 32:c7:79:c7:e2:e0 brd ff:ff:ff:ff:ff:ff link-netns child
    inet 10.16.8.1/24 scope global veth0
       valid_lft forever preferred_lft forever

The same is true of child namespaces ：

[netns: CHILD]# ip address add dev veth1 local 10.16.8.2/24
[netns: CHILD]# ip link set veth1 up

[netns: CHILD]# ip address | grep veth -A 5
35: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 5a:62:dd:40:a6:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.16.8.2/24 scope global veth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5862:ddff:fe40:a6f1/64 scope link
       valid_lft forever preferred_lft forever

Last , Should be able to interact with each other ping through ：

[netns: HOST]# ping 10.16.8.2
PING 10.16.8.2 (10.16.8.2) 56(84) bytes of data.
64 bytes from 10.16.8.2: icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from 10.16.8.2: icmp_seq=2 ttl=64 time=0.099 ms
64 bytes from 10.16.8.2: icmp_seq=3 ttl=64 time=0.100 ms

[netns: CHILD]# ping 10.16.8.1
PING 10.16.8.1 (10.16.8.1) 56(84) bytes of data.
64 bytes from 10.16.8.1: icmp_seq=1 ttl=64 time=0.057 ms
64 bytes from 10.16.8.1: icmp_seq=2 ttl=64 time=0.090 ms
64 bytes from 10.16.8.1: icmp_seq=3 ttl=64 time=0.118 ms

3、 ... and 、 group

Next is cgroups. It controls the resources that a process can consume The amount . The best example is CPU And memory . The best use case for doing this is to prevent the process from accidentally using all available CPU Or memory and prevent the entire system from performing any other operations .cgroup Located at /sys/fs/cgroup Under the table of contents . Let's take a look at the content ：

# ls /sys/fs/cgroup/ -lh
total 0
dr-xr-xr-x 5 root root  0 Feb 17 01:05 blkio
lrwxrwxrwx 1 root root 11 Feb 17 01:05 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Feb 17 01:05 cpuacct -> cpu,cpuacct
dr-xr-xr-x 5 root root  0 Feb 17 01:05 cpu,cpuacct
dr-xr-xr-x 2 root root  0 Feb 17 01:05 cpuset
dr-xr-xr-x 5 root root  0 Feb 17 01:05 devices
dr-xr-xr-x 2 root root  0 Feb 17 01:05 freezer
dr-xr-xr-x 2 root root  0 Feb 17 01:05 hugetlb
dr-xr-xr-x 9 root root  0 Feb 20 00:24 memory
lrwxrwxrwx 1 root root 16 Feb 17 01:05 net_cls -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Feb 17 01:05 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Feb 17 01:05 net_prio -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Feb 17 01:05 perf_event
dr-xr-xr-x 5 root root  0 Feb 17 01:05 pids
dr-xr-xr-x 2 root root  0 Feb 17 01:05 rdma
dr-xr-xr-x 5 root root  0 Feb 17 01:05 systemd
dr-xr-xr-x 5 root root  0 Feb 17 01:06 unified

Each directory is a resource that can be used for control . To create a new cgroup, We need to create a new directory in one of these resources . for example , If we plan to build a new cgroup A new directory that controls memory usage , We will be in /sys/fs/cgroups/memory Create a new directory under the path （ The name is up to us ）. So let's do this ：

# mkdir /sys/fs/cgroup/memory/child

# ls -lh /sys/fs/cgroup/memory/demo/
total 0
-rw-r--r-- 1 root root 0 Feb 24 12:29 cgroup.clone_children
--w--w--w- 1 root root 0 Feb 24 12:29 cgroup.event_control
-rw-r--r-- 1 root root 0 Feb 24 12:29 cgroup.procs
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.failcnt
--w------- 1 root root 0 Feb 24 12:29 memory.force_empty
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.memsw.failcnt
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.memsw.limit_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.memsw.max_usage_in_bytes
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.memsw.usage_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.numa_stat
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.oom_control
---------- 1 root root 0 Feb 24 12:29 memory.pressure_level
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.stat
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.swappiness
-r--r--r-- 1 root root 0 Feb 24 12:29 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Feb 24 12:29 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Feb 24 12:29 notify_on_release
-rw-r--r-- 1 root root 0 Feb 24 12:29 tasks

The operating system creates a bunch of files for each new directory . Let's take a look at one of the files ：

# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
9223372036854771712

The value in this file indicates the maximum memory that the process can use （ If it is this cgroup Part of ）. Let's set this value to a much smaller number , for example 4MB, But in bytes ：

# echo 4000000 > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

Let's take a look at this file ：

# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
3997696

This is not exactly what we write to the file , But it is about 3.99 MB. My guess is that this is related to memory alignment managed by the operating system .

Now start a new process in the new hostname namespace ：

# unshare -u

This will start a new shell process . Try running a command ,wget I know it needs more than 4MB Memory to run ：

# wget wikipedia.org
URL transformed to HTTPS due to an HSTS policy
--2020-02-24 12:36:58--  https://wikipedia.org/

Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving wikipedia.org (wikipedia.org)... 103.102.166.224, 2001:df2:e500:ed1a::1
Connecting to wikipedia.org (wikipedia.org)|103.102.166.224|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.wikipedia.org/ [following]
--2020-02-24 12:36:58--  https://www.wikipedia.org/
Resolving www.wikipedia.org (www.wikipedia.org)... 103.102.166.224, 2001:df2:e500:ed1a::1
Connecting to www.wikipedia.org (www.wikipedia.org)|103.102.166.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76776 (75K) [text/html]
Saving to: ‘index.html’

index.html                   100%[============================================>]  74.98K   362KB/s    in 0.2s

2020-02-24 12:36:59 (362 KB/s) - ‘index.html’ saved [76776/76776]

Now we notice that the command is valid . This is because this process is the default cgroup Part of . To make it a new cgroup Part of , This process needs to be PID write in cgroup.procs file ：

# echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs

Let's look at the contents of this file ：

# cat /sys/fs/cgroup/memory/demo/cgroup.procs
468401
468464

There seem to be two entries here . The first entry is what we write to the file shell Process PID. The other is cat Of the processes we run PID . This is because by default , All child processes belong to the same... As the parent process cgroup. Once the process terminates ,PID Will be automatically deleted from the file . If we run the same command again , We will still find two entries , But the second one will be different ：

# cat /sys/fs/cgroup/memory/demo/cgroup.procs
468401
468464

Now try running again wget command ：

# wget wikipedia.org
URL transformed to HTTPS due to an HSTS policy
--2020-02-24 12:44:26--  https://wikipedia.org/
Killed

The process is immediately killed , Because it tries to use more than is currently allowed cgroup More memory .

Four 、 summary

therefore ,namespaces and cgroups To isolate and control the use and formation of resources is commonly known as container .：

function ： It limits root The use of permissions . Sometimes you need to run a process that requires elevated permissions to do one thing , But with root Running it as an identity is a security risk , In this way, the process can perform almost any operation on the system . To limit this , Function provides a way to assign special permissions , There is no need to grant system wide to the process root jurisdiction . One example is , If you need a program that can manage the network interface and related operations , The program can be granted capabilities CAP_NET_ADMIN.
Seccomp： It limits the use of system calls . To further reduce security , They can be used to block system calls that may cause additional damage . for example , Blocking kill System calls will prevent a process from being able to terminate or send signals to other processes .

So in namespaces Allow us to isolate resources type At the same time ,cgroups Help us control the resource usage of a process The amount . and capabilities Limit by decomposing operations into different types of functions root The use of permissions . Last seccomp Helps prevent processes from calling unwanted system calls . These concepts combine to form a container , This is a better abstraction than worrying about all of this at once .

5、 ... and 、 And finally

fork The chart at the front of this article is a bit incomplete . This is a more complete chart ：

As mentioned earlier fork, Put the child process's PID Return to parent process , And use this PID Come on “ wait for ” The subprocess completes execution . This is from waitpid The system call is finished . This is important to avoid zombie processes , This is called harvesting . Once the child process terminates , It is the responsibility of the parent process to ensure that all resources allocated to the child process are cleaned up . In short , this It is the work of container running or container engine . It generates new containers or child processes , And make sure to clean up resources after the container terminates .

原网站

版权声明
本文为[mariolu]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/07/20210731020347196I.html