创建 pod 时提示 may need to increase max user processes 错误.
简述
Kubernetes 在运行一段时间后,某些 pod 健康检测失败后重新创建 pod 的过程中报 resource temporarily unavailable
和 may need to increase max user processes
的错。
排查
describe 查看该 pod 的事件,发现如下报错:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
failed to start container "log-pilot": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:338: starting init process command caused: fork/exec /proc/self/exe: resource temporarily unavailable: unknown
error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "05fb8ec819c5e503836d0f67b765359438fe4203fb69e560f89f33b14145d193" network for pod "log-pilot-nshdg": networkPlugin cni failed to set up pod "log-pilot-nshdg_kube-system" network: netplugin failed:
runtime: failed to create new OS thread (have 4 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 5 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
......
# 或者这种错
Liveness probe failed: /pilot/healthz: line 4: can\'t fork: Resource temporarily unavailable
Liveness probe failed: OCI runtime exec failed: exec failed: unable to start container process: read init-p: connection reset by peer: unknown
Error: failed to start container "log-pilot": Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: can\'t get final child\'s PID from pipe: EOF: unknown
|
查了一下资料大概就是最大线程数用完了导致的,我们排查一下。
登录到108机器上面查看当前内核参数看看线程总数是多少:
sysctl kernel.pid_max
,这里还是默认的数值,输出是: kernel.pid_max = 32768
使用ps命令,查看当前线程总数:
ps -eLf | wc -l
输出是:29853
因为以及发生过一段时间了,而且这个数值也很接近,应该是某个服务大量创建进程导致上面的那个报错的。
设置一下参数:
实时修改:sysctl -w kernel.pid_max=4194303
或者修改 /etc/sysctl.conf
文件,将 kernel.pid_max = 4194303
填写在 /etc/sysctl.conf
文件的末尾。
运行一段时间发现问题依旧, cat /sys/fs/cgroup/pids/kubepods.slice/pids.max
发现内核参数的数值还是原来的数。
查资料发现修改完要重启一下 kubelet 才能生效。github 相关 issue
结束
修改完重启一下有问题的 pod,问题解决,后续再继续观察看看问题是不是得到修复。