目录

kubernetes排错-内核参数最大线程数pid_max引发的问题

创建 pod 时提示 may need to increase max user processes 错误.

简述

Kubernetes 在运行一段时间后,某些 pod 健康检测失败后重新创建 pod 的过程中报 resource temporarily unavailablemay need to increase max user processes 的错。

排查

describe 查看该 pod 的事件,发现如下报错:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
failed to start container "log-pilot": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:338: starting init process command caused: fork/exec /proc/self/exe: resource temporarily unavailable: unknown

error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "05fb8ec819c5e503836d0f67b765359438fe4203fb69e560f89f33b14145d193" network for pod "log-pilot-nshdg": networkPlugin cni failed to set up pod "log-pilot-nshdg_kube-system" network: netplugin failed: 
runtime: failed to create new OS thread (have 4 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 5 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
......
# 或者这种错
Liveness probe failed: /pilot/healthz: line 4: can\'t fork: Resource temporarily unavailable 
Liveness probe failed: OCI runtime exec failed: exec failed: unable to start container process: read init-p: connection reset by peer: unknown 
Error: failed to start container "log-pilot": Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: can\'t get final child\'s PID from pipe: EOF: unknown

查了一下资料大概就是最大线程数用完了导致的,我们排查一下。

登录到108机器上面查看当前内核参数看看线程总数是多少:

sysctl kernel.pid_max ,这里还是默认的数值,输出是: kernel.pid_max = 32768

使用ps命令,查看当前线程总数:

ps -eLf | wc -l 输出是:29853

因为以及发生过一段时间了,而且这个数值也很接近,应该是某个服务大量创建进程导致上面的那个报错的。

设置一下参数:

实时修改:sysctl -w kernel.pid_max=4194303

或者修改 /etc/sysctl.conf 文件,将 kernel.pid_max = 4194303 填写在 /etc/sysctl.conf 文件的末尾。

运行一段时间发现问题依旧, cat /sys/fs/cgroup/pids/kubepods.slice/pids.max 发现内核参数的数值还是原来的数。

查资料发现修改完要重启一下 kubelet 才能生效github 相关 issue

结束

修改完重启一下有问题的 pod,问题解决,后续再继续观察看看问题是不是得到修复。