kubernetes基础-IPVS下Pod访问Service的ClusterIP流量路径

2023-05-07 约 2969 字预计阅读 6 分钟

基于 kubernetes 的 release-1.22 分支 .

写在前面

学习完 kube-proxy 后，我们只知道它创建了很多 iptables 和 ipvs 规则，但是还不知道具体这些规则是怎么互相配合的，流量转发路径是怎么走的，我们来详细看看这些 iptables 规则，研究一下它究竟是怎么实现流量转发的。

假设集群已经启用了 ipvs 特性。

开始之前

ipvs hooks

在开始前，我们先了解一下可以触发 ipvs 工作的几个主要钩子函数所在的位置

还记得 kube-proxy 篇提到的 dummy 网卡 kube-ipvs0 吗？我们回顾一下。

我们可以通过 ip a 命令查看：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


...
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 86:2b:90:0f:dd:4c brd ff:ff:ff:ff:ff:ff
    inet 192.168.242.177/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 192.168.23.138/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 192.168.120.111/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
...

作用：

因为 ipvs netfilter 的 DNAT 钩子挂载在 INPUT 链上，当访问 ClusterIP 时，将 ClusterIP 绑定在 dummy 网卡上为了让内核识别该 IP 就是本机 IP，进而进入 INPUT 链，然后通过钩子函数 ip_vs_in() 转发到 POSTROUTING 链。
IPVS virtual servers 和 real server，分别对应 service 和 endpoints。

ipvsadm

在开始之前，我们需要安装 ipvsadm ，方便 ipvs 规则的查看。

1

yum install ipvsadm -y

正常情况

我们先看看 iptables 规则，可以通过 iptable -t nat -nL 查看。

我们来看看数据包的路径：

请求 –> PREROUTING –> KUBE-SERVICES ，我们看到 KUBE-SERVICES 在这里会匹配到 ipset KUBE-CLUSTER-IP 然后将数据包发送到对应的 KUBE-MARK-MASQ Chain 做处理。

注意

我们解释一下：match-set KUBE-CLUSTER-IP dst,dst ，dst,dst 指的是目的ip，目的port ，这条规则意思就是去匹配请求里面的目的ip，目的port 在 KUBE-CLUSTER-IP 这个 ipset 里面存不存在。

可以通过 ipset list KUBE-CLUSTER-IP 命令查看 ipvs 规则。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-LOAD-BALANCER  all  --  0.0.0.0/0            0.0.0.0/0            /* Kubernetes service lb portal */ match-set KUBE-LOAD-BALANCER dst,dst
KUBE-MARK-MASQ  all  -- !10.80.0.0/16         0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER dst,dst

# ipset list KUBE-CLUSTER-IP|head -n 20
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 35000
References: 2
Number of entries: 774
Members:
192.168.144.37,tcp:80
192.168.195.39,tcp:80
192.168.58.204,tcp:80
192.168.219.108,tcp:80
192.168.6.176,tcp:80
192.168.188.233,tcp:80
192.168.87.54,tcp:80
192.168.167.123,tcp:80
192.168.221.73,tcp:80
192.168.38.179,tcp:80
192.168.98.174,tcp:80
192.168.75.93,tcp:80
......

匹配到 KUBE-MARK-MASQ 这条 Chain 后，我们可以看到，数据包在这里打了 0x4000/0x4000 的标记，标记完以后发现目标ip 是本机，则数据包进入到 INPUT Chain 。

1
2
3


Chain KUBE-MARK-MASQ (3 references)
target     prot opt source               destination         
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

INPUT Chain 会把数据包交给 KUBE-FIREWALL 去检查是否放行，我们看看 KUBE-FIREWALL 的具体行为，如果匹配到 0x8000/0x8000 标记的包会将包丢弃掉，非 ctstate RELATED,ESTABLISHED,DNAT 状态的包也丢掉。。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes health check rules */
ACCEPT     udp  --  0.0.0.0/0            169.254.20.10        udp dpt:53
ACCEPT     tcp  --  0.0.0.0/0            169.254.20.10        tcp dpt:53
KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0 

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
DROP       all  -- !127.0.0.0/8          127.0.0.0/8          /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT

然后执行 ipvs 的 hook ，进行 DNAT 。

将目标ip 从 service 的 ClusterIP 替换为 service 后端的某个真实的 pod ip ，端口从 service 的目标端口替换为 service 后端的某个真实的 pod 端口，源ip 和源端口不变，完成 DNAT 后，然后将数据直接送入 POSTROUTING Chain 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
MASQUERADE  all  --  169.254.123.0/24     0.0.0.0/0           
RETURN     all  --  10.80.0.0/16         10.80.0.0/16        
MASQUERADE  all  --  10.80.0.0/16        !224.0.0.0/4         
RETURN     all  -- !10.80.0.0/16         10.80.1.128/25      
MASQUERADE  all  -- !10.80.0.0/16         10.80.0.0/16

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose */ match-set KUBE-LOOP-BACK dst,dst,src
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */

数据在 POSTROUTING chain 中完成 MASQUERADE SNAT 。

这时源ip 为下一跳路由所使用网路设备的 ip ，源端口为随机端口，目标ip 为映射选择的 pod ip ，目标port 为映射选择的 port ，并做了 conntrack 记录，便于回包的时候，把源ip 从目标pod ip 改回 DNAT 之前的 service 的 ClusterIP 。

再根据路由表做下一跳路由选择。

信息

我们整理一下路径：

请求 –> PREROUTING –> KUBE-SERVICES –> KUBE-CLUSTER-IP –> INPUT –> KUBE-FIREWALL –> IPVS –> DNAT –> POSTROUTING –> KUBE-POSTROUTING –> ROUTE

总结

数据包从 pod network namespace 发出，进入 host 的 network namespace ，源ip 为 pod ip，源端口为随机端口，目标ip 为 cluster ip ，目标port 为指定port 。
数据包在 host network namespace 中进入 PREROUTING chain 。
在 PREROUTING chain 中经过匹配 ipset KUBE-CLUSTER-IP 做 mask 标记操作。
在 host network namespace 中创建网络设备 kube-ipvs0 ，并且绑定所有 cluster ip ，这样从 pod 发出的数据包目标ip 为 cluster ip ，有 kube-ipvs0 网络设备对应，数据进入 INPUT chain 中。
数据在 INPUT chain 中被 ipvs 的内核规则修改(可由 ipvsadm 查看规则)，完成 DNAT ，然后将数据直接送入 POSTROUTING chain 。这时源ip 为 pod ip，源端口为随机端口，目标ip 为映射选择的 pod ip ，目标port 为映射选择的 port 。
数据在 POSTROUTING chain 中，经过 KUBE-POSTROUTING target 完成 MASQUERADE SNAT 。这时源ip 为下一跳路由所使用网路设备的 ip ，源端口为随机端口，目标ip 为映射选择的 pod ip ，目标port 为映射选择的 port 。
数据包根据 host network namespace 的路由表做下一跳路由选择。

特殊情况

当 service 类型为 LoadBalancer ，且 spec.externalTrafficPolicy = Local 时。流量处理比较特殊。（本次案例基于阿里云ACK 和阿里云CLB）

举个例子，PodA 通过 LoadBalancer（当作是 service ，他是 service 的一种实现）去访问同一集群内的服务 PodD ，LoadBalancer 的 externalTrafficPolicy = Local 。

会发现，只有当 PodA 跟 LoadBalancer 的后端 Pod （比如 LoadBalancer 的后端是 nginx ingress controller，则后端 pod 是 nginx 的 pod）在同一个 ECS 上，才能访问到服务的后端 PodD 。

/kubernetes%E5%9F%BA%E7%A1%80-ipvs%E4%B8%8Bpod%E8%AE%BF%E9%97%AEservice%E7%9A%84clusterip%E6%B5%81%E9%87%8F%E8%B7%AF%E5%BE%84/local_policy.png — local_policy

有聪明的小伙伴就要问了，为什么要这么访问呢，为什么集群内不通过 svc 的域名或者 clusterIP 去访问呢？别问，问就是业务需求。

为什么会这样呢？我们得先从 kube-proxy 源码开始。

syncProxyRules() 同步 ipvs 规则的时候，会通过 svcInfo.NodeLocalExternal() 去判断 svc 的流量策略是否为 local ，syncEndpoint() 则确认在跑了 pod 的节点上，才给创建 ipvs 规则，供 DNAT 用。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80


func (proxier *Proxier) syncProxyRules() {
  ...
        // ipvs call
        serv := &utilipvs.VirtualServer{
          Address:   net.ParseIP(ingress),
          Port:      uint16(svcInfo.Port()),
          Protocol:  string(svcInfo.Protocol()),
          Scheduler: proxier.ipvsScheduler,
        }
        if svcInfo.SessionAffinityType() == v1.ServiceAffinityClientIP {
          serv.Flags |= utilipvs.FlagPersistent
          serv.Timeout = uint32(svcInfo.StickyMaxAgeSeconds())
        }
        if err := proxier.syncService(svcNameString, serv, true, bindedAddresses); err == nil {
          activeIPVSServices[serv.String()] = true
          activeBindAddrs[serv.Address.String()] = true
          if err := proxier.syncEndpoint(svcName, svcInfo.NodeLocalExternal(), serv); err != nil {
            klog.ErrorS(err, "Failed to sync endpoint for service", "service", serv)
          }
        } else {
          klog.ErrorS(err, "Failed to sync service", "service", serv)
        }
...
}

func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNodeLocalEndpoints bool, vs *utilipvs.VirtualServer) error {
  ...
  for _, epInfo := range endpoints {
    if epInfo.IsReady() {
      readyEndpoints.Insert(epInfo.String())
    }

    if onlyNodeLocalEndpoints && epInfo.GetIsLocal() {
      if epInfo.IsReady() {
        localReadyEndpoints.Insert(epInfo.String())
      } else if epInfo.IsServing() && epInfo.IsTerminating() {
        localReadyTerminatingEndpoints.Insert(epInfo.String())
      }
    }
  }
  ...
  // 如果 onlyNodeLocalEndpoints 为true， newEndpoints = localReadyEndpoints
  newEndpoints := readyEndpoints
  if onlyNodeLocalEndpoints {
    newEndpoints = localReadyEndpoints

    if utilfeature.DefaultFeatureGate.Enabled(features.ProxyTerminatingEndpoints) {
      if len(newEndpoints) == 0 && localReadyTerminatingEndpoints.Len() > 0 {
        newEndpoints = localReadyTerminatingEndpoints
      }
    }
  }
  ...
    newDest := &utilipvs.RealServer{
      Address: net.ParseIP(ip),
      Port:    uint16(portNum),
      Weight:  1,
    }

    if curEndpoints.Has(ep) {
      // check if newEndpoint is in gracefulDelete list, if true, delete this ep immediately
      uniqueRS := GetUniqueRSName(vs, newDest)
      if !proxier.gracefuldeleteManager.InTerminationList(uniqueRS) {
        continue
      }
      klog.V(5).InfoS("new ep is in graceful delete list", "uniqueRS", uniqueRS)
      err := proxier.gracefuldeleteManager.MoveRSOutofGracefulDeleteList(uniqueRS)
      if err != nil {
        klog.ErrorS(err, "Failed to delete endpoint in gracefulDeleteQueue", "endpoint", ep)
        continue
      }
    }
    // 添加 ipvs 规则
    err = proxier.ipvs.AddRealServer(appliedVirtualServer, newDest)
    if err != nil {
      klog.ErrorS(err, "Failed to add destination", "newDest", newDest)
      continue
    }
  ...
}

我们测试一下，出于安全考虑，我这边不使用真实的 ip ，假设，lb 的 ip 是 111.111.111.111 ，如果要自己验证的，你们根据自己的实际情况替换。

我们先部署一个 busybox 容器，用来测试流量能否通过 lb 访问到内部访问。

找一台没有运行 nginx pod 的节点，在节点上，执行 ipvsadm -Ln|grep -A 2 111.111.111.111 。

我们可以很清楚看到 lb 的 ipvs 后端并没有条目。

把 busybox 容器调度到该节点。

1
2
3
4
5
6
7


ipvsadm -Ln|grep -A 2 111.111.111.111

TCP  111.111.111.111:80 rr
TCP  111.111.111.111:443 rr

# curl yourdomain.com
curl: (7) Failed connect to yourdomain.com:80; Connection refused

busybox 容器内 curl 指向集群内 lb 的域名，会发现，连接被拒，因为没有可 DNAT 的后端。

接下来，我们在有 nginx pod 的 ECS 上查看 ipvs 规则，我们可以看到 lb 的 ipvs 后端存在条目。

1
2
3
4
5
6
7
8
9


ipvsadm -Ln|grep -A 2 111.111.111.111

TCP  111.111.111.111:80 rr
  -> 10.80.5.174:80               Masq    1      0          0         
TCP  111.111.111.111:443 rr
  -> 10.80.5.174:80               Masq    1      0          0

# curl yourdomain.com
{"responseInfo":{"tips":"Token不合法"},"successful":false}

把 busybox 容器调度到该节点，发现 curl 指向集群内 lb 的域名，通的。

原因就是：为 externalTrafficPolicy = Local 的原因，lb 没有在节点上创建对应的 ipvs 规则，数据在 INPUT chain 中进行 ipvs 的 DNAT 过程，因没有目标ip 找不到目标被拒。

总结

笔者对于 ipvs 的知识也是现学的，文中如有不当之处望指正。
这个坑一开始我也不是很清楚，是出了事故后，特意去研究 ipvs 和相关的 iptables 规则，以及 pod 访问 service 的流量路径后，才发现，原来是这样的。
我们发现 lb 的 external-ip 无论是公网内网，都会被绑定到 kube-ipvs0 网卡上，意味着数据都会通过 INPUT Chain 进入到本机通过 ipvs 进行转发，如果 lb 使用的是内网ip，然后绑定一个 eip ，eip 不会被绑到 dummy 网卡上，内部服务通过 eip 去访问内部服务，流量就会被绕到公网，从而避开 ipvs 转发，也可以通过这种方式访问到内部服务。
其他 svc 类型等以后有空再研究，比如 nodePort ，这种类型如果流量策略为 local ，则只有去访问的 ECS 上有对应服务的后端 pod ，才能通。

目录

kubernetes基础-IPVS下Pod访问Service的ClusterIP流量路径

写在前面

开始之前

ipvs hooks

ipvsadm

正常情况

总结

特殊情况

总结