Inspecting and Understanding Kubernetes (k8s) Service Network

A Kubernetes service object creates a stable network endpoint in front of a set of pods and load-balances traffic through them.

Always put a service in front of a set of pods that do the same job (they have the same container images running on them). For example, you can put one service in front of the web front-end pods and another in front of the authentication pods. You never put a service in front of the pods that are doing the different jobs (they have different container images running on them).

Customers talk to the Service and the Service load balances traffic to the Pods.

Fig. 1

In the diagram above, the Pods at the bottom may come and go as escalations, updates, failures, and other events occur and are tracked by the Services. However, the name, IP and port of the Services will never change.

Anatomy

of a Kubernetes service

It’s useful to think of a Kubernetes service as

having a front-end and a back-end:

Front-end: name, IP, the port that never changes
Back-end: pods that match a tag selector The front-end

is stable and reliable. This means that the name, IP and port number are guaranteed to never change throughout the life of the Service. The stable front-end nature of the service also means you don’t need to worry about outdated entries on clients caching DNS results for longer than recommended by standards.

The backend is very dynamic and will load balance traffic to all pods in the cluster that match the set of tags that the service is configured to look for.

Fig. 2

The load balancing in this situation is a simple L4 round-robin load balancing. This works at the “connection” level where all requests over the same connection go to the same Pod. This means two things:

Multiple requests from the same browser will always affect the same Pod. This is because browsers send all requests over a single connection that is kept open using keepalives. Requests through tools like curl open a new connection for each request and will therefore reach different Pods.
Load balancing does not take into account application layer (L7) concepts such as HTTP headers and cookie-based session affinity.

Recap of the introduction

Applications run in

containers, which in turn run inside Pods. All Pods in your Kubernetes cluster have their own IP address and are connected to the same flat Pod network. This means that all Pods can talk directly to all other Pods. However, pods are unreliable and come and go as scaling operations, continuous updates, rollbacks, errors, and other events occur. Fortunately, Kubernetes provides a stable network endpoint called Service that sits in front of a collection of similar Pods and features a stable name, IP, and port. Clients connect to the Service and the Service load balances traffic to the Pods.

When a new service is created, it is assigned a virtual IP address named ClusterIP. This is automatically recorded in the service name in the cluster’s internal DNS and relevant endpoint objects (or endpoint segments) are created to contain the list of healthy pods with which the service will load balance.

At the same time, all nodes in the cluster are configured with iptables/IPVS rules that listen for traffic to this ClusterIP and redirect it to real pod IPs. The flow is summarized in the image below, although the order of some events may be slightly different.

Fig. 3

When a Pod needs to

connect to another Pod, it does so through a Service. It sends a query to the cluster’s DNS to resolve the service name in its ClusterIP, and then sends traffic to the ClusterIP. This ClusterIP is on a special network called a service network. However, there are no routes to the service network, so the Pod sends traffic to its default gateway. This is forwarded to an interface on the node that the pod is running, and finally to the node’s default gateway. As part of this operation, the node kernel traps the address and rewrites the destination IP field in the packet header (using iptables/ipvs) so that it now goes to the IP of a healthy pod.

This is summarized in the image below.

Fig 4

We go through a lot of theory to understand the service network, let’s inspect a real service network of a Kubernetes

cluster.

Provision a 3-node GKE cluster for this purpose. Pod and service network configuration:

Fig. 5

Then connect to the cluster using kubectl in Cloud Shell. Authorize and review the cluster configuration.

kubectl get pods -Akubectl get nodeskubectl get node -o custom-columns=NAME:'{.metadata.name}’,\PrivateIP:'{.status.addresses[?( @.type == “InternalIP”)].address}’

Fig 6

Let’s ssh to the GKE worker nodes and review how iptables/ipvs rules are updated by the kube-proxy when services are created or when we scale the implementation associated with a service.

For that first, we need to create a deployment and a service to expose the deployment pods.

Let’s create a deployment with 3 replicas

#on cloudshell with kubectl accesskubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yamlkubectl get deployment -o widekubectl get pods -o widekubectl get pods -o custom-columns=NAME:'{.metadata.name}’,\HOSTIP:'{.status.hostIP}’,PODIP:'{.status.podIP}’

Fig 7

The 3 different worker nodes are created with IPs 192.168.0.6, 192.168.1.6 and 192.168.2.5

Now let’s create a service of type ClusterIP

kubectl expose the implementation nginx-deployment -name=nginx-svc -port=80 -target-port=80 -selector=’app=nginx’kubectl get service

Fig 8

Service created with clusterIP 192.168.251.24

Reviewing Kube-proxyConfig Kube-Proxy

runs as a static pod in GKE.

ssh to any of the worker nodes (connecting to the one with IP 10.128.0.4). It doesn’t matter if the pod for the nginx-deployment deployment is running on that node.

Fig 9

When reviewing the kube-proxy log, we see that since proxy mode was not mentioned in the kube-proxy

start command, the default mode (iptables) is considered proxy

mode. # grep “proxy mode” /var/log/kube-proxy.logW0802 20:09:49.428959 1 server_others.go:565] Unknown proxy mode “”, assuming

iptables proxy So we will see the iptables rules since that is the default mode used by kube-proxy.

The regular ClusterIP

service

exposes the service to an internal cluster IP. If you choose this value, the service will only be accessible from the cluster. This is the default ServiceType.

On a worker node (one with IP 10.128.0.4 for me), try the following command to find rules related to the service we created. Let’s review the iptables “nat” table and look for the “nginx-svc” (service name).

iptables -t nat -L | grep -i nginx-svc Fig 10

We see a lot of information, but it’s hard to make sense of it (especially when you’re new to iptables)

From the PREROUTING and OUTPUT chains we can see, all incoming or outgoing data packets of Pods enter the KUBE-SERVICES chain as a starting point

Fig

11.1

Let’s look first at the KUBE-SERVICES string, as it is the entry point for service packets, matches the destination IP: port and dispatches the packet to the corresponding KUBE-SVC-* string.

Fig.

11.2

Since KUBE-SVC-HL5LMXD5JFHQZ6LN is the next string, we will inspect it.

Fig 12

In this particular KUBE-SVC-HL5LMXD5JFHQZ6LN string, we see that there are four rules:

The first says that if any traffic originated from outside the podCIDR associated with “this” node and is destined for the nginx service on port 80 (http), add a Netfilter flag to packets, and packets with this flag will be modified in a KUBE-POSTROUTING routing chain rule to use source network address translation (SNAT) with the node’s IP address as its source IP address. Consider Fig. 12 and Fig. 13.

Fig. 13

KUBE-SVC-*

acts as a load balancer, distributing the packet toKUBE-SEP-* chains. The number of KUBE-SEP-* is equal to the number of endpoints behind the service (i.e. the number of pods running), i.e. three. Which KUBE-SEP-* to choose is determined at random. We can see the same thing in Fig. 12. The rules of KUBE-SEP* are similar, so we will discuss only one. We will discuss “statistical mode random probability” later in this article.

KUBE-SVC-HL5LMXD5JFHQZ6LN will send packets to KUBE-SEP-7EX3YM24AF6XH4A3 and 2 other strings randomly.

Each KUBE-SEP-* string represents a pod or endpoint respectively.

KUBE-SEP-7EX3YM24AF6XH4A3 has two rules:

Add a

Netfilter flag to packages, and packages with this flag will be modified in a KUBE-POSTROUTING chain rule. KUBE-MARK-MASQ marks a packet for post-masking (SNAT, so that the packet appears to come from the node’s IP) for the packet leaving the pod programmed on the same node.
The second rule redirects all packets to podIP of the pod programmed on the same node on the destination port (80 in this case)

Fig. 14 Similar rules also apply to the other two strings KUBE-SEP-* (Fig. 15)

Fig

If we scale the implementation and make the replica count 4 out of 3, another string KUBE-SEP-* will be created and a rule corresponding to that string will be added to KUBE-SVC-HL5LMXD5JFHQZ6LN.

The

NodePort service

exposes the service on each node’s IP to a static port (NodePort). A ClusterIP service, to which the NodePort service routings are routed. You’ll be able to contact the NodePort service, from outside the cluster, requesting <NodeIP>:<NodePort>.

There are 2 types of services

NodePort:default service (externalTrafficPolicy: Cluster)externalTrafficPolicy: Local We will discuss the default Nodeport service

(externalTrafficPolicy: Cluster)

test, I upgraded the existing service and made it of type “NodePort” with nodePort as 30010.

Kubectl Edit Service nginx-svc Fig. 16

From the perspective of iptables, two sets of strings and rules are added

to the chain KUBE-SERVICES & KUBE-NODEPORTS separately:

Fig 17

Regarding the string KUBE-SERVICES, if there is no matching rule such as KUBE-SVC* in this string for a package, the last rule in the chain is used,

that is, KUBE-NODEPORTS.

KUBE-NODEPORTS denotes that all packets accessing port 30010 go to the KUBE-SVC-HL5LMXD5JFHQZ6LN chain where they are first SNAT-ed (destination KUBE-MARK-MASQ) and then forwarded to KUBE-SEP strings* to select a pod to route.

Fig 18

UPDATE: The way NodePort routes packets has changed. As of March 2023, packets are not sent to KUBE-SVC-* chains, packets are forwarded to a KUBE-EXT-* chain which is then forwarded to KUBE-SVC-* and the rest is the same as above.

Here’s an example:

Why an extra string? In my opinion, the KUBE-EXT-* chain gives us the ability to reuse the chain.

The LoadBalancer

service

exposes the service externally by using a cloud provider’s load balancer (GCP). NodePort and ClusterIP Services, to which the external load balancer is routed, are created automatically.

If we change the service type from NodePort to LoadBalancer, there are no changes at the iptables level. It uses the same iptables strings and only adds an OSI Layer 4 (TCP) load balancer in front of the “Node Ports”.

This is not true for the GKE Load Balancer service, GKE Load Balancer does not forward traffic

to nodes and nodePorts, but forwards incoming packets destined for IP Load Balancer and service port to the KUBE-SVC-* (on each node in instance groups) related to that service, which is what happens when traffic is received in ClusterIP. In the KUBE-SVC-* chain, packets are first SNAT-ed (destination KUBE-MARK-MASQ) and then forwarded to KUBE-SEP* chains to select a pod to route to. This is specific to GKE and GCP LoadBalancer.

To load balance traffic between available endpoints, iptables includes a “random probability xx.xxx in statistical mode” clause for each KUBE-SEP* rule in the KUBE-SVC* chain.

The iptables engine is deterministic and the first matching rule will always be used. In this example, KUBE-SEP-7EX3YM24AF6XH4A3 (Fig. 18) will get all the connections, but we want to load balance between the available endpoints.

To work around this problem, iptables includes a module called statistics that ignores or accepts a rule based on some statistical conditions. The statistical module supports two different modes

:random: the rule is ignored based on

probability
: the rule is ignored based on a round robin algorithm

Random equilibrium Check

Fig. 18, notice that 3 different probabilities are defined and not 0.33 everywhere. The reason is that the rules are executed sequentially.

With a probability of 0.33, the first KUBE-SEP rule will run 33% of the time and skip 66% of the time.

With a probability of 0.5, the second rule will run 50% of the time and skip 50% of the time. However, since this rule is placed after the first one, only the remaining 66% of the time will be executed. Therefore, this rule will apply only to (50% of the remaining 66% = 33%) of applications.

Since only 33% of traffic reaches the last rule, it should always be enforced.

If we scale the replicas for this deployment from 3 to 4, what changes on the configuration side of the service at the iptables layer?

Fig. 19

The number of

pods increased means that the number of endpoint objects increased. Therefore, the number of KUBE-SEP* rules in the KUBE-SVC* chain will also increase.

Fig 20 Compare Fig

with Fig 18, now the first KUBE-SEP rule * will be executed on 25% of all packages, the second will be executed for 33% of the remaining 75% times, which is also 25% of the total times. The third will run for 50% of the remaining 50% and the last will run for 25% of the total number of times.

There are certain service configurations that we don’t discuss in this article

:External IP service: If there are external IPs

that are routed to one or more cluster nodes, Kubernetes services can be exposed on those external IPs. Traffic entering the cluster with the external IP (as destination IP) on the service port will be routed to one of the service endpoints.

Session affinity service

: Kubernetes supports ClientIP-based session affinity, session affinity causes requests from the same client to always be routed to the same pod.

No endpoint service: A ClusterIP service is always associated with backend pods, it uses a “selector” to select backend

pods, if backend pods based on a selector are found, Kubernetes will create an endpoint object to assign to the pod’s IP: Port, otherwise that service will have no endpoint.
Headless service: Sometimes you don’t need or want to balance the load and a single service IP. In this case, you can create “headless” services by specifying “None” for the cluster IP (.spec.clusterIP).

NodePort service with externalTrafficPolicy: Local: Using “

externalTrafficPolicy: Local” will preserve the source IP and remove packets from the worker node that does not have a local endpoint.

I highly recommend reviewing my post in a special case when using the GCP firewall with the GKE LoadBalancer service

Blogs