You see I didn't mention "fixing" a bug, because I think finding the root cause of any bug has value per se. Fixing them is a separate adventure.
Anyway this week I was entirely bamboozled by this: a Docker container was re-deployed with a different networking configuration. In particular, with Compose, I was dedicating a network for that container (let's say 172.18.0.0/24) separate from the default Docker network (e.g. 172.17.0.0/24).
- subnet: 172.18.0.0/24
In the "service" definition I just added this:
No need to be more specific with names because Compose will add the project name to the network, so in the Docker network list you see something like: MYPROJECT_apps.
This change worked perfectly on two previous deployments - in fact they were deployments on intermediate stages, in preparation for a deployment on a third stage. Everything went smoothly there, no big deal, Compose updating the networking as the new configuration was brought up.
This particular container exposes a UDP port (say 8888) and it's expected to receive UDP packets from the VLAN via the host interface. Packets arrive at the host's UDP 8888 and are forwarded to the container's UDP 8888, where they are processed. Very simple.
But when I did the deployment on the third environment, reasonably similar to the first two, things just didn't work. tcpdump was showing that the UDP packets were still received by the host but not forwarded to the container.
The forwarding rules - inspected with iptables - were correct. The container had the expected IP Address. ip route was correct. Where are those packets going then?
I ssh'd into one of the machines that were sending UDP traffic to the receiving host, opened a netcat connections to that host, port 8888, and sent some data. Those packets were correctly being received by the container! So some traffic with similar characteristics was received, some wasn't.
Perhaps an issue with data length? Nope. I tried sending big UDP packets and they were correctly received too. It was a weak lead anyway, because the source traffic was of various lengths while not a single packet was forwarded.
Errors logged somewhere? None.
iptables stats were confirming the arrival of the packets, and that they were not routed to the container's virtual interface.
At this point I was enough out of ideas to start tapping the shoulder of some colleague. "I'm observing an interesting behaviour!". "Have you tried...": "Yes". "How about... ": "Checked." "WTF?": "Exactly".
It was indeed a WTF moment. Surely there were some differences in the list of virtual interfaces on the problematic host, but none seemed to have an impact.
We looked deeper into the network traces. And there, glooming in their "light green on black" magnificence there were the ICMP Destination Unreachable responses from the receiving host to the producing hosts. A confirmation that packets were attempted to be sent to an unreachable destination. Why? The forwarding rules were not just right, but also easily testable. It was just the "official" traffic that wasn't forwarded...
Then a colleague, one of the brightest I've ever had, started talking about tracked connections (/proc/net/nf_conntrack). We could see our netcat connections were tracked, as they were the ones from the other hosts producing data. But there was an important detail. The connections from the "official" producing hosts were associated with the old container IP address (e.g. 172.17.0.2 instead of the new one, 172.18.0.2)!
The usage of conntrack is visible from iptables' output, in particular for the FORWARDING table:
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Why didn't we observe this behaviour in the other environments? Simple: there the traffic was more sporadic, and so the tracked connections had the time to expire. In the failing environment the continuous flow of UDP packets kept refreshing the tracked connections, even though the forwarding was actually failing.
Only at that point I had the right scenario described, and I could find this Docker issue. From there:
When you restart the container, container's ip has changed, so the DNAT rule, which will route to the new address. But the old connection's state in conntrack is not cleared. So when a packet arrives, it will not go through NAT table again, because it is not "the first" packet. So the solution is clearing the conntrack, [...].
It was a fun day :) I hope this may save some hours of head scratching to somebody in the future.