Everyone gets a Kubernetes war story, right? Well, here’s mine.
At work we have a huge, business-critical C# service that I helped to port to Linux (from the Windows-only .NET framework) to run under Kubernetes. This service emits a large number of logs.
Like, a lot of logs. Many gigabytes of logs per day, in sequentially-numbered text files of 50MB each.
For various reasons the developers of this application preferred to be able to look at these logs as actual files for post-mortem analysis rather than through our existing centralised logging tools. Under Windows they would remote into a server and inspect a mounted drive with the logs.
As part of the porting effort they asked me to emulate this workflow.
Log to the Network Fileshare Like It’s 1999
Since we’re running Azure’s Kubernetes offering, I had the pods mount an Azure Files share as a persistent volume – which is basically a classic network fileshare – and write their logs to that. Developers could then mount that fileshare to their local Windows computers via Azure’s SMB support. Very convenient.
However, it turns out that writing all of these logs to this network fileshare introduced so much latency to normal operations that the application ran about 3x slower on Kubernetes than it ran on Windows, which really undermined the effort we’d made porting it.
To “fix” this I introduced a very simple workaround: I had the app write logs to the local disk, and then I ran a dead-simple rsync command (in a sidecar container) every minute or so to sync these logs to the network fileshare for later inspection.
That’s how we did it in the old days.
– Our CTO, amused.
This worked nicely and brought the performance up to par (better, thankfully, than it had been on Windows).
Soon, we started having a problem with this application: Kubernetes would frequently evict the service’s pods for consuming too much memory and destabilising the cluster, essentially terminating them without warning.
Evictions might be fine for misbehaving stateless applications, but this service was heavily, expensively stateful. Unexpected shutdowns lost us real revenue.
What was puzzling about this was that Kubernetes was claiming that our service was using far more memory than I’d anticipated – think 4GB to an expected 1.5. During the porting effort I’d carefully set the pods' memory requests to match observed memory usage on Windows and test runs on Linux.
At first I wondered if Kubernetes was measuring something incorrectly, so I found a pod with high memory usage that would likely be evicted soon and opened a shell into it. You can check a container’s memory usage once inside by inspecting Sysfs directly:
root@<container># cat /sys/fs/cgroup/memory/memory.usage_in_bytes 4248424448
So it clearly wasn’t measurement error. To dig a little deeper I looked at the full memory usage statistics for the container:
root@<container># cat /sys/fs/cgroup/memory/memory.stat cache 2910048256 rss 1322893312 <snip> active_file 2884849664 <snip>
rss entry (for resident set
size) is close to what we had
used to measure memory usage on
and looked like what I would expect for the service under that particular
So what are those
It turns out that Linux aggressively caches recently-used files in memory, in what is called the Page Cache, in order to improve filesystem performance. Normally this does not constrain applications from using memory, because when the system starts to run low Linux will happily flush older parts of this cache to make the reclaimed RAM available.
But where could 2.8GB of recently-used file memory be coming from?
I was able to confirm (using a handy tool I found on GitHub) that all of the individual log files were indeed in the page cache (about 55 of them in the case above).
Caches to Caches
With this in hand a picture of what was happening emerges:
The log files for the service are cached in memory by Linux, but rarely flushed unless the machine it’s running on happens to be low on total memory.
Kubernetes (rightly) views page cache usage as part of a container’s workload, so it counts when making resource allocation/scheduling decisions.
Scheduling decisions also include kicking overly greedy pods off the cluster, but in this case it’s actually a Linux kernel optimisation that makes the application look more memory-hungry than it really is.
I discovered much later that we’re not the only ones with this problem: including the page cache in the memory calculations for an eviction is considered a longstanding bug in Kubernetes itself that has received ample commentary. As you’ll discover in that thread, it’s a tricky issue to solve, which is probably why it hasn’t been.
It does turn out to be possible to remove individual files from the page cache
manually by using a Linux-specific
flag. Years ago
Tobi Oetiker actually proposed modifying
itself to do this so that the files it copied
did not end up in the page cache – but this was never merged upstream.
We briefly considered using a patched version of
rsync or even running a
second sidecar to periodically
posix_fadvise() the log files to get them out
of the cache.
Eventually we settled on a less obscure method: we
the C# service to keep only the three most recent log files (though files still
have a unique suffix to prevent collisions). Provided it doesn’t log more than
150MB in the one-minute window between
rsync runs, this means we end up with
all of the logs on the network fileshare, while still deleting them on the local
disk to free up space in the page cache.
All of this just reinforced for me the wisdom of Charity Majors’s collection of lies we believe about logging:
“Logs are cheap.”
“They’re just logs, they can’t impact production reliability.”
“They’re just logs, they can’t possibly impact my code efficiency.”