Debugging Memory Leak for Python Asyncio Application

19 Dec 2018

Recently I had a chance to debug memory leaks for a program. This is not the first time I had to debug a leaking program, but this one was a little special: it was a Python asyncio web server. It’s special because traditional tools (like valgrind) do not work well with Python, and an asyncio application can be quite tricky to debug due to complex interactions among coroutines.

In general, the steps to tackle memory leak problem (also for any types of bugs) is as follows:

Usually the first step is the trickest among the three: you need to know what information is available and useful for diagnosis and where to find them, or, what tools are available to help you and how to use them. Even if you’ve got the necessary information, you need to know how to interpret it (perhaps by visualizing some data) and use that to dig down further if necessary.

I will explain what I did for tackling the aforementioned Python asyincio memory leak problem. However, before that, let’s take a step back and think about what could cause a Python program to leak memory.

If you know Python well enough, you probably know that it is a language with garbage collection, just like many other modern programming languages. That means, as a programmer (a user of the language), you normally do not need to worry about allocating memory and forgetting to free that allocated memory later on; because the garbage collector does that for you in the background. The presence of a garbage collector greatly reduces the chances of memory leaks. However, it is still possible to have memory leaks in Python programs. How? you might ask.

There are several scenarios I can think of:

Let’s discuss each scenario in more details.

The first one is obvious. And there is no garbage collector for C/C++ code; so memory leak bugs often arises in this part of the program.

The second scenario is actually not applicable anymore (since Python 3.4+). It was possible (before Python 3.4) to have leak memory leaks due to reference cycles when there are circular references between objects AND some classes of those objects have custom __del__ method implemented. After safe object finalization was introduced, such way of leaking should not happen anymore. Having said that, reference cycles should still be avoided when writing an application; this is because Python’s cyclic GC runs asynchronously in the background and in phases (or generations), in other words, those cyclic isolates are likely NOT to be collected immediately and linger for a while, which in turn might contribute to ‘high water’ phenomenon temporarily. In addition to this, running GC (which is triggered by certain heuristics) has performance penalty since that pauses the entire interpreter, especially in the third generation of garbage collection (the most costly).

For more in-depths explanations to garbage collection in Python (3.6), I recommend having a read of this post.

The third scenario is more interesting, many situations can happen (application-specific); for example, in a asyncio application, many corouines are running at the same time, and many of them are blocked by some IO operations, and each of them is still holding memory allocated to them, like this case study. Or, you have a large number of immuntable types (like ints and floats) allocated, Python still keeps them in the so-called free lists for performance reasons, even after they are not referenced anymore during program run. These examples are not really memory leaks, but if they happen in a long-running Python process, you might still experience monotonically memory usage increase or high water phenomenon, because the memory is not released by the interpreter back to the OS yet.

To better understand free lists and memory management in Python, I found this post, this one and this one helpful. With such understanding, it shall help you to write Python programs that interacts well with Python’s memory management algorithms.

The fourth scenario is not necessarily Python related, it is due to memory fragmentation. And seems like it was improved since Python 3.3+. Memory fragmentation usually happens when you have a long running program (like a server) and it allocates a mix of short-lived and long-lived objects over time. As a Python application developer, you normally do not need to worry about it. Memory fragmentation symptom shows high water mark phenomenon in memory profiling.

Enough said, let’s turn back to how I tackled the memory leak problem of a Python asyncio web server.

With the four scenarios in mind, you might have an educated guess that it is more possible to encounter the first and the third scenarios. But we are engineers, we do not just guess, we need proofs.

First of all, I knew I had a memory leak problem because there was a monitoring service running to monitor the node where the leaking asyncio web server was running on and alerted me of unusual memory usage pattern.

So, the problem did exist, now where was the problem exactly? I’ve tried using a few tools:

For my problem, I found the first two tools weren’t helpful. They seem to be good tools for estimating the overall memory usage of an application. If the memory leaks are instances of the same type and this type is only instantiated in a few places (for example, some custom classes in the application), these tools can identify that. objgraph is only useful when you’ve already identified which object(s) in the code is leaking, and ususually used to visualize objects relations for related objects.

After playing with the first three tools, all I saw was some generic types and most of them got garbage collected when I shutdown the application server on my local machine. The cause for the problem was still a mystery.

Then I realized that the way I debugged the problem was wrong: I invoked those tools in the server program entry point when server started (this part was correct), and displayed the memory profile when the server shut down (this part was wrong), on my local machine (this was also wrong). It was wrong in two ways, the first is if the leak was caused by the third scenario described above, then this way of debugging will never catch that because those dangling memory chunks may very likely be collected right before the server shuts down. The second is many memory problems manifest themselves only in a long running process and under real-world loads, which local experiments may not easily reproduce. To do it correctly, I really need to be able to take snapshots of memory profiles of the live, production server at run-time: an API memory profiling endpoint to query while the server is running.

Then I found this wonderful post, which was exactly what I needed. In this post, the author used tracemalloc and some asyncio instrospect utilities. It turns out that tracemalloc is well suited for such situations: to identify top memory holders from large applications. And it is a module from standard library since Python 3.4! I adapted the code snippets in that post for my own use and implemented a memory profiling endpoint for the leaking web server (and of course, installed that endpoint to the live production server).

Guess what? One day after the endpoint was released, I was able to (very easily) pinpoint exactly which object in the code that’s leaking memory based on the profiling information provided by the endpoint (step 1). It was some object in the lower level code of aiohttp that keeps a big bulk of memory. And after some googling around, it turned out that there was indeed a recent memory leak bug in library multidict used by aiohttp (step 2). And multidict is indeed a Python C extension! So after knowing the why, the solution became obvious and simple: either downgrade or upgrade the version of aiohttp to the one that does not have the multidict memory leak bug (step 3). Mission accomplished!

In summary, I personally found tracemalloc to be a very useful tool for debugging memory leak problems for Python programs. And if you’re debugging a web server, a profiling endpoint is really a good idea!