A day in the life of a memory leak hunter

By Attila Szegedi, on Friday, 4th February 2005

well, today was a rather dynamic end-of-work-week day. If I happened to be close enough geographically to any of my workmates so that going to a beer after work is feasible, I'd probably invite them to one (even if I actually prefer wine) and tell them about my day. Lacking this, I'm still sitting in front of the machine, and attempting to satisfy my social needs by typing this away. Well, don't feel sorry for me - I actually got to play few rounds of a board game with my son already this evening, so it's not as if I don't have a life. But he doesn't really appreciate the telltales of software development adventures. Well, at least not yet. But I digress. As far as digressions go, I might as well fetch myself a cup of Jägermeißter first.

Actually, it all started yesterday when I arrived home from my car purchase trip. I was really incredibly happy about the car deal; actually I'm still incredibly happy about it (unfortunately I can pick it up only by the end of the next week), but my overall happiness level suffered a noticeable drop when I got a mail in my Inbox from a coworker saying that the logs of a particular service in our dev cluster are littered with OutOfMemoryError-s.

Now, admittedly, Java can spoil us developers. Whatever goes wrong, you get a nice execution stack trace that usually pinpoints the place of failure, and the bug is usually somewhere in close proximity of it. Bugfixing has never been so easy. Really.

Well, not with Mr. java.lang.OutOfMemoryError. OOME is the most hideous, most sinister, most evil java.lang.Throwable you can encounter. Well, its siblings java.lang.InternalError and java.lang.UnknownError are probably just as bad, but I never encountered them in the wild and don't know anyone else who did. So, aside from JVM's own native code crashing on you, OOME is the worst thing you can get to debug in Java land. It's nonlocal - the location of the bug and the location of the failure usually have nothing to do with each other. The code where it is thrown is probably innocent; it just couldn't allocate memory because some totaly unrelated code is not behaving nice and keeps a strong reference to some object it shouldn't and therefore it is preventing some big chunk of the object graph (long forgotten and thought to be abandoned for good by the rest of the code) from being reclaimed by the garbage collector. That's the closest you can get to a memory leak in Java - a "loitering object" (more often "a loitering bunch of interconnected objects who taken together ammout to a rather impressive memory footprint"). That's an object that should have been taken away by the garbage collector, but some other piece of code secretly still holds a reference to it, and doesn't let the GC have it.

So, how do you debug this beast? After the initial phases of disbelief, depression and despair (in arbitrary order) you have few choices that can actually work.

First you need to find out what the loitering objects are. There's no hard rule for this one - you need to analyze logs, add debug statements, and plug stuff like

protected void finalize()
{
     System.err.println(this + " finalized");
}

into suspect classes - as to what classes are suspects, well you need to rely on your intuition. Few years of active experience help :-). With some luck, the problem can be narrowed to a fairly small reproducible testcase. Then you can proceed with running the narrowed testcase under a memory profiler. Lacking intuition (or time) for narrowing, another possible approach is to start with the memory profiler, and run the full blown system under it. So after a bit of investigation and identification of a subset of code I fired up the Free and Open Source and Could Even Make You A Coffee If You Ask It Really Nice profiler plugin in Eclipse, and then exercised the system, took a heap snapshot, waited until it terminates and the resources expected to clean up after certain idle time are destroyed, then exercised the system some more, watched as the resources are recreated, waited until they again expire and are destroyed, took another heap snapshot, and saw what it gives.

Now, one problem with all memory profilers is that sometimes they want to make your life too convenient. Not just the commercial ones, where you can expect certain percentage of requirements spec to come from the marketing department, but even this free one. They'll happily show you references between "ordinary" objects, but will also assume you're not too interested in the references between ClassLoader and Class object, or Class objects and their static fields. In case I didn't mention it, those "resources" mentioned above are actually reloadable code pulled in by a separate class loader, and then disposed by releasing all references to the class loader and all the classes. So I was pretty much certain that whoever is guilty of Illegal Posession of an Object Reference He Does Not Own is actually holding onto a ClassLoader instance.

Nevertheless, eventually I spotted it - and it was no one else but Apache Jakarta Commons Logging. Between you and me, it's a huge pile of crap. It's so bad that I'm not even using it in any code I write. How can it then be there, you ask? Well, unfortunately, some of the 3rd party libraries we load on demand through the class loader do rely on it, so it's not like we have a choice. Our code doesn't use it, but it must be there as additional dependency of certain 3rd party libs who didn't know better. So, it's there, and it's manically hoarding all the class loaders that I'm trying to throw away.

Now, holding a reference to a class loader is a big deal. If you hold a class loader, you also hold all the class objects it loaded, and all objects reachable through static fields of those classes. It can be a fairly big bunch. As it turned out it is not a bug that Commons Logging is manically clinging with a strong (instead of weak) reference on to each and every class loader it ever sees. It's actually a - surprise - feature! The consequence of its terrible design is that it must do it. The guys who wrote it were at least so nice to provide a

public void release(ClassLoader)

method in the LogFactory class, so we can tell it to let go of that particular class loader we're about to retire. Now, it's an incredibly ugly solution, because now I had to import org.apache.commons.logging.LogFactory as a compile-time dependency, and call the LogFactory.release(classLoader) call into my own class loader manager class' destroy() method, regardless of the fact that my code has nothing to do with commons-logging. Reminds me of those good ol' days of C programming where each library had its own memory reclaiming function (often named "release" - what a coincidence) with a more-or-less bizarre set of rules when to call each and how in order to keep everything ticking. So, commons-logging is a big slap in the face for automatic Java memory management, I tell you. Thank you guys for bringing back the days of explicit release() calls into Java.

At this point I have to interrupt my account of today's happenings - if we were phisically together in some pub, this'd be the equivalent of wife calling me on the phone to tell me she's really tired by this late hour, but would really appreciate if she could see me for half an hour to talk to someone about her day before collapsing to bed. Of course, the reality is a bit more convenient, as she doesn't need to actually call me on the phone, and I don't have to walk home from the pub in a chilly night. Well, anyway, before leaving I promise to hook up with you tomorrow and finish the story.


It's tomorrow already (isn't the time just flying when you read a good story?). I decided not to publish the above part written yesterday, so I don't leave you in suspense and jeopardize your night's sleep with it. I also have to mention that I got so busy with writing yesterday that in the end I forgot about my anticipated cup of Jägermeißter. And since drinking for breakfast is not my idea of a lifestyle, I'm compensating it this morning with a big mug of coffee-in-a-hot-chocolate.

Anyway, we left yesterday when I had to let a piece of commons-logging to gain a foothold in my class loader management code. Well, it's an ugly thing, and I thought that maybe I'll get around to rant about in on Jakarta commons mailing list, and maybe I'll even send them a patch, but for the time being I was happy because the offender guilty of withhelding the custody of my class loader is caught and dealt with. Just to be sure, I run everything again, and - oh, no; there's still OutOfMemoryError showing up. At this point I could feel a slight paranoia and suspected the computer might be mocking me. Check if we're indeed running with the fixed code. We are. Hm... Okay, back to the memory profiler, who pretty much tells me the ClassLoader still isn't garbage collected.

So, commons-logging was not a lone offender, no sir. He's got a partner. I should have known that good detective stories don't have such an easy ending. Now, this partner of his is hard to track down. Sometimes, I can catch a heap snapshot where he holds the class loader, on other occasions it doesn't. He's playing hide-and-seek with me. On further investigation, it turns out the culprit lies on sun.net.www.http.KeepAliveCache. Well, stuff with "keep alive" in its name is suspicious at best. It turns out it's a caching mechanism for HTTP 1.1 persistent connections. Whenever the code loaded through the class loader does a HTTP request, the sun.net.www.http.* stuff leaves the TCP connection open so it can be reused on a subsequent HTTP request. It spins up a thread that will close the connection after a timeout. Unfortunately, as a side effect the thread will also store a strong reference to the class loader (in its inherited access control context field, which has a reference to a bunch of protection domains, which refer to the class loader). The timeout explains why it was a now- you-see-me-now-you-don't - the thread eventually terminated and let go of the class loader, but sometimes not fast enough to prevent an OOME.

If you're in for a quick mind exercise, think about how would you fix this? There's no option to turn off persistent connections in sun.net.www.http.* stuff. you pretty much can't configure or mess with it in any legal or illegal way. Granted, you can always write an URL stream handler that'll use, say, Jakarta Commons HttpClient (I'm grewing suspicious of Jakarta Commons stuff, btw), but it's a very heavyweight approach.

Well, after banging my head against the wall for a while, I eventually tackled the problem at the other end. Since I was lucky enough to control the HTTP server code too, I just added a

Connection: close

header to all HTTP responses it generates. It forcibly terminates the TCP connection after the request. No more persistent connections. Simple. Works. Problem solved1.

Or so have I thought.

The class loader still wasn't being unloaded. Gosh, not only were commons-logging and sun.net.www.http partners in crime, there's also a third offender somewhere! I'm facing a freaking crime syndicate!

The third guy is even more shady. He holds the reference so stealthily, the Eclipse profiler can't spot it at all. It's not a straight business as with commons-logging who visibly held the reference, or the more tricky business as with sun.net.www.http where it only held it for a period of time (still long enough to cause OOME in certain circumstances). It's completely invisible.

I attribute this to the incompetence of the profiler, who doesn't even see the class loader as being reachable at all (it is, I see it from the memory usage and from the fact that after submitting two workloads to the service with 16M heap limit the OOME arrives). So I go see if I can obtain another evaluation license for certain commercial profiler. I notice there's now a freeware edition of it, so I grab it, install it, and fire it up. I'm immediately disappointed as the freeware edition is only good for diagnosing performance problems - it's a CPU profiler only. The memory profiler is not available in the freeware edition. What the heck, let's check out a competing product. Another download, after a sell-your-soul-to-us type of evaluation license agreement, another install and config, and we're ready to go. (A particular reason I'm not mentioning the names of any of them is that I'm not sure if I'm allowed to publish any results of their evaluation under the click-through license; better to be on the safe side.)

I must say I also like this commercial profiler pretty much after this little exercise in memory leak investigation. However, it isn't too helpful. Well, at least it shows that the class loader is still there in memory, but it also fails to pinpoint why is it still there. It also suffers from the same problem of not considering references to Class objects to be significant.

Well, I'm out of luck. There's another offender at large in my JVM, and even the commercial investigation tool can't help me find him. As the generations old engineer saying tells you, when high-tech approach fails, try the low-tech approach.

So I go back to my code, and start butching out parts of it methodically. As the bulk of the service is actually the Mozilla Rhino JavaScript interpreter, I cut out all of the JavaScript execution logic, and see if we have a leak (having abandoned the profiler, I now run the service from the command line with -verbose:gc and watch for when will I see a bunch of classes being unloaded - a sure sign the leak was cleared up). Okay, there's no leak after it's cut out. I'm narrowing the cutout region more and more, running it in each iteration and watching if I'll see the classes unloading. The tracks indeed lead to Rhino, and eventually...

... I have him. It's the most amazing case of a dangling reference I have encountered so far in my carreer, I tell you. Even when I had it in front of my eyes, it took me a while to realize why is it a dangling reference at all. In a way, it's so subtle it's downright beautiful.

It's in the Rhino's org.mozilla.javascript.Context class. It uses a ThreadLocal to store the current Context instance for the thread. This is harmless in itself, as the correct usage of the Context API is

   Context.enter();
   try
   {
       doJavaScriptStuff();
   }
   finally
   {
       Context.exit();
   }

and Context.exit() clears the ThreadLocal contents, so there's no remaining instance in it.

But! Whoever wrote the code is not just storing the Context in the ThreadLocal. No, he chose to store a one-element array of Contexts (new Context[1]) in the ThreadLocal on first access. That's done so that once you retrieve it, you can easily set a new value into it - you just do

contextStorage[0] = context;

instead of

threadLocal.set(context);

Okay, it's not significantly simpler, but it's faster to set/get array element than calling threadLocal.get() and threadLocal.set() all the time - they both involve at least Thread.currentThread() call and a Map.get(). This Context[] array is never cleared up from the thread local, just its 0th element is set to a value in Context.enter() and to null on Context.exit(). But the array remains there all the time. And that's the problem.

So, why's having a Context[] in the ThreadLocal causing a dangling reference to the ClassLoader that loaded Rhino? To understand this, you need to understand how ThreadLocal is implemented. Each thread has a WeakHashMap where keys are ThreadLocal instances, and values are the value of those ThreadLocals. A rule for WeakHashMap - where keys are held with weak references and values with strong references is that the value MUST NOT hold a direct or indirect strong reference to the key, otherwise it can't clear up until the thread dies. As a matter of fact, our threads never die, as they're permanent workers created by JMS sessions to execute javax.jms.MessageListener.onMessage(), so if there's a reference from the value to the key, it'll never clear up. And anyway, getting rid of dangling references by periodically recreating all the threads is uncool. Now watch this chain of references from the value to the key ("-->" means a direct strong reference) in the thread's WeakHashMap of thread locals:

key == the ThreadLocal instance in question
value == Context[] instance

value == Context[] instance --> Context[].class --> Context.class -->
Context.threadLocal static variable --> the ThreadLocal instance in
question == key
or shortly
value --> ... --> key

BINGO! Naturally, it's also true that Context.class --> its own ClassLoader, so that's why it couldn't be unloaded.

All that was left was to fix it. And here's the stunning part. That Context[] array was used only to temporarily store a Context instance between Context.enter() and Context.exit(). Now, thank to Java polymorphism as applied to arrays, we can use an Object[] array instead for the purpose of temporary storage. So, by simply storing a new Object[1] in the ThreadLocal instead of new Context[1], the memory leak was dead. Gone. Kaputt. Running the service again with "- verbose:gc" finally showed that all the classes belonging to the custom class loader are being unloaded after a period of idleness. It was what I was waiting for all day. It was a bliss to see it.

So, we sacrificed a bit of static type safety by using a more generic type Object[] instead of a more specific type Context[], but we gained a fix for an incredibly subtle memory leak. I sent the patch to Mozilla (if you're really interested), and it is already incorporated into the Rhino CVS HEAD. As amazing as this dangling reference was, the fix was even more amazing. I fixed a memory leak by changing the component type of an array! How often do you get to do that?

Hope you understand why I felt like sharing this with someone over a beer.

Oh, and the bottomlines:

Attila.

Update: as much as I loathed it from happening, I have Yet another day in the life of a memory leak hunter to tell you about.


Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Endnotes

1 Eugene Grant kindly pointed out to me that in case you can't control the headers, you can resort to using -Dhttp.keepAlive=false command line switch when launching your JVM.