Yet another day in the life of a memory leak hunter

By Attila Szegedi, on Tuesday, 8th February 2005

It looks like the rather taxing activity of debugging Java memory leaks wasn't over last Friday. The OutOfMemoryError unfortunately again reared its ugly head on Monday night. As I was now at least sure that all the offenders rounded up last friday are no longer causing trouble, I was free to focus on scrutinizing other areas of the system.

Regardless, the immediate feeling of being lost and helpless when faced with an OOME wasn't absent this time either. I started working on it slowly, with one apparent thing being that in more trivial testcases, there was no OOME. Only when the full-blown system with complex scripts was set in the motion did it come out. I was again suspicious of Rhino, as I thought that maybe it could somehow hide another dangling reference.

I pretty soon discovered an offender. After Jakarta commons-logging, I found another one in commons-pool (which is used by commons-dbcp, a database connection pooling library). Accidentaly I switched on the "idle evictor" in it, and it starts up a thread with a Runnable that periodically checks whether the "idle" objects in the pool should be "evicted" from it.

Now, check out part of the code for the evictor class:

    class Evictor implements Runnable {
        private boolean _cancelled = false;

        void cancel() {
            System.err.println("Cancelled evictor " + this);
            _cancelled = true;
        }

        public void run() {
            while(!_cancelled) {
                try {
                    Thread.sleep(_delay);
                } catch(Exception e) {
                    // ignored
                }
                try {
                    evict();
                } catch(Exception e) {
                    // ignored
                }
                try {
                    ensureMinIdle();
                } catch(Exception e) {
                    // ignored
                }
            }
            ...
       }

The cancel() method is called externally when the thread running the evictor should die. Now, if you know the Java memory model, you'll immediately see the problem with this code: under the Java memory model, it's perfectly legal that the thread running the loop in run() will never observe the fact that another thread switched the _cancelled flag to true. As it turns out, the JIT in the Sun's server VM will apparently generate code that'll exploit this. Now, if there was a synchronized block around reading and writing the flag in both cancel() and run(), or even if the field was declared volatile, it couldn't happen. In this case, I had experienced Java memory model at work firsthand, as the looping thread never experienced the updated value of the flag. The thread kept running indefinitely, binding the class loader via the already well known reference path (interited access control context holding a protection domain holding the class loader).

Actually, fixing this was easy, as the evicting behavior was turned on by accident, so I just had to turn it off.

Just as on Friday, it was early to be joyous after catching the first source of leak. Another offender was still at large. The custom loaded classes still weren't unloaded after I released all known references to the class loader.

I even employed a trick to ensure the leak is not associated with thread-held resources (security contexts, thread locals, thread context class loader, and the like). As early as possible in the application logic, I replaced the main entry point to the bulk of execution:

doStuff();

with

Thread t = new Thread(new Runnable()
{
    public void run()
    {
         doStuff();
    }
};
t.start();
t.join();

After I did this, the leak still didn't disappear. Now, this doesn't mean the leak isn't thread related. It merely means it definitely has a non-thread related cause (but can still have a thread related cause too).

I'll spare you the tedious details of a lost afternoon, hair pulled out, and almost getting runned over by a car because of lack of attention to the physical world while I was going to pick the kids from the kindergarten in the afternoon (well, at least I wasn't as absent as to drag home somebody else's kids). When even few hours later I finally discovered it, I had to admit it was a lesson hard learned. Here I'm going to spare you a similar afternoon if you ever get in my shoes. Here's the lesson, which I'll put into a nice red box for emphasis:

Never, ever, load a JDBC driver through a class loader that is meant to be eventually trown away. Move the JDBC driver classes into a location where a permanent class loader (system or application class loader) will pick it up.

Actually, I should have known better than to fall into this in the first place. Unfortunately, all the abstraction layers (Spring, connection pooling etc.) make it easy to forget that sooner or later, a JDBC driver class' instance gets registered in the java.sql.DriverManager, which is a system class and hence in the boot class path. It is indeed JVM-global. Actually, this gives another very good reason not to load JDBC driver classes through a disposable class loader - you might end up with several instances of them all being responsible for the same JDBC URL prefix, and that can also spell trouble in some circumstances (i.e. if not all of them are the same version of a driver, only the first loaded would get to handle all connections for a prefix, and some code might break on this as it'd expect a later loaded, essentially inaccessible version to handle the connections).

The bottom line: JVM-wide singletons are bad for scalability. java.sql.DriverManager is not an exception. By being forced to move the JDBC driver's JAR file upper in the class loader chain, I lost the ability to use hot-reloading class loaders to upgrade the JDBC driver on the fly, without restarting the JVM.

Well, this little article got a bit terse, but believe me, it is a product of hunting these problems down for a full workday. I hope somebody else also learns from my experience. As for experience, the commercial profiler was again of no use in tracking down this one. Well, maybe it's only me who doesn't know how to use it.

Attila.

The saga goes on in "Memory leak hunting still doesn't end, or: Serialization also requires a bit of manually imposed amnesia"


Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.