Pitfalls of executing Java code from remote JAR files

By Attila Szegedi, on Thursday, 2nd June 2005

Foreword

Sun Microsystems had - or maybe still has - the slogan "The Network is the Computer". Java is one of their premier technologies, and they market it (as much as you market something you give away for free, anyway) as an ideal development and runtime environment for networked systems. It's only when you start to developed something that is really fundamentally networked in nature that you figure out that the myth of Java being born for networking is even less true than the "Write Once Run Anywhere" promise. While there are many facets to this, many of them probably unknown to me yet, in this article I'll concentrate on a single aspect: running code from a JAR file that is accessed over HTTP.

Use case

Why would I want to do it anyway? Why not download the JAR file and run it locally? In a word: centralized deployment. Imagine a networked system where many machines are running the same code. Do you really want to deploy and configure the software on each one of them? I don't want to. Nor do the ops staff at the company I work for find much amusement in such repetitive and prone to error task. Instead, we have a centralized HTTP server, and all JVMs pull their code from there. We can even have all JVMs hot-update their code by simply updating the code at the central HTTP server - but I'm jumping too far ahead. For now, let's suffice to say we're just doing centralized deployment.

The failure of java -jar

First thing you'd be inclined to attempt is package your code into executable JAR files (those declaring the Main-Class attribute in their manifest), and using java -jar to launch them:

java -jar http://centralserver/mycode.jar

Sounds intuitive, doesn't it? Too bad you get this:

Unable to access jarfile http://centralserver/mycode.jar

And indeed, browsing the tools documentation in JDK docs, it explicitly says that the -jar switch works only with a local file, not any URL. That's especially a shame considering that the most used class loader in Java (the one used for servicing app and ext classpaths) is URLClassLoader, capable of loading code from whatever type of URL the JVM can handle. Especially since it took me only two hours to whip up a small component that emulates the -jar switch and can run a JAR file from an arbitrary URL. The emulation of the -jar feature requires reading the JAR file's manifest to see if it declares any dependencies in the Class-Path standard attribute, and making sure URLs of those JAR files (and their further dependencies, until we reach the transitive closure) are all enumerated, then pack them all (plus the original URL specified on the command line, of course) into an array of URL objects, create an URLClassLoader working from this array of URLs, and load the JAR file's main class and invoke its main() method. Not a big deal, but it would have been really nice if Sun provided this trivial piece of functionality out of the box, instead of forcing me to write it. (I must admit though, that it was fun to write it...)

After packing this newly developed class into an executable JAR file of its own, named urlLauncher.jar, the command line becomes similar to the above not working one, with only a single small addition, only this time it works as expected:

java -jar urlLauncher.jar http://centralserver/mycode.jar

We do need to deploy a single 3K urlLauncher.jar file to each machine participating in our networked distributed system though, so our dreams of true zero local deployment crushed unto themselves. On the bright side, we still have near-zero local deployment.

Something positive1: versatile URL operations

In order to say something positive as well, Java's URL class is rather versatile and supports all operations you'd probably need, including - naturally - opening a stream to a resource described by the URL, and constructing URLs relative to another URL. It is also rather easy to figure out the URL where a class was loaded from, so if you need to access configuration files and whatnot that are located in a well-known place relative to the JAR file, the code can easily access those. In this regard, Java really lives up to the promise of a web-ready system (as long as we associate "web" with collection of resources described by URLs, which fits our purpose of accessing JAR files over HTTP).

Default JAR caching

Okay, suppose you create such an URL launcher and you are now able to launch executable JAR files from an arbitrary URL. What Sun's implementation of the jar: protocol handler does for remote JAR files is copy them to a local file in the java.io.tmpdir directory, named jar_cache12345 or similar, with deleteOnExit flag, so the file gets deleted when the JVM exits. This is a reasonably sane behavior, since this way it can use random access file operations to find and load entries from it. So far so good. You'd think it's as good as it gets.

Well, it is, unless you ever plan to implement code reloading.

A little detour for an advice: leveraging default JAR caching

A word of advice: if you manipulate the remote JAR file in any way (i.e. read its manifest) and expect that you'll need to access it later (i.e. pass its URL to a class loader so it loads classes from it), then open a JAR URL connection to it. That'll trigger the local caching immediately, you'll read the manifest from the local copy, and the class loader will also use the same local copy later on to load classes from. This means one HTTP request is made to the HTTP server for that JAR file during the lifetime of the JVM. To translate the above advice into code, don't do this:

URL urlToJar = ...; // some HTTP URL
JarInputStream in = new JarInputStream(urlToJar.openStream());
Manifest mf = in.getManifest();
...

do this instead:

URL urlToJar = ...; // some HTTP URL
URL jarUrl = new URL(urlToJar, "jar:" + urlToJar + "!/");
JarURLConnection conn = (JarURLConnection)jarUrl.openConnection();
Manifest mf = conn.getManifest();
...

Another little detour for another advice: preserving URL stream handlers

You might have noticed that I used a peculiar way to construct the jar: URL in the example above. I.e. instead of simply:

URL jarUrl = new URL("jar:" + urlToJar + "!/");

I used:

URL jarUrl = new URL(urlToJar, "jar:" + urlToJar + "!/");

This little trick will cause the new URL to use the same java.net.URLStreamHandler that the old URL uses, and not the JVM's default URL stream handler. The Java platform allows you to plug in custom URL stream handlers at various points - the URL class has a constructor that takes one, the URLClassLoader class also has a constructor that takes a custom URL stream handler, and all URLs handed out from its getResource() method will use that handler. To be on the safe side, it is best to adopt the practice of always using this two-arg constructor for URLs when creating one URL from another, so you can be sure the new URL will use the same stream handler as the old one. It is also a very good cause not to pass URLs around your code as strings whenever you can avoid it, but always use actual java.net.URL instances.

How default JAR caching bungles code reloading

Imagine now that we use an URL class loader to load code from a JAR file hosted on a HTTP server:

URL urlToJar = ...; // some HTTP URL
ClassLoader cl = new URLClassLoader(new URL[] { urlToJar });
Class clazz = cl.getClass(...);
... etc. ...

Imagine further that you update one or more classes in the JAR file (i.e. to fix a serious bug), and redeploy the JAR file on the HTTP server. Imagine that you even have some external notification mechanism that you use to notify all the JVMs using your JAR file that it has been updated. What does your code running in these JVMs do? Well, it abandons the old class loader, and then again executes the above code to construct a new URLClassLoader instance to the same URL. The new class loader is created with no classes loaded, and it will have to reload all the classes from the JAR file, this time with updated code, right?

Ha ha. Wishful thinking.

It will still load the old classes. Since the JVM cached the JAR file when it first accessed it, a new class loader created to that URL will again get to read the classes from the already cached (and now outdated) JAR file. And the bad news is that there is no way you could tell the caching mechanism to invalidate a cached JAR file. (Well, at least in the current crop of Sun's virtual machines.)

This is where the real adventure begins - we want to somehow overcome this limitation of the caching. Unfortunately, all solutions to this problem lead us into the sun.* package hierarchy, so they won't work for non-Sun JVMs (but then, maybe some non-Sun JVMs don't have this problem in the first place).

Let's disable the caching

First idea would be to disable the JAR caching altogether and see where it leads us. Turns out that an internal Sun class, sun.net.www.protocol.jar.JarFileFactory is in charge of JAR file caching. However, if it receives a java.net.URLConnection object (representing the JAR file) that returns false from a call to getUseCaches() it will not cache that JAR file locally. Now, the problem is that it's not easy to intercept the creation of these URLConnection objects to set their useCaches property to false. It takes writing a custom URL stream handler and passing it to the URLClassLoader that's using the JAR file. I've done that (see this blog entry, here's a short summary, paraphrased from my own blog entry:

The fastest solution turned out to be to replicate the Sun built-in factory's behavior (namely, for each protocol it'll dinamically load the class sun.net.www.protocol.protocolName.Handler, i.e. for HTTP it's sun.net.www.protocol.http.Handler. Then I used CGLIB to dinamically subclass each handler and add a method interceptor into the openConnection(URL) method that invoked the original method, and then disabled caching on the new java.net.URLConnection objects before returning them to the caller. And guess what, it worked flawlessly at once - whenever the JAR files got modified on the HTTP server, the worker JVMs reloaded the code, and this time, it was the new code (HTTP logs showed that the JARs were downloaded again).

As you can see, the task was not exactly trivial: dynamic bytecode generation, method interception, other assorted goodies. Fun! Ok, once this is in place, I can have a URLClassLoader that bypasses caching of JAR files. By the way, with a custom URL stream handler in place, it was now crucial to make every effort to preserve it when constructing new URLs from old ones; remember my little advice from above?

However, it later turned out we now unfortunately threw out the baby with the bathwater. Sure, we can now load new code whenever the JAR file changes.

But...

Looking at the HTTP server logs while the program is running shows us that the JAR file is now downloaded in its entirety whenever any single class is loaded from it, or whenever a single classpath resource is accessed from it. Gee, that's lots of HTTP traffic. A question briefly crosses my mind: "what would it have costed them to cache the file even when the caching is disabled, but on each access do a HTTP GET request with an appropriate If-Modified-Since header?". That question will be left unanswered, I'm afraid. Anyway, disabling caching ensured that we can load the new code, but it came at a quite high cost in network traffic, as the JAR file is redundantly downloaded again and again. (And it is cached in a temporary file regardless, only that file is now never reused after reading a single entry from it, so after a while we have a decent pile of temporary files named "jar_cache" cluttering the temp directory.) This is also plain unacceptable, so we need to seek yet another solution.

Let's plug in our own cache

After rummaging through the sun.net.www.protocol.jar package some more, I notice an interface named URLJarFileCallBack. It turns out this is just what I was looking for - an interface with a single method returning a local JarFile instance for a URL. Apparently, it's in there so that the Java browser plugin can customize the JAR file caching, at least that's what the source code comment says. Cool. This offers some (well, may as well be unjustified) hope that the interface will remain in the JDK for some time. There's also

public static void setCallBack(URLJarFileCallBack cb)

method in the sun.net.www.protocol.jar.URLJarFile class to install a custom callback. So, I go and implement a callback class that does its own local caching of JAR files, but also has a method for invalidating a cached entry. My code can call the invalidator method whenever it gets notified that the JAR file has been update. All is peachy.

I need to keep my custom URL stream handler though - the one that sets the useCaches property of JarURLConnection objects to false. If the property were set to true, the callback would get called only once for each URL, and then the JarFileFactory would still cache the obtained JarFileinternally, and we again couldn't programmaticaly invalidate the cached copy.

In an ideal world

In an ideal world, Sun would add a static method, preferrably to java.net.JarURLConnection, with signature

public static void invalidateCachedJarFile(URL url)

so that the code can tell it to invalidate a cached JAR entry. It'd definitely be possible as I implemented this myself with few hours of coding a custom solution. Heck, I just went and raised a RFE for this :-)

Discussion

If you feel like discussing anything related to this article, you can post a comment to my related blog entry.


Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
1 "Something Positive" is an ingenious webcomic of the bitter-sarcastic variety that I'm a big fan of.