Archive for the 'Java' Category
Leak Testing
So, here we are. We have found lots of memory leaks, and drastically improved the redeploy experience for WLP 10.2 (and the fixes are being backported to previous releases).
However, this is not the first time we’ve been here. But it is hopefully the last.
We are putting in place a test environment to keep these leaks plugged. Actually, to be more precise, we’re adding some instrumentation to our integration test suite that drives tests via the IDE. So this test will behave just like a developer - although it’s a rather anti-social and uninteresting developer :).
Note that this technique is not restricted to WLP or redeployment testing, but hopefully you can find it useful for other situations.
Counting ClassLoader Instances
The basis of this technique is that we know that the leaks we are interested in happen to always hold on to the application or webapp ClassLoader. Further, we also know that these ClassLoaders are instances of GenericClassLoader (for the app) or ChangeAwareClassLoader (for the webapp).
So, our instrumentation relies on being able to count the instances of these classes. Fortunately, JRockit can do this easily, without any extra messing about with the startup settings or whatnot.
JRockit comes with this handy jrcmd command. It lets you interact with a running JRockit JVM, like for example the WLS server. You can use it to do all sorts of interesting things:
- find out the JVM’s command line (how it was started)
- print thread dumps
- change the verbosity of the JVM on the fly (things like garbage collector stats)
- trigger a garbage collection
- print a GC report
- print diagnostics about memory usage
All sorts of cool things - and none of these require you to change anything about your running process. No command line switches or anything messy like that.
So, we are looking for a count of the instances of our ClassLoader objects. We do that by running:
jrcmd <wls server pid> heap_diagnostics
(You get the wls server pid by running just jrcmd - it will print the process ids of running JRockit instances.)
If you run that, you’ll see tons and tons of output - including a listing of every class loaded by the system and how many instances there are. So all we need to do is search for the class we are interested in:
jrcmd <wls server pid> heap_diagnostics | grep '%.*GenericClassLoader
That will grep for the one line we are interested in (turns out it contains a percent sign; why is not important here). We’ll get output something like this:
0.0% 10k 150 +10k weblogic/utils/classloaders/GenericClassLoader
In the above, we see that there are 150 instances of the GenericClassLoader.
To get a more accurate count, we also want to trigger a full garbage collection and finalization before counting anything. We can do that with jrcmd also:
jrcmd <wls server pid> runfinalization
Using this information, here’s what we are doing for WLP testing:
- Start the server, with the application undeployed.
- Run finalization and count the GenericClassLoader and ChangeAwareClassLoader instances. This gives us a baseline for the number of instances used by the server itself.
- Deploy the application and use it (i.e. wander through the portal in a browser)
- Count instances again.
- Redeploy the application, use it, and recount.
- Repeat the redeploy step several times.
- Undeploy the application.
- Count instances.
All these counts (and additionally the [re-]deploy times) are all being recorded and analyzed. The most important metric is that the first and last counts are the same - after the application is undeployed, the ClassLoaders used by the application should be gone. We are collecting the intermediate results because we expect they might be useful in failure analysis.
The one last step is that we are running the JVM instrumented for YourKit, and configured to dump a memory snapshot on exit. So if the ClassLoader counts don’t look right, a developer will have the memory snapshot for analysis, without having to rerun the whole test suite.
Hopefully some of this information can be useful if you have leaks you are hunting.
Technorati Tags: JRockit, leak, memory, portal, YourKit, WebLogic
No commentsLeaks We Plugged
In this post is a dump of most of the leaks that we have found and plugged for our upcoming 10.2 release. I present it here not so you can see exactly what was leaking or whatever, but rather to give me a format where I can make some notes about what we encountered, and how we fixed things. Hopefully, that might prove useful to someone.
Note that many of these leaks were never released, but were only present in internal revisions. But the ones that did make it out are being back-ported to patches for 10.0 and 9.2.
I’ve also kind-of-categorized them into a few groupings, where similar patterns have emerged. Some did not fit clearly into just one category, but the breakdown should help other leak-hunters in their quests.
Clean Up After Yourself!
This class of problems is caused by creating references to objects without clearing them when you are done with them. The Thread and ThreadLocal issues I discussed in earlier posts generally fall under this category.
Timers not released: If an app created a Timer (via commonj.timers.TimerManager), and the timer did not stop before the application undeployed, it was not cleaned up (and thus leaked). The culprit was a misused WeakHashMap. The Timers were stored like this:
WeakHashMap timers = new WeakHashMap(); ... timers.put(theTimer, theTimer);
The map was just being used as a Set, but the insertion of the timer into the map value ensured that the same value in the keys were never unreferenced, thus loosing the entire benefit of the Weak keys in the first place (see the “Implementation note” in the WeakHashMap javadoc). The fix was simple:
timers.put(theTimer, null);
WSEE Runtime MBeans: WSEE (the web services stack) was creating MBeans to hold the configuration of handlers. They were not cleared when the application was undeployed, which in itself represents a minor leak. Unfortunately, the MBeans also (indirectly) held a reference to the Application’s ClassLoader - turning a small problem into a rather large one.
GroupSpace use of Beehive: GroupSpace was calling Beehive’s beginContext without a matching endContext. Behind the scenes, Beehive created some ThreadLocals, but it does the right thing and removes them when (if) you call endContext. This was fixed by balancing the context usage within a request.
Portal Framework service manager: - Internal class in the System ClassLoader has a Map that was holding references of services to be used. One of the services was loaded by the Application’s ClassLoader. This was fixed by unregistering services when the application was undeployed. Lesson: Keep track of what you are using and let it go when it is no longer needed.
Lesson: The Garbage Collector is not your Mom - you have to clean up after yourself around here.
Hold Me Tight!
Where the above references were necessary but simply not cleaned up, this grouping is more caches that are holding unnecessary or extra information.
Log Buffer: WLS Logging keeps a buffer that holds the last 50 or so log messages for efficient display by console. But the implementation was holding the application-scoped logger objects (rather than simply holding the text it needed to display). The fix was to copy information from the application-scoped objects to generic system-scoped objects for the buffer.
Resource initialization optimization: The ResourceBase class that is the base class for our Entitlement resources, was keeping a collection of previously created objects. It then used data from these objects to initialize fields in new instances referring to the same resource. But our Entitlement resource objects were application-scoped, and thus this collection caused leaks. We fixed this by overriding the base object with our own intermediate implementation. We kept the caches, but wrapped the objects in Weak and Soft references. This is not an ideal solution, so we have planned to revisit the whole design in a future release.
BeanELResolver cache: The javax.el.BeanELResolver (used when EL is used in a JSP) caches bean properties in a static Map. Since these bean objects are generally webapp-scoped, there is a leak. We fixed it the same way Glassfish fixed it by clearing the map on undeploy. Since BeanELResolver does not have a public method to clear the map or uncache entries, this had to be done by reflection of the private map field (ugh). The interesting bit is that BeanELResolver does have methods to provide the close/clear function, but they are private (and never referenced).
Commons Logging in System Classpath: Commons Logging holds a map of loggers, which it may load via the application or webapp ClassLoader. Thus simply having the commons logging jar in your System ClassPath will trigger a leak (if logging is used from an application - as it is with Beehive). For us, the jar was added to the system classloader inadvertently (via jar manifest entries), so the fix was to simply remove it. However, we really can’t do anything about it if a customer decides to stick commons logging in their classpath.
Glassfish JAXB: The Glassfish implementation of JAXB caches some reflection lookups. It has a WeakHashMap with Class objects as keys. Good try, however the values in that map are Constructor objects, which themselves hold an instance to the Class (which therefore means the keys are strongly held, and the Map’s weak keys can never be released). We fixed this in our local version by wrapping the Constructor in a WeakReference. We filed the ssue with Sun, and they adopted our fix in version 2.1.6.
LeaseManager holding a TimerManager: The LeaseManager (a new internal class) was holding a static reference to an app-scoped commonj.timers.TimerManager. This one was fun because it required tripping a race condition in order to trigger the leak. Also, in addition to being a leak, this was an error because the static reference tied the LeaseManager to a single deployed instance of a single application, so it would not work properly with multiple applications. We fixed this by looking up the TimerManager as needed (from JNDI, like you’re supposed to do). Yet another example of premature optimization (i.e. presuming that the JNDI lookup was too slow).
Lesson: You gotta know when to hold ‘em; Know when to fold ‘em.
The Ties that Bind
I’ve covered the Thread and ThreadLocal issues previously, but here are the details to complete the list.
ThreadLocals in Portal Framework: Some of the Framework JDBC code was using ThreadLocal but not cleaning up at end of request. This was fixed by simply removing the ThreadLocal (in a finally block) at the end of the request
new Threads in JSP compiler: See the previous post. It turned out that the thread was unnecessary: a leftover from when the code was ported to the server from a different environment. So it was just removed.
ThreadLocals in Content Management: Content was using ThreadLocal and not removing them after a request. The ThreadLocals were added as a (possibly unnecessary?) performance optimization, and the fix was simply to remove them.
Lesson: Don’t let your code get all tied up in knots - unravel those threads.
Final Thoughts
If you’ve taken the time to look at these issues, you can see that the problems spanned across multiple subsystems: WLP, WLS, WLW, Apache, and Sun. It was quite a project, and quite a fun challenge to track all these down.
Next, I’ll outline how we intend to keep this from happening again, and show you some tricks for your own memory leak testing.
Technorati Tags: leak, memory, portal, WebLogic
No commentsLeakPatterns: ThreadLocal
Another common source of leaks we found was related to the use of ThreadLocal. A ThreadLocal is a way to attach some data to a Thread, so you can get at it later. They are really useful for a couple of cases.
First, you can cache some value (presumably expensive to compute) and reuse it later. Since the value is associated with the Thread, it is thread-safe (baring any issues with the attached objects themselves). This makes the ThreadLocal an important tool in performance optimizations.
Another good use of ThreadLocal is to hold some context-related information. For example, maybe you need some data that is on the HTTP Request (like URLs, attributes, query parameters) in order to make an authorization decision for access to certain data. Normally, this would mean that all that data (or the Request itself) would need to appear on the authorizer API - and on any API that called that one - and on any API calling those APIs, etc. Or, you could toss the Request data into a ThreadLocal and retrieve it in the authorizer. It becomes context for the call to the authorizer. This is another really useful technique for loosely tying subsystems together.
But in a server environment (like WebLogic), the Threads are usually pooled and reused for request after request. The ThreadLocal is associated with a Thread, and the Thread is associated with a Request - so that is good. Except that the Thread is only temporarily associated with the Request.
So, if ThreadLocal objects are not managed correctly (such that their life-cycle matches that of the Request), you can get all sorts of evil things happening. First, if you were using ThreadLocal for performance, it just might not work since one user session might be handled by many different Threads. Worse, you might leak data from one user to another (when a user’s Request reuses a Thread containing another user’s data).
And in the cases I saw, if that ThreadLocal holds objects that otherwise should be garbage collected, you will get a leak.
The solution is to ensure that your ThreadLocal usages are properly managed. The ThreadLocal should have the same life cycle as the Request, and you should use ThreadLocal.remove() – preferably in a proper finally block - to clean up at the end of your Request.
Technorati Tags: leak, memory, portal, ThreadLocal, WebLogic
1 commentLeak Patterns: new Thread
One of the most surprising memory leaks I found looked like this:
Thread t = new Thread();
Of course, that’s not exactly the whole story, but it illustrates the point that just knowing where the leaky object is referenced is not always enough to find the cause of the leak.
So why does new Thread() leak? It turns out that WLS puts some important (internal) information on the thread as InheritableThreadLocal instances. Some of these references may contain references to the application’s ClassLoader. Since they are inheritable, they will be passed to the new thread.
And if that new thread outlives the application ClassLoader (that is, the application is undeployed before the thread stops), then you have a leak.
Of course, a good J2EE developer will recognize that creating threads is not allowed by the specification - you should instead use something like WorkManager. And that is correct.
But I am in a somewhat different environment, where we are building what amounts to a container. So, while we are doing J2EE, we are also building the container to support J2EE components. So we sometimes walk a fine line on issues like this.
Technorati Tags: leak, memory, WebLogic
No commentsSetting up YourKit for WebLogic
Setting up YourKit for WLS is pretty easy, and reasonably clear in their documentation. Since I work with not-yet-released versions of WLS, the automatic configuration scripts for various profilers usually don’t work for me, so I usually configure things manually.
For WLS and YourKit, you just need to do 3 things:
- Set the library path
- Set the agentlb option
- Start the server with debugging off
To do this in Linux, it looks something like this:
$ export LD_LIBRARY_PATH='/opt/yjp-6.0.16/bin/linux-x86-32/' $ export JAVA_OPTIONS='-agentlib:yjpagent' $ ./startWeblogic.sh nodebug
Once the server starts, you can connect to it from the YourKit UI, trigger the leak, and take a memory snapshot.
I have been looking for application ClassLoader leaks, and I know that the WLS GenericClassLoader has a String called annotation that holds the application name. So with the snapshot open, I pick Memory -> Strings by Pattern and enter the application name.
I happen to know how the annotation is formatted so I acually search for the regular expression “^myApp@$”. That will find the string, and thus the GenericClassLoader that is leaked. And you can easily see then what is holding the ClassLoader and why it is a leak.
Technorati Tags: YourKit, WebLogic
No comments