Monday, 13 July 2009

JVM Tuning from the Trenches

This article is a follow-up to http://jojovedder.blogspot.com/2009/06/slow-weblogic-part-6-jvm-heap-analysis.html. Please read that one first for the basics on JVM heap, parameters and flags.

Also remember these tips have worked for the server settings and issues faced below, but blindly using these on your server will not produce the same results. You have to measure and then tune after measuring.

Problem:
Platform running Weblogic 8.1 on Sun V880 servers. Total RAM of 32 Gb on the machine.
2 Gb assigned to the managed server JVM heap. JDK 1.4

Initial settings:
-XX:+AggressiveHeap -Xms2048m -Xmx2048m  -XX:SurvivorRatio=32 -XX:MaxPermSize=128m 


But still there are 20 Full GCs per hour in peak times, before the server crashes.


Analysis

1. It was decided to reduce the SurvivorRatio to 4 and restart with some more flags.

The size of ONE Survivor Space is calculated as

SurvivorSpace = NewSize / (SurvivorRatio + 2)

Keeping SurvivorRatio as 32 means the Survivor spaces are too small for promoting stuff from Eden. Hence we reduce this to 4 which allows for larger Survivor spaces.

2. As per Sun Bug ID: 6218833, setting AggressiveHeap set before Heapsize (Xmx and Xms) can confuse the JVM. Revert the order to have -Xms and -Xmx to come before -XX:+AggressiveHeap or not use it

3. The application has 180+ EJBs with pools of beans. Hence set the -Dsun.rmi.dgc.client.gcInterval=3600000 (1 hour) instead of the default 60000 (1 min). More on this here: http://docs.sun.com/source/817-2180-10/pt_chap5.html

4. The site is restarted once a week at 4:30AM. The patterns stays normal for 2 days – and then degrades into full GC.

5. The Old space is pretty much full – at every minor collection – the Old space must be cleared up for promotion from Young to Old to take place.

6. Permanent space is pretty much full – keeps loading classes and classes ( could that be a problem – the difference between the number of JSP’s per Release?)
Hence we increased the PermSpace from 128M to 256M

7. Ensure we are running the server JVM by using the -server flag

8. Use OptimizeIt or similar profiling tool to see the memory usage and find code bottlenecks.


The settings now were

-server -Xms2048m -Xmx2048m  -XX:MaxNewSize=512m -XX:NewSize=512m -XX:SurvivorRatio=4 -XX:MaxPermSize=256m -Xincgc -XX:+DisableExplicitGC -XX:+AggressiveHeap -XX:-OmitStackTraceInFastThrow



This reduced the Full GCs to one a day.

Error Logs

At the time of the server going out of memory prior to a crash, the logs are filled with repeated errors (up to 100 repetitions) of this sort

java.lang.NullPointerException
 <<no stack trace available>>


Adding the -XX:-OmitStackTraceInFastThrow flag resolves this problem, the root cause of the NPE it self has to be tracked down but we do not have any longer the issue of huge recursive exception strings.

We could now see the stack trace as

java.lang.NullPointerException
 at java.util.StringTokenizer.(StringTokenizer.java:117)
 at java.util.StringTokenizer.(StringTokenizer.java:133)
 at jsp_servlet._framework._security.__login._jspService(login.jsp:294)
 at weblogic.servlet.jsp.JspBase.service(JspBase.java:27)
 at weblogic.servlet.internal.ServletStubImpl$ServletInvocationAction.run(ServletStubImpl.java:1075)


This seems to be a Sun bug described here.