Tuesday, 25 November 2008

Simultaneous Solaris T2000 core dumps on local zones

Issue:

Our applications run on Sun T2000 servers which are configured with 4 or 5 Local zones on each Global Zone.

17th November 05:00 there was an issue on one Global Zone which caused 5 separate and simultaneous core dumps to get created on the server. One core in each Local zone.




Impact:
The 5 processes which crashed on each Local zone were

LZ 15 – core created by the wlshell java process. Low impact
LZ 27 – one core created by wlshell java processes. Low impact.
LZ 25 – core created by HM server java process. Was manually restarted today. Medium impact
LZ 26 – core created by PA server java process. This was automatically restarted by the HM. High impact.
LZ 37 – core created by HM server java process. Was manually restarted today. Medium impact




All 5 crashes above were in Library=/platform/sun4v/lib/libc_psr.so.1

Each core exits with a hs_err_.log file - which contains the foll details

An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : 10 occurred at PC=0xFF2709F0
Function=_memset+0x70
Library=/platform/sun4v/lib/libc_psr.so.1

Current Java thread:
at java.lang.Thread.start(Native Method)
- locked <0xf1f3a378> (a java.util.logging.LogManager$Cleaner)
at java.lang.Shutdown.runHooks(Shutdown.java:126)
at java.lang.Shutdown.sequence(Shutdown.java:165)
at java.lang.Shutdown.exit(Shutdown.java:210)
- locked <0xf5998498> (a java.lang.Class)
at java.lang.Runtime.exit(Runtime.java:90)
at java.lang.System.exit(System.java:715)
at ConsoleScriptRunner.main(ConsoleScriptRunner.java:64)


Investigations


1. Seems to be running out of swap at the time of the crash. Memory: 31.9G real, 4.0G free, 37.7G swap in use, 2.4G swap free
Update: added 50 G swap on 19/11 - Memory: 31.9G real, 3.4G free, 38.0G swap in use, 52.8G swap free


2. Application Support is identifying what the wlshell process does – it seems to be a Weblogic server monitoring script running every 10 minutes under cron.


3. Library=/platform/sun4v/lib/libc_psr.so.1 Are there any known issues with this library on solaris 10 or T2000? Support to query SUN


The libc_psr libraries implement platform-specific, optimized versions of block copy and move routines from libc, such as memcpy(). On UltraSPARC machines, these routines are coded in assembler, and use block load and store ASI's, prefetch, and other tricks for better performance.


Root Cause

Sun identified that the issue was related to known bugs in JDK 142_03 and running an old version of libc_psr.so

The ( a) type error as seen in hs_err_pid6216_18111030.log the error was a know java bug with the j2sdk1.4.2_03 and the bug is identified as :


Bug ID: 4927116 fixed in 1.4.2_04 Synopsis Regression: 1.4.2 JVM core dumps in ClassLoader.defineClass0()

Actions

1. Immediate action is to upgrade to JDK 142_18

2. Also identified that all 5 core dumps were created by sh scripts running under crontab. Further analysis showed that there were cron jobs running every 15 mins on each of the zones - and this puts a load on system resources especially around memory consumption in the JVM and sufficient unhanded exceptions between the java virtual machine and the operating system could cause the then dead locked JVM to terminate .


Thus the 2nd resolution was to offset the start time of each cron job - to try and ensure no two jobs run on the same Global Zone at the same time.


Old Timings:

0,15,30,45 * * * * /export/home/Handlers.sh > /tmp/Handcheck
0,15,30,45 * * * * /export/home/Average_response_time.sh > /tmp/Avgcheck
0,15,30,45 * * * * /export/home/Msg_In_and_Out_count.sh > /tmp/Mescheck
0 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * * /log/zipper.sh


New Timings:

1,16,31,46 * * * * /export/home/Handlers.sh > /tmp/Handcheck
2,17,32,47 * * * * /export/home/Average_response_time.sh > /tmp/Avgcheck
3,18,33,48 * * * * /export/home/Msg_In_and_Out_count.sh > /tmp/Mescheck
4 0,2,4,6,8,10,12,14,16,18,20,22 * * * /log/zipper.sh


Also knocking off unnecessary cron jobs from the box to reduce the load.