You Never Know: weblogic

Showing posts with label weblogic. Show all posts

Friday, 24 August 2012

Identifying memory leaks using Eclipse MAT - Part 3

Part 1 and 2 recounted how we found a memory leak using Eclipse MAT

So back to finding out what this class weblogic.xml.query.xdbc.Context was and how we could solve the memory leak.

A quick search in Eclipse showed this class gets loaded from com.bea.core.xquery_1.3.0.0.jar and the Eclipse calling trace showed a call to org.apache.xmlbeans.impl.store.XqrlImpl$CompiledQuery.

We knew we are using apache xmlbeans in our application and checked our Weblogic start classpath to see this jar had been added to the classpath since we had some runtime issues without it.

/wls_domains/com.bea.core.xquery.xmlbeans-interop_1.3.0.0.jar:/wls_domains/com.bea.core.xquery_1.3.0.0.jar

But we did not want to use the weblogic version of xmlbeans, so only tried with the Apache one instead.

So the changes made were to remove the above jars from start classpath and add saxon jars instead. We also edited weblogic-application.xml to specify the package from Apache

<prefer-application-packages>
<package-name>org.apache.xmlbeans.*</package-name>
</prefer-application-packages>

We then repeated our performance tests and monitored the heap dumps to verify the problem had gone away.

There was no instance of the weblogic.xml.query.xdbc.Context class at all which caused earlier heap block, so this indicates our fix is applied correctly

Thursday, 23 August 2012

Identifying memory leaks using Eclipse MAT - Part 2

As I wrote in Part 1, we identified that there was some sort of memory leak happening within our Java JEE application – so the next steps were to obtain a Heap Dump and run it through Eclipse MAT.

So what is a Heap Dump? From the Eclipse docs

A heap dump is a snapshot of the memory of a Java process at a certain point of time. There are different formats for persisting this data, and depending on the format it may contain different pieces of information, but in general the snapshot contains information about the java objects and classes in the heap at the moment the snapshot was triggered.

The Memory Analyzer is able to work with HPROF binary heap dumps, IBM system dumps (after preprocessing them), and IBM portable heap dumps (PHD) from a variety of platforms.

Typical information which can be found in heap dumps (once more - depending on the heap dump type) is:

All Objects

Class, fields, primitive values and references

All Classes

Classloader, name, super class, static fields

Garbage Collection Roots

Objects defined to be reachable by the JVM

Thread Stacks and Local Variables

How do we get a heap dump into a .hprof file?

Set the following flags to the java process

· -XX:+HeapDumpOnOutOfMemoryError writes heap dump on OutOfMemoryError

· -XX:+HeapDumpOnCtrlBreak writes heap dump together with thread dump on CTRL+BREAK

Or you can fire one via tools jmap or jconsole

· Sun JMap: <jdkhome>/bin/jmap -dump:format=b,file=HeapDumpFilename.hprof

Our idea was to take multiple heap dumps every 2 hours or so and observe the objects in the heap. As with a thread dump, a heap dump is a static view of the live objects in the heap at that time – so it’s not possible to make a definitive judgment on just viewing one heap dump – hence space it out over some time and observe if the same objects (sometimes these have the same memory address location) are growing and causing a possible leak.

Also remember – at the point of taking a heap dump, Java runs a full GC, so you are left with the live objects in the heap.

We took the first heap dumps every 2 hours, but nothing much was happening for the first 12 hours or so.

Once you have the hprof with you, and have installed Eclipse Memory Analyzer Tool (MAT) just open the hprof in Eclipse and wait for it to do it’s thing.

The page will open with an Overview.

This lists important stuff like the Size of the heap – so in our case, though we have a 2 GB heap the retained size after GC is 473.1 MB.

The graph of “Biggest Objects by Retained Size” will give you immediate clues and this is also reflected in tabular format in the “Dominator Tree” report.

Overview

But the best thing about MAT is the Leak Suspects report which points out clearly potential memory leaks.

So in our hprof, 55% of the heap was being retained by one instance of weblogic.xml.query.xdbc.Context i.e. if this object was reclaimed by GC, we could get back 273 MB of memory. This object was residing at memory location weblogic.xml.query.xdbc.Context @ 0xb2477930

The important thing to see in this view is the Retained Heap

In Memory Analyzer the term shallow size means the size of an object itself, without counting and accumulating the size of other objects referenced from it.

The retained heap is the total objects and memory which this object is holding onto.

We took the next snapshot after 2 hours of the first heap dump and it showed the same instance of weblogic.xml.query.Context@0x7c1642f8 had now grown to 301.6 MB with the total heap now at 524.7 MB

Our last snapshot in confirming our problem was after a further 2 hours, so a total of 4 hours since the first heap dump showed this object.

The heap had now grown to 616.5 MB with 54% of it occupied by the same object

So next we now found out from MAT what was causing the leak. We next had to analyze how this class weblogic.xml.query.xdbc.Context was being used in the app and how we could prevent the leak. I will post that in Part 3.

Saturday, 28 July 2012

JVM Heap analysis – Identifying memory leaks using Eclipse MAT

As part of the regular pre-Production testing it is common to conduct Performance tests for load under Peak and Soak conditions.

So we had this Weblogic platform on which all Peak tests were good and so also were the 12 hour Soak tests.

The JVM memory usage graph plotted showed regular GC clearing up memory and no obvious leaks. A simple graph plotted in Excel showing JVM Used Heap over time is shown below. It shows the pattern fairly close to an ideal sawtooth.

An additional test planned was an Extended Soak – mainly letting the system run as per expected normal volumes over 7 or 10 days to see if the JVM would throw up any unusual memory usage patterns or memory leaks.

And we came across some interesting issues !!

We did not have much of a problem for the first 24 and 48 hours – but after 48 hours (2 whole days) of continuous load, the graph showed that the JVM was unable to clear up any memory even after GC.

As the graphs above show the Weblogic server JVM inspite of multiple GCs was not reducing the utilized memory down to the usual 30 - 40% or so. It was staying at 80% or so until after 7 days the servers went OutOfMemory and just crashed.

So in order to analyze this we decided to take JVM Heap Dumps and analyze this using VisualVM or Eclipse MAT.

In this instance Eclipse MAT which is an easy Eclipse plugin gave an instant feedback and we were able to narrow down on Leak Suspects and actually find the root cause of the problem.

I will elaborate on that in Part 2 of this series.

Wednesday, 4 April 2012

Weblogic - Accessing Server Runtime details using JMX

We had a requirement to access certain runtime info on the Weblogic server to carry out some admin tasks.

One way to do this is via the JMX API, which in Weblogic is available as an extensive set of MBeans

There are plenty of tutorials and examples of using these on the net.

Now within your Weblogic domain, if you run as part of a Cluster then you would access Domain information on the Admin server if you need to identify the running servers in the cluster.

This uses

weblogic.management.mbeanservers.domainruntime.DomainRuntimeServiceMBean

object which is bound only on the Admin server.
One full example of this is at http://middlewaremagic.com/weblogic/?p=210

If you try this on a managed server directly, you will get this error

java.io.IOException: Unable to resolve 'weblogic.management.mbeanservers.domainruntime'. Resolved 'weblogic.management.mbeanservers'
        at weblogic.management.remote.common.ClientProviderBase.makeConnection(ClientProviderBase.java:195)
        at weblogic.management.remote.common.ClientProviderBase.newJMXConnector(ClientProviderBase.java:83)
        at javax.management.remote.JMXConnectorFactory.newJMXConnector(JMXConnectorFactory.java:338)
        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:247)

So if you do not have the Admin server IP address and port, how can you still identify the other members in the cluster?

A use case for this is if you want to run and access Cluster details from one of the managed servers. But you only have the current running server IP and Port and not that of the Admin server - which is usually the case

There is an MBean for this which is available on the individual managed server as well,
weblogic.management.mbeanservers.runtime.RuntimeServiceMBean

Using this you can get the details of the Admin server and then from there get access to the DomainRuntimeServiceMBean.

A working code example is below

This runs from a JSP where we are able to access the server IP and port via request variables


<%@page import="javax.xml.bind.DatatypeConverter"%>
<%@page import="java.io.IOException"%>
<%@page import="java.net.MalformedURLException"%>
<%@page import="java.util.Hashtable"%>
<%@page import="javax.management.MBeanServerConnection"%>
<%@page import="javax.management.MalformedObjectNameException"%>
<%@page import="javax.management.ObjectName"%>
<%@page import="javax.management.remote.JMXConnector"%>
<%@page import="javax.management.remote.JMXConnectorFactory"%>
<%@page import="javax.management.remote.JMXServiceURL"%>
<%@page import="javax.naming.Context"%>
<%@page import="java.net.URLConnection"%>
<%@page import="java.net.HttpURLConnection"%>
<%@page import="java.net.URL"%>
<%@page import="sun.misc.BASE64Encoder"%>
<%@page import="java.io.InputStream"%>
<%@page import="javax.management.MBeanServer"%>
<%@page import="javax.management.ObjectName"%>
<%@page import="javax.naming.InitialContext"%>

 MBeanServer mBeanServer = null;
 InitialContext ctx = null;
 String administrationURL = null;
 int adminPort = 0;
 try {
  ctx = new InitialContext();
  mBeanServer = (MBeanServer) ctx.lookup("java:comp/env/jmx/runtime");
  
  //Get Admin Server and Port
  ObjectName runtimeService = new ObjectName(
     "com.bea:Name=RuntimeService,Type=weblogic.management.mbeanservers.runtime.RuntimeServiceMBean");
  String managedServerName = (String) mBeanServer.getAttribute(runtimeService, "ServerName");
  ObjectName msServerRuntime = new ObjectName("com.bea:Name="+ managedServerName + ",Type=ServerRuntime");
  administrationURL = (String) mBeanServer.getAttribute(msServerRuntime, "AdminServerHost");
  adminPort = (Integer) mBeanServer.getAttribute(msServerRuntime, "AdminServerListenPort");
  System.out.println(administrationURL + adminPort);
 } catch (Exception ex) {
  System.out.println("Caught Exception while fetching Admin Server information : "+ ex);
  ex.printStackTrace();
 } finally {
  if (ctx != null) {
   try {
    ctx.close();
   } catch (Exception ex) {
    ex.printStackTrace();
   }
  }
 }

 //Connect via JMX using Admin Server credentials
 String protocol = "t3";
 Integer portInteger = Integer.valueOf(request.getServerPort());
 int port = portInteger.intValue();

 String jndiroot = "/jndi/";
 String mserver = "weblogic.management.mbeanservers.domainruntime";
 JMXServiceURL serviceURL = new JMXServiceURL(protocol, administrationURL, adminPort, jndiroot + mserver);
 Hashtable h = new Hashtable();
 h.put(Context.SECURITY_PRINCIPAL, weblogicuser);
 h.put(Context.SECURITY_CREDENTIALS, weblogicpassword);
 h.put(JMXConnectorFactory.PROTOCOL_PROVIDER_PACKAGES,"weblogic.management.remote");
 JMXConnector connector = JMXConnectorFactory.connect(serviceURL, h);
 MBeanServerConnection connection = connector.getMBeanServerConnection();

 ObjectName domainRuntimeService = new ObjectName(
 "com.bea:Name=DomainRuntimeService,Type=weblogic.management.mbeanservers.domainruntime.DomainRuntimeServiceMBean");
 ObjectName[] serverRT = (ObjectName[]) connection.getAttribute(domainRuntimeService, "ServerRuntimes");
 Hashtable server_states = new Hashtable();
 for (ObjectName ser : serverRT) {
  server_states.put((String) connection.getAttribute(ser,"Name"), (String) connection.getAttribute(ser,"State"));
 }

 ObjectName domain1 = (ObjectName) connection.getAttribute(
    domainRuntimeService, "DomainConfiguration");
 ObjectName[] cluster_list = (ObjectName[]) connection
    .getAttribute(domain1, "Clusters");
 for (ObjectName cl : cluster_list) {

  System.out.println("

 Cluster Name: "
    + (String) connection.getAttribute(cl, "Name"));
  try {
   System.out.println("

");
   ObjectName[] servers = (ObjectName[]) connection.getAttribute(cl, "Servers");
   for (ObjectName ser : servers) {
    String server_name = (String) connection.getAttribute(ser, "Name");
   
    try {
     String server_URL = (String) connection.getAttribute(ser, "ListenAddress");
     Integer server_port = (Integer) connection.getAttribute(ser, "ListenPort");
     System.out.println("Server Name: " + server_name + ", Server State: "
        + server_states.get(server_name)
        + ", :" + server_URL + ":"
        + server_port);

   
    }
   }
  }
  connector.close();
 }

Saturday, 9 October 2010

Some info on Weblogic clustering

Recently this question regarding Weblogic clustering came up on Stackoverflow.

I put some time into researching the answer I provided, and thought of putting this up on the blog as well.

Question from user Russell

I've read http://download.oracle.com/docs/cd/E11035_01/wls100/cluster/overview.html and searched this topic on the internet but still had a hard time understanding some of weblogic's clustering concepts.

Can anybody confirm/correct my understandings below?

•a cluster contains one or more logical servers which can reside on one or many physical servers
•when deploying a j2ee app to a cluster, it is tied to one server in that cluster
•external users of the deployed app aren't aware of clustering
•the log file of that app is located on the server it's deployed
•if the server hosting the app fails, it's okay because the app is in a cluster and another server will pick up the work?
•if the server hosting the app fails, what happens to logging?
Maybe I got the whole concept wrong. Could anybody point me into the correct directions?

Answer

It's good to understand the concept of Domain first.

The Domain is the parent of a Cluster. It contains typically one Admin and one or more Managed servers.
Now the Cluster is a grouping of some or all of these managed servers within the domain.

This diagram here should help understanding the relation between Domains and Clusters

Once you configure a Domain and a Cluster yourself on a development environment, you'll get to know more about it.

Now here are the answers to your specific questions

•a cluster contains one or more
logical servers which can reside on
one or many physical servers

True. But let's clarify what you mean by 'logical' servers.
In the Cluster you typically have two or more Managed servers. These servers run in their own JVMs and can be started independently and serve requests independently. Each server will have a unique IP:port address, and it can be directly accessed from the browser. But these server instances can reside over multiple physical servers.

•when deploying a j2ee app to a
cluster, it is tied to one server in
that cluster

No it is not tied to one server. When you deploy a J2EE app to the Cluster, it will get deployed in turn to each server in that cluster. The JNDI is cluster-wide and each server maintains a local copy of the JNDI.

You can look up the object (say an EJB) via JNDI on the Cluster or on the individual server. Also see what types of Objects can be clustered.

•external users of the deployed app
aren't aware of clustering

True.

But in this case you should have an Apache web server or a load balancer or DNS server which takes the request from the browser, and internally maps it to one of the servers in the cluster. If you dont have any of these, you would have to define the cluster address as a DNS name or IP address for the client. See the section "Avoiding Listen Address Problems " on http://download.oracle.com/docs/cd/E13222_01/wls/docs103/cluster/setup.html#wp682940

•the log file of that app is located
on the server it's deployed

True, one weblogic log per server.

•if the server hosting the app fails,
it's okay because the app is in a
cluster and another server will pick
up the work?

Not by default, you have to configure it for failover and replication. This is a huge topic which needs separate reading

•if the server hosting the app fails,
what happens to logging?

Logging stops. You'll see some shutdown or heartbeat errors in the log, or outofmemory or whatever reason for failure. you'll have to restart the server - and logging continues in a new file (depending on your logging settings)

My original answer is here on stackoverflow for those interested.

Wednesday, 17 February 2010

Weblogic JDK 1.6 Error Could not reserve enough space for object heap

Weblogic 10 running on HP-UX throws this error on startup when we try to give 3 GB Xms and Xmx.


-Dweblogic.system.BootIdentityFile=/portalapp/bea/user_projects/domains/MSD4/servers/MS4/data/nodemanager/boot.properties -Dweblogic.nodemanager.ServiceEnabled=true -Dweblogic.security.SSL.ignoreHostnameVerification=false -Dweblogic.ReverseDNSAllowed=false -server -Xms3g -Xmx3g -Xmn1024m -XX:+AggressiveHeap -XX:+UseParallelGC -XX:ParallelGCThreads=20 -XX:PermSize=1024m -XX:MaxPermSize=1024m -XX:MaxTenuringThreshold=3 -XX:LargePageSizeInBytes=4m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:SurvivorRatio=8 -XX:-UseAdaptiveSizePolicy -Xloggc:/portal_logs/gclog/MS4_gclog.log weblogic.Server >

<Feb 17, 2010 10:56:06 AM> <Info> <NodeManager> <Working directory is "/portalapp/bea/user_projects/domains/MSD4">

<Feb 17, 2010 10:56:06 AM> <Info> <NodeManager> <Server output log file is "/portalapp/bea/user_projects/domains/MSD4/servers/MS4/logs/MS4.out">
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
<Feb 17, 2010 10:56:06 AM> <Debug> <NodeManager> <Waiting for the process to die: 13663>
<Feb 17, 2010 10:56:06 AM> <Info> <NodeManager> <Server failed during startup so will not be restarted>

Additional information:

os.name = HP-UX
os.version = B.11.31

java.version = 1.6.0.04
java.vm.info = mixed mode
java.vm.name = Java HotSpot(TM) Server VM
java.vm.specification.name = Java Virtual Machine Specification
java.vm.specification.vendor = Sun Microsystems Inc.
java.vm.specification.version = 1.0
java.vm.vendor = "Hewlett-Packard Company"
java.vm.version = 11.3-b02-jre1.6.0.04-rc2
java.runtime.version = 1.6.0.04-jinteg_28_apr_2009_04_46-b00

Analysis:

Though we are using a 64-bit server which has 16 GB RAM, we are unable to assign more than 2 Gb to the Weblogic heap.

This value in the startup log shows that the JVM is a 32-bit one.

sun.arch.data.model = 32

In a 32-bit JVM – the process cannot address memory higher than 4 GB in all. But as per Sun site, in reality we cannot get more than 2G.

As per this URL: http://java.sun.com/docs/hotspot/HotSpotFAQ.html#gc_heap_32bit

Why can't I get a larger heap with the 32-bit JVM?

The maximum theoretical heap limit for the 32-bit JVM is 4G. Due to various additional constraints such as available swap, kernel address space usage, memory fragmentation, and VM overhead, in practice the limit can be much lower. On most modern 32-bit Windows systems the maximum heap size will range from 1.4G to 1.6G. On 32-bit Solaris kernels the address space is limited to 2G.

So we tuned the heap to 2 GB for now. i.e. set Xmx and Xms to 2G and that solved it.

Friday, 12 February 2010

More tuning tips on slow Weblogic 10

In the earlier parts of this article I described some technical tuning tips specific to a slow-performing Weblogic JEE server.

Another earlier post has looked at how to set the JSP check seconds values.

This is a follow-up article specifically for tuning the similar settings for Weblogic 10 - since the syntax has changed.

In weblogic.xml, ensure the following are set as below:

<wls:session-descriptor>
<wls:encode-session-id-in-query-params>true</wls:encode-session-id-in-query-params>
<wls:cookie-comment>mysessioncookie</wls:cookie-comment>
<wls:cookie-max-age-secs>-1</wls:cookie-max-age-secs>
<wls:cookie-name>JSESSIONID</wls:cookie-name>
<wls:cookie-path>/</wls:cookie-path>
<wls:cookies-enabled>true</wls:cookies-enabled>
<wls:timeout-secs>1200</timeout-secs>
<wls:cookie-domain>@MY_COOKIE_DOMAIN@</wls:cookie-domain>
</wls:session-descriptor>

Assuming that no JSPs are being hot deployed or dropped directly into the production server

<wls:jsp-descriptor>
<wls:keepgenerated>true</wls:keepgenerated>
<wls:precompile>true</wls:precompile>
<wls:page-check-seconds>-1</wls:page-check-seconds>
</wls:jsp-descriptor>

and this one

<wls:container-descriptor>
<wls:servlet-reload-check-secs>-1</wls:servlet-reload-check-secs>
</wls:container-descriptor>

Ensure all of these are inserted at the right place within the weblogic.xml to avoid any start up error messages

Check against the reference here http://download.oracle.com/docs/cd/E12840_01/wls/docs103/webapp/weblogic_xml.html

Thursday, 6 August 2009

Weblogic JMS Performance Tuning Tips

Here are a couple of real-life tips on tuning JMS performance.

Problem:
The creates and sends JMS messages on an outgoing queue for consumption by another application.

It was observed that when the consuming application was offline for a period of time the number of messages that could be retained on the queue before the JVM heap was filled up was quite low. This was tested to be roughly 7000 messages, after which OutOfMemory exceptions begin to occur. Given that the consuming application could realistically be offline for a period, hence the use of an asynchronous queue, we needed to increase the number of messages that could realistically be stored in the queue.

The exception we get is shown below:


Start server side stack trace:
java.lang.OutOfMemoryError:

Start server side stack trace:
java.lang.OutOfMemoryError
        <<no stack trace available>>
End  server side stack trace
        at weblogic.rmi.internal.BasicOutboundRequest.sendReceive(BasicOutboundRequest.java:109)
       at weblogic.rmi.internal.BasicRemoteRef.invoke(BasicRemoteRef.java:127)
       at weblogic.jms.dispatcher.DispatcherImpl_WLStub.dispatchSyncFuture(Unknown Source)
       at weblogic.jms.dispatcher.DispatcherWrapperState.dispatchSync(DispatcherWrapperState.java:286)
       at weblogic.jms.client.JMSSession.createProducer(JMSSession.java:1484)
       at weblogic.jms.client.JMSSession.createSender(JMSSession.java:1335)
...

The GC logs also show frequent Full GCs before the server goes out of memory.

Solution Steps:

1. Enabling JMS Paging

Paging had not been enabled for the queue. Despite this queue being persistent, this meant that every message was stored in the JVM memory heap in its entirety. Enabling message paging for this queue means that only the headers for paged messages are kept in memory, significantly reducing the amount heap utilized.

As the messages were being persisted via a JDBCStore to a database, this functioned as a paging store as well, however a FileStore must still be specified as the paging store for the JMS Server, or the JMS Server will not deploy at WLS server start time. This would be due to the need to cater for any non-persistent destinations when paging is enabled. If non-persistent messages are not paged, the size of this FileStore will be negligible. Paged messages still occupy some space on the memory heap as the message headers are still kept in memory.

On enabling paging for the specific queue, the test could cater for roughly 15000 messages before OutOfMemory exceptions occurred. The point at which paging began was set deliberately low to 100. Recovering from the page store does incur a certain performance cost, so in Production this was set to a more reasonable number based on the peak number of messages expected in the queue under normal conditions.

Despite the gain in number of messages that could be catered for, the heap utilisation graphs were very similar to those before paging was enabled. This showed no minor GCs, only full GCs at fairly frequent intervals.

2. JVM Settings and Garbage Collection Tuning

The untuned JVM heap size was 512Mb. This value could be increased, but test results after tuning the JVM settings indicate that this was probably more than adequate.

Examining the current JVM settings uncovered some settings that needed to be changed.

The most significant issue with the JVM settings was the NewSize value. This was set very high to 384Mb out of the total heap of 512 Mb.
A reasonable New Generation area would normally be 20-25% of the total heap size and setting it larger than the Tenured Generation area is guaranteed to cause unhealthy GC operations. In addition, it is good practice to use NewRatio rather than NewSize to avoid fixing an absolute size. A NewRatio of 3 (1:3, i.e. 25% of heap) or 4 is considered the most appropriate for WebLogic Server applications. NewSize was therefore dropped and a NewRatio was set to 4 (20% of total heap).

The SurvivorRatio value was set to a reasonable value of 3.
TargetSurvivorRatio, however, was unset meaning that the default of 50 applies. 80 would probably be a better setting, meaning that the switch between survivor spaces in the JVM heap would occur at 80% rather than 50%. The performance improvement from this should be noticeable in the frequency of minor GC, though not dramatic.

The PermSize values were high, with PermSize and MaxPermSize both set to 384Mb. These were reduced to 64Mb and 128Mb respectively, which should be more than adequate. Though these changes are unlikely to improve performance, having the PermSize set to high will needlessly consume memory.

The effect of setting a good NewRatio value was dramatic. Many minor GCs were the rule, with infrequent full GCs. 60,000 messages were added to the queue before the heap was approaching full. We ran an overnight, and somewhere between 60,000 and 70,000 messages the OutOfMemory exceptions occurred.

Scaling this up to the Production environment which has 1Gb Heap, shows that the server could easily cater for the expected load.

Monday, 13 July 2009

JVM Tuning from the Trenches

This article is a follow-up to http://jojovedder.blogspot.com/2009/06/slow-weblogic-part-6-jvm-heap-analysis.html. Please read that one first for the basics on JVM heap, parameters and flags.

Also remember these tips have worked for the server settings and issues faced below, but blindly using these on your server will not produce the same results. You have to measure and then tune after measuring.

Problem:
Platform running Weblogic 8.1 on Sun V880 servers. Total RAM of 32 Gb on the machine.
2 Gb assigned to the managed server JVM heap. JDK 1.4

Initial settings:

-XX:+AggressiveHeap -Xms2048m -Xmx2048m  -XX:SurvivorRatio=32 -XX:MaxPermSize=128m

But still there are 20 Full GCs per hour in peak times, before the server crashes.

Analysis

1. It was decided to reduce the SurvivorRatio to 4 and restart with some more flags.

The size of ONE Survivor Space is calculated as

SurvivorSpace = NewSize / (SurvivorRatio + 2)

Keeping SurvivorRatio as 32 means the Survivor spaces are too small for promoting stuff from Eden. Hence we reduce this to 4 which allows for larger Survivor spaces.

2. As per Sun Bug ID: 6218833, setting AggressiveHeap set before Heapsize (Xmx and Xms) can confuse the JVM. Revert the order to have -Xms and -Xmx to come before -XX:+AggressiveHeap or not use it

3. The application has 180+ EJBs with pools of beans. Hence set the -Dsun.rmi.dgc.client.gcInterval=3600000 (1 hour) instead of the default 60000 (1 min). More on this here: http://docs.sun.com/source/817-2180-10/pt_chap5.html

4. The site is restarted once a week at 4:30AM. The patterns stays normal for 2 days – and then degrades into full GC.

5. The Old space is pretty much full – at every minor collection – the Old space must be cleared up for promotion from Young to Old to take place.

6. Permanent space is pretty much full – keeps loading classes and classes ( could that be a problem – the difference between the number of JSP’s per Release?)
Hence we increased the PermSpace from 128M to 256M

7. Ensure we are running the server JVM by using the -server flag

8. Use OptimizeIt or similar profiling tool to see the memory usage and find code bottlenecks.

The settings now were

-server -Xms2048m -Xmx2048m  -XX:MaxNewSize=512m -XX:NewSize=512m -XX:SurvivorRatio=4 -XX:MaxPermSize=256m -Xincgc -XX:+DisableExplicitGC -XX:+AggressiveHeap -XX:-OmitStackTraceInFastThrow

This reduced the Full GCs to one a day.

Error Logs

At the time of the server going out of memory prior to a crash, the logs are filled with repeated errors (up to 100 repetitions) of this sort

java.lang.NullPointerException
 <<no stack trace available>>

Adding the -XX:-OmitStackTraceInFastThrow flag resolves this problem, the root cause of the NPE it self has to be tracked down but we do not have any longer the issue of huge recursive exception strings.

We could now see the stack trace as

java.lang.NullPointerException
 at java.util.StringTokenizer.(StringTokenizer.java:117)
 at java.util.StringTokenizer.(StringTokenizer.java:133)
 at jsp_servlet._framework._security.__login._jspService(login.jsp:294)
 at weblogic.servlet.jsp.JspBase.service(JspBase.java:27)
 at weblogic.servlet.internal.ServletStubImpl$ServletInvocationAction.run(ServletStubImpl.java:1075)

This seems to be a Sun bug described here.

Tuesday, 7 July 2009

More on Weblogic Muxers

This is a follow-up to an earlier article which detailed the Weblogic Socket Muxer pattern in thread dumps.

Another ERROR pattern associated with Muxers is shown below


####<Jun 23, 2009 9:25:24 AM BST> <Error> <Socket> <d001> <managed1> <ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'> <<WLS Kernel>> 
 <BEA-000403> <IOException occurred on socket: Socket[addr=/10.23.22.1,port=1950,localport=61002]
 java.net.SocketException: Connection refused.
java.net.SocketException: Connection refused
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at weblogic.socket.SocketMuxer.readReadySocketOnce(SocketMuxer.java:672)
        at weblogic.socket.SocketMuxer.readReadySocket(SocketMuxer.java:648)
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:123)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

As written in an earlier post, each incoming request on a TCP socket uses one File Descriptor.

We monitored the File Descriptors available to this Weblogic server using



Here the Weblogic pid = 8359

/usr/proc/bin/pfiles 8359 | grep rlimit


1024

ls /proc/8359/fd | wc –l


140

The first command shows that 1024 FDs are allocated to this process, and the second shows how many are being used in real-time.

A plot of the second value during high load times shows the FDs go up to 900+ in use and hence a resolution used in this case is to increase the number of FDs.

A recommended value is to increase it to 8192.

Monday, 15 June 2009

Slow Weblogic Part 6 - JVM Heap Analysis using GCViewer

Basics

In an earlier article, I had listed the review of JVM memory parameters as one of the important checks for tuning the JEE server platform.

The basic primer for JDK 1.4 is at http://java.sun.com/docs/hotspot/gc1.4.2/

The key points you need to know are:


Total JVM Heap = Young + Tenured(also called Old)
Young = Eden + From (SS1) + To (SS2)

In the diagram below [taken from the Sun website], "From" and "To" are the names of the two Survivor Spaces (SS) within the "Young".

Perm Space (and code cache): stores JVM’s own stuff. This is outside the Heap you assign using Xms and Xmx. A good explanation of this is available here

The JVM Heap is at default initial 2Mb and max 64Mb (for JDK 1.4 on Solaris).
Default Perm Size is 16MB (for JDK 1.4 on Solaris)
The defaults change for each JDK and are different on each OS - so look up the values on the respective websites.

The ratios are as shown below

Now the object life cycle and garbage collection occurs like this:

1. Objects when created are always first allocated to Eden.
2. When Eden fills up, a fast but not comprehensive GC (minor collection) is run over the young generation only.
3. All surviving objects are moved from Eden into one Survivor Space.
4. In consequent minor collections, new objects move from Eden into the other Survivor Space, plus everything from the first Survivor Space (survivors from the previous minor collection) is also moved into the second Survivor Space. Thus one survivor should be empty at that time.
5. When objects in Survivor Space are old enough (or survivor fills up), they are moved to Tenured. By default the long-lived objects may be copied up to 31 times between the Survivor Spaces before they are finally promoted to the Old generation.
6. When tenured fills up, a Full GC collection is run that is comprehensive: the entire heap is analyzed, all objects that can be destroyed are killed and memory is reclaimed.

Note: the above lifecycle changes slightly when advanced options such as ConcurrentMarkSweep etc are enabled.

Look Closer

What do these values mean ?

A full list of options available at http://java.sun.com/docs/hotspot/VMOptions.html and http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp

The absolute basic ones are listed in the table below. Note: This is for JDK 1.4
Some of these have changed in JDK 1.6

-Xms1536m -Xmx1536m	These represent the total heap (minus Perm space). Xms is the Initial Heap, set to 1.5Gb in this case. Xmx is Max Heap. It is good practice to set Xms = Xmx The max heap is limited by the RAM available on the server
-XX:NewSize=512m	This specifies the initial size of the Young generation,set to 512Mbin this example. It is better to set this as a percentage of the Heap using -XX:NewRatio
-XX:MaxNewSize=512m	This specifies the maximum size of the Young generation,set to 512Mbin this example. It is better to set this as a percentage of the Heap using -XX:MaxNewRatio
-XX:PermSize=64m -XX:MaxPermSize=128m	These values are the Minimum and Maximum sizes of the permanent generation heap space. Optimally, set PermSize equal to MaxPermSize to avoid heap having to be adjusted when permanent area grows. As specified earlier, this area of memory is over and above the Total Heap set using Xms
-XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90	The New generation area is divided into three sub-areas: Eden, and two survivor spaces that are equal in size. Use the -XX:SurvivorRatio=X option to configure the ratio of the Eden/survivor space size. In the above example, setting it to 8 means the ratio of Eden:SS1:SS2 is 8:1:1. So for a NewSize of 512 Mb, the two SS will be 51 Mb each, and Eden will be 512 MINUS (51 + 51) = 410 Mb. TargetSurvivorRatio of 90 allows 90% of the survivor spaces to be occupied instead of the default 50%, allowing better utilization of the survivor space memory.
-XX:MaxTenuringThreshold=10	This switch determines how many times the objects are hopped between the Survivor spaces before getting promoted to the older generation. The default value is 31.
-XX:+DisableExplicitGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -Xloggc:/log/gc.log	These are GC specific settings asking for GC details and the log file name in which these details should be captured

If you have an appetite for more, read this http://java.sun.com/performance/reference/whitepapers/tuning.html

GCViewer

This link below explains how to download and use the GCViewer tool. This is quite a useful tool for viewing the number of GCs and Full GCs and how the JVMis behaving over time.

http://www.javaperformancetuning.com/tools/gcviewer/index.shtml

The most important things to look at in the GCViewer analysis are the

* Acc Pauses - Accumulated Pause Time (total time app was stopped for GC).Pauses are the times when an application appears unresponsive because garbage collection is occurring
* Total Time - Total Time the application runs.
* Throughput - Time the application runs and is not busy with GC. Greater than 99% is fantastic.Throughput is the percentage of total time not spent in garbage collection, considered over long periods of time.
* Footprint - Overall Memory Consumption - Ideally as low as possible. This is the working set of a process, measured in pages and cache lines. On systems with limited physical memory or many processes, footprint may dictate scalability. Thus this usually reflects the size of total Heap allocated via Xms and Xmx

This diagram is taken from the above site:

Tuning Example From the Trenches - Frequent GC due to Perm Space getting Full

JVM Parameters already set

java -server -Xms1024m -Xmx1024m -XX:MaxPermSize=340m -XX:NewSize=340m -XX:MaxNewSize=340m
-XX:SurvivorRatio=9 -XX:TargetSurvivorRatio=90 -XX:+UseParNewGC
-Xloggc:/wls_domains/gclog/jms.gc -XX:+PrintGCDetails -XX:+UseParNewGC
-XX:+PrintGCTimeStamps -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
-XX:+JavaMonitorsInStackTrace -Dweblogic.system.BootIdentityFile=/wls_domains/xx/boot.properties
-Dweblogic.management.server=http://xx.xx.xx.xx:6000 -Dweblogic.Name=xxxxxx
-Dweblogic.ProductionModeEnabled=true
-Djava.security.policy=/opt/bea/wls/8.1sp4/weblogic81/server/lib/weblogic.policy
weblogic.Server

Full GC Pattern

The GC log shows


0.000: [Full GC 0.000: [Tenured: 0K->9794K(700416K), 0.8050952 secs] 134607K->9794K(1016960K), [Perm
: 20479K->20479K(20480K)], 0.8053527 secs]

First Full GC happens at 0:00 sec from start of server. Before and after '->' figures represent size of live objects before and after GC. Number in parenthesis indicates total available space. So from above numbers - Tenured was not full but Perm was almost full.


6579.013: [Full GC 6579.013: [Tenured: 9794K->18941K(700416K), 0.9677233 secs]
155600K->18941K(1016960K), [Perm : 24575K->24575K(24576K)], 0.9679896 secs]

Same thing happens here. The second Full GC took place at 6579.013 sec (1hr 49mins) from startup of server.
Again a Full GC is triggered, but the Tenured was not full. The Tenured is now 18.9 Mb out of 700 Mb - but it is seen the Perm Space has grown to 24.5 Mb and is not getting cleared.


9363.515: [Full GC 9363.516: [Tenured: 18941K->19463K(700416K), 0.6532332 secs] 36950K->19463K(1016960K), [Perm : 28672K->26462K(28672K)], 0.6536095 secs]

At the 3rd Full GC at 9363 seconds after server startup, the Perm space grew to 28.6 Mb and recovered marginally to 26.4 Mb.

Observing this over a long period of time, we concluded that at startup around 20MB is allocated to perm which with each Full GC keeps growing till 30MB and later shrinks back to 25MB and the cycle continues.

The pattern is highlighted below:


0.000: [Full GC 0.000: [Tenured: 0K->9794K(700416K), 0.8050952 secs] 134607K->9794K(1016960K), [Perm : 20479K->20479K(20480K)], 0.8053527 secs]
6579.013: [Full GC 6579.013: [Tenured: 9794K->18941K(700416K), 0.9677233 secs] 155600K->18941K(1016960K), [Perm : 24575K->24575K(24576K)], 0.9679896 secs]
9363.515: [Full GC 9363.516: [Tenured: 18941K->19463K(700416K), 0.6532332 secs] 36950K->19463K(1016960K), [Perm : 28672K->26462K(28672K)], 0.6536095 secs]
13483.233: [Full GC 13483.233: [Tenured: 19463K->16962K(700416K), 0.9783693 secs] 26678K->16962K(1016960K), [Perm : 30719K->21330K(30720K)], 0.9857390 secs]
17308.829: [Full GC 17308.830: [Tenured: 16962K->17312K(700416K), 1.0578872 secs] 88025K->17312K(1016960K), [Perm : 25600K->25600K(25600K)], 1.0581738 secs]
21237.810: [Full GC 21237.810: [Tenured: 17312K->17814K(700416K), 1.4728764 secs] 302290K->17814K(1016960K), [Perm : 29695K->26719K(29696K)], 1.4801234 secs]
30079.672: [Full GC 30079.672: [Tenured: 17814K->18676K(700416K), 1.0282446 secs] 83159K->18676K(1016960K), [Perm : 30975K->27564K(30976K)], 1.0349869 secs]

Solution:
Though the MaxPermSize=340m is provided, initial available Perm Size is getting full and JVM is invoking a Full GC to free up memory. The default initial PermSize is 16Mb and hence the Perm Space is resizing itself as the JVM grows.

Not sure if this behaviour is a bug but by adding an initial PermSize of 64K using this flag -XX:PermSize=64m resolved this issue of frequent Full GC.

UPDATE: Another examples published on http://jojovedder.blogspot.com/2009/07/jvm-tuning-from-trenches.html

Sunday, 17 May 2009

Weblogic - Socket Muxers in Thread Dumps

What are these weblogic.socket.Muxer threads seen in thread dumps ?

Note: for a basic primer on taking thread dumps and analyzing them, see this earlier article

Socket Reader Threads accept the incoming request from the Listen Thread Queue and put it on the Execute Thread Queue.

In WL 8.1, there are 3 socket reader threads by default.
In WL 9 and 10, WebLogic allocates 33% of server threads to act as socket readers by default. This need not be changed usually.

One socket reader thread is usually in the poll function, while the others are available to process requests.
The polling thread is highlighted in the thread dump below.


"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x016b2148 nid=0x42 waiting for monitor entry [5997f000..5997fc28]
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
- waiting to lock <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)


"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x00683c28 nid=0x41 waiting for monitor entry [59a7f000..59a7fc28]
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
- waiting to lock <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)


"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0079e5b0 nid=0x40 runnable [59b7f000..59b7fc28]
at weblogic.socket.PosixSocketMuxer.poll(Native Method)
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:100)
- locked <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

In an earlier support case on Stuck Threads, we asked BEA:

Should we worry about the Weblogic.socket.Muxer threads which always show 2 threads waiting for lock and 3rd thread locking the same object?

The Muxer TD is attached. This shows same behaviour on all our Weblogic servers.


Full thread dump Java HotSpot(TM) Server VM (1.4.2_05-b04 mixed mode):

"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0151
c588 nid=0x1b4 waiting for monitor entry [ad57f000..ad57fc28]
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
        - waiting to lock <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)



"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0161
d608 nid=0x1b3 runnable [ad67f000..ad67fc28]
        at weblogic.socket.PosixSocketMuxer.poll(Native Method)
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:100)
        - locked <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)



"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x01bb
6730 nid=0x1b2 waiting for monitor entry [ad77f000..ad77fc28]
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
        - waiting to lock <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

The reply from BEA Support was that the above pattern of weblogic.socket.Muxer threads are not a cause of stuck threads.

Why do they mostly show as being Stuck in Samurai TD analyzer ?

As the image shows, when you analyze thread dumps using Samurai, the muxer threads are shown as being Stuck since they're all locked on the same object. This is probably treated as a deadlock condition.


"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'"
- waiting to lock <0xd9b61098> (a java.lang.String)

"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'"
- waiting to lock <0xd9b61098> (a java.lang.String)

"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'"
- locked <0xd9b61098> (a java.lang.String)

But you will see the same in any Thread dump even on a development instance with no requests.
The locks mentioned do show up as red in Samurai - but they aren't deadlocks just regular locks.

A thread gains an exclusive lock on an object to perform some action, then frees it allowing the next thread to gain access.

Additionally, if you look at the thread dumps over time, you'll see that these specific locks are not always present - they are moving between the threads which is indicative of their transitory nature.

I want to know more details on Muxers

The socket Muxer manages the server’s existing socket connections.
It first determines which sockets have incoming requests waiting to be processed. It then reads enough data to determine the protocol and dispatches the socket to an appropriate runtime layer based on the protocol.
In the runtime layer, the socket muxer threads determine which execute thread queue to be used and delegates the request accordingly.

From the documentation on http://edocs.bea.com/wls/docs100/perform/WLSTuning.html#wp1152246 ,
Weblogic has two versions of the socket muxer, one is the Java version and the other uses a native library which makes better use of operating system calls. The Enable Native IO checkbox on the server’s configuration settings tells the server which version to use. This is ON by default for most platforms.

Native muxers provide superior scalability because they implement a non-blocking thread model. When a native muxer is used, the server creates a fixed number of threads dedicated to reading incoming requests. Oracle recommends using the default setting of true for the Enable Native IO parameter which allows the server to automatically select the appropriate muxer to use.

You must ensure that to use Native I/O, the native library must be present in the server’s shared library path . This is set up with the default scripts.
When the server does not find the native library, it throws an error
java.lang.UnsatisfiedLinkError: no muxer in java.library.path
and then loads the Java version of the muxer.

Confirm the LD library path is okay and pointing to the Solaris LD path. Check the startup log when starting a managed server. What is the value of java.library.path?
This is where the JVM actually get's the library from.

http://m-button.blogspot.com/2008/08/how-does-weblogic-handle-socket-muxers.html has a good example of how to identify Native vs Java muxer in a thread dump.

The Thread Dump I’ve used in my examples above uses the Native muxer (weblogic.socket.PosixSocketMuxer) on Solaris.

Solaris has another Native muxer called the weblogic.socket.DevPollSocketMuxer
An example TD using this muxer is shown below.


"ExecuteThread: '4' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '3' for queue: 'weblogic.socket.Muxer'" RUNNABLE native
weblogic.socket.DevPollSocketMuxer.doPoll(Native Method)
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:96)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)



weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)
"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)

To change the number of Muxers from the default, follow the instructions given at http://e-docs.bea.com/wls/docs92/ConsoleHelp/taskhelp/tuning/TuningSocketReaders.html

See http://jojovedder.blogspot.com/2009/07/more-on-weblogic-muxers.htmlfor an update on Muxers

Additionally on Oracle JRockit JVMs - there are some information in the thread dumps which point out the same problem in a different manner.

After the normal stack dumps, BEA JRockit performs a deadlock detection. This is done by finding "lock chains" in the Java application. If a lock chain is found to be circular, the application is considered caught in a deadlock.

A detailed explanation of the 3 types of lock chains in JRockit is given here

What is relevant for us is the example of Muxers which are shown as:

Blocked lock chains
===================
Chain 2:
"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" id=129 idx=0x218 tid=4079 waiting for java/lang/String@0x37804000 held by:
"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" id=127 idx=0x210 tid=4077 in chain 1

Open lock chains
================
Chain 1:
"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" id=128 idx=0x214 tid=4078 waiting for java/lang/String@0x37804000 held by:
"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" id=127 idx=0x210 tid=4077 (active)

As per the explanation, the Open lock chain depicts Thread 1 waiting for Thread 0. This is not a deadlock, only a straight dependency.

Since Thread 0 is already part of the Open lock chain, the fact that Thread 2 is also waiting on the same Thread 0 is treated as a "Blocked lock chain".
In this case this is not a problem.

Update 15th Feb 2011

I'm glad this blog entry on weblogic muxers has made it onto the Oracle forums with a mention from James Bayer.

Tuesday, 12 May 2009

Slow Server Response Part 4 - Platform Checks and Action Plan

In the earlier parts of this article I described some technical tuning tips specific to a slow-performing Weblogic JEE server.

Another earlier post has looked at analyzing thread dumps and prstats when the Weblogic/JEE server consumes high CPU.

This article provides a sequence of actions a Support team should carry out when faced with a critical situation - the site is down / servers running high CPU - and management teams want quick updates - and an action plan.

So here is (from experience) the Rapid Action Plan:

Technical Checklist for the Platform

1. Start a diary and mark out each of the steps given below as well as any following action with the timestamp at which it was carried out and the result/status.

2. Get the relevant technical experts on a conference call + Netmeeting/LiveMeeting/Desktop Sharing/Remote Admin.

3. Does the application have any traps or thresholds set which are configured to automatically raise alarms to the Support teams? Have any of the traps set been exceeded ? eg: Server CPU, Memory utilization, No of Threads

4. Can we narrow down to a problem area - Web server, Application server, Database, OS - based on log files, error messages and Support team or User input.

If a particular Weblogic Managed Server is identified as a point of failure, does the Configuration allow that server process to be shutdown for a while - thereby reducing Cluster capability but still providing acceptable Quality of service.

Are all the Managed Servers running with equal threads? If not, this can cause a load balancing issue.

Sometimes the bottleneck can be the Web server plugin which is not able to properly load balance the requests across the Weblogic cluster. This is usually the case when users complain of loss of session, spontaneous logout etc. The problem can be the user has been bounced from one Weblogic server to another in the cluster - and the session might not be replicated across the servers.

Any redundant JDBC connection pools - i.e. those configured with a high capacity but monitoring shows they dont need that many.
Then reduce the Capacity of that pool so that it does not hold on to unnecessary connections on the database.

5. From the log files, identify whether a particular application or code area is causing an issue. eg: EJB throwing errors, Spring bean configuration missing.

6. Are the log files too large (> 500 Mb) or not getting rotated via Weblogic rotation policy or Unix archiving ?

7. Check the Downstream Back-end systems which the server connects to - via DBLink, Web service, XML/Http, JMS, HTTP screen scraping etc.? Any known issues or planned outages? There should be error logs pointing in that direction. Contact their support teams to know if their system is available and returning responses as per the SLA.

8. Can the problem be replicated on Reference/Test instances?
A Dev or Test team can in parallel try out to see whether the issue is replicable.

If Yes, is it code related or configuration related?

If the issue is not replicable, then can it be data related ? Perhaps a particular set of data exists on Production which is not on the Test instance - and that could be the problem. Can the data be brought into Test to try and replicate the problem ?

9. Can it be content related? Does the platform have a Content Management System? Is the link from the CMS to the server working or broken? Is the Content correctly getting deployed into the database + file system as per daily process?

Check if there was a content deployment carried out and any records whether they failed or passed. Is content deployment happening during the business timings, and utilizing system and CPU resources - which chokes the JEE server?

Can a resource-hungry content deployment process be moved to out-of-business hours.

10. Test broken user journeys on the site.

Can the problem be seen while running HttpHeaders, HttpAnalyzer, Fiddler etc ? Does it show any change in HTTP parameters such as Cookies, Session timeouts?
Compare these against the Test environment and see whether any mismatches which could cause the problem.
If there is bouncing of user sessions between managed servers, this will be visible in the weblogic JSessionID which will be different on the client browser.

11. What were the last few changes to the platform ?

Check latest release or configuration change as per Support Team Diary of Events. Could these have caused an issue and should these be rolled back?

Were these properly tested and signed off before going into Production.

eg: any new Database driver, changes to TCP parameters, JTA timeouts increased?

12. Check the last few support cases raised? See if there were any problems reported by business or end customers.

13. Solaris/OS checks

Is the platform running on the latest OS patch levels and JDK settings as recommended by Sun.

a. No of processes running. Use
ps -ef | wc -l

b. Ping the boxes, to check if they are alive

c. CPU utilization

prstat

d. Memory utilization

vmstat 3

Swap space utilization, amount of space in /tmp - is there any old file or core dump occupying the directory used as swap space. We once moved old EAR files from /tmp on the server; memory utilisation went from approx 90% down to 65%.

e. Disk space

df -ek

f. No of File descriptors

14. Weblogic/Web server checks

a. Thread utilization - any Stuck Threads

Analyze Thread dumps , at least 4 sets of Thread dumps taken 5 seconds apart when stuck thread is observed. See here for more details on what to look for in the thread dumps. Use Samurai or TDA

b. CPU %

c. Access and error logs - Any CRITICAL messages in the logs. Any Connection_Refused errors indicating the threads were not able to accept new requests.

d. No of open sockets to weblogic

netstat -a | grep

e. Memory utilization via Weblogic console

f. Check via console if all the managed servers are up and running

g. Connection pool utilization, Are they hitting the peak values.

h. Frequent Garbage collection shown in the console?
Frequency of GC, GC pattern. Has the JVM been tuned to allow optimum garbage collection. See this URL for more.

i. Check for the values in weblogic.xml for jsp-pageCheckSeconds and servlet-reload-check-secs - if these are at the default of 1, the server will check each second to see whether the JSP should be recompiled - this is horribly slow

j. Cron job logs - any failures.

k. No of weblogic sessions per server - the more the number of HttpSessions, the higher the memory (RAM) that gets used.

l. Is a large part of the application journey over SSL. When supporting the cryptography operations in the SSL protocol, WebLogic Server cannot handle as many simultaneous connections.
Typically, for every SSL connection that the server can handle, it can handle three non-SSL connections. SSL reduces the capacity of the server by about 33-50% depending upon the strength of encryption used in the SSL connections.
(Source: http://edocs.bea.com/wlp/docs92/capacityplanning/capacityplanning.html#wp1080286)
Consider reducing the SSL journeys on the site.

m. Disk space taken by Weblogic and other logs such as Log4J.
Is log4j running in DEBUG and writing out loads of logs ? This will also slow down the server horribly.

15. Database checks

a. SQL Server locks (Call out DBA)

b. Database stuck/locked processes

c. Any DB link down

d. Any issues with open cursors, cached cursors ?

e. Is the database running at very high Memory Utilization?

16. Search Engine processing - check the log for the day.

17. Any MIS such as Webtrends / Omniture Analysis - for application usage. Has there been a sudden rise in users on the site - eg a marketing campaign or a new feature gone live - causing a rise in usage which the infrastructure cannot cope with.

18. Any application cached data which was wiped out and took time to rebuild - causing slow service in the interim period. eg: is any database table with a lot of rows being cached.
Or conversely, is there incorrect data in a certain cache and will clearing the cache help ?

19. SMTP email delivery failures due to any problems on the OS ?

20. Any planned backup processes running on the OS which takes up a lot of CPU.

Remedial actions

1. Make a list of possible changes based on the above checks to address these problems.

2. Only change one setting on any system at a time. Test and record the desired effects and observed effects. Be clear on why a particular change is being made.

3. If it doesn't work rework the plan to get to root cause of failure.

4. Be aware that reactive changes will be made directly to the production
environment by various parties. Significant changes will be made purely to enable investigation and diagnosis of issues.
The lack of up-to-date documentation creates risk. Maintain a documented rationale for a particular design decision, configuration choice, or system parameter; this reduces the likelihood that mistakes will be repeated. Documentation is a key communication tool, without it intent may be miscommunicated within the team. If key staff members leave, knowledge will be lost to the extent that the platform may become unmanageable.

5. Add additional tests to the regression test
suite. Increase the coverage of the regression test suite, focussing on
simulating live system interaction.

6. Over the long term, identify and re-architect towards removing Single Points of Failure - such that loss of a single machine or process would not lead to a loss of service.

Examples:
· Single web server machine, hosting the Apache/SunOne instances.
· Single application server machine, hosting the Weblogic application server.
· Single database server instance.

The system runs at risk of lengthy service outage if any one of these components fails. If a hardware failure occurred and one of the servers was lost, alternative hardware would need be installed and initialised from back-up tapes.
This needs to be fixed and stabilized over the long term.

6. Medium to long term remedial action include code review. Use your tools of choice for Java. .NET, front end Angular/React or backend NodeJS such as sonarQube, lint4j, sonarscanner for .NET, jsLint, eslint etc

7. While the analysis is going on, a member of Support team should circulate the KEY Metrics on an hourly basis to the TECHNICAL community. Ensure this is the important dataset and not too much info which just becomes noise.

Example in the table below:

Servers	Server 1	Server 2	Server 3
Idle Threads	11	14	17
JMSErrors	1	0	17
IOExceptions	3	0	1
Stuck Threads	0	0	7
JMSErrors	1	0	17
netstat -a \| grep TIME_WAIT \| wc -l	186	289	69
CPU Utilization (%)	12	1.8	3.6
Memory Utilization (%)	3.9	2.9	2.1
500 Internal Server Error	3	0	0

No of logged in Users: 260

JMS Pending Messages: 0

Any queries or clarifications, leave me a comment and I'll try to get back.

You Never Know