You Never Know: stuck thread

Showing posts with label stuck thread. Show all posts

Tuesday, 7 July 2009

More on Weblogic Muxers

This is a follow-up to an earlier article which detailed the Weblogic Socket Muxer pattern in thread dumps.

Another ERROR pattern associated with Muxers is shown below


####<Jun 23, 2009 9:25:24 AM BST> <Error> <Socket> <d001> <managed1> <ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'> <<WLS Kernel>> 
 <BEA-000403> <IOException occurred on socket: Socket[addr=/10.23.22.1,port=1950,localport=61002]
 java.net.SocketException: Connection refused.
java.net.SocketException: Connection refused
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at weblogic.socket.SocketMuxer.readReadySocketOnce(SocketMuxer.java:672)
        at weblogic.socket.SocketMuxer.readReadySocket(SocketMuxer.java:648)
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:123)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

As written in an earlier post, each incoming request on a TCP socket uses one File Descriptor.

We monitored the File Descriptors available to this Weblogic server using



Here the Weblogic pid = 8359

/usr/proc/bin/pfiles 8359 | grep rlimit


1024

ls /proc/8359/fd | wc –l


140

The first command shows that 1024 FDs are allocated to this process, and the second shows how many are being used in real-time.

A plot of the second value during high load times shows the FDs go up to 900+ in use and hence a resolution used in this case is to increase the number of FDs.

A recommended value is to increase it to 8192.

Sunday, 17 May 2009

Weblogic - Socket Muxers in Thread Dumps

What are these weblogic.socket.Muxer threads seen in thread dumps ?

Note: for a basic primer on taking thread dumps and analyzing them, see this earlier article

Socket Reader Threads accept the incoming request from the Listen Thread Queue and put it on the Execute Thread Queue.

In WL 8.1, there are 3 socket reader threads by default.
In WL 9 and 10, WebLogic allocates 33% of server threads to act as socket readers by default. This need not be changed usually.

One socket reader thread is usually in the poll function, while the others are available to process requests.
The polling thread is highlighted in the thread dump below.


"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x016b2148 nid=0x42 waiting for monitor entry [5997f000..5997fc28]
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
- waiting to lock <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)


"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x00683c28 nid=0x41 waiting for monitor entry [59a7f000..59a7fc28]
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
- waiting to lock <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)


"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0079e5b0 nid=0x40 runnable [59b7f000..59b7fc28]
at weblogic.socket.PosixSocketMuxer.poll(Native Method)
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:100)
- locked <0x94846b40> (a java.lang.String)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

In an earlier support case on Stuck Threads, we asked BEA:

Should we worry about the Weblogic.socket.Muxer threads which always show 2 threads waiting for lock and 3rd thread locking the same object?

The Muxer TD is attached. This shows same behaviour on all our Weblogic servers.


Full thread dump Java HotSpot(TM) Server VM (1.4.2_05-b04 mixed mode):

"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0151
c588 nid=0x1b4 waiting for monitor entry [ad57f000..ad57fc28]
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
        - waiting to lock <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)



"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x0161
d608 nid=0x1b3 runnable [ad67f000..ad67fc28]
        at weblogic.socket.PosixSocketMuxer.poll(Native Method)
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:100)
        - locked <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)



"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" daemon prio=5 tid=0x01bb
6730 nid=0x1b2 waiting for monitor entry [ad77f000..ad77fc28]
        at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:91)
        - waiting to lock <0xd9331760> (a java.lang.String)
        at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
        at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:219)
        at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:178)

The reply from BEA Support was that the above pattern of weblogic.socket.Muxer threads are not a cause of stuck threads.

Why do they mostly show as being Stuck in Samurai TD analyzer ?

As the image shows, when you analyze thread dumps using Samurai, the muxer threads are shown as being Stuck since they're all locked on the same object. This is probably treated as a deadlock condition.


"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'"
- waiting to lock <0xd9b61098> (a java.lang.String)

"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'"
- waiting to lock <0xd9b61098> (a java.lang.String)

"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'"
- locked <0xd9b61098> (a java.lang.String)

But you will see the same in any Thread dump even on a development instance with no requests.
The locks mentioned do show up as red in Samurai - but they aren't deadlocks just regular locks.

A thread gains an exclusive lock on an object to perform some action, then frees it allowing the next thread to gain access.

Additionally, if you look at the thread dumps over time, you'll see that these specific locks are not always present - they are moving between the threads which is indicative of their transitory nature.

I want to know more details on Muxers

The socket Muxer manages the server’s existing socket connections.
It first determines which sockets have incoming requests waiting to be processed. It then reads enough data to determine the protocol and dispatches the socket to an appropriate runtime layer based on the protocol.
In the runtime layer, the socket muxer threads determine which execute thread queue to be used and delegates the request accordingly.

From the documentation on http://edocs.bea.com/wls/docs100/perform/WLSTuning.html#wp1152246 ,
Weblogic has two versions of the socket muxer, one is the Java version and the other uses a native library which makes better use of operating system calls. The Enable Native IO checkbox on the server’s configuration settings tells the server which version to use. This is ON by default for most platforms.

Native muxers provide superior scalability because they implement a non-blocking thread model. When a native muxer is used, the server creates a fixed number of threads dedicated to reading incoming requests. Oracle recommends using the default setting of true for the Enable Native IO parameter which allows the server to automatically select the appropriate muxer to use.

You must ensure that to use Native I/O, the native library must be present in the server’s shared library path . This is set up with the default scripts.
When the server does not find the native library, it throws an error
java.lang.UnsatisfiedLinkError: no muxer in java.library.path
and then loads the Java version of the muxer.

Confirm the LD library path is okay and pointing to the Solaris LD path. Check the startup log when starting a managed server. What is the value of java.library.path?
This is where the JVM actually get's the library from.

http://m-button.blogspot.com/2008/08/how-does-weblogic-handle-socket-muxers.html has a good example of how to identify Native vs Java muxer in a thread dump.

The Thread Dump I’ve used in my examples above uses the Native muxer (weblogic.socket.PosixSocketMuxer) on Solaris.

Solaris has another Native muxer called the weblogic.socket.DevPollSocketMuxer
An example TD using this muxer is shown below.


"ExecuteThread: '4' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '3' for queue: 'weblogic.socket.Muxer'" RUNNABLE native
weblogic.socket.DevPollSocketMuxer.doPoll(Native Method)
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:96)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)



weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)
"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)



"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" waiting for lock java.lang.String@4edf4f BLOCKED
weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:95)
weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)

To change the number of Muxers from the default, follow the instructions given at http://e-docs.bea.com/wls/docs92/ConsoleHelp/taskhelp/tuning/TuningSocketReaders.html

See http://jojovedder.blogspot.com/2009/07/more-on-weblogic-muxers.htmlfor an update on Muxers

Additionally on Oracle JRockit JVMs - there are some information in the thread dumps which point out the same problem in a different manner.

After the normal stack dumps, BEA JRockit performs a deadlock detection. This is done by finding "lock chains" in the Java application. If a lock chain is found to be circular, the application is considered caught in a deadlock.

A detailed explanation of the 3 types of lock chains in JRockit is given here

What is relevant for us is the example of Muxers which are shown as:

Blocked lock chains
===================
Chain 2:
"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" id=129 idx=0x218 tid=4079 waiting for java/lang/String@0x37804000 held by:
"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" id=127 idx=0x210 tid=4077 in chain 1

Open lock chains
================
Chain 1:
"ExecuteThread: '1' for queue: 'weblogic.socket.Muxer'" id=128 idx=0x214 tid=4078 waiting for java/lang/String@0x37804000 held by:
"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" id=127 idx=0x210 tid=4077 (active)

As per the explanation, the Open lock chain depicts Thread 1 waiting for Thread 0. This is not a deadlock, only a straight dependency.

Since Thread 0 is already part of the Open lock chain, the fact that Thread 2 is also waiting on the same Thread 0 is treated as a "Blocked lock chain".
In this case this is not a problem.

Update 15th Feb 2011

I'm glad this blog entry on weblogic muxers has made it onto the Oracle forums with a mention from James Bayer.

Tuesday, 12 May 2009

Slow Server Response Part 4 - Platform Checks and Action Plan

In the earlier parts of this article I described some technical tuning tips specific to a slow-performing Weblogic JEE server.

Another earlier post has looked at analyzing thread dumps and prstats when the Weblogic/JEE server consumes high CPU.

This article provides a sequence of actions a Support team should carry out when faced with a critical situation - the site is down / servers running high CPU - and management teams want quick updates - and an action plan.

So here is (from experience) the Rapid Action Plan:

Technical Checklist for the Platform

1. Start a diary and mark out each of the steps given below as well as any following action with the timestamp at which it was carried out and the result/status.

2. Get the relevant technical experts on a conference call + Netmeeting/LiveMeeting/Desktop Sharing/Remote Admin.

3. Does the application have any traps or thresholds set which are configured to automatically raise alarms to the Support teams? Have any of the traps set been exceeded ? eg: Server CPU, Memory utilization, No of Threads

4. Can we narrow down to a problem area - Web server, Application server, Database, OS - based on log files, error messages and Support team or User input.

If a particular Weblogic Managed Server is identified as a point of failure, does the Configuration allow that server process to be shutdown for a while - thereby reducing Cluster capability but still providing acceptable Quality of service.

Are all the Managed Servers running with equal threads? If not, this can cause a load balancing issue.

Sometimes the bottleneck can be the Web server plugin which is not able to properly load balance the requests across the Weblogic cluster. This is usually the case when users complain of loss of session, spontaneous logout etc. The problem can be the user has been bounced from one Weblogic server to another in the cluster - and the session might not be replicated across the servers.

Any redundant JDBC connection pools - i.e. those configured with a high capacity but monitoring shows they dont need that many.
Then reduce the Capacity of that pool so that it does not hold on to unnecessary connections on the database.

5. From the log files, identify whether a particular application or code area is causing an issue. eg: EJB throwing errors, Spring bean configuration missing.

6. Are the log files too large (> 500 Mb) or not getting rotated via Weblogic rotation policy or Unix archiving ?

7. Check the Downstream Back-end systems which the server connects to - via DBLink, Web service, XML/Http, JMS, HTTP screen scraping etc.? Any known issues or planned outages? There should be error logs pointing in that direction. Contact their support teams to know if their system is available and returning responses as per the SLA.

8. Can the problem be replicated on Reference/Test instances?
A Dev or Test team can in parallel try out to see whether the issue is replicable.

If Yes, is it code related or configuration related?

If the issue is not replicable, then can it be data related ? Perhaps a particular set of data exists on Production which is not on the Test instance - and that could be the problem. Can the data be brought into Test to try and replicate the problem ?

9. Can it be content related? Does the platform have a Content Management System? Is the link from the CMS to the server working or broken? Is the Content correctly getting deployed into the database + file system as per daily process?

Check if there was a content deployment carried out and any records whether they failed or passed. Is content deployment happening during the business timings, and utilizing system and CPU resources - which chokes the JEE server?

Can a resource-hungry content deployment process be moved to out-of-business hours.

10. Test broken user journeys on the site.

Can the problem be seen while running HttpHeaders, HttpAnalyzer, Fiddler etc ? Does it show any change in HTTP parameters such as Cookies, Session timeouts?
Compare these against the Test environment and see whether any mismatches which could cause the problem.
If there is bouncing of user sessions between managed servers, this will be visible in the weblogic JSessionID which will be different on the client browser.

11. What were the last few changes to the platform ?

Check latest release or configuration change as per Support Team Diary of Events. Could these have caused an issue and should these be rolled back?

Were these properly tested and signed off before going into Production.

eg: any new Database driver, changes to TCP parameters, JTA timeouts increased?

12. Check the last few support cases raised? See if there were any problems reported by business or end customers.

13. Solaris/OS checks

Is the platform running on the latest OS patch levels and JDK settings as recommended by Sun.

a. No of processes running. Use
ps -ef | wc -l

b. Ping the boxes, to check if they are alive

c. CPU utilization

prstat

d. Memory utilization

vmstat 3

Swap space utilization, amount of space in /tmp - is there any old file or core dump occupying the directory used as swap space. We once moved old EAR files from /tmp on the server; memory utilisation went from approx 90% down to 65%.

e. Disk space

df -ek

f. No of File descriptors

14. Weblogic/Web server checks

a. Thread utilization - any Stuck Threads

Analyze Thread dumps , at least 4 sets of Thread dumps taken 5 seconds apart when stuck thread is observed. See here for more details on what to look for in the thread dumps. Use Samurai or TDA

b. CPU %

c. Access and error logs - Any CRITICAL messages in the logs. Any Connection_Refused errors indicating the threads were not able to accept new requests.

d. No of open sockets to weblogic

netstat -a | grep

e. Memory utilization via Weblogic console

f. Check via console if all the managed servers are up and running

g. Connection pool utilization, Are they hitting the peak values.

h. Frequent Garbage collection shown in the console?
Frequency of GC, GC pattern. Has the JVM been tuned to allow optimum garbage collection. See this URL for more.

i. Check for the values in weblogic.xml for jsp-pageCheckSeconds and servlet-reload-check-secs - if these are at the default of 1, the server will check each second to see whether the JSP should be recompiled - this is horribly slow

j. Cron job logs - any failures.

k. No of weblogic sessions per server - the more the number of HttpSessions, the higher the memory (RAM) that gets used.

l. Is a large part of the application journey over SSL. When supporting the cryptography operations in the SSL protocol, WebLogic Server cannot handle as many simultaneous connections.
Typically, for every SSL connection that the server can handle, it can handle three non-SSL connections. SSL reduces the capacity of the server by about 33-50% depending upon the strength of encryption used in the SSL connections.
(Source: http://edocs.bea.com/wlp/docs92/capacityplanning/capacityplanning.html#wp1080286)
Consider reducing the SSL journeys on the site.

m. Disk space taken by Weblogic and other logs such as Log4J.
Is log4j running in DEBUG and writing out loads of logs ? This will also slow down the server horribly.

15. Database checks

a. SQL Server locks (Call out DBA)

b. Database stuck/locked processes

c. Any DB link down

d. Any issues with open cursors, cached cursors ?

e. Is the database running at very high Memory Utilization?

16. Search Engine processing - check the log for the day.

17. Any MIS such as Webtrends / Omniture Analysis - for application usage. Has there been a sudden rise in users on the site - eg a marketing campaign or a new feature gone live - causing a rise in usage which the infrastructure cannot cope with.

18. Any application cached data which was wiped out and took time to rebuild - causing slow service in the interim period. eg: is any database table with a lot of rows being cached.
Or conversely, is there incorrect data in a certain cache and will clearing the cache help ?

19. SMTP email delivery failures due to any problems on the OS ?

20. Any planned backup processes running on the OS which takes up a lot of CPU.

Remedial actions

1. Make a list of possible changes based on the above checks to address these problems.

2. Only change one setting on any system at a time. Test and record the desired effects and observed effects. Be clear on why a particular change is being made.

3. If it doesn't work rework the plan to get to root cause of failure.

4. Be aware that reactive changes will be made directly to the production
environment by various parties. Significant changes will be made purely to enable investigation and diagnosis of issues.
The lack of up-to-date documentation creates risk. Maintain a documented rationale for a particular design decision, configuration choice, or system parameter; this reduces the likelihood that mistakes will be repeated. Documentation is a key communication tool, without it intent may be miscommunicated within the team. If key staff members leave, knowledge will be lost to the extent that the platform may become unmanageable.

5. Add additional tests to the regression test
suite. Increase the coverage of the regression test suite, focussing on
simulating live system interaction.

6. Over the long term, identify and re-architect towards removing Single Points of Failure - such that loss of a single machine or process would not lead to a loss of service.

Examples:
· Single web server machine, hosting the Apache/SunOne instances.
· Single application server machine, hosting the Weblogic application server.
· Single database server instance.

The system runs at risk of lengthy service outage if any one of these components fails. If a hardware failure occurred and one of the servers was lost, alternative hardware would need be installed and initialised from back-up tapes.
This needs to be fixed and stabilized over the long term.

6. Medium to long term remedial action include code review. Use your tools of choice for Java. .NET, front end Angular/React or backend NodeJS such as sonarQube, lint4j, sonarscanner for .NET, jsLint, eslint etc

7. While the analysis is going on, a member of Support team should circulate the KEY Metrics on an hourly basis to the TECHNICAL community. Ensure this is the important dataset and not too much info which just becomes noise.

Example in the table below:

Servers	Server 1	Server 2	Server 3
Idle Threads	11	14	17
JMSErrors	1	0	17
IOExceptions	3	0	1
Stuck Threads	0	0	7
JMSErrors	1	0	17
netstat -a \| grep TIME_WAIT \| wc -l	186	289	69
CPU Utilization (%)	12	1.8	3.6
Memory Utilization (%)	3.9	2.9	2.1
500 Internal Server Error	3	0	0

No of logged in Users: 260

JMS Pending Messages: 0

Any queries or clarifications, leave me a comment and I'll try to get back.

Wednesday, 6 May 2009

Slow Weblogic Response Part 2 - Overall Tuning Considerations

Part 1 of this article was related to using the JSP precompilation and tuning the recompilation settings provided in Weblogic.

Now, lets take a look at the other parameters that require tuning and setting correctly, in order for the Weblogic server to be generating acceptable performance of your site.

Usually the problem statement is similar to this:

1. The site regularly stops responding and all available execute threads are consumed. All future requests fail to be handled and a server restart is required.
2. The page response times of the entire site are too high and need to be brought down to a more useable level.

Note: An additional post has been published which provides a skeleton Action Plan for analyzing the entire slow site/high CPU issue including managing stakeholder expectations and appropriate reporting.

Operating System Review
A review of the Operating System configuration needs to look at
a) The number of file descriptors available and whether that matches the value recommended for Weblogic
b) Various TCP settings (also called NDD parameters) which will affect how long it takes to recycle a closed or dropped connection.
2 top settings are:
• File Descriptor limit: increase to 8192 if not 65536. A detailed follow-up on File Descriptors is published here.
• tcp_time_wait_interval: change to 60000 [The default in Solaris 8 is 4 minutes used to keep a socket open in TIME_WAIT state, even after the response is provided to the client, set this down to 1 minute. The default in Solaris 9 is 1 minute]
Check the Oracle site for the latest recommended values

Database Usage and JDBC Drivers
a) If you get Out Of Memory errors occurring in JDBC calls, it is recommended that the JDBC driver be upgraded to the latest version.

b) Prepared Statement caching is a feature that allows prepared statements to be held in a cache on the database so they do not have to be re-parsed for each call. This functionality needs to be enabled per Connection Pool and can have a significant impact on the performance of the pools. But this needs to be validated with a focussed round of Performance testing.

It should be noted that for every Prepared Statement that is held in cache, a cursor is held open on the database for each connection in the pool. So if a cache size of 10 is used on the abcPool, and the pool has a size of 50 then 500 open cursors will be required. Repeatable load tests will highlight any gains achieved by enabling this caching.

Review JDBC Pool Sizes
Review connection pool versus the number of Execute threads. Usually keep Pool size close to Execute thread size. Note: This applies to versions earlier than Weblogic 9. See detailed explanation below.

If the JDBC pool size is quite less compared to the Thread size, there is the potential to negatively impact performance quite dramatically, since threads have to wait for connections to be returned to the pool.
Your most frequently used pool should have their minimum (initial) and maximum sizes increased to the number of Execute threads plus one. This will mean there is an available connection for every thread.
One comment on pool sizing it is beneficial where ever possible to have the initial and max connections set to the same size for a JDBCPool as this avoids expanding/shrinking work that can be costly, for both the establishment of new connections while expanding the pool and housekeeping work for the pool.

However it is also recommended to monitor the JDBC pools during peak hours and see how many connections are being used at maximum. If you are not hitting the MaxCapacity, it is useful to reduce the MaxCapacity to avoid unnecessary open cursors on the database.

Note: As of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. Work Managers can be used for JDBC pools by defining the max-threads-constraint to define how many threads to allocate for a particular Datasource.
It is possible to run Weblogic 9 and 10 with the Execute Queues as available earlier. This is not recommended since Work Managers are self-tuning and more advanced than Execute Threads.

WebLogic Server, Version 8.1, implemented Execute Queues to handle thread management in which you created thread-pools to determine how workload was handled. WebLogic Server still
provides Execute Queues for backward compatibility, primarily to facilitate application
migration. However, when developing new applications, you should use Work Managers to perform thread management more efficiently.
You can enable Execute Queues in the following ways:
􀁺 Using the command line option
-Dweblogic.Use81StyleExecuteQueues=true
􀁺 Setting the Use81StyleExecuteQueues property via the Kernel MBean in config.xml.
Enabling Execute Queues disables all Work Manager configuration and thread self tuning.
Execute Queues behave exactly as they did in WebLogic Server 8.1.

Database Persistent JMS Queues

Verify whether the database architecture is such that persistent JMS queues use the same database instance as a message store as your Weblogic portal uses for data.
As the volumes on these queues increase this could significantly degrade the performance of the portal by competing for valuable CPU cycles on the database server.

1. Move the message store for persistent queues to a separate database instance from that used by most of the JDBC pools belonging to the Weblogic server. This will prevent increases in message volumes from adversely affecting the performance of the database, which would also slow the portal applications down and vice-versa.

2. Implement paging with a file-store. This allows the amount of memory consumed by JMS queues to be restricted by paging message contents to disk and only holding headers in memory. Note that this does not provide failover protection in the way persistence does and performs better with a paging file-store than a paging JDBC store.

3. It is recommended that a review is also undertaken to determine exactly which queues are persisted and whether they truly need to be. The performance gains from switching to non-persisted queues are substantial, and guaranteed delivery is not always required.

Review Number of Execute Threads
A common mistake made by Support teams when seeing Stuck threads is to increase the number of execute threads in a single ‘default’ queue. At one time, I have worked in a project which ran the Weblogic server with 95 threads.

This figure is very high and results in a large amount of context-switching as load increases, which consumes valuable CPU cycles. Because threads consume memory, you can degrade performance by increasing the value of the Thread Count attribute unnecessarily.

Taking a Thread Dump when the server is not responding will show what the threads are doing, and help identify whether there is an application coding issue or deadlock occuring. Use Samurai to analyze these as earlier posted
It is recommended that regular monitoring of the number of idle threads and the length of queued requests for each execute queue is set up via MBeans. This allows the Support teams to plot a graph of utilization and validation of any changed values.
Note: As of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. Another good link is here

Use Dedicated Execute Thread Queue for Intensive applications

As the number of threads is small, and if a particular application is seen to utilize a majority of the execute, the following 2 approaches are suggested to resolve the issue:
1. Review the design of the offending application to determine whether it really needs so many threads.
2. Move the offending application to a dedicated execute queue, with enough threads allocated to this queue. This will prevent it from starving the main server of threads and allow the ‘default’ queue to remain with a lower number of Execute threads. This split of Execute Queue can be done at servlet or webapp level. Mail me if you need an example, we've done both successfully in WL 8.1 and 9.

Note: However, as of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. You can use a Work Manager to dedicate resources at Application, Web App, EJB level.

Review Java VM Settings
Tuning the JVM settings for the Total Heap, and Young/Old Generations is essential to regulate the frequency of Garbage Collection on the servers. The basic primer is on the Sun website, and a follow up of actual values and learnings is published here. The most essential ones are Xms, Xmx for the total Heap and NewSize, NewRatio for the Young Generation. Also set PermSize and MaxPermSize appropriately to avoid consuming high memory

Other Areas

1. To speed up server start times, do not delete the .wlnotdelete directories at startup - Unless you are deploying changed application jars and code.
Be aware you might occasionally see a problem shutting down the server which goes into an UNKNOWN state due to too many old temp files and wl_internal files.
You will get the dreaded error below which can only be resolved by killing the process and clearing out all temp files, .lck files etc within the domain. The files are under DOMAIN_HOME/servers//



weblogic.management.NoAccessRuntimeException: Access not allowed for subject: principals=[weblogic, Deployers], on ResourceType: ServerLifeCycleRuntime Action: execute, Target: shutdown
        at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:195)
        at weblogic.rmi.internal.BasicRemoteRef.invoke(BasicRemoteRef.java:224)

2. Avoid the URLEncoder.encode method call as much as possible. This calls is not optimal in most JDKs < 1.5 and is often found as memory and CPU hotspot issue.

3. Check the network connection between WebLogic and the database. If Thread dumps show that threads are often in a suspended state (waiting for so long that they were suspended) while doing a socket read from the database.
The DBA wouldn't see this as a long-running SQL statement. This needs to be checked out at the network level.

4. Switch off all DEBUG options on Production on app server as well as web server and web server plugins.

5. Ensure log files are rotated so that they can be backed up and moved off the main log directory. Define a rotation policy based on file size or fixed time (like 24 hours)
However also note that: On certain platforms, if some application is tailing
the log at the time of rotation, the rotation fails. Stop the application tailing and reopen the tail after the rotation is complete.

6. Do not use "Emulate Two-Phase Commit for non-XA Driver" for DataSources.
It is not a good idea to use emulated XA. It can result in loss of data integrity and can cause heuristic exceptions. This option is only meant for use with databases for which there is no XA driver so that the datasources for these pools can still participate in XA transactions.
If an XA driver is available (and there is for Oracle), it should be used. If this option is selected because of problems with Oracle's Thin XA driver, try the newest version, or pick a different XA driver.

7. The <save-sessions-enabled> element in web.xml controls whether session data is cleaned up during redeploy or undeploy.
It affects memory and replicated sessions. The default is false. Setting the value to true means session data is saved and this is an overhead.

8. If firewalls are present between Weblogic server and the database or an external system connecting via JMS, this can cause transactional and session timeout issues. The session timeout on the firewall shoudl be configured to allow for normal transaction times.
In the case of JMS, transactional interoperation between the two servers can be compromised and hence it is beneficial to open the firewall between the two servers so that RMI/T3 connections can be made freely.

Any queries or clarifications, leave me a comment and I'll try to get back.

Slow Weblogic response - JSP and Servlet Reload

The top Weblogic tuning tip I've used over the last few years is the one below. This article deals with Weblogic 8 and 9.

You know the symptoms:

JEE Application works okay on Unit Integration and Test servers.
But once you're into Performance and Stress Testing - it's totally choked up - pages are returned so slowly - Project Manager is asking for a code review to find the blockers - etc.

The first thing to do on a slow JEE application is to take a thread dump, see more on that here

Now, the thing to look for in the Thread Dump, is multiple threads doing this:


weblogic.servlet.jsp.JspStub.checkForReload(JspStub.java:144)

or this:


waiting for monitor entry [c4f7f000..c4f7fc28]
     at weblogic.servlet.internal.ServletStubImpl.checkForReload(ServletStubImpl.java:771)

This is not a stuck thread within your code – but this is a weblogic.xml setting for checking each time if your servlet class or JSP has been changed.

As documented here:

http://e-docs.bea.com/wls/docs92/webapp/weblogic_xml.html#wp1038491

Sets the interval, in seconds, at which WebLogic Server checks to see if JSP files have changed and need recompiling. Dependencies are also checked and recursively reloaded if changed.

If set to 0, pages are checked on every request. If set to -1, page checking and recompiling is disabled.

Most users set this to -1 to disable in Production unless they're altering jsps on the fly (which is not normal practice on a production system).

The correct syntax for this is:

<weblogic-web-app>
<jsp-descriptor>
<jsp-param>
<param-name>pageCheckSeconds</param-name>
<param-value>-1</param-value>
</jsp-param>
</jsp-descriptor>

The similar value for servlets is set using the servlet-reload-check-secs within the tags below.

<container-descriptor>
<servlet-reload-check-secs>-1</servlet-reload-check-secs>
</container-descriptor>
</weblogic-web-app>

On non-production environments, you can set this to a reasonable value such as 600 (in seconds, which is 10 minutes) to allow for changed JSPs you drop in/FTP to the deployed environment, and want those to reflect sooner.

The other thing we must do on a production system is to precompile the JSPs as part of the build, and before deploying to Live.

Else, even the first time JSP compilation is quite slow and this must be avoided on the Production system.

For the same solution on Weblogic 10, look at an updated article.

Tuesday, 13 January 2009

App Server High CPU or Low Threads - Support Team Actions

Occasionally Weblogic/JEE app servers run into High CPU (>70% or as per your app-defined thresholds) , then below is a list of actions and investigations that the support teams need to do.

Check prstat –L –p Capture into a file if required.

Output kill -3 into a file i.e. take thread dump. On Weblogic 10,you can do this from the Admin console as well

Output pstack into a file

Repeat this 4 times at 5 second intervals. At the end, you will have 5 sets of prstat, TD and pstack files.

Run the TDs through Samurai or TDA to identify deadlocks. These tools show a graphical view of what each thread is doing at each point of time (when the thread dump was taken).
The image below from Samurai shows RED blocks which are Stuck, and Green with < in them. Both of these need to be analyzed by clicking on the link in blue and seeing what line the thread is at. The Green ones with < might not be stuck, but they were at the same line as in previous thread dump and that indicates a potential problem.
Then analyze the prstat and pstack for any particular thread issues.

On the prstat, look for the LWP number on the right hand corner for high cpu and any long time in action.

Convert the lwp number to hexa decimal value as follows:
echo nawk '{printf("%d=0x%x\n",$1,$1)}'

In the thread dump, search for "nid=" these threads could be the issue for the application.

In the pstack outputs, check for the LWP numbers and confirm that the threads show the java trace.

Below is an example - Note this is indicative and not a proper instance of a slow-performing server.

1. prstat -a shows the java process consuming 42.6% and it has 180 LWP (Light-weight processes) running.

PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP

20677 wlsuser 1116M 526M sleep 1 0 2:20:54 42.6% java/180
11922 wlsuser 232M 110M sleep 1 0 3:41:59 0.1% java/157
29861 noaccess 198M 72M sleep 8 0 32:25:49 0.0% java/55
18133 root 630M 55M sleep 27 0 12:25:57 0.0% java/60
12107 wlsuser 927M 288M sleep 1 0 3:51:28 0.0% java/187
29654 wlsuser 1270M 206M sleep 1 0 0:54:26 0.0% java/158
2633 wlsuser 899M 217M sleep 1 0 7:06:10 0.0% java/382
7431 wlsuser 880M 253M sleep 5 0 1:44:24 0.0% java/186
1058 wlsuser 881M 241M sleep 1 0 3:11:35 0.0% java/182
9864 wlsuser 4336K 3952K cpu21 59 0 0:00:00 0.0% prstat/1
13789 wlsuser 1090M 208M sleep 17 0 0:30:44 0.0% java/210
13147 wlsuser 635M 168M sleep 15 0 7:17:05 0.0% java/180
7724 root 5728K 4112K sleep 60 0 33:54:37 0.0% vasd/1
18131 root 2272K 1088K sleep 59 0 8:18:11 0.0% wrapper/1
15397 root 2272K 1296K sleep 59 0 3:33:59 0.0% procmon/1

2. prstat -Lp 20677 gives the breakdown of these 180 LWPs
The top 2 can be seen to be consuming 10% CPU each and are running for longer than 15 minutes.

PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID

20677 wlsuser  1116M  526M sleep   58    0   0:16:03 10.3% java/120

20677 wlsuser  1116M  526M sleep   24    0   0:15:15 10.1% java/123

20677 wlsuser  1116M  526M sleep    3    0   0:03:36 4.0% java/23

20677 wlsuser  1116M  526M sleep    5    0   0:01:42 2.0% java/21

20677 wlsuser  1116M  526M sleep    9    0   0:03:38 2.0% java/20

20677 wlsuser  1116M  526M sleep    9    0   0:00:38 1.7% java/19

20677 wlsuser  1116M  526M sleep    4    0   0:00:42 0.0% java/18

20677 wlsuser  1116M  526M sleep    4    0   0:00:36 0.0% java/17

20677 wlsuser  1116M  526M sleep    4    0   0:00:34 0.0% java/16

20677 wlsuser  1116M  526M sleep    1    0   0:00:42 0.0% java/15

20677 wlsuser  1116M  526M sleep    4    0   0:00:42 0.0% java/10

20677 wlsuser  1116M  526M sleep    6    0   0:07:43 0.0% java/8

20677 wlsuser  1116M  526M sleep    3    0   0:00:39 0.0% java/7

20677 wlsuser  1116M  526M sleep    1    0   0:00:40 0.0% java/4

20677 wlsuser  1116M  526M sleep    8    0   0:00:39 0.0% java/2

20677 wlsuser  1116M  526M sleep   29    0   0:04:55 0.0% java/121

20677 wlsuser  1116M  526M sleep    1    0   0:11:52 0.0% java/126

20677 wlsuser  1116M  526M sleep    1    0   0:00:05 0.0% java/119

20677 wlsuser  1116M  526M sleep   43    0   0:00:27 0.0% java/122

20677 wlsuser  1116M  526M sleep   28    0   0:00:03 0.0% java/118

20677 wlsuser  1116M  526M sleep    1    0   0:12:43 0.0% java/125

3. Run the cmd to convert the LWP decimal to Hex
echo 120 nawk '{printf("%d=0x%x\n",$1,$1)}' 120=0x78
echo 123 nawk '{printf("%d=0x%x\n",$1,$1)}' 123=0x7b
Now we need to investigate what these 2 threads are doing.

4. Check the Thread dump for the nid=0x78. It shows

"ExecuteThread: '83' for queue: 'wlr3Queue'" daemon prio=5 tid=0x00a886f0 nid=0x78 waiting for monitor entry [c3afe000..c3affc28]

at org.apache.log4j.Category.callAppenders(Category.java:204)

- waiting to lock <0xdc8e9080> (a org.apache.log4j.spi.RootLogger)

at org.apache.log4j.Category.forcedLog(Category.java:391)

at org.apache.log4j.Category.debug(Category.java:260)

...

5. Running the above TD through Samurai also shows that Execute Thread 83, has been stuck for a while. It also shows that the same object <0xdc8e9080> is being locked by another Thread which happens to be the nid=0x7b

"ExecuteThread: '86' for queue: 'wlr3Queue'" daemon prio=5 tid=0x00a85730 nid=0x7b runnable [c3efd000..c3effc28]

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:260)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:106)

- locked <0xd8b55a70> (a java.io.BufferedOutputStream)

at java.io.PrintStream.write(PrintStream.java:258)

- locked <0xd88512c8> (a java.io.PrintStream)

at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408)

at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)

- locked <0xdce03ae8> (a java.io.OutputStreamWriter)

at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213)

at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:58)

at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:316)

at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)

at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)

- locked <0xdc8eeb60> (a org.apache.log4j.ConsoleAppender)

at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)

at org.apache.log4j.Category.callAppenders(Category.java:206)

- locked <0xdc8e9080> (a org.apache.log4j.spi.RootLogger)

6. Thus there is an issue around the log4j writing into log files, which needs further investigation.

In this case, one issue identified was that the log4j was in DEBUG mode which should not be the case in a Production platform. Thus there was unnecessary logging taking place.

7. Finally, a look at the pstack outputs for the two LWP 120 and 123 confirms what the LWPs were doing.

----------------- lwp# 120 / thread# 120 --------------------
ff341758 lwp_mutex_timedlock (f66e68, 0)
fedbbb60 __1cLOptoRuntimebAcomplete_monitor_locking_C6FpnHoopDesc_pnJBasicLock_pnKJavaThread__v_ (da5a6848, c3aff0b0, a886f0, da5b6980, d6d15e88, 0) + 84
f9c342dc ???????? (da5a6848, c3aff0b0, da5b6980, d6d15e70, 0, c3aff0f8)
fa643500 ???????? (da5c9fc0, 0, da5c9fc0, da5b6980, db3b3af8, 0)
fa5c0e2c ???????? (da5c9fc0, f1c8e7a0, da5b6980, d6d15e70, 0, c3aff0f8)
fa762524 ???????? (da5b65f8, d6d15e70, 0, f9c15ea0, 8, c3aff178)
f9c4eec4 ???????? (da5b65f8, f52b9bf0, f0ab1c94, f9c160d0, 8, c3aff0f8)
f9c05a8c ???????? (c3aff29c, b6, 0, f9c15ea0, 8, c3aff178)
f9c05804 ???????? (c3aff33c, b6, 0, f9c15e50, c, c3aff210)
f9c05804 ???????? (c3aff3cc, b6, 0, f9c15e50, c, c3aff2d8)
f9c05804 ???????? (c3aff464, b6, 0, f9c15e50, c, c3aff358)
f9c05804 ???????? (c3aff4f4, f1a8a5c8, 0, f9c15e50, c, c3aff400)
f9c05a8c ???????? (c3aff58c, f084f340, 0, f9c163d0, c, c3aff490)
f9c059d8 ???????? (c3aff674, b6, 0, f9c163d0, 4, c3aff518)
f9c05804 ???????? (c3aff6fc, b6, 0, f9c15e50, 10, c3aff608)
f9c05804 ???????? (c3aff760, d883f8c8, 0, f9c15e50, c, c3aff690)
f9c4f278 ???????? (d6d15bc0, d883f8c8, d884f5a0, f0ac6f08, 4, c3aff720)
fa0a23a4 ???????? (d884f5a0, d883f8c8, d6d15bc0, f9c15e50, c, c3aff7e8)
f9c30f48 ???????? (d884f5a0, b6, 82bb, f9c16250, 82b9, 0)
f9c05750 ???????? (c3aff8dc, b8, f373459c, f9c15e50, c, c3aff7e8)
f9c05750 ???????? (c3aff98c, b6, 0, f9c16250, c, c3aff868)
f9c05804 ???????? (c3aff9fc, d68129b0, 0, f9c15e50, c, c3aff920)
f9c4947c ???????? (d68129b0, d8808160, f81fc000, f0818618, 8, c3aff9b8)
fa114558 ???????? (d8808160, db3d24f0, f3745c98, f9c15e50, 8, c3affab0)
fa14b064 ???????? (d8808160, f096a2a0, f81fc000, f0818618, d8808160, 9)
f9cd2c38 ???????? (c3affb9c, c3affcf0, f3745c98, f9c15e50, 8, c3affab0)
f9c0010c ???????? (c3affc28, c3affe90, a, f112bc70, 4, c3affb40)
fed5bcf8 __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (c3affe88, c3affcf0, c3affda8, a886f0, a886f0, c3affd00) + 27c
fee4a3e4 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_nMsymbolHandle_4pnRJavaCallArguments_pnGThread__v_ (ff182000, a88c98, c3affd9c, c3affd98, c3affda8, a886f0) + 164
fee5d5a8 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_ (c3affe88, c3affe84, c3affe7c, c3affe74, c3affe6c, a886f0) + 6c
fee6e8f8 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (a886f0, a886f0, c69ff8, a88c98, 31a08c, fee67ed8) + 128
fee67f00 __1cKJavaThreadDrun6M_v_ (a886f0, 78, 40, 0, 40, 0) + 284
fee643e0 _start (a886f0, c3b00000, 0, 0, fee642ac, 1) + 134
ff3404f4 _lwp_start (0, 0, 0, 0, 0, 0)

----------------- lwp# 123 / thread# 123 --------------------
ff341a08 _write (1, c3afab10, 6d, 0, 0, 22) + c
fed54288 JVM_Write (1, c3afab10, 6d, 6d, c3afab10, daa519f8) + 68
fe85e348 ???????? (c3afab10, c3afcbd0, c3afcbcc, 0, 6d, 22)
fe85e1c8 Java_java_io_FileOutputStream_writeBytes (a88784, c3afcbd0, c3afcbcc, 0, 6d, daa519f8) + 34
fa1e1608 ???????? (d8b55ad0, d8a5f6e8, 0, 6d, 0, c3afce50)
fa1e3dc0 ???????? (d8b55a70, daa726a8, 0, 6d, d351afb0, daa519f8)
f9dfae70 ???????? (d88512c8, daa726a8, 0, 6d, 0, c3afce50)
fa033244 ???????? (daa63ed8, d351afb0, da845a80, 0, d351afb0, daa519f8)
fa5fab54 ???????? (daa63ed8, d353dca0, da735c30, d351af98, 0, c3afce50)
fa599d40 ???????? (da845a80, d351afb0, da845a80, 0, d351afb0, daa519f8)
fa5977b8 ???????? (da845a80, d351afb0, da735c30, d351af98, 0, c3afce50)
fa4945b8 ???????? (2, 0, da845a80, 0, d351afb0, daa519f8)
fa5b98ec ???????? (db112a08, f1c8e7a0, da735c30, d351af98, 0, c3afce50)
fa1a906c ???????? (db146dd8, f42a9fd0, d351af98, 1, f4262f28, f81fc000)
f9c45dc8 ???????? (db146dd8, b7, 0, f9c152a0, 8, c3afcee0)
f9c05804 ???????? (c3afd05c, b6, 0, f9c16118, c, c3afcf60)
f9c05804 ???????? (c3afd0e4, b6, 0, f9c15e98, 10, c3afcff0)
f9c05804 ???????? (c3afd174, b6, 0, f9c15e98, 10, c3afd078)
f9c05804 ???????? (c3afd1dc, d34e93e0, 0, f9c15e98, 10, c3afd108)
f9c63804 ???????? (d34756e0, d34e83d8, 0, f9c, 10, c3afd190)
fa4bd36c ???????? (f9c, d34e93e0, f96, f95, d, c3afd268)
f9c63e4c ???????? (d34e83a8, b6, c3afd338, 1, 4, 0)
f9c05804 ???????? (c3afd338, f2ee27c0, 0, f9c15e98, 10, c3afd268)
f9c46bb8 ???????? (d34e9388, f0866a58, f2ee27c0, 1, 4, c3afd2f8)
fa6c56d8 ???????? (d34e9388, f2ee27c0, 2f, f0866a58, 0, c3afd478)
fa6c6c70 ???????? (d34e9388, f0866a58, f2ee27c0, 1, 10, c3afd540)
fa6ccbb8 ???????? (d34e9bc0, 0, 1, f0866a58, 8, c3afd478)
fa67bf28 ???????? (d34e9bc0, d3502ac0, d34e82d8, 1, 10, c3afd540)
fa67c168 ???????? (d34e82d8, e3c6b060, f0846b40, f0866a58, 8, c3afd478)
fa64773c ???????? (d3502ab0, d3502ac0, d34e82d8, 1, 10, c3afd540)
f9dc820c ???????? (d3502ab0, b6, c3afd628, f9c15e98, 10, 0)
f9c05804 ???????? (c3afd634, b6, 0, f9c15e98, 10, c3afd540)
f9c05804 ???????? (c3afd6b4, b6, 0, f9c15e98, 10, c3afd5c8)
f9c05804 ???????? (c3afd73c, b8, 0, f9c15e50, c, c3afd650)
f9c05804 ???????? (c3afd7c4, b6, 0, f9c16298, c, c3afd6d8)
f9c05804 ???????? (c3afd844, b6, 0, f9c15e98, c, c3afd760)
f9c05804 ???????? (c3afd8cc, b7, 0, f9c15e50, 8, c3afd7e0)
f9c05804 ???????? (c3afd95c, b6, 0, f9c160d0, c, c3afd868)
f9c05804 ???????? (c3afd9e4, b6, 0, f9c15e98, c, c3afd8f8)
f9c05804 ???????? (c3afda74, f4405ee8, 0, f9c15e50, 8, c3afd978)
f9c05a8c ???????? (c3afdb14, f45c37a0, 0, f9c163d0, c, c3afda08)
f9c05a8c ???????? (c3afdba4, f47f5e30, 0, f9c16420, 14, c3afdaa8)
f9c05a8c ???????? (c3afdc34, b6, 0, f9c163d0, 8, c3afdb38)
f9c056e4 ???????? (c3afdcd4, b6, 0, f9c15e50, c, c3afdbd0)
f9c05804 ???????? (c3afdd64, f4783158, 0, f9c15e50, c, c3afdc70)
f9c059fc ???????? (c3afddfc, b7, 0, f9c163d0, c, c3afdcf0)
f9c05804 ???????? (c3afdea4, b6, 0, f9c160d0, 8, c3afdd90)
f9c05804 ???????? (c3afdf3c, b6, 0, f9c15e50, 8, c3afde30)
f9c05774 ???????? (c3afdfc4, b6, 0, f9c15e50, 10, c3afded0)
f9c05774 ???????? (c3afe064, b6, 0, f9c15e50, c, c3afdf58)
f9c05804 ???????? (c3afe10c, b7, 0, f9c15e50, 14, c3afdfe8)
f9c05804 ???????? (c3afe19c, b6, 0, f9c160d0, 10, c3afe0a0)
f9c05804 ???????? (c3afe22c, b7, 0, f9c16250, 10, c3afe130)
f9c05804 ???????? (c3afe2cc, f4591318, 0, f9c160d0, 10, c3afe1c0)
f9c059d8 ???????? (c3afe374, b8, 0, f9c163d0, 8, c3afe260)
f9c05804 ???????? (c3afe40c, b6, 0, f9c16250, 4, c3afe308)
f9c05750 ???????? (c3afe4a4, b6, 0, f9c15e50, 8, c3afe398)
f9c05804 ???????? (c3afe53c, b6, 0, f9c15e98, 8, c3afe440)
f9c05804 ???????? (c3afe5e4, b6, 0, f9c15e98, 8, c3afe4c0)
f9c05750 ???????? (c3afe684, f43eee20, 0, f9c15e50, 8, c3afe580)
f9c059d8 ???????? (c3afe6e8, daa8e080, 0, f9c163d0, 8, c3afe618)
f9c30d40 ???????? (ddb9b518, daa8e080, d1afafe8, f9c15e50, c, c3afe6a0)
fa026230 ???????? (db146b88, daa8e080, d1afafe8, f9c16250, c, 0)
fa0182bc ???????? (daf02128, daa8e080, d1afafe8, f9c15e50, c, c3afe7b0)
f9c30f48 ???????? (daf02128, b6, c3afe89c, f9c16250, c, 0)
f9c05750 ???????? (c3afe8a4, b8, 0, f9c15e50, c, c3afe7b0)
f9c05750 ???????? (c3afe92c, b6, 0, f9c16250, c, c3afe840)
f9c05750 ???????? (c3afe990, d1afaff8, 0, f9c15e50, 4, c3afe8c0)
f9c4f278 ???????? (d1afaff8, daa8e5c0, d1b0b920, d1afb0f0, 4, c3afe950)
fa7400a0 ???????? (61a80, d1afaff8, 1, daa00d48, 2710, 34)
f9c30b04 ???????? (daf9f1a8, f441b8a8, c3afeb54, f9c163d0, f441b8a8, f441bd18)
f9c059d8 ???????? (c3afeb54, f45f12b0, 0, f9c163d0, 8, c3afea48)
f9c059d8 ???????? (c3afec1c, f0d5e7d8, 0, f9c163d0, 4, c3afead8)
f9c059d8 ???????? (c3afecac, f43eee20, 0, f9c16420, 10, c3afeb98)
f9c059d8 ???????? (c3afed54, f43eef20, 0, f9c163d0, 8, c3afec40)
f9c059d8 ???????? (c3afedf4, b6, 0, f9c163d0, 8, c3afece0)
f9c05750 ???????? (c3afee7c, b6, 0, f9c15e50, 8, c3afed88)
f9c05750 ???????? (c3afef04, b6, 0, f9c15e50, c, c3afee18)
f9c05750 ???????? (c3afef8c, b6, 0, f9c15e50, 14, c3afee98)
f9c05750 ???????? (c3aff01c, b6, 0, f9c15e50, 14, c3afef20)
f9c05750 ???????? (c3aff0ac, f43e7f08, 0, f9c15e50, c, c3afefb0)
f9c059d8 ???????? (c3aff13c, f429d368, 0, f9c163d0, c, c3aff048)
f9c059d8 ???????? (c3aff1e4, b6, 0, f9c15e50, 10, c3aff0c8)
f9c05804 ???????? (c3aff29c, b6, 0, f9c15e50, c, c3aff178)
f9c05804 ???????? (c3aff33c, b6, 0, f9c15e50, c, c3aff210)
f9c05804 ???????? (c3aff3cc, b6, 0, f9c15e50, c, c3aff2d8)
f9c05804 ???????? (c3aff464, b6, 0, f9c15e50, c, c3aff358)
f9c05804 ???????? (c3aff4f4, f1a8a5c8, 0, f9c15e50, c, c3aff400)
f9c05a8c ???????? (c3aff58c, f084f340, 0, f9c163d0, c, c3aff490)
f9c059d8 ???????? (c3aff674, b6, 0, f9c163d0, 4, c3aff518)
f9c05804 ???????? (c3aff6fc, b6, 0, f9c15e50, 10, c3aff608)
f9c05804 ???????? (c3aff760, d883f8c8, 0, f9c15e50, c, c3aff690)
f9c4f278 ???????? (d19577a8, d883f8c8, d884f5a0, f0ac6f08, 4, c3aff720)
fa0a23a4 ???????? (d884f5a0, d883f8c8, d19577a8, f9c15e50, c, c3aff7e8)
f9c30f48 ???????? (d884f5a0, b6, 786b, f9c16250, 7869, 0)
f9c05750 ???????? (c3aff8dc, b8, f375bd34, f9c15e50, c, c3aff7e8)
f9c05750 ???????? (c3aff98c, b6, 0, f9c16250, c, c3aff868)
f9c05804 ???????? (c3aff9fc, d1a10f80, 0, f9c15e50, c, c3aff920)
f9c4947c ???????? (d1a10f80, d8808160, f81fc000, f0818618, 8, c3aff9b8)
fa114558 ???????? (d8808160, d1a10f80, f3745c98, f9c15e50, 8, c3affab0)
fa14b064 ???????? (d8808160, f096a2a0, f81fc000, f0818618, d8808160, 9)
f9cd2c38 ???????? (c3affb9c, c3affcf0, f3745c98, f9c15e50, 8, c3affab0)
f9c0010c ???????? (c3affc28, c3affe90, a, f112bc70, 4, c3affb40)
fed5bcf8 __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (c3affe88, c3affcf0, c3affda8, a886f0, a886f0, c3affd00) + 27c
fee4a3e4 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_nMsymbolHandle_4pnRJavaCallArguments_pnGThread__v_ (ff182000, a88c98, c3affd9c, c3affd98, c3affda8, a886f0) + 164
fee5d5a8 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_ (c3affe88, c3affe84, c3affe7c, c3affe74, c3affe6c, a886f0) + 6c
fee6e8f8 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (a886f0, a886f0, c69ff8, a88c98, 31a08c, fee67ed8) + 128
fee67f00 __1cKJavaThreadDrun6M_v_ (a886f0, 78, 40, 0, 40, 0) + 284
fee643e0 _start (a886f0, c3b00000, 0, 0, fee642ac, 1) + 134
ff3404f4 _lwp_start (0, 0, 0, 0, 0, 0)

Another thing to check is for any other processes that are running at the same time on the server.

We faced a situation in the past, when backup processes ran at a fixed time on a particular directory, and caused CPU spikes. This was due to there being a huge number of old files in that directory.

We ran a du on the /data filesystem and as soon as it hits /data/software/xyz/log it stops dead and the CPU goes up to 0% free. This filesystem is used for deployments as the name implies. The reason was when the backup hits this directory, it will attempt to do an ls or something similar, right? As it encounters the 188994 logs that were there, it completely clogs up the CPU trying to process them. This is what is causing the CPU spikes. Whenever anyone or anything accesses this filesystem, everything stops.

Tuesday, 25 November 2008

Simultaneous Solaris T2000 core dumps on local zones

Issue:

Our applications run on Sun T2000 servers which are configured with 4 or 5 Local zones on each Global Zone.

17th November 05:00 there was an issue on one Global Zone which caused 5 separate and simultaneous core dumps to get created on the server. One core in each Local zone.

Impact:
The 5 processes which crashed on each Local zone were

LZ 15 – core created by the wlshell java process. Low impact
LZ 27 – one core created by wlshell java processes. Low impact.
LZ 25 – core created by HM server java process. Was manually restarted today. Medium impact
LZ 26 – core created by PA server java process. This was automatically restarted by the HM. High impact.
LZ 37 – core created by HM server java process. Was manually restarted today. Medium impact

All 5 crashes above were in Library=/platform/sun4v/lib/libc_psr.so.1

Each core exits with a hs_err_.log file - which contains the foll details

An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : 10 occurred at PC=0xFF2709F0
Function=_memset+0x70
Library=/platform/sun4v/lib/libc_psr.so.1

Current Java thread:
at java.lang.Thread.start(Native Method)
- locked <0xf1f3a378> (a java.util.logging.LogManager$Cleaner)
at java.lang.Shutdown.runHooks(Shutdown.java:126)
at java.lang.Shutdown.sequence(Shutdown.java:165)
at java.lang.Shutdown.exit(Shutdown.java:210)
- locked <0xf5998498> (a java.lang.Class)
at java.lang.Runtime.exit(Runtime.java:90)
at java.lang.System.exit(System.java:715)
at ConsoleScriptRunner.main(ConsoleScriptRunner.java:64)

Investigations

1. Seems to be running out of swap at the time of the crash. Memory: 31.9G real, 4.0G free, 37.7G swap in use, 2.4G swap free
Update: added 50 G swap on 19/11 - Memory: 31.9G real, 3.4G free, 38.0G swap in use, 52.8G swap free

2. Application Support is identifying what the wlshell process does – it seems to be a Weblogic server monitoring script running every 10 minutes under cron.

3. Library=/platform/sun4v/lib/libc_psr.so.1 Are there any known issues with this library on solaris 10 or T2000? Support to query SUN

The libc_psr libraries implement platform-specific, optimized versions of block copy and move routines from libc, such as memcpy(). On UltraSPARC machines, these routines are coded in assembler, and use block load and store ASI's, prefetch, and other tricks for better performance.

Root Cause

Sun identified that the issue was related to known bugs in JDK 142_03 and running an old version of libc_psr.so

The ( a) type error as seen in hs_err_pid6216_18111030.log the error was a know java bug with the j2sdk1.4.2_03 and the bug is identified as :

Bug ID: 4927116 fixed in 1.4.2_04 Synopsis Regression: 1.4.2 JVM core dumps in ClassLoader.defineClass0()

Actions

1. Immediate action is to upgrade to JDK 142_18

2. Also identified that all 5 core dumps were created by sh scripts running under crontab. Further analysis showed that there were cron jobs running every 15 mins on each of the zones - and this puts a load on system resources especially around memory consumption in the JVM and sufficient unhanded exceptions between the java virtual machine and the operating system could cause the then dead locked JVM to terminate .

Thus the 2nd resolution was to offset the start time of each cron job - to try and ensure no two jobs run on the same Global Zone at the same time.

Old Timings:

0,15,30,45 * * * * /export/home/Handlers.sh > /tmp/Handcheck
0,15,30,45 * * * * /export/home/Average_response_time.sh > /tmp/Avgcheck
0,15,30,45 * * * * /export/home/Msg_In_and_Out_count.sh > /tmp/Mescheck
0 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * * /log/zipper.sh

New Timings:

1,16,31,46 * * * * /export/home/Handlers.sh > /tmp/Handcheck
2,17,32,47 * * * * /export/home/Average_response_time.sh > /tmp/Avgcheck
3,18,33,48 * * * * /export/home/Msg_In_and_Out_count.sh > /tmp/Mescheck
4 0,2,4,6,8,10,12,14,16,18,20,22 * * * /log/zipper.sh

Also knocking off unnecessary cron jobs from the box to reduce the load.

You Never Know