You Never Know: tuning

Showing posts with label tuning. Show all posts

Thursday, 6 August 2009

Weblogic JMS Performance Tuning Tips

Here are a couple of real-life tips on tuning JMS performance.

Problem:
The creates and sends JMS messages on an outgoing queue for consumption by another application.

It was observed that when the consuming application was offline for a period of time the number of messages that could be retained on the queue before the JVM heap was filled up was quite low. This was tested to be roughly 7000 messages, after which OutOfMemory exceptions begin to occur. Given that the consuming application could realistically be offline for a period, hence the use of an asynchronous queue, we needed to increase the number of messages that could realistically be stored in the queue.

The exception we get is shown below:


Start server side stack trace:
java.lang.OutOfMemoryError:

Start server side stack trace:
java.lang.OutOfMemoryError
        <<no stack trace available>>
End  server side stack trace
        at weblogic.rmi.internal.BasicOutboundRequest.sendReceive(BasicOutboundRequest.java:109)
       at weblogic.rmi.internal.BasicRemoteRef.invoke(BasicRemoteRef.java:127)
       at weblogic.jms.dispatcher.DispatcherImpl_WLStub.dispatchSyncFuture(Unknown Source)
       at weblogic.jms.dispatcher.DispatcherWrapperState.dispatchSync(DispatcherWrapperState.java:286)
       at weblogic.jms.client.JMSSession.createProducer(JMSSession.java:1484)
       at weblogic.jms.client.JMSSession.createSender(JMSSession.java:1335)
...

The GC logs also show frequent Full GCs before the server goes out of memory.

Solution Steps:

1. Enabling JMS Paging

Paging had not been enabled for the queue. Despite this queue being persistent, this meant that every message was stored in the JVM memory heap in its entirety. Enabling message paging for this queue means that only the headers for paged messages are kept in memory, significantly reducing the amount heap utilized.

As the messages were being persisted via a JDBCStore to a database, this functioned as a paging store as well, however a FileStore must still be specified as the paging store for the JMS Server, or the JMS Server will not deploy at WLS server start time. This would be due to the need to cater for any non-persistent destinations when paging is enabled. If non-persistent messages are not paged, the size of this FileStore will be negligible. Paged messages still occupy some space on the memory heap as the message headers are still kept in memory.

On enabling paging for the specific queue, the test could cater for roughly 15000 messages before OutOfMemory exceptions occurred. The point at which paging began was set deliberately low to 100. Recovering from the page store does incur a certain performance cost, so in Production this was set to a more reasonable number based on the peak number of messages expected in the queue under normal conditions.

Despite the gain in number of messages that could be catered for, the heap utilisation graphs were very similar to those before paging was enabled. This showed no minor GCs, only full GCs at fairly frequent intervals.

2. JVM Settings and Garbage Collection Tuning

The untuned JVM heap size was 512Mb. This value could be increased, but test results after tuning the JVM settings indicate that this was probably more than adequate.

Examining the current JVM settings uncovered some settings that needed to be changed.

The most significant issue with the JVM settings was the NewSize value. This was set very high to 384Mb out of the total heap of 512 Mb.
A reasonable New Generation area would normally be 20-25% of the total heap size and setting it larger than the Tenured Generation area is guaranteed to cause unhealthy GC operations. In addition, it is good practice to use NewRatio rather than NewSize to avoid fixing an absolute size. A NewRatio of 3 (1:3, i.e. 25% of heap) or 4 is considered the most appropriate for WebLogic Server applications. NewSize was therefore dropped and a NewRatio was set to 4 (20% of total heap).

The SurvivorRatio value was set to a reasonable value of 3.
TargetSurvivorRatio, however, was unset meaning that the default of 50 applies. 80 would probably be a better setting, meaning that the switch between survivor spaces in the JVM heap would occur at 80% rather than 50%. The performance improvement from this should be noticeable in the frequency of minor GC, though not dramatic.

The PermSize values were high, with PermSize and MaxPermSize both set to 384Mb. These were reduced to 64Mb and 128Mb respectively, which should be more than adequate. Though these changes are unlikely to improve performance, having the PermSize set to high will needlessly consume memory.

The effect of setting a good NewRatio value was dramatic. Many minor GCs were the rule, with infrequent full GCs. 60,000 messages were added to the queue before the heap was approaching full. We ran an overnight, and somewhere between 60,000 and 70,000 messages the OutOfMemory exceptions occurred.

Scaling this up to the Production environment which has 1Gb Heap, shows that the server could easily cater for the expected load.

Monday, 15 June 2009

Slow Weblogic Part 6 - JVM Heap Analysis using GCViewer

Basics

In an earlier article, I had listed the review of JVM memory parameters as one of the important checks for tuning the JEE server platform.

The basic primer for JDK 1.4 is at http://java.sun.com/docs/hotspot/gc1.4.2/

The key points you need to know are:


Total JVM Heap = Young + Tenured(also called Old)
Young = Eden + From (SS1) + To (SS2)

In the diagram below [taken from the Sun website], "From" and "To" are the names of the two Survivor Spaces (SS) within the "Young".

Perm Space (and code cache): stores JVM’s own stuff. This is outside the Heap you assign using Xms and Xmx. A good explanation of this is available here

The JVM Heap is at default initial 2Mb and max 64Mb (for JDK 1.4 on Solaris).
Default Perm Size is 16MB (for JDK 1.4 on Solaris)
The defaults change for each JDK and are different on each OS - so look up the values on the respective websites.

The ratios are as shown below

Now the object life cycle and garbage collection occurs like this:

1. Objects when created are always first allocated to Eden.
2. When Eden fills up, a fast but not comprehensive GC (minor collection) is run over the young generation only.
3. All surviving objects are moved from Eden into one Survivor Space.
4. In consequent minor collections, new objects move from Eden into the other Survivor Space, plus everything from the first Survivor Space (survivors from the previous minor collection) is also moved into the second Survivor Space. Thus one survivor should be empty at that time.
5. When objects in Survivor Space are old enough (or survivor fills up), they are moved to Tenured. By default the long-lived objects may be copied up to 31 times between the Survivor Spaces before they are finally promoted to the Old generation.
6. When tenured fills up, a Full GC collection is run that is comprehensive: the entire heap is analyzed, all objects that can be destroyed are killed and memory is reclaimed.

Note: the above lifecycle changes slightly when advanced options such as ConcurrentMarkSweep etc are enabled.

Look Closer

What do these values mean ?

A full list of options available at http://java.sun.com/docs/hotspot/VMOptions.html and http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp

The absolute basic ones are listed in the table below. Note: This is for JDK 1.4
Some of these have changed in JDK 1.6

-Xms1536m -Xmx1536m	These represent the total heap (minus Perm space). Xms is the Initial Heap, set to 1.5Gb in this case. Xmx is Max Heap. It is good practice to set Xms = Xmx The max heap is limited by the RAM available on the server
-XX:NewSize=512m	This specifies the initial size of the Young generation,set to 512Mbin this example. It is better to set this as a percentage of the Heap using -XX:NewRatio
-XX:MaxNewSize=512m	This specifies the maximum size of the Young generation,set to 512Mbin this example. It is better to set this as a percentage of the Heap using -XX:MaxNewRatio
-XX:PermSize=64m -XX:MaxPermSize=128m	These values are the Minimum and Maximum sizes of the permanent generation heap space. Optimally, set PermSize equal to MaxPermSize to avoid heap having to be adjusted when permanent area grows. As specified earlier, this area of memory is over and above the Total Heap set using Xms
-XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90	The New generation area is divided into three sub-areas: Eden, and two survivor spaces that are equal in size. Use the -XX:SurvivorRatio=X option to configure the ratio of the Eden/survivor space size. In the above example, setting it to 8 means the ratio of Eden:SS1:SS2 is 8:1:1. So for a NewSize of 512 Mb, the two SS will be 51 Mb each, and Eden will be 512 MINUS (51 + 51) = 410 Mb. TargetSurvivorRatio of 90 allows 90% of the survivor spaces to be occupied instead of the default 50%, allowing better utilization of the survivor space memory.
-XX:MaxTenuringThreshold=10	This switch determines how many times the objects are hopped between the Survivor spaces before getting promoted to the older generation. The default value is 31.
-XX:+DisableExplicitGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -Xloggc:/log/gc.log	These are GC specific settings asking for GC details and the log file name in which these details should be captured

If you have an appetite for more, read this http://java.sun.com/performance/reference/whitepapers/tuning.html

GCViewer

This link below explains how to download and use the GCViewer tool. This is quite a useful tool for viewing the number of GCs and Full GCs and how the JVMis behaving over time.

http://www.javaperformancetuning.com/tools/gcviewer/index.shtml

The most important things to look at in the GCViewer analysis are the

* Acc Pauses - Accumulated Pause Time (total time app was stopped for GC).Pauses are the times when an application appears unresponsive because garbage collection is occurring
* Total Time - Total Time the application runs.
* Throughput - Time the application runs and is not busy with GC. Greater than 99% is fantastic.Throughput is the percentage of total time not spent in garbage collection, considered over long periods of time.
* Footprint - Overall Memory Consumption - Ideally as low as possible. This is the working set of a process, measured in pages and cache lines. On systems with limited physical memory or many processes, footprint may dictate scalability. Thus this usually reflects the size of total Heap allocated via Xms and Xmx

This diagram is taken from the above site:

Tuning Example From the Trenches - Frequent GC due to Perm Space getting Full

JVM Parameters already set

java -server -Xms1024m -Xmx1024m -XX:MaxPermSize=340m -XX:NewSize=340m -XX:MaxNewSize=340m
-XX:SurvivorRatio=9 -XX:TargetSurvivorRatio=90 -XX:+UseParNewGC
-Xloggc:/wls_domains/gclog/jms.gc -XX:+PrintGCDetails -XX:+UseParNewGC
-XX:+PrintGCTimeStamps -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
-XX:+JavaMonitorsInStackTrace -Dweblogic.system.BootIdentityFile=/wls_domains/xx/boot.properties
-Dweblogic.management.server=http://xx.xx.xx.xx:6000 -Dweblogic.Name=xxxxxx
-Dweblogic.ProductionModeEnabled=true
-Djava.security.policy=/opt/bea/wls/8.1sp4/weblogic81/server/lib/weblogic.policy
weblogic.Server

Full GC Pattern

The GC log shows


0.000: [Full GC 0.000: [Tenured: 0K->9794K(700416K), 0.8050952 secs] 134607K->9794K(1016960K), [Perm
: 20479K->20479K(20480K)], 0.8053527 secs]

First Full GC happens at 0:00 sec from start of server. Before and after '->' figures represent size of live objects before and after GC. Number in parenthesis indicates total available space. So from above numbers - Tenured was not full but Perm was almost full.


6579.013: [Full GC 6579.013: [Tenured: 9794K->18941K(700416K), 0.9677233 secs]
155600K->18941K(1016960K), [Perm : 24575K->24575K(24576K)], 0.9679896 secs]

Same thing happens here. The second Full GC took place at 6579.013 sec (1hr 49mins) from startup of server.
Again a Full GC is triggered, but the Tenured was not full. The Tenured is now 18.9 Mb out of 700 Mb - but it is seen the Perm Space has grown to 24.5 Mb and is not getting cleared.


9363.515: [Full GC 9363.516: [Tenured: 18941K->19463K(700416K), 0.6532332 secs] 36950K->19463K(1016960K), [Perm : 28672K->26462K(28672K)], 0.6536095 secs]

At the 3rd Full GC at 9363 seconds after server startup, the Perm space grew to 28.6 Mb and recovered marginally to 26.4 Mb.

Observing this over a long period of time, we concluded that at startup around 20MB is allocated to perm which with each Full GC keeps growing till 30MB and later shrinks back to 25MB and the cycle continues.

The pattern is highlighted below:


0.000: [Full GC 0.000: [Tenured: 0K->9794K(700416K), 0.8050952 secs] 134607K->9794K(1016960K), [Perm : 20479K->20479K(20480K)], 0.8053527 secs]
6579.013: [Full GC 6579.013: [Tenured: 9794K->18941K(700416K), 0.9677233 secs] 155600K->18941K(1016960K), [Perm : 24575K->24575K(24576K)], 0.9679896 secs]
9363.515: [Full GC 9363.516: [Tenured: 18941K->19463K(700416K), 0.6532332 secs] 36950K->19463K(1016960K), [Perm : 28672K->26462K(28672K)], 0.6536095 secs]
13483.233: [Full GC 13483.233: [Tenured: 19463K->16962K(700416K), 0.9783693 secs] 26678K->16962K(1016960K), [Perm : 30719K->21330K(30720K)], 0.9857390 secs]
17308.829: [Full GC 17308.830: [Tenured: 16962K->17312K(700416K), 1.0578872 secs] 88025K->17312K(1016960K), [Perm : 25600K->25600K(25600K)], 1.0581738 secs]
21237.810: [Full GC 21237.810: [Tenured: 17312K->17814K(700416K), 1.4728764 secs] 302290K->17814K(1016960K), [Perm : 29695K->26719K(29696K)], 1.4801234 secs]
30079.672: [Full GC 30079.672: [Tenured: 17814K->18676K(700416K), 1.0282446 secs] 83159K->18676K(1016960K), [Perm : 30975K->27564K(30976K)], 1.0349869 secs]

Solution:
Though the MaxPermSize=340m is provided, initial available Perm Size is getting full and JVM is invoking a Full GC to free up memory. The default initial PermSize is 16Mb and hence the Perm Space is resizing itself as the JVM grows.

Not sure if this behaviour is a bug but by adding an initial PermSize of 64K using this flag -XX:PermSize=64m resolved this issue of frequent Full GC.

UPDATE: Another examples published on http://jojovedder.blogspot.com/2009/07/jvm-tuning-from-trenches.html

Tuesday, 12 May 2009

Slow Server Response Part 4 - Platform Checks and Action Plan

In the earlier parts of this article I described some technical tuning tips specific to a slow-performing Weblogic JEE server.

Another earlier post has looked at analyzing thread dumps and prstats when the Weblogic/JEE server consumes high CPU.

This article provides a sequence of actions a Support team should carry out when faced with a critical situation - the site is down / servers running high CPU - and management teams want quick updates - and an action plan.

So here is (from experience) the Rapid Action Plan:

Technical Checklist for the Platform

1. Start a diary and mark out each of the steps given below as well as any following action with the timestamp at which it was carried out and the result/status.

2. Get the relevant technical experts on a conference call + Netmeeting/LiveMeeting/Desktop Sharing/Remote Admin.

3. Does the application have any traps or thresholds set which are configured to automatically raise alarms to the Support teams? Have any of the traps set been exceeded ? eg: Server CPU, Memory utilization, No of Threads

4. Can we narrow down to a problem area - Web server, Application server, Database, OS - based on log files, error messages and Support team or User input.

If a particular Weblogic Managed Server is identified as a point of failure, does the Configuration allow that server process to be shutdown for a while - thereby reducing Cluster capability but still providing acceptable Quality of service.

Are all the Managed Servers running with equal threads? If not, this can cause a load balancing issue.

Sometimes the bottleneck can be the Web server plugin which is not able to properly load balance the requests across the Weblogic cluster. This is usually the case when users complain of loss of session, spontaneous logout etc. The problem can be the user has been bounced from one Weblogic server to another in the cluster - and the session might not be replicated across the servers.

Any redundant JDBC connection pools - i.e. those configured with a high capacity but monitoring shows they dont need that many.
Then reduce the Capacity of that pool so that it does not hold on to unnecessary connections on the database.

5. From the log files, identify whether a particular application or code area is causing an issue. eg: EJB throwing errors, Spring bean configuration missing.

6. Are the log files too large (> 500 Mb) or not getting rotated via Weblogic rotation policy or Unix archiving ?

7. Check the Downstream Back-end systems which the server connects to - via DBLink, Web service, XML/Http, JMS, HTTP screen scraping etc.? Any known issues or planned outages? There should be error logs pointing in that direction. Contact their support teams to know if their system is available and returning responses as per the SLA.

8. Can the problem be replicated on Reference/Test instances?
A Dev or Test team can in parallel try out to see whether the issue is replicable.

If Yes, is it code related or configuration related?

If the issue is not replicable, then can it be data related ? Perhaps a particular set of data exists on Production which is not on the Test instance - and that could be the problem. Can the data be brought into Test to try and replicate the problem ?

9. Can it be content related? Does the platform have a Content Management System? Is the link from the CMS to the server working or broken? Is the Content correctly getting deployed into the database + file system as per daily process?

Check if there was a content deployment carried out and any records whether they failed or passed. Is content deployment happening during the business timings, and utilizing system and CPU resources - which chokes the JEE server?

Can a resource-hungry content deployment process be moved to out-of-business hours.

10. Test broken user journeys on the site.

Can the problem be seen while running HttpHeaders, HttpAnalyzer, Fiddler etc ? Does it show any change in HTTP parameters such as Cookies, Session timeouts?
Compare these against the Test environment and see whether any mismatches which could cause the problem.
If there is bouncing of user sessions between managed servers, this will be visible in the weblogic JSessionID which will be different on the client browser.

11. What were the last few changes to the platform ?

Check latest release or configuration change as per Support Team Diary of Events. Could these have caused an issue and should these be rolled back?

Were these properly tested and signed off before going into Production.

eg: any new Database driver, changes to TCP parameters, JTA timeouts increased?

12. Check the last few support cases raised? See if there were any problems reported by business or end customers.

13. Solaris/OS checks

Is the platform running on the latest OS patch levels and JDK settings as recommended by Sun.

a. No of processes running. Use
ps -ef | wc -l

b. Ping the boxes, to check if they are alive

c. CPU utilization

prstat

d. Memory utilization

vmstat 3

Swap space utilization, amount of space in /tmp - is there any old file or core dump occupying the directory used as swap space. We once moved old EAR files from /tmp on the server; memory utilisation went from approx 90% down to 65%.

e. Disk space

df -ek

f. No of File descriptors

14. Weblogic/Web server checks

a. Thread utilization - any Stuck Threads

Analyze Thread dumps , at least 4 sets of Thread dumps taken 5 seconds apart when stuck thread is observed. See here for more details on what to look for in the thread dumps. Use Samurai or TDA

b. CPU %

c. Access and error logs - Any CRITICAL messages in the logs. Any Connection_Refused errors indicating the threads were not able to accept new requests.

d. No of open sockets to weblogic

netstat -a | grep

e. Memory utilization via Weblogic console

f. Check via console if all the managed servers are up and running

g. Connection pool utilization, Are they hitting the peak values.

h. Frequent Garbage collection shown in the console?
Frequency of GC, GC pattern. Has the JVM been tuned to allow optimum garbage collection. See this URL for more.

i. Check for the values in weblogic.xml for jsp-pageCheckSeconds and servlet-reload-check-secs - if these are at the default of 1, the server will check each second to see whether the JSP should be recompiled - this is horribly slow

j. Cron job logs - any failures.

k. No of weblogic sessions per server - the more the number of HttpSessions, the higher the memory (RAM) that gets used.

l. Is a large part of the application journey over SSL. When supporting the cryptography operations in the SSL protocol, WebLogic Server cannot handle as many simultaneous connections.
Typically, for every SSL connection that the server can handle, it can handle three non-SSL connections. SSL reduces the capacity of the server by about 33-50% depending upon the strength of encryption used in the SSL connections.
(Source: http://edocs.bea.com/wlp/docs92/capacityplanning/capacityplanning.html#wp1080286)
Consider reducing the SSL journeys on the site.

m. Disk space taken by Weblogic and other logs such as Log4J.
Is log4j running in DEBUG and writing out loads of logs ? This will also slow down the server horribly.

15. Database checks

a. SQL Server locks (Call out DBA)

b. Database stuck/locked processes

c. Any DB link down

d. Any issues with open cursors, cached cursors ?

e. Is the database running at very high Memory Utilization?

16. Search Engine processing - check the log for the day.

17. Any MIS such as Webtrends / Omniture Analysis - for application usage. Has there been a sudden rise in users on the site - eg a marketing campaign or a new feature gone live - causing a rise in usage which the infrastructure cannot cope with.

18. Any application cached data which was wiped out and took time to rebuild - causing slow service in the interim period. eg: is any database table with a lot of rows being cached.
Or conversely, is there incorrect data in a certain cache and will clearing the cache help ?

19. SMTP email delivery failures due to any problems on the OS ?

20. Any planned backup processes running on the OS which takes up a lot of CPU.

Remedial actions

1. Make a list of possible changes based on the above checks to address these problems.

2. Only change one setting on any system at a time. Test and record the desired effects and observed effects. Be clear on why a particular change is being made.

3. If it doesn't work rework the plan to get to root cause of failure.

4. Be aware that reactive changes will be made directly to the production
environment by various parties. Significant changes will be made purely to enable investigation and diagnosis of issues.
The lack of up-to-date documentation creates risk. Maintain a documented rationale for a particular design decision, configuration choice, or system parameter; this reduces the likelihood that mistakes will be repeated. Documentation is a key communication tool, without it intent may be miscommunicated within the team. If key staff members leave, knowledge will be lost to the extent that the platform may become unmanageable.

5. Add additional tests to the regression test
suite. Increase the coverage of the regression test suite, focussing on
simulating live system interaction.

6. Over the long term, identify and re-architect towards removing Single Points of Failure - such that loss of a single machine or process would not lead to a loss of service.

Examples:
· Single web server machine, hosting the Apache/SunOne instances.
· Single application server machine, hosting the Weblogic application server.
· Single database server instance.

The system runs at risk of lengthy service outage if any one of these components fails. If a hardware failure occurred and one of the servers was lost, alternative hardware would need be installed and initialised from back-up tapes.
This needs to be fixed and stabilized over the long term.

6. Medium to long term remedial action include code review. Use your tools of choice for Java. .NET, front end Angular/React or backend NodeJS such as sonarQube, lint4j, sonarscanner for .NET, jsLint, eslint etc

7. While the analysis is going on, a member of Support team should circulate the KEY Metrics on an hourly basis to the TECHNICAL community. Ensure this is the important dataset and not too much info which just becomes noise.

Example in the table below:

Servers	Server 1	Server 2	Server 3
Idle Threads	11	14	17
JMSErrors	1	0	17
IOExceptions	3	0	1
Stuck Threads	0	0	7
JMSErrors	1	0	17
netstat -a \| grep TIME_WAIT \| wc -l	186	289	69
CPU Utilization (%)	12	1.8	3.6
Memory Utilization (%)	3.9	2.9	2.1
500 Internal Server Error	3	0	0

No of logged in Users: 260

JMS Pending Messages: 0

Any queries or clarifications, leave me a comment and I'll try to get back.

Wednesday, 6 May 2009

Slow Weblogic Response Part 2 - Overall Tuning Considerations

Part 1 of this article was related to using the JSP precompilation and tuning the recompilation settings provided in Weblogic.

Now, lets take a look at the other parameters that require tuning and setting correctly, in order for the Weblogic server to be generating acceptable performance of your site.

Usually the problem statement is similar to this:

1. The site regularly stops responding and all available execute threads are consumed. All future requests fail to be handled and a server restart is required.
2. The page response times of the entire site are too high and need to be brought down to a more useable level.

Note: An additional post has been published which provides a skeleton Action Plan for analyzing the entire slow site/high CPU issue including managing stakeholder expectations and appropriate reporting.

Operating System Review
A review of the Operating System configuration needs to look at
a) The number of file descriptors available and whether that matches the value recommended for Weblogic
b) Various TCP settings (also called NDD parameters) which will affect how long it takes to recycle a closed or dropped connection.
2 top settings are:
• File Descriptor limit: increase to 8192 if not 65536. A detailed follow-up on File Descriptors is published here.
• tcp_time_wait_interval: change to 60000 [The default in Solaris 8 is 4 minutes used to keep a socket open in TIME_WAIT state, even after the response is provided to the client, set this down to 1 minute. The default in Solaris 9 is 1 minute]
Check the Oracle site for the latest recommended values

Database Usage and JDBC Drivers
a) If you get Out Of Memory errors occurring in JDBC calls, it is recommended that the JDBC driver be upgraded to the latest version.

b) Prepared Statement caching is a feature that allows prepared statements to be held in a cache on the database so they do not have to be re-parsed for each call. This functionality needs to be enabled per Connection Pool and can have a significant impact on the performance of the pools. But this needs to be validated with a focussed round of Performance testing.

It should be noted that for every Prepared Statement that is held in cache, a cursor is held open on the database for each connection in the pool. So if a cache size of 10 is used on the abcPool, and the pool has a size of 50 then 500 open cursors will be required. Repeatable load tests will highlight any gains achieved by enabling this caching.

Review JDBC Pool Sizes
Review connection pool versus the number of Execute threads. Usually keep Pool size close to Execute thread size. Note: This applies to versions earlier than Weblogic 9. See detailed explanation below.

If the JDBC pool size is quite less compared to the Thread size, there is the potential to negatively impact performance quite dramatically, since threads have to wait for connections to be returned to the pool.
Your most frequently used pool should have their minimum (initial) and maximum sizes increased to the number of Execute threads plus one. This will mean there is an available connection for every thread.
One comment on pool sizing it is beneficial where ever possible to have the initial and max connections set to the same size for a JDBCPool as this avoids expanding/shrinking work that can be costly, for both the establishment of new connections while expanding the pool and housekeeping work for the pool.

However it is also recommended to monitor the JDBC pools during peak hours and see how many connections are being used at maximum. If you are not hitting the MaxCapacity, it is useful to reduce the MaxCapacity to avoid unnecessary open cursors on the database.

Note: As of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. Work Managers can be used for JDBC pools by defining the max-threads-constraint to define how many threads to allocate for a particular Datasource.
It is possible to run Weblogic 9 and 10 with the Execute Queues as available earlier. This is not recommended since Work Managers are self-tuning and more advanced than Execute Threads.

WebLogic Server, Version 8.1, implemented Execute Queues to handle thread management in which you created thread-pools to determine how workload was handled. WebLogic Server still
provides Execute Queues for backward compatibility, primarily to facilitate application
migration. However, when developing new applications, you should use Work Managers to perform thread management more efficiently.
You can enable Execute Queues in the following ways:
􀁺 Using the command line option
-Dweblogic.Use81StyleExecuteQueues=true
􀁺 Setting the Use81StyleExecuteQueues property via the Kernel MBean in config.xml.
Enabling Execute Queues disables all Work Manager configuration and thread self tuning.
Execute Queues behave exactly as they did in WebLogic Server 8.1.

Database Persistent JMS Queues

Verify whether the database architecture is such that persistent JMS queues use the same database instance as a message store as your Weblogic portal uses for data.
As the volumes on these queues increase this could significantly degrade the performance of the portal by competing for valuable CPU cycles on the database server.

1. Move the message store for persistent queues to a separate database instance from that used by most of the JDBC pools belonging to the Weblogic server. This will prevent increases in message volumes from adversely affecting the performance of the database, which would also slow the portal applications down and vice-versa.

2. Implement paging with a file-store. This allows the amount of memory consumed by JMS queues to be restricted by paging message contents to disk and only holding headers in memory. Note that this does not provide failover protection in the way persistence does and performs better with a paging file-store than a paging JDBC store.

3. It is recommended that a review is also undertaken to determine exactly which queues are persisted and whether they truly need to be. The performance gains from switching to non-persisted queues are substantial, and guaranteed delivery is not always required.

Review Number of Execute Threads
A common mistake made by Support teams when seeing Stuck threads is to increase the number of execute threads in a single ‘default’ queue. At one time, I have worked in a project which ran the Weblogic server with 95 threads.

This figure is very high and results in a large amount of context-switching as load increases, which consumes valuable CPU cycles. Because threads consume memory, you can degrade performance by increasing the value of the Thread Count attribute unnecessarily.

Taking a Thread Dump when the server is not responding will show what the threads are doing, and help identify whether there is an application coding issue or deadlock occuring. Use Samurai to analyze these as earlier posted
It is recommended that regular monitoring of the number of idle threads and the length of queued requests for each execute queue is set up via MBeans. This allows the Support teams to plot a graph of utilization and validation of any changed values.
Note: As of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. Another good link is here

Use Dedicated Execute Thread Queue for Intensive applications

As the number of threads is small, and if a particular application is seen to utilize a majority of the execute, the following 2 approaches are suggested to resolve the issue:
1. Review the design of the offending application to determine whether it really needs so many threads.
2. Move the offending application to a dedicated execute queue, with enough threads allocated to this queue. This will prevent it from starving the main server of threads and allow the ‘default’ queue to remain with a lower number of Execute threads. This split of Execute Queue can be done at servlet or webapp level. Mail me if you need an example, we've done both successfully in WL 8.1 and 9.

Note: However, as of Weblogic 9 and higher, Execute Queues are now replaced by Work Managers. You can use a Work Manager to dedicate resources at Application, Web App, EJB level.

Review Java VM Settings
Tuning the JVM settings for the Total Heap, and Young/Old Generations is essential to regulate the frequency of Garbage Collection on the servers. The basic primer is on the Sun website, and a follow up of actual values and learnings is published here. The most essential ones are Xms, Xmx for the total Heap and NewSize, NewRatio for the Young Generation. Also set PermSize and MaxPermSize appropriately to avoid consuming high memory

Other Areas

1. To speed up server start times, do not delete the .wlnotdelete directories at startup - Unless you are deploying changed application jars and code.
Be aware you might occasionally see a problem shutting down the server which goes into an UNKNOWN state due to too many old temp files and wl_internal files.
You will get the dreaded error below which can only be resolved by killing the process and clearing out all temp files, .lck files etc within the domain. The files are under DOMAIN_HOME/servers//



weblogic.management.NoAccessRuntimeException: Access not allowed for subject: principals=[weblogic, Deployers], on ResourceType: ServerLifeCycleRuntime Action: execute, Target: shutdown
        at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:195)
        at weblogic.rmi.internal.BasicRemoteRef.invoke(BasicRemoteRef.java:224)

2. Avoid the URLEncoder.encode method call as much as possible. This calls is not optimal in most JDKs < 1.5 and is often found as memory and CPU hotspot issue.

3. Check the network connection between WebLogic and the database. If Thread dumps show that threads are often in a suspended state (waiting for so long that they were suspended) while doing a socket read from the database.
The DBA wouldn't see this as a long-running SQL statement. This needs to be checked out at the network level.

4. Switch off all DEBUG options on Production on app server as well as web server and web server plugins.

5. Ensure log files are rotated so that they can be backed up and moved off the main log directory. Define a rotation policy based on file size or fixed time (like 24 hours)
However also note that: On certain platforms, if some application is tailing
the log at the time of rotation, the rotation fails. Stop the application tailing and reopen the tail after the rotation is complete.

6. Do not use "Emulate Two-Phase Commit for non-XA Driver" for DataSources.
It is not a good idea to use emulated XA. It can result in loss of data integrity and can cause heuristic exceptions. This option is only meant for use with databases for which there is no XA driver so that the datasources for these pools can still participate in XA transactions.
If an XA driver is available (and there is for Oracle), it should be used. If this option is selected because of problems with Oracle's Thin XA driver, try the newest version, or pick a different XA driver.

7. The <save-sessions-enabled> element in web.xml controls whether session data is cleaned up during redeploy or undeploy.
It affects memory and replicated sessions. The default is false. Setting the value to true means session data is saved and this is an overhead.

8. If firewalls are present between Weblogic server and the database or an external system connecting via JMS, this can cause transactional and session timeout issues. The session timeout on the firewall shoudl be configured to allow for normal transaction times.
In the case of JMS, transactional interoperation between the two servers can be compromised and hence it is beneficial to open the firewall between the two servers so that RMI/T3 connections can be made freely.

Any queries or clarifications, leave me a comment and I'll try to get back.

You Never Know