Tuesday, 12 May 2009

Slow Server Response Part 4 - Platform Checks and Action Plan



In the earlier parts of this article I described some technical tuning tips specific to a slow-performing Weblogic JEE server.

Another earlier post has looked at analyzing thread dumps and prstats when the Weblogic/JEE server consumes high CPU.




This article provides a sequence of actions a Support team should carry out when faced with a critical situation - the site is down / servers running high CPU - and management teams want quick updates - and an action plan.

So here is (from experience) the Rapid Action Plan:


Technical Checklist for the Platform

1. Start a diary and mark out each of the steps given below as well as any following action with the timestamp at which it was carried out and the result/status.

2. Get the relevant technical experts on a conference call + Netmeeting/LiveMeeting/Desktop Sharing/Remote Admin.

3. Does the application have any traps or thresholds set which are configured to automatically raise alarms to the Support teams? Have any of the traps set been exceeded ? eg: Server CPU, Memory utilization, No of Threads

4. Can we narrow down to a problem area - Web server, Application server, Database, OS - based on log files, error messages and Support team or User input.

If a particular Weblogic Managed Server is identified as a point of failure, does the Configuration allow that server process to be shutdown for a while - thereby reducing Cluster capability but still providing acceptable Quality of service.

Are all the Managed Servers running with equal threads? If not, this can cause a load balancing issue.

Sometimes the bottleneck can be the Web server plugin which is not able to properly load balance the requests across the Weblogic cluster. This is usually the case when users complain of loss of session, spontaneous logout etc. The problem can be the user has been bounced from one Weblogic server to another in the cluster - and the session might not be replicated across the servers.

Any redundant JDBC connection pools - i.e. those configured with a high capacity but monitoring shows they dont need that many.
Then reduce the Capacity of that pool so that it does not hold on to unnecessary connections on the database.

5. From the log files, identify whether a particular application or code area is causing an issue. eg: EJB throwing errors, Spring bean configuration missing.

6. Are the log files too large (> 500 Mb) or not getting rotated via Weblogic rotation policy or Unix archiving ?

7. Check the Downstream Back-end systems which the server connects to - via DBLink, Web service, XML/Http, JMS, HTTP screen scraping etc.? Any known issues or planned outages? There should be error logs pointing in that direction. Contact their support teams to know if their system is available and returning responses as per the SLA.

8. Can the problem be replicated on Reference/Test instances?
A Dev or Test team can in parallel try out to see whether the issue is replicable.

If Yes, is it code related or configuration related?

If the issue is not replicable, then can it be data related ? Perhaps a particular set of data exists on Production which is not on the Test instance - and that could be the problem. Can the data be brought into Test to try and replicate the problem ?

9. Can it be content related? Does the platform have a Content Management System? Is the link from the CMS to the server working or broken? Is the Content correctly getting deployed into the database + file system as per daily process?

Check if there was a content deployment carried out and any records whether they failed or passed. Is content deployment happening during the business timings, and utilizing system and CPU resources - which chokes the JEE server?

Can a resource-hungry content deployment process be moved to out-of-business hours.

10. Test broken user journeys on the site.

Can the problem be seen while running HttpHeaders, HttpAnalyzer, Fiddler etc ? Does it show any change in HTTP parameters such as Cookies, Session timeouts?
Compare these against the Test environment and see whether any mismatches which could cause the problem.
If there is bouncing of user sessions between managed servers, this will be visible in the weblogic JSessionID which will be different on the client browser.

11. What were the last few changes to the platform ?

Check latest release or configuration change as per Support Team Diary of Events. Could these have caused an issue and should these be rolled back?

Were these properly tested and signed off before going into Production.

eg: any new Database driver, changes to TCP parameters, JTA timeouts increased?

12. Check the last few support cases raised? See if there were any problems reported by business or end customers.

13. Solaris/OS checks

Is the platform running on the latest OS patch levels and JDK settings as recommended by Sun.

a. No of processes running. Use
ps -ef | wc -l

b. Ping the boxes, to check if they are alive

c. CPU utilization

prstat

d. Memory utilization

vmstat 3

Swap space utilization, amount of space in /tmp - is there any old file or core dump occupying the directory used as swap space. We once moved old EAR files from /tmp on the server; memory utilisation went from approx 90% down to 65%.

e. Disk space

df -ek

f. No of File descriptors


14. Weblogic/Web server checks

a. Thread utilization - any Stuck Threads

Analyze Thread dumps , at least 4 sets of Thread dumps taken 5 seconds apart when stuck thread is observed. See here for more details on what to look for in the thread dumps. Use Samurai or TDA

b. CPU %

c. Access and error logs - Any CRITICAL messages in the logs. Any Connection_Refused errors indicating the threads were not able to accept new requests.

d. No of open sockets to weblogic

netstat -a | grep

e. Memory utilization via Weblogic console

f. Check via console if all the managed servers are up and running

g. Connection pool utilization, Are they hitting the peak values.

h. Frequent Garbage collection shown in the console?
Frequency of GC, GC pattern. Has the JVM been tuned to allow optimum garbage collection. See this URL for more.

i. Check for the values in weblogic.xml for jsp-pageCheckSeconds and servlet-reload-check-secs - if these are at the default of 1, the server will check each second to see whether the JSP should be recompiled - this is horribly slow

j. Cron job logs - any failures.

k. No of weblogic sessions per server - the more the number of HttpSessions, the higher the memory (RAM) that gets used.

l. Is a large part of the application journey over SSL. When supporting the cryptography operations in the SSL protocol, WebLogic Server cannot handle as many simultaneous connections.
Typically, for every SSL connection that the server can handle, it can handle three non-SSL connections. SSL reduces the capacity of the server by about 33-50% depending upon the strength of encryption used in the SSL connections.
(Source: http://edocs.bea.com/wlp/docs92/capacityplanning/capacityplanning.html#wp1080286)
Consider reducing the SSL journeys on the site.

m. Disk space taken by Weblogic and other logs such as Log4J.
Is log4j running in DEBUG and writing out loads of logs ? This will also slow down the server horribly.

15. Database checks

a. SQL Server locks (Call out DBA)

b. Database stuck/locked processes

c. Any DB link down

d. Any issues with open cursors, cached cursors ?

e. Is the database running at very high Memory Utilization?


16. Search Engine processing - check the log for the day.


17. Any MIS such as Webtrends / Omniture Analysis - for application usage. Has there been a sudden rise in users on the site - eg a marketing campaign or a new feature gone live - causing a rise in usage which the infrastructure cannot cope with.

18. Any application cached data which was wiped out and took time to rebuild - causing slow service in the interim period. eg: is any database table with a lot of rows being cached.
Or conversely, is there incorrect data in a certain cache and will clearing the cache help ?

19. SMTP email delivery failures due to any problems on the OS ?

20. Any planned backup processes running on the OS which takes up a lot of CPU.



Remedial actions

1. Make a list of possible changes based on the above checks to address these problems.

2. Only change one setting on any system at a time. Test and record the desired effects and observed effects. Be clear on why a particular change is being made.

3. If it doesn't work rework the plan to get to root cause of failure.

4. Be aware that reactive changes will be made directly to the production
environment by various parties. Significant changes will be made purely to enable investigation and diagnosis of issues.
The lack of up-to-date documentation creates risk. Maintain a documented rationale for a particular design decision, configuration choice, or system parameter; this reduces the likelihood that mistakes will be repeated. Documentation is a key communication tool, without it intent may be miscommunicated within the team. If key staff members leave, knowledge will be lost to the extent that the platform may become unmanageable.

5. Add additional tests to the regression test
suite. Increase the coverage of the regression test suite, focussing on
simulating live system interaction.


6. Over the long term, identify and re-architect towards removing Single Points of Failure - such that loss of a single machine or process would not lead to a loss of service.

Examples:
· Single web server machine, hosting the Apache/SunOne instances.
· Single application server machine, hosting the Weblogic application server.
· Single database server instance.

The system runs at risk of lengthy service outage if any one of these components fails. If a hardware failure occurred and one of the servers was lost, alternative hardware would need be installed and initialised from back-up tapes.
This needs to be fixed and stabilized over the long term.

6. While the analysis is going on, a member of Support team should circulate the KEY Metrics on an hourly basis to the TECHNICAL community. Ensure this is the important dataset and not too much info which just becomes noise.


Example in the table below:












Servers Server 1Server 2 Server 3
Idle Threads111417
JMSErrors1017
IOExceptions301
Stuck Threads007
JMSErrors1017
netstat -a | grep TIME_WAIT | wc -l 18628969
CPU Utilization (%)121.83.6
Memory Utilization (%)3.92.92.1
500 Internal Server Error300



No of logged in Users: 260

JMS Pending Messages: 0



Any queries or clarifications, leave me a comment and I'll try to get back.