Crash on AIX produces no core or a truncated core
Crash on AIX produces no core or a truncated core
Troubleshooting
Problem
This document outlines what needs to be done to ensure that a full core file is produced on AIX if WebSphere Application Server crashes.
Resolving The Problem
System core dump files should generate in WebSphere Application Server during a crash, or if manually triggered, and in some OutOfMemory instances. A good system core dump is needed to diagnose crashes, some OutOfMemory issues, and some other issues as needed. A few conditions can cause the core dumps to be truncated and unusable.
NOTE: There is a different technote that discusses issues where the process does not record a crash event.
1. SET ULIMITS
See Also: Guidelines for Setting Ulimits
The ulimits for core and fsize need to be tuned so that the hard and soft limits are set to unlimited. This may require root access to change.
For setting them at a global level, you would need to edit the /etc/security/limits file to change the core and file settings for hard and soft limits. However, if the application server is started by the init process at startup, these settings will not take effect. You will need to use the ulimit command line settings directly in the init.d script.
If you want to validate an already running application server process, capture a javacore (kill -3 PID), open it with a text editor and check for "RLIMIT_CORE" and "RLIMIT_FSIZE".
** NOTE: If the appserver is associated with a nodeagent, BOTH the nodeagent and the appserver MUST be restarted to pick up the change. In the case where this installation doesn't have a nodeagent, the appserver must be restarted to pick up the change.
2. CONFIGURE FULL CORE ON THE OPERATING SYSTEM
Check your OS configuration (in the SMIT tool) to see if the fullcore option is set to true.
The IBM SDK will notify you in the native_stderr.log (or your logging for standard error is directed) if this is not set via this string output when a core dump is generated:
Note: "Enable full CORE dump" in smit is set to FALSE and as a result there will be limited threading information in core file.
If you do not have access to the SMIT administration tool, the following flag can be set from the command line (as the root user):
To set full core generation:
chdev -a fullcore=true -lsys0
To verify full core is set:
lsattr -Elsys0 | grep full
3. DISK SPACE
Check your partitions where WebSphere Application Server resides and make sure there is enough space for the dump to be produced. Usually an error message will be seen in the native_stderr.log that indicates if the core was unable to be written.
To check all of your partitions, execute this command (the -k is for kilobytes):
df -k
4. DISABLE SIGNAL HANDLERS
To force the operating system to handle all signals sent to the JVM process, you can disable all JVM signal handlers.
For IBM SDK 6.0 and later, set this JVM argument:
-Xrs
NOTE: On SDK 6.0 and later, to prevent unintentional crashes due to SIGTRAP, clear the shared class cache by executing <WAS_HOME>/bin/clearClassCache.sh
5. EXECUTE "pdump.sh" SCRIPT
In cases where core files are still not being produced, you can execute the attached script pdump.sh to extract information from the running process. This is especially helpful if you suspect the process is in a zombie state and does not respond to any signals.
You can download the latest version from this location:
ftp://ftp.software.ibm.com/aix/tools/debug/pdump.sh
pdump.sh <Java_PID>
This will create a file pdump.java.###.txt file. Locate the line containing the string "sigcatch". If SEGV is listed in output, then the signal is being caught. Both SEGV and SIGSEGV represent signal 11.
Additional Questions:
What happens if I do not have write permission in the profile's root directory, or the directory I am redirecting javacores, heapdumps, and system core files to?
This will result in a failure when writing these files to the system. Check for an error in the native_stderr.log, as it may try to write the dump to an alternate folder (such as /tmp).
Even with all ulimit settings set to unlimited, core files are truncated at 2GB?
There is a limitation on 32-bit processes which can be worked around if you enable large file support..
Using a 64-bit version of WebSphere Application Server also resolves this limitation, although if you run out of disk space the dump can still be truncated.
Can I test my configuration to see if a core can be generated?
Yes you can simulate a crash by sending a signal 11 to the JVM process. This will terminate the process.
kill -11 PID
An alternative is to use the gencore command. This produces a core file and keeps the process running.
gencore PID
2. CONFIGURE FULL CORE ON THE OPERATING SYSTEM
Check your OS configuration (in the SMIT tool) to see if the fullcore option is set to true.
The IBM SDK will notify you in the native_stderr.log (or your logging for standard error is directed) if this is not set via this string output when a core dump is generated:
Note: "Enable full CORE dump" in smit is set to FALSE and as a result there will be limited threading information in core file.
If you do not have access to the SMIT administration tool, the following flag can be set from the command line (as the root user):
To set full core generation:
chdev -a fullcore=true -lsys0
To verify full core is set:
lsattr -Elsys0 | grep full
3. DISK SPACE
Check your partitions where WebSphere Application Server resides and make sure there is enough space for the dump to be produced. Usually an error message will be seen in the native_stderr.log that indicates if the core was unable to be written.
To check all of your partitions, execute this command (the -k is for kilobytes):
df -k
=======================================================
** Stop after step 3
Only do steps below if specifically instructed by IBM Support
4. DISABLE SIGNAL HANDLERS
To force the operating system to handle all signals sent to the JVM process, you can disable all JVM signal handlers.
For IBM SDK 6.0 and later, set this JVM argument:
-Xrs
NOTE: On SDK 6.0 and later, to prevent unintentional crashes due to SIGTRAP, clear the shared class cache by executing <WAS_HOME>/bin/clearClassCache.sh
5. EXECUTE "pdump.sh" SCRIPT
In cases where core files are still not being produced, you can execute the attached script pdump.sh to extract information from the running process. This is especially helpful if you suspect the process is in a zombie state and does not respond to any signals.
You can download the latest version from this location:
ftp://ftp.software.ibm.com/aix/tools/debug/pdump.sh
pdump.sh <Java_PID>
This will create a file pdump.java.###.txt file. Locate the line containing the string "sigcatch". If SEGV is listed in output, then the signal is being caught. Both SEGV and SIGSEGV represent signal 11.
Additional Questions:
What happens if I do not have write permission in the profile's root directory, or the directory I am redirecting javacores, heapdumps, and system core files to?
This will result in a failure when writing these files to the system. Check for an error in the native_stderr.log, as it may try to write the dump to an alternate folder (such as /tmp).
Even with all ulimit settings set to unlimited, core files are truncated at 2GB?
There is a limitation on 32-bit processes which can be worked around if you enable large file support..
Using a 64-bit version of WebSphere Application Server also resolves this limitation, although if you run out of disk space the dump can still be truncated.
Can I test my configuration to see if a core can be generated?
Yes you can simulate a crash by sending a signal 11 to the JVM process. This will terminate the process.
kill -11 PID
An alternative is to use the gencore command. This produces a core file and keeps the process running.
gencore PID
Related Information
Submitting information to IBM support
------------------------------------------------------------------------------------------
如果你觉得文章有用,欢迎打赏
自然语言处理爱好者,欢迎交流。QQ: 7214218