点击量:2158
上篇介绍了如何生成一个Java进程的coredump,今天则来总结下当Java进程挂掉之后应该怎么样一步一步来分析。
一、Fatal Error Log
接着昨天的例子继续看,在coredump之后目录里面的文件是这样的:
1 2 |
[xxxx@VM_72_32_tlinux ~/fanyy]$ ls core.23687 CoreDumpTest.class CoreDumpTest.h CoreDumpTest.java crash.c crash.so hs_err_pid23687.log |
一般情况下当Java进程挂掉之后,在程序目录下面会生成一个错误日志:hs_err_pidxxxx.log。这个文件是你排查问题的过程中第一个应该阅读的日志,因为这个文件里面会记录很多有用的信息,比如:线程堆栈,堆的使用情况,JVM参数和环境变量,虚拟机状态,操作系统信息等。这里提供的信息非常多,也非常全面,为我们大致定位问题提供了很大的帮助。
先来看下文件开头的注释信息,这里的信息很粗略地指出了崩溃的原因:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f37956cd554, pid=23687, tid=139881182439168 # # JRE version: 6.0_22-b04 # Java VM: Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode linux-amd64 ) # Problematic frame: # C [crash.so+0x554] Java_CoreDumpTest_crash+0x18 # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. |
第8,9行指出了出现问题的栈是在动态库crash.so的Java_CoreDumpTest_crash方法。接下来看下具体的堆栈信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Stack: [0x00007f38a020a000,0x00007f38a030b000], sp=0x00007f38a0309850, free space=3fe0000000000000018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [crash.so+0x554] Java_CoreDumpTest_crash+0x18 j CoreDumpTest.crash()V+0 j CoreDumpTest.main([Ljava/lang/String;)V+7 v ~StubRoutines::call_stub V [libjvm.so+0x3e75fd] V [libjvm.so+0x5f6fe9] V [libjvm.so+0x3e7435] V [libjvm.so+0x4206d1] V [libjvm.so+0x40eb50] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j CoreDumpTest.crash()V+0 j CoreDumpTest.main([Ljava/lang/String;)V+7 v ~StubRoutines::call_stub |
第一段Native frames包含了所有代码的堆栈信息,上限是一百行。这里可以清楚的看到程序在执行CoreDumpTest.crash()之后调用本地方法crash.so里的Java_CoreDumpTest_crash方法而崩溃的,第二段的Java frames仅仅包含Java代码的堆栈。经过这两段分析可以简单定位问题。
二、GDB调试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
gdb /usr/local/jdk/jre/bin/java core.23687 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/local/jdk/jre/bin/java...(no debugging symbols found)...done. ... ... ... (gdb) bt #0 0x00007f38a1069625 in raise () from /lib64/libc.so.6 #1 0x00007f38a106ae05 in abort () from /lib64/libc.so.6 #2 0x00007f38a0b82517 in os::abort(bool) () from /usr/local/jdk/jre/lib/amd64/server/libjvm.so #3 0x00007f38a0cbf78d in VMError::report_and_die() () from /usr/local/jdk/jre/lib/amd64/server/libjvm.so #4 0x00007f38a0b88515 in JVM_handle_linux_signal () from /usr/local/jdk/jre/lib/amd64/server/libjvm.so #5 0x00007f38a0b84e3e in signalHandler(int, siginfo*, void*) () from /usr/local/jdk/jre/lib/amd64/server/libjvm.so #6 <signal handler called> #7 0x00007f37956cd554 in Java_CoreDumpTest_crash () from /home/astd/fanyy/javacoredump/crash.so #8 0x00007f389899ac88 in ?? () #9 0x00007f389c006000 in ?? () #10 0x00007f389898fa42 in ?? () #11 0x00007f38a0309870 in ?? () #12 0x00007f3798336358 in ?? () #13 0x00007f38a03098d0 in ?? () #14 0x00007f3798336898 in ?? () #15 0x0000000000000000 in ?? () |
GDB只提供了native code层面的堆栈信息,再往上的堆栈函数名都变成问号了!!这是为什么呢?注意到第11行打印出了“Reading symbols from /usr/local/jdk/jre/bin/java…(no debugging symbols found)…done.”,就是说从可执行文件/usr/local/jdk/jre/bin/java中并没有读取到符号表,学过编译原理的应该知道,函数名,变量名等都是存储在符号表里的,如果没有了符号表,那么函数名也就无法解析了。OK,那再思考一下,为什么上一篇文章里GDB调试C代码的时候就能打印出函数名,就能读取到符号表呢?为什么Java代码就不行呢?我们注意到,上文调试的可执行文件是test,就是我们运行报错的程序,但这里却是Java,Java里面自然是不会包括你写的程序的符号表了(因为符号表存储在自己写的Java程序中),那为什么不直接用GDB调试自己写的Java程序呢?因为Java程序只能在jvm上面跑,并不能被操作系统直接执行。分析到这里我们可以发现,因为打印不出Java层面的堆栈,GDB调试对Java coredump的分析作用并不大。举个例子,比如报错出现在这一行 in Java_java_net_SocketInputStream_socketRead0 () from /usr/local/jdk/jre/lib/amd64/libnet.so 这是网络读包的本地函数,上层调用这个函数的地方太多了,如果抓不到上层的Java堆栈,那么对问题的定位不会有太大的帮助。
三、jstack
当然还是有办法获得获得Java层面的堆栈信息的,比如可以通过强大的调试工具:jstack来调试core文件。
1).直接使用jstack
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
[xxxx@VM_16_59_tlinux /app/opbak]$ jstack /usr/local/jdk/jre/bin/java core.23687 Attaching to core core.23687 from executable /usr/local/jdk/jre/bin/java, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.80-b11 Deadlock Detection: No deadlocks found. Thread 25037: (state = BLOCKED) Thread 25036: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Interpreted frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=36, line=209 (Interpreted frame) Thread 25035: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 (Interpreted frame) Thread 25025: (state = IN_NATIVE) - CoreDumpTest.crash() @bci=0 (Interpreted frame) - CoreDumpTest.main(java.lang.String[]) @bci=7, line=12 (Interpreted frame) |
2).使用jsadebugd
jsadebugd可以依附到一个Java进程或者core文件,充当一个调试服务器的作用。客户端的一些工具比如jstack,jmap,jinfo可以通过RMI连接到这个服务器进行调试。
1 2 3 |
[xxxx@VM_16_59_tlinux /app/opbak]$ jsadebugd /usr/local/jdk/jre/bin/java core.23687 TempServer Attaching to core core.23687 from executable /usr/local/jdk/jre/bin/java and starting RMI services, please wait... Debugger attached and RMI services started. |
然后jstack:
1 |
jstack TempServer@localhost |
后面的堆栈信息和直接使用jstack是一样的,不再赘述。
四、jmap
jstack能分析core文件,那么jmap也是可以的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
[xxxx@VM_16_59_tlinux /app/opbak]$ jmap /usr/local/jdk/jre/bin/java core.23687 Attaching to core core.23687 from executable /usr/local/jdk/jre/bin/java, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.80-b11 0x0000000000400000 7K /usr/local/jdk/jre/bin/java 0x00007f6ac1a66000 42K /lib64/libonion.so 0x00007f6ac173a000 139K /lib64/libpthread.so.0 0x00007f6ac1525000 96K /usr/local/jdk1.7.0_80/bin/../lib/amd64/jli/libjli.so 0x00007f6ac1321000 19K /lib64/libdl.so.2 0x00007f6ac0f8d000 1876K /lib64/libc.so.6 0x00007f6ac1957000 150K /lib64/ld-linux-x86-64.so.2 0x00007f6ac0113000 14879K /usr/local/jdk1.7.0_80/jre/lib/amd64/server/libjvm.so 0x00007f6abfe8f000 582K /lib64/libm.so.6 0x00007f6abfb86000 42K /lib64/librt.so.1 0x00007f6abf978000 63K /usr/local/jdk1.7.0_80/jre/lib/amd64/libverify.so 0x00007f6abf74d000 214K /usr/local/jdk1.7.0_80/jre/lib/amd64/libjava.so 0x00007f6abf535000 107K /usr/local/jdk1.7.0_80/jre/lib/amd64/libzip.so 0x00007f6aa7cfe000 5K /home/fanyy/crash.so |
这里指出一点:不论是使用jsadebugd还是直接使用jstack,jmap调试core文件,在JDK 1.6.0_22版本下都会报错!换成1.7的JDK之后就好了,不知道是不是JDK版本小于1.7的都有问题,还是只是1.6的这个版本有问题,还是哪里设置不正确?要把这个问题弄清楚需要的条件实在太多了,不再一一尝试。
报错信息如下:
1 2 3 |
[xxxx@VM_72_32_tlinux ~/fanyy]$ jstack /usr/local/jdk/jre/bin/java core.23687 Attaching to core core.23687 from executable /usr/local/jdk/jre/bin/java, please wait... Error attaching to core file: Can't attach to the core file |
报错时采用的JDK版本信息如下:
1 2 3 4 |
[xxxx@VM_72_32_tlinux ~/fanyy]$ java -version java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) |