Partager l'article ! Crash dump on Solaris x86 (with interrupt NMI): I'm really surprised when I still see Solaris servers configure ...
I'm really surprised when I still see Solaris servers configured without the possibility of taking a dump. These next few lines explain how to do that especially for Solaris x86 (use interrupt NMI).
To achieve this, it's necessary to configure several elements in the Solaris system:
How to start in debug mode Solaris 10x86 ? It's really simple, just add parameter "kadb" at the end of line "multiboot" in the file "menu.lst" then reboot.
As you can see below:
# pwd
/rpool/boot/grub
# cat menu.lst
[...]
title s10x_u9wos_14a
bootfs rpool/ROOT/s10x_u9wos_14a
findroot (pool_rpool,0,a)
kernel$ /platform/i86pc/multiboot -B console=ttyb,$ZFS-BOOTFS kadb
module /platform/i86pc/boot_archive
[...]
How to configure correct setting for taking dump ? Just use the command "dumpadm". Two things to check: the dump device and the savecore directory exist with the correct size (the size depends on RAM - two different policies on this subjetc: the size of dump device is the same as the RAM or not).
For exemple:
# dumpadm
Dump content: kernel pages
Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/zlap
Savecore enabled: no
Save compressed: on
# prtconf | grep -i memory
Memory size: 16384 Megabytes
# zfs get volsize rpool/dump
NAME PROPERTY VALUE SOURCE
rpool/dump volsize 512M -
# df -h /var/crash/zlap
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/solaris 54G 9.9G 16G 39% /
How to configure system for NMI interrupt ? Just add the following lines in the file /etc/system then reboot.
For exemple:
# egrep apic /etc/system
set pcplusmp:apic_kmdb_on_nmi=1
set pcplusmp:apic_panic_on_nmi=1
Now, if the system hang, you can send an interrupt NMI and thus take a dump. Either you use the "ipmi" command (if ipmi command are available on ILOM's server) or you use the website of the ILOM's server to generate an interrupt NMI.
For exemple (ipmi command):
# ipmitool -I lanplus -H server-rsc -U root chassis power diag
A simple demonstration...
On server :
$ ssh zlap
# uname -a
SunOS zlap 5.10 Generic_142910-17 i86pc i386 i86pc
# prtdiag
System Configuration: HP ProLiant DL360 G5
BIOS Configuration: HP P58 05/18/2009
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)
[...]
On Rilo server :
$ ssh admin@zlap-rsc
root@zlap-rsc's password:
User:root logged-in to ILOLD75MU6996.(XXX.XXX.XXX.XX)
iLO 2 Advanced 2.05 at 15:38:15 Dec 17 2009
Server Name: DL360G5P-34-13
Server Power: On
</>hpiLO->
</>hpiLO-> nmi server
</>hpiLO-> vsp
Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.
</>hpiLO-> Virtual Serial Port active: IO=0x02F8 INT=3
[1]>
[1]> ::showrev
Hostname: zlap
Release: 5.10
Kernel architecture: i86pc
Application architecture: amd64
Kernel version: SunOS 5.10 i86pc Generic_142910-17
Platform: i86pc
[1]> $<systemdump
nopanicdebug: 0 = 0x1
panic[cpu1]/thread=fffffe80005e0c60: BAD TRAP: type=e (#pf Page fault) rp=fffffe80005e0980 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
sched: #pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0xfffffe80005e0a78, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
cr2: 0 cr3: 14717000 cr8: c
rdi: fffffffffbc7eab0 rsi: 2f8
rdx: 2f8
rcx: a
r8: 0 r9: ffffffffa4f71c90
rax: fffffffffbcecbe0 rbx: ffffffffef8f1250 rbp: fffffe80005e0a80
r10: fffffe80005e09c0 r11: 0 r12: fffffe80005e0af0
r13: 1 r14: fffffffffbc561c0
r15: 1
fsb: 0 gsb: ffffffffa44fb800
ds: 43
es: 43
fs: 0 gs: 1c3
trp: e
err: 0 rip:
0
cs: 28
rfl: 10002 rsp: fffffe80005e0a78
ss: 30
fffffe80005e0890 unix:die+da ()
fffffe80005e0970 unix:trap+5e6 ()
fffffe80005e0980 unix:cmntrap+140 ()
fffffe80005e0a80 0 ()
fffffe80005e0a90 genunix:kdi_dvec_enter+d ()
fffffe80005e0ab0 unix:debug_enter+66 ()
fffffe80005e0ac0 pcplusmp:apic_nmi_intr+94 ()
fffffe80005e0ae0 unix:av_dispatch_nmivect+1f ()
fffffe80005e0af0 unix:nmiint+17e ()
fffffe80005e0be0 unix:i86_mwait+d ()
fffffe80005e0c20 unix:cpu_idle_mwait+125 ()
fffffe80005e0c40 unix:idle+89 ()
fffffe80005e0c50 unix:thread_start+8 ()
syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
0:01 100% done
100% done: 320938 pages dumped, dump succeeded
rebooting...
It's very simply... no ?
For you computer culture, here are some links on the topic:
| Juin 2012 | ||||||||||
| L | M | M | J | V | S | D | ||||
| 1 | 2 | 3 | ||||||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 | ||||
| 11 | 12 | 13 | 14 | 15 | 16 | 17 | ||||
| 18 | 19 | 20 | 21 | 22 | 23 | 24 | ||||
| 25 | 26 | 27 | 28 | 29 | 30 | |||||
|
||||||||||