Working with Predictive Self Healing (PSH) Oracle & Fujitsu Sparc Enterprise Server

Apr 26, 2015 4:49 PM

.... .. ... +^""888h. ~"888h .x888888hx : .xHL 8X. ?8888X 8888f d88888888888hxx .-`8888hxxx~ '888x 8888X 8888~ 8" ... `"*8888%` .H8X `%888*" '88888 8888X "88x: ! " ` .xnxx. 888X ..x.. `8888 8888X X88x. X X .H8888888%: '8888k .x8888888x `*` 8888X '88888X X 'hn8888888*" ?8888X "88888X ~`...8888X "88888 X: `*88888%` ! ?8888X '88888 x8888888X. `%8" '8h.. `` ..x8 H8H %8888 `8888 '%"*8888888h. " `88888888888888f '888> 888" 8888 ~ 888888888!` '%8888888888*" "8` .8" .. 88* X888^""" ^"****""` ` x8888h. d*" `88f !""*888%~ 88 ! `" . "" '-....:~ # Working with Predictive Self Healing (PSH) on Oracle Sparc Enterprise Server # Suwardi - line.console49@gmail.com # @ 26/04/2015 16:49:00 # Copyright (C) Suwardi 2015 |-- 0x0 Background When we maintenance Sparc Enterprise Server for Oracle brand and Fujitsu brand with Solaris OS inside. We will face to a system problem and hardware problem. To indentify, analyzing and repair problems there was a Oracle facility in the devices. That facility is PSH, PSH is a command line facility which deal with system problems. It's very usefull and helpfull.
|-- 0x1 Working with PSH on ILOM and Solaris OS We can check hardware fault status and fix it in ILOM area. Enter to ILOM command line with serial port at 9200 bit rate or connected to network management for ILOM. Start PSH shell with following command :
-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp>
Check fault hardware status using 'fmadm':
faultmgmtsp> fmadm faulty
----------------------------------------------------------
Time                UUID           msgid          Severity
----------------------------------------------------------
2015-04-21/07:08:18 196e61d5-cf75-67cb-ae6e-909cdc3dbe38 PCIEX-8000-YJ Unknown
Fault class : fault.io.pciex.device-pcie-ce
FRU         : /SYS/MB
              (Part Number: 7096745)
              (Serial Number: 465769T+1503BH05AD)
Description : A fault was diagnosed by the Host Operating System.
Action      : Please refer to the associated reference document at
              http://support.oracle.com/msg/PCIEX-8000-YJ for a complete, detailed description and the latest service procedures and
              policies regarding this diagnosis.
-------------------------------------------------------------
Time   UUID                          msgid          Severity
-------------------------------------------------------------
2015-04-21/07:08:18 196e61d5-cf75-67cb-ae6e-909cdc3dbe38 PCIEX-8000-YJ Unknown
Fault class : fault.io.pciex.device-pcie-ce
FRU         : /SYS/MB/RISER1/PCIE1
              (Part Number: unknown)
              (Serial Number: unknown)
Description : A fault was diagnosed by the Host Operating System.
Action      : Please refer to the associated reference document at
              http://support.oracle.com/msg/PCIEX-8000-YJ for a complete, detailed description and the latest service procedures and
              policies regarding this diagnosis.
At that fault messages, there were fault detected at PCIE1 HBA card fault. There were 3 kind of command in the fault management shell as bellow : fmadm - Administers the fault management service,like fault repair & detection fmdump - Displays contents of the fault and ereport/error logs fmstat - Displays statistics on fault management operations We can check a fault status with PSH at ILOM/ALOM, XSCF and Operating System (Solaris). It was a facility which enable us to detect problem. So, it's very important facility for maintenance. |-- 0x2 How to Identify a Fault It's simply method to identify a fault on sparc machine. For ILOM, login to ILOM then type command bellow :-> start /SP/faultmgmt/shell system will enter to fault management shell. Fault can be check with command bellow:-> faultmgmtsp> fmadm faulty In solaris OS, login as root then type the same command :
# fmadm faulty
---------------------------------------------------------------
TIME            EVENT-ID               MSG-ID         SEVERITY
---------------------------------------------------------------
Feb 06 20:55:21 a86ebcf7-c8bf-4434-e604-95791d143dca  PCIEX-8000-3S  Critical
Host        : War49
Platform    : SUNW,SPARC-Enterprise    Chassis_id  : BDF1235E68
Product_sn  :

Fault class : fault.io.pciex.device-interr max 50%
              fault.io.pciex.bus-linkerr 25%
Affects     : dev:////pci@3,700000/pci@0
              dev:////pci@3,700000
                  faulted but still in service
FRU         : "iou#0-pci#4" (hc:///component=iou#0-pci#4) faulty
Description : A problem has been detected on one of the specified devices or on one of the specified connecting buses.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated with this fault
Action      : Use 'fmadm faulty' to provide a more detailed view of this event. If a plug-in card is involved check for badly-seated cards or bent pins. Please refer to the associated reference document at http://sun.com/msg/PCIEX-8000-3S for the latest service procedures and policies regarding this diagnosis.
There shown fault 50% of pciex device at fault class and affected to pci device at dev:////pci@3,700000/pci@0. Fault still in service means, the fault has detected but, the hardware is still able to operational. But, it's recommended to repalce the part. |-- 0x3 How to indentify a date occurrances & repairation With a simple command from Fault Management Facility, it's able to see when a fault occurred & repaired. Command 'fmdump' is used for it as bellow example: On ILOM command line : -> faultmgmtsp> fmdump On Solaris command line :
# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Jun 05 2013 10:17:55 fea08ed2-c7fb-6824-80f2-d88361608698 FMD-8000-4M Repaired
Jun 05 2013 10:17:55 fea08ed2-c7fb-6824-80f2-d88361608698 FMD-8000-6U Resolved
Jun 05 2013 12:17:55 b46e0b1c-bbfe-6612-f6c5-ca8967a628b9 FMD-8000-4M Repaired
Jun 05 2013 12:17:55 b46e0b1c-bbfe-6612-f6c5-ca8967a628b9 FMD-8000-6U Resolved
Nov 06 2014 11:20:24 5b1bd503-0dc5-c03c-dd70-c1bb4d3368a6 PCIEX-8000-3S
Feb 06 20:55:21.6496 a86ebcf7-c8bf-4434-e604-95791d143da9 PCIEX-8000-3S
Feb 06 21:00:44.9569 5b1bd503-0dc5-c03c-dd70-c1bb4d3368a6 FMD-8000-58 Updated
Feb 06 21:00:44.9601 5b1bd503-0dc5-c03c-dd70-c1bb4d3368a6 FMD-8000-58 Updated
Feb 06 21:00:45.0179 e632f2e6-2d2b-451e-ef79-ebdfea50aac7 SUNOS-8000-FU
|-- 0x4 How to manually repair and clear Faults A fault can be repair/resolve/clear. But, if the fault has impacted to hardware, the hardware should be replaced. After the replacement, the fault should be clear.Example clearing/repairing fault with fmadm faulty output example above:
# fmadm repair dev:////pci@3,700000/pci@0
# fmadm repair dev:////pci@3,700000
# fmadm acquit a86ebcf7-c8bf-4434-e604-95791d143dca
After it done, lets check the output of 'fmadm faulty' command. If the command has no output then the fault has cleared and resolved. And the repair activity is recorded at fmdump. #EOF

0 comments:

Article list :