AIX (Advanced Interactive eXecutive) is a series of proprietary Unix operating systems developed and sold by IBM.
Performance Optimization With Enhanced RISC (POWER) version 7 enables a unique performance advantage for AIX OS.
POWER7 features new capabilities using multiple cores and multiple CPU threads, creating a pool of virtual CPUs.
AIX 7 includes a new built-in clustering capability called Cluster Aware
AIX POWER7 systems include the Active Memory Expansion feature.

Friday, November 25, 2011

Remove a single failed path

I noticed that the VIO server (VIOS) error log was reporting some failed paths for LUNs connecting to the SAN. The VIOS command errlog -ls
(the equivalent of the AIX command errpt -a) showed errors on the Fibre Channel adapter fscsi2:

Diagnostic Analysis
Diagnostic Log sequence number: 1126130
Resource tested:        fscsi2
Menu Number:            2603902
Description:


Error Log Analysis has detected multiple communication
errors.  These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.

If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.
Multiple path Redundancy

Each LUN that had a failed path still had other paths on this VIOS functioning correctly. In addition, each of the LUNs is presented to the VIO client via MPIO through this and another VIO server. That makes for a lot of redundancy, which gave us some breathing space to sort out the real cause of one of the many paths being lost. In the meantime, it's quite easy to remove the failing path on the VIOS using the rmpath command.

View paths for a LUN

First, I used the VIOS lspath command from the VIOS restricted shell to look at a single PV. This showed that there were multiple paths from the VIOS through to the SAN (in this case going to SVC).

lspath -dev hdisk63
Or via the AIX shell after logging in to the VIOS as padmin and running oem_setup_env:
lspath -l hdisk63
Whichever version of the lspath command you use, here's the output showing several paths for the same disk.
status  name    parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000 <  Four
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000 <  paths
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000 <  via
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000 <  fscsi0
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000 < Another
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000 < four
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000 < from
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000 < fscsi1
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000 < Three good paths on fscsi2
Failed  hdisk63 fscsi2 50050768014025bd,3d000000000000  <--- This failed path needs to be removed or recovered
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000 < Three good paths on fscsi2
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000 < Three good paths on fscsi2
Option 1: Sledgehammer special

Removing all the paths for hdisk63 via fscsi2 would work, but it would remove the successful paths to fscsi2 at the same time.  A bit drastic, but let's face it, sledgehammers had to be invented for a reason. Anyway, as there are several other paths to the same LUN - four via fscsi0 and another four via fscsi1, removing three good paths from fscsi2, as well as the one that has failed isn't really a problem. After all four fscsi2 paths are exterminated, you can rediscover the three good paths using the VIOS cfgdev command or the AIX command cfgmgr.

Here are the steps I took to remove all four paths for fscsi2 from hdisk63:

rmpath -dev hdisk63 -fscsi2

lspath -dev hdisk63
status  name    parent connection

Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Defined hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Defined hdisk63 fscsi2 50050768011025bd,3d000000000000
Defined hdisk63 fscsi2 500507680140239f,3d000000000000


Aussie Cultural Lesson
Here's a little aside for the benefit of readers not overly familiar with Australian slang. A "dummy" is a pacifier / comforter sometimes given to babies to, well, pacify them. On occasion some babies have been known to expunge the said dummy with speed and skill of Olympian standards.



Well, the rmpath command didn't actually remove the paths. It kept them Defined in the ODM. When I ran cfgdev (or cfgmgr), the command spat the dummy.

Some error messages may contain invalid information
for the Virtual I/O Server environment.

Method error (/usr/lib/methods/cfgscsidisk -l hdisk63 ):
        0514-082 The requested function could only be performed for some
                 of the specified paths.
At this point, lspath shows that the three good paths have recovered, but the failed path is still Defined and the cause of the above error.

lspath -dev hdisk63
status  name    parent connection

Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000

Option 2: Search and destroy

We would have been better to remove the fscsi2 paths from the ODM altogether, using the rmpath command with the -rm flag. This is similar to the -d flag on the rmdev command, as it deletes the references from the ODM.

rmpath -dev hdisk63  -pdev fscsi2 -rm
paths Deleted

Now all the paths via fscsi2 for this hdisk are gone:
lspath -dev hdisk63
status  name    parent connection

Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Then when you rediscoveri the paths via cfgdev / cfgmgr it only brings back the three good ones. No error message on cfgdev this time:
cfgdev
lspath -dev hdisk63
status  name    parent connection

Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000
Option 3: Can you be more specific?

A better solution would be to remove just the bad path. As hdisk63 is already fixed, let's do it on a different LUN which also has a bad path:

 lspath -dev hdisk54
status  name    parent connection

Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Failed  hdisk54 fscsi2 50050768014025bd,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000


The rmpath command allows you to narrow the path you want to remove down to a single connection. Here's an extract from the command documentation for the VIOS rmpath command:

rmpath command

Purpose


Removes from the system a path to an MPIO-capable device.

Syntax

rmpath { [ -dev Name ] [ -pdev Parent ] [ -conn Connection ] } [ -rm ]

Once again, I'll use the -rm flag to remove the path from the ODM. Otherwise it would simply go from Available to Defined and still report a problem when running cfgmgr. But this time, I can narrow the path down to a single connection using the -conn flag:

rmpath -dev hdisk54 -pdev fscsi2 -conn "50050768014025bd,34000000000000" -rm

path Deleted
lspath -dev hdisk54
status  name    parent connection

Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000
Looking for failure


The lspath command allows you to list paths by their status. This allows you to list all of the failed paths.

lspath -status failed
status    name    parent connection


Available ses1    sas0   a00,0   < What are these guys
Available ses2    sas0   20a00,0
< doing here?
Failed    hdisk3  fscsi2 50050768014025bd,1000000000000 < This line is where we want to start
Failed    hdisk4  fscsi2 50050768014025bd,2000000000000
Failed    hdisk6  fscsi2 50050768014025bd,19000000000000
Failed    hdisk7  fscsi2 50050768014025bd,1a000000000000
Failed    hdisk8  fscsi2 50050768014025bd,1b000000000000
Failed    hdisk9  fscsi2 50050768014025bd,1c000000000000
Failed    hdisk10 fscsi2 50050768014025bd,1d000000000000
Failed    hdisk11 fscsi2 50050768014025bd,e000000000000
Failed    hdisk12 fscsi2 50050768014025bd,23000000000000
Failed    hdisk13 fscsi2 50050768014025bd,24000000000000
Failed    hdisk16 fscsi2 50050768014025bd,5000000000000
Failed    hdisk17 fscsi2 50050768014025bd,6000000000000
Failed    hdisk18 fscsi2 50050768014025bd,7000000000000
Failed    hdisk20 fscsi2 50050768014025bd,9000000000000
Failed    hdisk22 fscsi2 50050768014025bd,b000000000000
Failed    hdisk32 fscsi2 50050768014025bd,16000000000000
Failed    hdisk21 fscsi2 50050768014025bd,a000000000000
Failed    hdisk25 fscsi2 50050768014025bd,f000000000000
Failed    hdisk26 fscsi2 50050768014025bd,10000000000000
Failed    hdisk27 fscsi2 50050768014025bd,11000000000000
Failed    hdisk28 fscsi2 50050768014025bd,12000000000000
Failed    hdisk29 fscsi2 50050768014025bd,13000000000000
Failed    hdisk33 fscsi2 50050768014025bd,17000000000000
Failed    hdisk34 fscsi2 50050768014025bd,18000000000000
Failed    hdisk35 fscsi2 50050768014025bd,1e000000000000
Failed    hdisk36 fscsi2 50050768014025bd,1f000000000000
Failed    hdisk37 fscsi2 50050768014025bd,20000000000000
Failed    hdisk38 fscsi2 50050768014025bd,21000000000000
Failed    hdisk39 fscsi2 50050768014025bd,22000000000000
Failed    hdisk40 fscsi2 50050768014025bd,26000000000000
Failed    hdisk41 fscsi2 50050768014025bd,27000000000000
Failed    hdisk42 fscsi2 50050768014025bd,28000000000000
Failed    hdisk43 fscsi2 50050768014025bd,29000000000000
Failed    hdisk44 fscsi2 50050768014025bd,2a000000000000
Failed    hdisk47 fscsi2 50050768014025bd,2d000000000000
Failed    hdisk48 fscsi2 50050768014025bd,2e000000000000
Failed    hdisk49 fscsi2 50050768014025bd,2f000000000000
Failed    hdisk50 fscsi2 50050768014025bd,30000000000000
Failed    hdisk51 fscsi2 50050768014025bd,31000000000000
Failed    hdisk5  fscsi2 50050768014025bd,3000000000000
Failed    hdisk45 fscsi2 50050768014025bd,2b000000000000
Failed    hdisk52 fscsi2 50050768014025bd,32000000000000
Failed    hdisk53 fscsi2 50050768014025bd,33000000000000
Failed    hdisk19 fscsi2 50050768014025bd,8000000000000
Failed    hdisk61 fscsi2 50050768014025bd,3b000000000000
Failed    hdisk64 fscsi2 50050768014025bd,3e000000000000
Failed    hdisk65 fscsi2 50050768014025bd,3f000000000000
Failed    hdisk66 fscsi2 50050768014025bd,40000000000000
Failed    hdisk67 fscsi2 50050768014025bd,41000000000000
Failed    hdisk68 fscsi2 50050768014025bd,42000000000000
Failed    hdisk70 fscsi2 50050768014025bd,44000000000000

It's easy enough to script this now:

lspath -status failed | grep Failed | while read status hdisk parent connection
do
rmpath -dev $hdisk -pdev $parent -conn $connection -rm
done

It seems smarter not to throw out the good paths with the bad one and then repair the damage.