I noticed that the VIO server (VIOS) error log was reporting some failed paths for LUNs connecting to the SAN. The VIOS command
errlog -ls (the equivalent of the AIX command errpt -a) showed errors on the Fibre Channel adapter fscsi2:
Diagnostic Analysis
Diagnostic Log sequence number: 1126130
Resource tested: fscsi2
Menu Number: 2603902
Description:
Error Log Analysis has detected multiple communication
errors. These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.
If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.
Multiple path Redundancy
Each LUN that had a failed path still had other paths on this VIOS functioning correctly. In addition, each of the LUNs is presented to the VIO client via MPIO through this and another VIO server. That makes for a lot of redundancy, which gave us some breathing space to sort out the real cause of one of the many paths being lost. In the meantime, it's quite easy to remove the failing path on the VIOS using the rmpath command.
View paths for a LUN
First, I used the VIOS lspath command from the VIOS restricted shell to look at a single PV. This showed that there were multiple paths from the VIOS through to the SAN (in this case going to SVC).
lspath -dev hdisk63
Or via the AIX shell after logging in to the VIOS as padmin and running oem_setup_env:
lspath -l hdisk63
Whichever version of the lspath command you use, here's the output showing several paths for the same disk.
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000 < Four
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000 < paths
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000 < via
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000 < fscsi0
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000 < Another
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000 < four
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000 < from
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000 < fscsi1
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000 < Three good paths on fscsi2
Failed hdisk63 fscsi2 50050768014025bd,3d000000000000 <--- This failed path needs to be removed or recovered
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000 < Three good paths on fscsi2
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000 < Three good paths on fscsi2
Option 1: Sledgehammer special
Removing all the paths for hdisk63 via fscsi2 would work, but it would remove the successful paths to fscsi2 at the same time. A bit drastic, but let's face it, sledgehammers had to be invented for a reason. Anyway, as there are several other paths to the same LUN - four via fscsi0 and another four via fscsi1, removing three good paths from fscsi2, as well as the one that has failed isn't really a problem. After all four fscsi2 paths are exterminated, you can rediscover the three good paths using the VIOS cfgdev command or the AIX command cfgmgr.
Here are the steps I took to remove all four paths for fscsi2 from hdisk63:
rmpath -dev hdisk63 -fscsi2
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Defined hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Defined hdisk63 fscsi2 50050768011025bd,3d000000000000
Defined hdisk63 fscsi2 500507680140239f,3d000000000000
Aussie Cultural Lesson
Here's a little aside for the benefit of readers not overly familiar with Australian slang. A "dummy" is a pacifier / comforter sometimes given to babies to, well, pacify them. On occasion some babies have been known to expunge the said dummy with speed and skill of Olympian standards.
Well, the rmpath command didn't actually remove the paths. It kept them Defined in the ODM. When I ran cfgdev (or cfgmgr), the command spat the dummy.
Some error messages may contain invalid information
for the Virtual I/O Server environment.
Method error (/usr/lib/methods/cfgscsidisk -l hdisk63 ):
0514-082 The requested function could only be performed for some
of the specified paths.
At this point, lspath shows that the three good paths have recovered, but the failed path is still Defined and the cause of the above error.
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000
Option 2: Search and destroy
We would have been better to remove the fscsi2 paths from the ODM altogether, using the rmpath command with the -rm flag. This is similar to the -d flag on the rmdev command, as it deletes the references from the ODM.
rmpath -dev hdisk63 -pdev fscsi2 -rm
paths Deleted
Now all the paths via fscsi2 for this hdisk are gone:
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Then when you rediscoveri the paths via cfgdev / cfgmgr it only brings back the three good ones. No error message on cfgdev this time:
cfgdev
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000
Option 3: Can you be more specific?
A better solution would be to remove just the bad path. As hdisk63 is already fixed, let's do it on a different LUN which also has a bad path:
lspath -dev hdisk54
status name parent connection
Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Failed hdisk54 fscsi2 50050768014025bd,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000
The rmpath command allows you to narrow the path you want to remove down to a single connection. Here's an extract from the command documentation for the VIOS rmpath command:
rmpath command
Removes from the system a path to an MPIO-capable device.
Syntax
rmpath { [ -dev Name ] [ -pdev Parent ] [ -conn Connection ] } [ -rm ]
Once again, I'll use the -rm flag to remove the path from the ODM. Otherwise it would simply go from Available to Defined and still report a problem when running cfgmgr. But this time, I can narrow the path down to a single connection using the -conn flag:
rmpath -dev hdisk54 -pdev fscsi2 -conn "50050768014025bd,34000000000000" -rm
path Deleted
lspath -dev hdisk54
status name parent connection
Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000
Looking for failure
The lspath command allows you to list paths by their status. This allows you to list all of the failed paths.
lspath -status failed
status name parent connection
Available ses1 sas0 a00,0 < What are these guys
Available ses2 sas0 20a00,0 < doing here?
Failed hdisk3 fscsi2 50050768014025bd,1000000000000 < This line is where we want to start
Failed hdisk4 fscsi2 50050768014025bd,2000000000000
Failed hdisk6 fscsi2 50050768014025bd,19000000000000
Failed hdisk7 fscsi2 50050768014025bd,1a000000000000
Failed hdisk8 fscsi2 50050768014025bd,1b000000000000
Failed hdisk9 fscsi2 50050768014025bd,1c000000000000
Failed hdisk10 fscsi2 50050768014025bd,1d000000000000
Failed hdisk11 fscsi2 50050768014025bd,e000000000000
Failed hdisk12 fscsi2 50050768014025bd,23000000000000
Failed hdisk13 fscsi2 50050768014025bd,24000000000000
Failed hdisk16 fscsi2 50050768014025bd,5000000000000
Failed hdisk17 fscsi2 50050768014025bd,6000000000000
Failed hdisk18 fscsi2 50050768014025bd,7000000000000
Failed hdisk20 fscsi2 50050768014025bd,9000000000000
Failed hdisk22 fscsi2 50050768014025bd,b000000000000
Failed hdisk32 fscsi2 50050768014025bd,16000000000000
Failed hdisk21 fscsi2 50050768014025bd,a000000