I noticed that the VIO server (VIOS) error log was reporting some failed paths for LUNs connecting to the SAN. The VIOS command
(the equivalent of the AIX command
errpt -a
) showed errors on the Fibre Channel adapter fscsi2:
Diagnostic Analysis
Diagnostic Log sequence number: 1126130
Resource tested: fscsi2
Menu Number: 2603902
Description:
Error Log Analysis has detected multiple communication
errors. These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.
If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.
Multiple path Redundancy
Each LUN that had a failed path still had other paths on this VIOS functioning correctly. In addition, each of the LUNs is presented to the VIO client via MPIO through this and another VIO server. That makes for a lot of redundancy, which gave us some breathing space to sort out the real cause of one of the many paths being lost. In the meantime, it's quite easy to remove the failing path on the VIOS using the rmpath command.
View paths for a LUN
First, I used the VIOS
lspath command
from the VIOS restricted shell to look at a single PV. This showed that there were multiple paths from the VIOS through to the SAN (in this case going to SVC).
lspath -dev hdisk63
Or via the AIX shell after logging in to the VIOS as
padmin and running
oem_setup_env
:
lspath -l hdisk63
Whichever version of the lspath command you use, here's the output showing several paths for the same disk.
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000 < Four
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000 < paths
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000 < via
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000 < fscsi0
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000 < Another
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000 < four
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000 < from
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000 < fscsi1
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000 < Three good paths on fscsi2
Failed hdisk63 fscsi2 50050768014025bd,3d000000000000 <--- This failed path needs to be removed or recovered
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000 < Three good paths on fscsi2
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000 < Three good paths on fscsi2
Option 1: Sledgehammer special
Removing
all the paths for hdisk63 via fscsi2 would work, but it would remove the successful paths to fscsi2 at the same time. A bit drastic, but let's face it, sledgehammers had to be invented for a reason. Anyway, as there are several other paths to the same LUN - four via fscsi0 and another four via fscsi1, removing three good paths from fscsi2, as well as the one that has failed isn't really a problem. After all four fscsi2 paths are exterminated, you can rediscover the three good paths using the
VIOS cfgdev command
or the
AIX command cfgmgr
.
Here are the steps I took to remove all four paths for fscsi2 from hdisk63:
rmpath -dev hdisk63 -fscsi2
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Defined hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Defined hdisk63 fscsi2 50050768011025bd,3d000000000000
Defined hdisk63 fscsi2 500507680140239f,3d000000000000
Aussie Cultural Lesson
Here's a little aside for the benefit of readers not overly familiar with Australian slang. A "dummy" is a pacifier / comforter sometimes given to babies to, well, pacify them. On occasion some babies have been known to expunge the said dummy with speed and skill of Olympian standards.
Well, the rmpath command didn't actually remove the paths. It kept them Defined in the ODM. When I ran cfgdev (or cfgmgr), the command
spat the dummy
.
Some error messages may contain invalid information
for the Virtual I/O Server environment.
Method error (/usr/lib/methods/cfgscsidisk -l hdisk63 ):
0514-082 The requested function could only be performed for some
of the specified paths.
At this point, lspath shows that the three good paths have recovered, but the failed path is still Defined and the cause of the above error.
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Defined hdisk63 fscsi2 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000
Option 2: Search and destroy
We would have been better to remove the fscsi2 paths from the ODM altogether, using the rmpath command with the -rm flag. This is similar to the -d flag on the rmdev command, as it deletes the references from the ODM.
rmpath -dev hdisk63 -pdev fscsi2 -rm
paths Deleted
Now all the paths via fscsi2 for this hdisk are gone:
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Then when you rediscoveri the paths via cfgdev / cfgmgr it only brings back the three good ones. No error message on cfgdev this time:
cfgdev
lspath -dev hdisk63
status name parent connection
Enabled hdisk63 fscsi0 500507680110239f,3d000000000000
Enabled hdisk63 fscsi0 50050768014025bd,3d000000000000
Enabled hdisk63 fscsi0 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi0 500507680140239f,3d000000000000
Enabled hdisk63 fscsi1 500507680130239f,3d000000000000
Enabled hdisk63 fscsi1 500507680120239f,3d000000000000
Enabled hdisk63 fscsi1 50050768013025bd,3d000000000000
Enabled hdisk63 fscsi1 50050768012025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680110239f,3d000000000000
Enabled hdisk63 fscsi2 50050768011025bd,3d000000000000
Enabled hdisk63 fscsi2 500507680140239f,3d000000000000
Option 3: Can you be more specific?
A better solution would be to remove just the bad path. As hdisk63 is already fixed, let's do it on a different LUN which also has a bad path:
lspath -dev hdisk54
status name parent connection
Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Failed hdisk54 fscsi2 50050768014025bd,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000
The rmpath command allows you to narrow the path you want to remove down to a single connection. Here's an extract from the command documentation for the
VIOS rmpath command
:
rmpath command
Removes from the system a path to an MPIO-capable device.
Syntax
rmpath { [ -dev Name ] [ -pdev Parent ] [ -conn Connection ] } [ -rm ]
Once again, I'll use the -rm flag to remove the path from the ODM. Otherwise it would simply go from Available to Defined and still report a problem when running cfgmgr. But this time, I can narrow the path down to a single connection using the -conn flag:
rmpath -dev hdisk54 -pdev fscsi2 -conn "50050768014025bd,34000000000000" -rm
path Deleted
lspath -dev hdisk54
status name parent connection
Enabled hdisk54 fscsi0 500507680110239f,34000000000000
Enabled hdisk54 fscsi0 50050768014025bd,34000000000000
Enabled hdisk54 fscsi0 50050768011025bd,34000000000000
Enabled hdisk54 fscsi0 500507680140239f,34000000000000
Enabled hdisk54 fscsi1 500507680130239f,34000000000000
Enabled hdisk54 fscsi1 500507680120239f,34000000000000
Enabled hdisk54 fscsi1 50050768013025bd,34000000000000
Enabled hdisk54 fscsi1 50050768012025bd,34000000000000
Enabled hdisk54 fscsi2 500507680110239f,34000000000000
Enabled hdisk54 fscsi2 50050768011025bd,34000000000000
Enabled hdisk54 fscsi2 500507680140239f,34000000000000
Looking for failure
The lspath command allows you to list paths by their status. This allows you to list all of the failed paths.
lspath -status failed
status name parent connection
Available ses1 sas0 a00,0 < What are these guys
Available ses2 sas0 20a00,0 < doing here?
Failed hdisk3 fscsi2 50050768014025bd,1000000000000 < This line is where we want to start
Failed hdisk4 fscsi2 50050768014025bd,2000000000000
Failed hdisk6 fscsi2 50050768014025bd,19000000000000
Failed hdisk7 fscsi2 50050768014025bd,1a000000000000
Failed hdisk8 fscsi2 50050768014025bd,1b000000000000
Failed hdisk9 fscsi2 50050768014025bd,1c000000000000
Failed hdisk10 fscsi2 50050768014025bd,1d000000000000
Failed hdisk11 fscsi2 50050768014025bd,e000000000000
Failed hdisk12 fscsi2 50050768014025bd,23000000000000
Failed hdisk13 fscsi2 50050768014025bd,24000000000000
Failed hdisk16 fscsi2 50050768014025bd,5000000000000
Failed hdisk17 fscsi2 50050768014025bd,6000000000000
Failed hdisk18 fscsi2 50050768014025bd,7000000000000
Failed hdisk20 fscsi2 50050768014025bd,9000000000000
Failed hdisk22 fscsi2 50050768014025bd,b000000000000
Failed hdisk32 fscsi2 50050768014025bd,16000000000000
Failed hdisk21 fscsi2 50050768014025bd,a000000000000
Failed hdisk25 fscsi2 50050768014025bd,f000000000000
Failed hdisk26 fscsi2 50050768014025bd,10000000000000
Failed hdisk27 fscsi2 50050768014025bd,11000000000000
Failed hdisk28 fscsi2 50050768014025bd,12000000000000
Failed hdisk29 fscsi2 50050768014025bd,13000000000000
Failed hdisk33 fscsi2 50050768014025bd,17000000000000
Failed hdisk34 fscsi2 50050768014025bd,18000000000000
Failed hdisk35 fscsi2 50050768014025bd,1e000000000000
Failed hdisk36 fscsi2 50050768014025bd,1f000000000000
Failed hdisk37 fscsi2 50050768014025bd,20000000000000
Failed hdisk38 fscsi2 50050768014025bd,21000000000000
Failed hdisk39 fscsi2 50050768014025bd,22000000000000
Failed hdisk40 fscsi2 50050768014025bd,26000000000000
Failed hdisk41 fscsi2 50050768014025bd,27000000000000
Failed hdisk42 fscsi2 50050768014025bd,28000000000000
Failed hdisk43 fscsi2 50050768014025bd,29000000000000
Failed hdisk44 fscsi2 50050768014025bd,2a000000000000
Failed hdisk47 fscsi2 50050768014025bd,2d000000000000
Failed hdisk48 fscsi2 50050768014025bd,2e000000000000
Failed hdisk49 fscsi2 50050768014025bd,2f000000000000
Failed hdisk50 fscsi2 50050768014025bd,30000000000000
Failed hdisk51 fscsi2 50050768014025bd,31000000000000
Failed hdisk5 fscsi2 50050768014025bd,3000000000000
Failed hdisk45 fscsi2 50050768014025bd,2b000000000000
Failed hdisk52 fscsi2 50050768014025bd,32000000000000
Failed hdisk53 fscsi2 50050768014025bd,33000000000000
Failed hdisk19 fscsi2 50050768014025bd,8000000000000
Failed hdisk61 fscsi2 50050768014025bd,3b000000000000
Failed hdisk64 fscsi2 50050768014025bd,3e000000000000
Failed hdisk65 fscsi2 50050768014025bd,3f000000000000
Failed hdisk66 fscsi2 50050768014025bd,40000000000000
Failed hdisk67 fscsi2 50050768014025bd,41000000000000
Failed hdisk68 fscsi2 50050768014025bd,42000000000000
Failed hdisk70 fscsi2 50050768014025bd,44000000000000
It's easy enough to script this now:
lspath -status failed | grep Failed | while read status hdisk parent connection
do
rmpath -dev $hdisk -pdev $parent -conn $connection -rm
done
It seems smarter not to throw out the good paths with the bad one and then repair the damage.