Disk inaccessible following SAN outage and recovery

  • 7021114
  • 20-Jul-2017
  • 24-Jul-2017

Environment

Novell Open Enterprise Server 11 (OES 11) Linux
Novell Open Enterprise Server 2015 (OES 2015) Linux
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11
SAN or Disk Subsystem Failure

Situation

Following a SAN infrastructure failure, where the path to the SAN was interrupted but there were no physical disk failures, some or all of the following messages were seen in logs (such as /var/log/messages), utilities (such as pvscan, vgs, lvs or  nssmu) or in a supportconfig report after the SAN fault was addressed and the Linux/OES server was rebooted:

In the following example, /dev/sdx refers to the inaccessible disk.

OES NSS Errors
  • 20204 zERR WRITE FAILURE
  • 20812 zERR_POOL_NOT_FOUND
  • StampIO: Error 5 (Input/output error) reading at 0 of device sdx
  • Unable to find all segments of pool <POOL>. Found=<some> Expected=<all>
  • Pool POOL missing segment <Segment#>
  • Cannot mount pool <POOL>. No pool device object
Native SLES Errors
  • ldm_validate_partition_table(): Disk read failed.
  • sdx: unable to read partition table
  • /dev/sdx: read failed after 0 of 4096 at 0: Input/output error
  • /dev/sdx: read failed after 0 of 4096 at 1099511562240: Input/output error
However, commands such as mount and fdisk show that devices are available but cannot be read.

It was seen that two out of four disks were accessible from this server but the other two were not.

Rebooting the server did not help.

Resolution

Rebooting the SAN node that the server was connected to resolved the problem.  

Cause

The SAN Support Team were not able to identify why this should resolve the issue.  It may have been a symptom of the original fault that caused the SAN to fail in the first place. 

Additional Information

As a troubleshooting step, the following command was used to determine if the disk was readable at a block level:

dd if=/dev/sdx of=/tmp/sdx.out bs=1024 count=5

This would fail attempting to read a few blocks of the disk.

The dd command does a block-level read/copy of the disk and is therefore independent of any filesystem on the disk; in other words, even if the filesystem were irreparably corrupt, dd would still be able to read and copy that data (including any corruption).  As other disks on the same server, which use the same driver, could be successfully read, it suggested that the driver was functioning correctly.