A couple of weeks ago a customer reported problems with a DPM 2016 server with Modern Backup Storage (MBS). Modern Backup Storage is a new approach of DPM to get a more efficient and flexible storage pool without the known limitations of the LDM database. With MBS you use storage spaces on the DPM server to add the disks to a large pool. Then on top of the storage space you create a volume or several with shares that the DPM server can use as storage instead of adding unallocated disks that DPM manages.
An explanation on MBS is not part of the scope of this blog but the problem that we had with it is related to Storage Spaces and more specific to ReFS volumes. ReFS is the new file system Microsoft has been cooking on for the last 6+ years. It has many improvements and several of them are used to make sure data corruption does not occur and if it occurs it repairs it automatically or takes actions to prevent it.
So why seems the volume on this DPM server offline, inaccessible and corrupt in a RAW file format?
In this setup the DPM server is a VM with 20x 1TB .vhdx files on a virtual SCSI controller as the backup storage. The DPM VM has a storage pool that was not giving any error messages. Also the virtual disk containing the RAW volume was not giving any errors.
The eventlog however was being flooded with error messages on disk 22 (which was the virtual disk on top of the storage space) and a “physical” disk3. Disk 3 was off course a virtual disk and not a physical disk. The hyper-v host was not giving any error messages in regard to disks so the focus was on the DPM VM.
Log Name: System Source: Disk Date: 7-6-2017 12:08:15 Event ID: 153 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: xxxx-vsvr-20.domain.local Description: The IO operation at logical block address 0x618c4318 for Disk 3 (PDO name: Device�0000030) was retried.
Log Name: System Source: Disk Date: 7-6-2017 12:08:16 Event ID: 153 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: xxxx-vsvr-20.domain.local Description: The IO operation at logical block address 0x3cc919f18 for Disk 22 (PDO name: DeviceSpace1) was retried.
After running Get-PhysicalDisk the problem with disk3 was displayed once again:
To Fix the problem
It turned out not to be a disk corruption or what so ever but more a health handling issue from DPM. DPM checks disks for their health, and because it has health error’s DPM flags the disk as unhealthy and brings it offline. To fix the problem I removed the disk with the IO Error from the pool. Storage spaces then starts to move al data from the problematic disk to the other disks and the volume became healthy again and was brought back online and was accessible.
The big question remains why is a volume going offline or in RAW state and do I have disks errors…
After some deep digging in the environment it turned out to be a performance problem which you can read all about in this blog…