A couple of weeks ago a customer reported problems with a DPM 2016 server with Modern Backup Storage (MBS). Modern Backup Storage is a new approach of DPM to get a more efficient and flexible storage pool without the known limitations of the LDM database. With MBS you use storage spaces on the DPM server to add the disks to a large pool. Then on top of the storage space you create a volume or several with shares that the DPM server can use as storage instead of adding unallocated disks that DPM manages.
An explanation on MBS is not part of the scope of this blog but the problem that we had with it is related to Storage Spaces and more specific to ReFS volumes. ReFS is the new file system Microsoft has been cooking on for the last 6+ years. It has many improvements and several of them are used to make sure data corruption does not occur and if it occurs it repairs it automatically or takes actions to prevent it.
So why seems the volume on this DPM server offline, inaccessible and corrupt in a RAW file format?
In this setup the DPM server is a VM with 20x 1TB .vhdx files on a virtual SCSI controller as the backup storage. The DPM VM has a storage pool that was not giving any error messages. Also the virtual disk containing the RAW volume was not giving any errors.
The eventlog however was being flooded with error messages on disk 22 (which was the virtual disk on top of the storage space) and a “physical” disk3. Disk 3 was off course a virtual disk and not a physical disk. The hyper-v host was not giving any error messages in regard to disks so the focus was on the DPM VM.
Log Name: System
Source: Disk
Date: 7-6-2017 12:08:15
Event ID: 153
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: xxxx-vsvr-20.domain.local
Description:
The IO operation at logical block address 0x618c4318 for Disk 3 (PDO name: Device�0000030) was retried.
Log Name: System
Source: Disk
Date: 7-6-2017 12:08:16
Event ID: 153
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: xxxx-vsvr-20.domain.local
Description:
The IO operation at logical block address 0x3cc919f18 for Disk 22 (PDO name: DeviceSpace1) was retried.
After running Get-PhysicalDisk the problem with disk3 was displayed once again:
To Fix the problem
It turned out not to be a disk corruption or what so ever but more a health handling issue from DPM. DPM checks disks for their health, and because it has health error’s DPM flags the disk as unhealthy and brings it offline. To fix the problem I removed the disk with the IO Error from the pool. Storage spaces then starts to move al data from the problematic disk to the other disks and the volume became healthy again and was brought back online and was accessible.
The big question remains why is a volume going offline or in RAW state and do I have disks errors…
After some deep digging in the environment it turned out to be a performance problem which you can read all about in this blog…
Good luck!
Pascal Slijkerman
SR-IO was my problem. Win16, DPM16, 1Tb VHDX files on Hosted SAN Drive, DEDUP on Hosts.
SR-IO Enabled
Lose drives mid backup, detach re-attach fixed
Eventually lose drive entirely, changing from ReFS to RAW.
SR-IO Disabled on Backup VM NIC’s
Everything working, adding final backup jobs now. Been stable for over a week without loss.
435 Jobs, Online Azure backup for some SQL, 147 Jobs running in Incremental of 15 Min.
12 Tb of Storage
And DEDUP is hitting 25% now
Sorry, No Edit. 34% and gaining. Also Storage using VHDX, 3 x 1Tb drives per storage volume, Simple/Fixed. 48 VHDX total, 16 Storage Volumes.