VM Loses Connection During Snapshot Removal
During the snapshot removal component of a Druva backup the source virtual machine loses connectivity temporarily.
- Druva does not remove the snapshot itself. Druva only sends an API call to the VMware platform requesting the snapshot action to be performed.
- The snapshot removal process significantly lowers the total IOPS that can be delivered to the VM because of additional locks on the VMFS storage due to the increase in metadata updates, as well as the added IOP load of the snapshot removal process itself.
- In most environments, if you are already over 30-40% IOP load for your target storage, which is not uncommon with a busy application server, then the snapshot removal process will easily push that into the 80%+ mark and likely much higher.
- Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
- The following test should be performed when connectivity to the VM is not sensitive, for instance, during off-peak hours.
- To isolate the VMware snapshot removal event, Druva suggests the following isolation test:
- Create a snapshot on the VM in question.
- Leave the snapshot on the VM for the duration of time that a Druva job runs against that VM.
- Remove the snapshot.
- Observe the VM during the snapshot removal.
- While performing the test above, if you observe the same connectivity issues as during the Druva job run, the issue likely exists within the VMware environment itself. Review the following list of troubleshooting steps and known issues. If none of the following work to resolve the issue, we advise that you contact VMware support directly regarding the snapshot removal issue
Snapshot Stun Troubleshooting / Solutions
- If the VM being stunned is stored on an NFS 3.0 Datastore, please refer to the following documents:
- Check for snapshots on the VM while no Druva job is running and remove any that are found.
- Druva can back up a VM that has snapshots present. However, it has been observed that when VMware attempts to remove the snapshot created during a Druva job operation, and there was a snapshot present on the VM before the Druva job, snapshot stun may occur.
- Check for orphaned snapshots on the VM. (See:http://kb.vmware.com/kb/1005049 )
- Reduce the number of concurrent tasks that are occurring within Druva. This will reduce the number of active snapshot tasks on the datastores.
- Move VM to a datastore with more available IOPS, or split the disks of the VM up into multiple datastores to more evenly spread the load.
- If the VM's CPU resources spike heavily during Snapshot consolidation, consider increasing the CPU reservation for that VM.
- Ensure you are on the latest build of your current version of vSphere, hypervisors, VMware Tools, and SAN firmware when applicable.
- Move VM to a host with more available resources.
- If possible, change the time of day that the VM gets backed up or replicated to a time when the least storage activity occurs.
- Use a workingDir to redirect Snapshots to a different datastore than the one the VM resides on. (See: http://kb.vmware.com/kb/1002929 )