cancel
Showing results for 
Search instead for 
Did you mean: 

VMware snapshot of vSZ causing AP disconnects

david_henderson
Contributor II
We have two Ruckus Virtual Smartzone controllers, both running 3.5.1
We use Veeam for backup and replication. As part of the backup and replication process, each VM gets a VMware snapshot, the snapshot stays open for about 3 minutes, and then the snapshot is deleted

Not every time, but quite often the VMware snapshot process causes AP disconnects. APs disconnect for 20-30 seconds then reconnect. APs do not restart, they just lose the connection to the controller for a short period of time

Has anyone else seen this?
3 REPLIES 3

dave_watkins_74
Contributor
At a guess yo're seeing VM stun either when creating the snapshot, or more likely when it's being consolidated after deletion. What version of VMWare are you running? ESXi 6 had significant improvements in VM stun around snapshots. 

The other factor affecting VM stun is the speed of your storage. The faster the storage the lesser the affect

david_henderson
Contributor II
We are running ESXi 6, update 3a
We are using a Nimble all flash array in production which has very high IOPS and very low latency
I thought about stun as well but it only takes a second to take a snapshot and even when deleting the snapshot and consolidation occurs it only takes a second

In the Ruckus controller under events I am seeing lots of "AP lost heartbeat" which does make sense. My guess is the AP does lose the heartbeat to the controller for just a second or two. I would not think this is long enough for AP disconnects. We have been running this setup for about 9 months and it is only recently we are seeing this behavior. We were running Ruckus firmware 3.4.x for much of that before upgrading to 3.5.0 and finally to 3.5.1 which is the latest.

david_henderson
Contributor II
Here are the exact times from yesterdays snapshot that results in large number of AP disconnects

Create virtual machine snapshot
Requested Start Time - 4:17:23
Start Time - 4:17:23
Completed Time - 4:17:24

Remove snapshot
Requested Start Time - 4:20:24
Start Time - 4:20:24
Completed Time - 4:20:26

When a VM gets a snapshot or when a snapshot is removed the VM is stunned for a period of time and no I/O happens. The snapshot took 1 second to take and 2 seconds to remove. Seeing "AP lost heartbeat" during this time did not surprise me. One or 2 second stun should not be long enough for an AP disconnect