Recently noticed a recurring issue where numerous Access Points were experiencing random disconnections from the control interface of the vSZ-H. The state of these APs frequently involves disconnection and hopping across nodes. Refer to the checklist below to gain insights into this issue and discover a workaround to address and overcome the problem.
Numerous APs experiencing disconnections from control planes and bouncing to different cluster nodes. The frequency of these node changes was notably high, with the same APs observed hopping between nodes every minute and, at times the issue appeared after hours. The issue was very random.
Issue observed in vSZ-H version 184.108.40.206.959.
Review the AP event logs to assess the frequency of the issue, focusing on event code 311 (AP changed control plane).
Examine the total number of control planes on the Cluster Summary page, observing patterns in AP node changes. Determine whether APs are switching nodes within specific interfaces or across all interfaces in the cluster.
Assess the uptime of control planes from the Control Plane section, verifying the accuracy of leader and follower status. Check for recent role changes.
Ensure a reasonable balance of APs per node.
In the event of a known or recent outage, and if recovery seems inadequate, schedule a maintenance window. Reboot the cluster node to prevent sync issues or any lingering lock-up state.
Validate whether uptime aligns with the reboot time after the activity. Confirm that services on all nodes are online and operational.
Ensure the vm resources required for the amount of APs involved or connected to the vSZ-H as per configuration guide.
Ensure ping reachability, latencies and node to node connectivity between AP to vSZ.
The impact can range from minimal to severe, contingent on the total number of APs or APs associated with the cluster node affected. A change in cluster nodes triggers a restart of SSH tunnels, enabling APs to download the updated configuration from the zone. This AP bounce and configuration download can have ramifications for WIFI operation, with the impact escalating significantly when dealing with a larger number of APs.
Logs to confirm:
Observe below logs from the AP support info files;
2023-09-27T18:22:18+00:00 vszxxxxxx sshd: Connection closed by 10.x.x.x.x port 49404
2023-09-27T18:29:54+00:00 vszxxxxxx sshtunnel_auth_key: Can not find key ssh-rsa
Based on the aforementioned logs and conditions, a support engineer can decipher the cause of the "Can not find key" error by exploring various combinations. In our specific case, the analysis revealed that a potential trigger for this issue could be a situation where the Command Line Interface to keycached is shorter than the timeout duration set for keycached.
Direction to resolve:
If you observe a similar behavior, please adhere to the checklist below and gather the following information:
Follow the provided checklist and include screenshots of any relevant references.
Download the AP support info files for a minimum of 2 to 3 APs under scrutiny.
Download snapshot logs from both the leader node and the node associated with the identified APs.
Initiate a support case, attaching the requested information. The support team will then validate the observed behavior and provide either a KSP or recommend a suitable release to address and resolve the identified issue.