cancel
Showing results for 
Search instead for 
Did you mean: 

Random APs are frequently changing vSZ nodes

Ahamed1
RUCKUS Team Member

Recently noticed a recurring issue where numerous Access Points were experiencing random disconnections from the control interface of the vSZ-H. The state of these APs frequently involves disconnection and hopping across nodes. Refer to the checklist below to gain insights into this issue and discover a workaround to address and overcome the problem.

Issue noticed: 

Numerous APs experiencing disconnections from control planes and bouncing to different cluster nodes. The frequency of these node changes was notably high, with the same APs observed hopping between nodes every minute and, at times the issue appeared after hours. The issue was very random.

Issue observed in vSZ-H version 6.1.1.0.959.

Checklist:

  1. Review the AP event logs to assess the frequency of the issue, focusing on event code 311 (AP changed control plane).
  2. Examine the total number of control planes on the Cluster Summary page, observing patterns in AP node changes. Determine whether APs are switching nodes within specific interfaces or across all interfaces in the cluster.
  3. Assess the uptime of control planes from the Control Plane section, verifying the accuracy of leader and follower status. Check for recent role changes.
  4. Ensure a reasonable balance of APs per node.
  5. In the event of a known or recent outage, and if recovery seems inadequate, schedule a maintenance window. Reboot the cluster node to prevent sync issues or any lingering lock-up state.
  6. Validate whether uptime aligns with the reboot time after the activity. Confirm that services on all nodes are online and operational.
  7. Ensure the vm resources required for the amount of APs involved or connected to the vSZ-H as per configuration guide.
  8. Ensure ping reachability, latencies and node to node connectivity between AP to vSZ.

Observed Impact:

The impact can range from minimal to severe, contingent on the total number of APs or APs associated with the cluster node affected. A change in cluster nodes triggers a restart of SSH tunnels, enabling APs to download the updated configuration from the zone. This AP bounce and configuration download can have ramifications for WIFI operation, with the impact escalating significantly when dealing with a larger number of APs.

Logs to confirm:

Observe below logs from the AP support info files;

Sep 14 23:14:05 xxxxxxx-AP daemon.err idm: httpRecv receive fail

Sep 14 23:14:05 xxxxxxx-AP daemon.info rsmd_func[9098]: SSH Tunnel Stopped

Sep 14 23:14:05 xxxxxxx-AP daemon.notice rsmd[45]: sshclient ....... [stopped] (0.686)

Sep 14 23:14:05 xxxxxxx-AP daemon.err rsmd_func[9115]: SSHtunnel: Cannot start SSH-Tunnel. rsm_ip6_sgetSettingsWrapper() function failed to execute. RSM API Return value = 35 : unknown err code 35

 

Sep 9 00:08:55 xxxxxxx-AP user.err syslog: dbclient - Restarting SSH tunnel due to dbclient restarting itself

Sep 9 00:08:55 xxxxxxx-AP user.info syslog: /usr/bin/dbclient: Connection to sshtunnel@172.25.207.192:22 exited: No auth methods could be used.

 

From the SZ logs, we observed the error for authorizedKey and the connection status was closed by AP. 

 

2023-09-27T18:22:18+00:00 vszxxxxx sshd[12236]: error: AuthorizedKeysCommand /usr/bin/sshtunnel_auth_key.sh sshtunnel ssh-rsa 

2023-09-27T18:22:18+00:00 vszxxxxxx sshd[12236]: Connection closed by 10.x.x.x.x port 49404

2023-09-27T18:29:54+00:00 vszxxxxxx sshtunnel_auth_key[29977]: Can not find key ssh-rsa

 

Based on the aforementioned logs and conditions, a support engineer can decipher the cause of the "Can not find key" error by exploring various combinations. In our specific case, the analysis revealed that a potential trigger for this issue could be a situation where the Command Line Interface to keycached is shorter than the timeout duration set for keycached.

 

Direction to resolve:

If you observe a similar behavior, please adhere to the checklist below and gather the following information:

  1. Follow the provided checklist and include screenshots of any relevant references.
  2. Download the AP support info files for a minimum of 2 to 3 APs under scrutiny.
  3. Download snapshot logs from both the leader node and the node associated with the identified APs.
  4. Initiate a support case, attaching the requested information. The support team will then validate the observed behavior and provide either a KSP or recommend a suitable release to address and resolve the identified issue.
0 REPLIES 0