We have two SZ-100 controllers and 925 APs. We normally get one or two APs disconnecting daily (not rebooting, just getting the disconnected message) but over the last week or two, the number of AP disconnects has jumped to 20-30 a day.
Question 1: what does it mean when an AP is disconnected? I know they're not rebooting so is it just that the controller can't reach them?
Question 2: What happens to the devices that are connected to them? Does there network access pause until the AP reconnects?
Question 3: What could be causing this? A single SmartZone is supposed to be able to handle 1000 APs but looking at our two, the memory usage is steady at 60-70% and the CPUs hit 100% a lot. And normally our 20-30 disconnects are spread over three or four of our higher traffic sites but this morning, for example, 34 disconnects in two hours at ONE site!
It's probably significant that all the disconnects over the last few weeks are ALL coming from the same controller.
1. All communications initiate from the AP. The AP sends a heartbeat signal every 40 seconds and if it does not get a response from the controller it starts sending them every 5 seconds. If 12 heartbeat messages are not received on the AP (about 90 seconds- depending on version) the AP will go into discover - connecting to another node in a multinode cluster. If Geo Redundancy is employed (second backup cluster defined) the AP will also try to connect to the backup controller.
The controller also keeps track of heart beat messages. A heartbeat event indicates a heartbeat was not received from AP within the time frame. If no heartbeats are received within about 5 minutes the AP is listed as Disconnected and an Alarm is raised.
Most commonly this is a network issue, but there is the possibility that the AP is hung up and a (more remote) possibility that the controller is not able to respond due to congestion or service issues. 100% CPU on the controller, node out of service events or controller port saturation could cause loss of AP to SZ connectivity and raise disconnect alarms.
2. if you are using open or PSK WLAN's in Local Breakout (no tunnel) mode then service continues for connected clients. Clients that use controller services such as captive portal, MAC authentication may not be able to connect if the AP cannot communicate to the controller, but if they are connected and usiing Local Breakout, not tunnel, communication should continue until session timeouts requiring re authentication.
3. Most lost heartbeat and AP disconnect issues are due to network issues, so standard IP network connectivity from the AP CLI or SZ CLI are the best place to start. Diagnosing these issues requires checking that status of the AP first (has it rebooted, can it ping the controller. Accessing the AP via SSH and checking ping/Traceroute to the controllers (control plane IP for 3 IP systems) and the state of the connection (get scg, get ipaddr wan, get eth, get netstats eth0). ALWAYS try to get the AP support info file for any AP (or client) issues. If the AP is in connected state this can be done from the GUI by highlighting the AP and using the "more" button on the top to download support info file. For AP's that are disconnected you can download the support info by using logging on your SSH tool and run two commands - support (generates the support info file); support show (scrolls support info to SSH tool log)
Diagnosing issues on the controller side are more complicated. Put the "communicator" or "core log (for systems with less then 8 cores) into debug level on the GUI - Administration::Diagnostics::Application Logs ... wait for the problem to happen and save the communicator log (click on the number next to the log name in the application list, then click on the file to download) GET the AP support info!!, and also generate a Diagnostic Snap shot - either from the button (more) in the GUI application log page or from SZ CLI (en; diagnostics; exec all). Diagnostic snap shots created from CLI are stored in the HDU and can be downloaded from the GUI application logs under Diagnostics list, Diagnostic snapshots created from GUI are exported directly to browser and not saved.
For this issue i recommend you open a support case with Ruckus Networks support - check the contact us section of the Support portal
Thanks very much, all of this was immensely helpful.
Just to make sure I understand. The AP sends a heartbeat signal every 40 seconds which I'm guessing is a unicast to the controller IP address? Then, if it doesn't get a reply, it tries the other controller? Or would it broadcast to try and find a different controller? And if ALL our disconnect emails list the B controller as the node IP, does that imply an issue with the B controller?
And does this look like a normal Resource Utilization graph for 1 of 2 controllers with 900+ APs?
We don't have the tunnel enabled for any of the SSIDs we're broadcasting but, as I understand it, the one SSID that we have radius authentication on gets proxied back through the controller so if the AP disconnected, that would de-authenticate any clients and interrupt their service, correct?
I'm working on creating the support case right now.
The heartbeat message from the AP and the reply from the controller are in the SSH control tunnel.
when an AP joins a controller it is sent the list of (control plane for 3 interface version) IP's of all nodes, in a psudo random fashion to "spread" the AP's across nodes. You can check the "C-list" using the AP CLI command - get scg. This is the key command to verify if the AP is connected. The AP tries the first IP in the list and moves on down if they do not reply..
rkscli: get scg
------ SCG Information ------ SCG Service is enabled. AP is managed by SCG.
Server List: 10.1.4.11,12.xx.xx.xx SSH tunnel connected to 12.xx.xx.xx Failover List: Not found Failover Max Retry: 2 DHCP Opt43 Code: 6 Server List from DHCP (Opt43/Opt52): Not found SCG default URL: RuckusController SCG config|heartbeat intervals: 300|30 SCG gwloss|serverloss timeouts: 1800|7200 -----------------------------
This example is a single node cluster but with both internal and Public NAT IP's. My AP is connected to the Public IP (hidden to protect the innocent!) .
Check this command on a few AP's to see which controller they are connected to. Also check the SZ GUI under System::Cluster to see the distribution of AP's between nodes.
I am not sure if the email message is being sent from your "B" node or if it is indicating that the AP's are disconnecting from that node. You can check for AP disconnect messages in the Events and Alarms page or if you select a specific AP from Monitor::Access Points you can see the events for that AP and which node it is connecting to, and if it is getting disconnected from one particular node or the other.
The CPU graph you provided does not appear to show an node in crisis. It has some peak CPU usage but the green average is still below 50%.
If you are using RADIUS Proxy - then if the the AP is disconnected from both clusters, authentication will fail.
BUT - the AP should only stay disconnected if BOTH nodes are refusing to control it ... The AP should automatically fall over to the other node if one is in trouble within about 90 seconds. If the AP is in disconnected state then it must not be able to connect to either node. Licenses are shared between nodes so unless you hit the max AP's per node (10K) .... it should be connected.
BTW - in SZ/SCG architecture neither node has precedence over the other, they are "peers" and support Active Active redundancy. The "leader" node is just one of the nodes that manages time synchronization for the cluster and AP's. It can change without affecting operation.
If AP's are going into disconnection state then I would look to the network or the AP's (check uptime in support info). If AP's are "favoring" one node over another then there may be some issue with the node that will require remote access by Ruckus support engineer to diagnose further.
Just to be sure I understand, if we're getting an email about a disconnected AP and it's roughly five minutes between disconnect and the alert generation from the controller, then the inference is that the AP has tried to unsuccessfully contact both controllers, right?
And if it was one controller that was having issues (as all the disconnects coming from the B controller would seem to indicate), then wouldn't the number of APs on the A controller keep increasing? And if the numbers stayed consistent, then that sounds to me like there's something not controller related that is "blocking" them from contacting the controllers. But if it was something like network congestion, then wouldn't all the APs at a site (or in that subnet) disconnect instead of just an assortment? And wouldn't the disconnects all happen at about the same time (our disconnects seem to happen every few minutes, one at a time)?
Sorry, lots of "ands" and "buts" but I'm trying to wrap my head around all this.