GTench Posted June 19, 2023 Posted June 19, 2023 I noticed that if I disconnect from the internet for a short time then reconnect that all node servers are in a fail state and I need to restart PGx3. Is this normal or will PGx3 reconnect by itself after some period of time?
bpwwer Posted June 19, 2023 Posted June 19, 2023 You should be able to restart the node servers without having to restart PG3x. However, PG3x won't try to restart them automatically. The "failed" state means that the node server failed and disconnected from PG3x unexpectedly. In your case, the node servers are likely crashing when they get disconnected from the internet. Most are not designed to recover gracefully when that happens. This is something that would have to be fixed in each of node servers.
GTench Posted June 19, 2023 Author Posted June 19, 2023 Thanks Bob, Since restarting pg3x does seem to solve the problem, is it possible to add an option/feature/flag to restart PG3x in this situation?
bpwwer Posted June 19, 2023 Posted June 19, 2023 PG3x doesn't know the internet connection was off-line so it can't detect that situation. It doesn't know why the nodes servers failed, just that they have. The problem is with specific node servers. Some don't handle the internet disconnect well others do.
GTench Posted June 19, 2023 Author Posted June 19, 2023 OK thanks. Just my opinion but I think if a node server needs/uses/relies on an internet connection then it should be able to recover from an internet outage not just crash. I have 5 node servers and they all fail if there is an internet outage. I guess I could check UD Mobile notifications for a failed node server and do a restart but it would be simpler if it was all automatic
Solution bpwwer Posted June 19, 2023 Solution Posted June 19, 2023 49 minutes ago, GTench said: OK thanks. Just my opinion but I think if a node server needs/uses/relies on an internet connection then it should be able to recover from an internet outage not just crash. I have 5 node servers and they all fail if there is an internet outage. Yes, I agree, they shouldn't crash. You can help by reporting the issues to the node server authors. And there's a good chance that some of them may be my node servers, I know many/most of mine don't handle lose of network connectivity well and it's on my list of things to fix eventually.
GTench Posted June 20, 2023 Author Posted June 20, 2023 OK thanks... will do. Yes, Weatherflow is one of the ones that crashed
GTench Posted June 21, 2023 Author Posted June 21, 2023 Bob, The node server authors are passing the issue back to PG3x. 39 minutes ago, Goose66 said: All those errors are from the udi_interface (the PG3/PG3x) side of the node server interface. @bpwwer would be the man to investigate here. 31 minutes ago, Goose66 said: The log file contains the same errors here as in the EnvisaLink-DSC log file you posted earlier. Again, this appears to be a PG3/PG3x issue. I suggest you post a message with the log file(s) in the PG3 or PG3x support forum regarding this issue. EnvisaLink-DSC_6-21-2023_10902_PM.zip YoLink_6-21-2023_11016_PM.zip
Goose66 Posted June 21, 2023 Posted June 21, 2023 (edited) @bpwwer, AFAIK from the two attached node server log files, both node servers are running along swimmingly, and then udi_interface logs this message: 2023-06-21 13:05:54,611 MQTT udi_interface.interface INFO interface:_disconnect: MQTT Unexpected disconnection. Trying reconnect. rc: 16 About 10 seconds later, the udi_interface throws this error: 023-06-21 13:06:09,269 MQTT udi_interface.interface ERROR interface:_disconnect: MQTT Connection error: An exception of type timeout occured. Arguments: ('_ssl.c:1117: The handshake operation timed out',) Traceback (most recent call last): File "/var/polyglot/pg3/ns/0021b90261c8_1/.local/lib/python3.9/site-packages/udi_interface/interface.py", line 460, in _disconnect self._mqttc.reconnect() File "/var/polyglot/pg3/ns/0021b90261c8_1/.local/lib/python3.9/site-packages/paho/mqtt/client.py", line 1073, in reconnect sock.do_handshake() File "/usr/local/lib/python3.9/ssl.py", line 1310, in do_handshake self._sslobj.do_handshake() socket.timeout: _ssl.c:1117: The handshake operation timed out It then appears that the node server is "restarted," or at least all of the initialization routines are performed again. Another 10 seconds, the udi_interface throws the same "handshake operation timed out." The node server continues to run and attempt to update the node driver values (it's initializing) but gets dozens of warnings and errors like these: 2023-06-21 13:06:37,775 MainThread udi_interface.interface WARNING interface:send: MQTT Send waiting on connection :: {'config': {'version': '3.0.8'}} 2023-06-21 13:06:40,782 MainThread udi_interface.interface ERROR interface:send: MQTT Send timeout :: {'config': {'version': '3.0.8'}}. That goes on for a few minutes and then the user evidently pulled the plug. These events are almost exactly the same in both node server log files as well as having similar timestamps. Note that my node server (EnvisaLink-DSC) doesn't rely on or have a connection to an Internet service - just a direct connection to the Envisalink device over the local network. It appears to me that it was PG3x (or the MQTT broker) that yacked when the Internet connection went down and the node servers (specifically, udi_interface) were never able to reconnect. Perhaps you can get @GTenchto give you a PG3 log file. Edited June 21, 2023 by Goose66
bpwwer Posted June 21, 2023 Posted June 21, 2023 I looked at the files too and noticed that. This doesn't appear to be an internet outage but a local network outage which is a very different thing. There's a lot of local network communication between components and if that is disrupted for more than milliseconds, it may be unrecoverable. The MQTT connection between a node server and PG3x is happening on 127.0.0.1, that's not even external to the machine so for that to get a handshake error implies that something is actually blocking network activity on the local machine. I'll have to do some investigation but even unplugging the network connection should not effect that connection.
Goose66 Posted June 21, 2023 Posted June 21, 2023 It's also possible this is a PG3x thing and not a PG3 thing, so that may narrow down where to look.
GTench Posted June 22, 2023 Author Posted June 22, 2023 I don't know if this will help but here are a few more details concerning what I did. I have an incoming fiber connection to a modem. The modem is connected to a unifi switch with an ethernet cable. The modem has its wifi turned off. Wifi is provided through out the house by unifi access points connected to the switch. I never enabled wifi on the eisy. Eisy is connected directly to the unifi switch with an ethernet cable. My testing consisted of unplugging the ethernet cable from the modem to the unifi switch for a couple of minutes then plugging it back in. If the disconnect/reconnect time is say 15 to maybe 30 seconds the node servers do do show fail but I have not timed this. Disconnect/reconnect of a few minutes caused a fail of all node servers I am using the latest releases of PG3x and IOX.
GTench Posted June 22, 2023 Author Posted June 22, 2023 The modem is also the dhcp server and is a router
GTench Posted June 22, 2023 Author Posted June 22, 2023 Here is the PG3x log file based on a test that I just did pg3_6-22-2023_124130_PM.zip
bpwwer Posted June 22, 2023 Posted June 22, 2023 Thanks for the details on your setup and what you did to test it. The node server to PG3 communication should not be effected by external network issues, but it clearly appears to be. I'll have to do some investigation to see if I can figure out why.
Recommended Posts