johnnyt Posted July 21, 2022 Posted July 21, 2022 Hi, I had an ISY 994i lock up last night so power cycled it and Polisy to recover from that. Recovery did not go so well for OWM NS. I've attached today's log file. ISY lock up occurred at or shortly after 01:33:15 AM last night and was power cycled at 09:53 AM. Polisy turned off as well around 9:53 but only turned back on at 10:10 to let ISY settle. One of my ISY programs subsequently power cycled Polisy around 11:00AM because OWM had not updated ISY data for too long. It was still seen as disconnected as I started writing this but a subsequent ISY program initiated power cycle (about 11:44) fixed it. (As you know in another post I discuss needing ability to restart NS or PG3. This is an example of why it's needed, at least until all NS' can detect and recover from ISY lock ups, restarts and other issues. This weather data affects multiple HVAC related program decisions so I need to make sure the data is fresh enough. I do get notified when these Polisy restarts are called so I can look into them but, until I can fix the root cause, the show must go on and a self healing mechanism needs to be available because I can't always fix issues as soon as they occur. When I move to IoP, these power cycles will become suicide missions for ISY, which is really not ideal) OpenWeatherMap_7-21-2022_115802_AM.zip
bpwwer Posted July 21, 2022 Posted July 21, 2022 Thanks for providing such detailed information! Would it be possible to get the PG3 log file for the that time as well? You can PM it to me if you don't want to post that. From the node server log. Until 09:50:05, the node server was working properly and sending data to PG3. It was unaware that there were issues with the ISY. At this point, the node server stopped (I'm guessing because you power cycled the Polisy. The node server starts again at 10:10:54. There are no issues with it starting and it runs and is sending data to PG3 until 10:50:59 when it is again stopped. I'm not sure how you are determining that no data was sent. At 10:50:52 - 10:50:59 the log shows it sending: Successfully set controller :: CLITEMP to 26.93 UOM 4 Successfully set controller :: GV2 to 28.58 UOM 4 Successfully set controller :: DEWPT to 20.52 UOM 4 Successfully set forecast_0 :: DEWPT to 20.3 UOM 4 Successfully set forecast_0 :: GV0 to 27.0 UOM 4 Successfully set forecast_0 :: GV20 to 3.12 UOM 106 Successfully set forecast_0 :: GV2 to 26.0 UOM 4 This says the node server successfully sent those updates to PG3. With the info I have, I can't tell you what PG3 did with them. Also, the node server only sends values that changed. In this case, for the main node it was only the temperature, dewpoint and feels like temperature that changed so that's what was sent. How often stuff changes is really dependent on how often OpenWeatherMap updates the info for your location, not your node server polling intervals. The node server starts again at 11:03:29. And again, there are no issues with it starting and running. From the log, the node server was running properly the whole time. I see no indication that a restart was needed.
bpwwer Posted July 21, 2022 Posted July 21, 2022 I just saw you posted the PG3 log in another thread so I can provide more updates here. Looking at the PG3 log. At 10:50:52 PG3 scheduled the request to send the temperature value to the ISY: 7/21/2022, 10:50:52 [pg3] debug: Scheduling http://192.168.200.251:80/rest/ns/8/nodes/n008_s_2930073475/report/status/CLITEMP/23.6/4 to group statusGroup It got a response back from the ISY indicating it was successful 7/21/2022, 10:50:52 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 1526.422921ms - http://192.168.200.251:80/rest/ns/4/nodes/n004_controller/report/status/CLITEMP/26.93/4 So again, I see no indication that anything is wrong. PG3 is getting updates from the OWM node server and passing those on to the ISY successfuly. 56 minutes ago, johnnyt said: Hi, I had an ISY 994i lock up last night so power cycled it and Polisy to recover from that. Recovery did not go so well for OWM NS. I've attached today's log file. ISY lock up occurred at or shortly after 01:33:15 AM last night and was power cycled at 09:53 AM. I'm confused as the PG3 shows PG3 communicating just fine with the ISY until 02:13:52, that's the first time a request from PG3 to the ISY failed. It continued failing until 09:53:05 when it looks like the Polisy was rebooted. PG3 restarted started at 10:10:33 At 10:10:54 PG3 sends some info to the ISY and the ISY responds with success. At 10:10:57 PG3 sends a /rest/nodes request and the ISY fails to respond. But then other requests after that are successful. At 10:11:07 PG3 tries to add the OWM controller node to the ISY, this fails with a "bad request" error, however, it's not a bad request. This continues for about a minute with various "bad request" responses, successful responses and no responses to the PG3. This is all happening while node servers are trying to initializing. It appears that the ISY is struggling to keep up. It mostly clears up after that, but the ISY is still not responding to the /rest/nodes requests from PG3. For the next few minutes there brief periods where the ISY fails to respond to PG3 but then it starts responding normally. Probably as node servers finish their initialization. From that point everything is normal until it is restarted at 11:03. After the restart, I see the same errors happening where the ISY fails to respond to PG3. PG3 restarts place a heavy load on the ISY as node servers start and initialize. On your system, it seems like it takes 5-10 minutes after a PG3 restart before the load on the ISY decreases to what it can handle.
johnnyt Posted July 21, 2022 Author Posted July 21, 2022 (edited) Just for reference the Polisy power cycle is something ISY does when it reboots, and it waits 10 mins after turning the (WebSwitch outlet) for Polisy off before turning it back on so the ISY has time to stabilize. I had to build complicated, multi program startup routine to do things is chunks otherwise I get "full queue" messages and a variety of things don't happen as they should. My startup routine works 60% of the time, and for the other 40%, I think I've been able to minimize the things that don't happen. I digress a bit to provide some context. The problem right now is that the data in ISY has not updated since before the spontaneous reboot at 11:53 (about 7.5 hours ago). Here are the variables that get updated when ISY sees changes in values such as temperature and humidity Here are two programs that should have set these variables to something that's not 0. There are others too that aren't working. Set Current Humidity - [ID 0303][Parent 0330] If '1-HVAC and Backyard / Climate / OpenWeatherMap' Humidity > 0% Then $sHum.Outdoor.Weather = '1-HVAC and Backyard / Climate / OpenWeatherMap' Humidity % Else - No Actions - (To add one, press 'Action') Set Current Temp Weather Service - [ID 00B1][Parent 0330] If '1-HVAC and Backyard / Climate / OpenWeatherMap' Temperature > -50.0°F Then Wait 3 seconds $sTemp.Outdoor.WeatherService = '1-HVAC and Backyard / Climate / OpenWeatherMap' Temperature °C Else - No Actions - (To add one, press 'Action') Interestingly when I saved data from the ISY event viewer and separated out the data for OWM there are regular updates being seen by the ISY. See attached spreadsheet. However the displayed data has not changed accordingly (and shows node server is not online) and even if I run the program for temp update manually, it doesn't change the variable. The program is "True" (green) but either the data is 0 or ISY is badly broken and the program isn't actually performing the action. Here's data right now in ISY that I believe is from before 11:50 this morning. This is very strange. OWM_EventViewerData.zip Edited July 22, 2022 by johnnyt
johnnyt Posted July 21, 2022 Author Posted July 21, 2022 So after investigating and capturing the above info, I went ahead and restarted just the OWM NS and all is fine now. This is good example of how an ISY initiated NS restart function would eliminate the need for the "sledge hammer"/ IoP suicide mission of doing a Polsy power cycle, which actually didn't fix the problem this time (probably my ISY missed something at restart despite the 10 min delay in restarting Polisy) Maybe when I move to IoP I won't have these issues but no one is telling me that (including UDI) so it remains to be seen. Even then, will I be limited in what else I can ask ISY to do for me, if not initially then 2-3 other node servers and 200-300 programs down the road? I won't have the luxury of delaying start of PG3 when I move to IoP unless I buy a separate Polisy just for IoP...
bpwwer Posted July 22, 2022 Posted July 22, 2022 49 minutes ago, johnnyt said: Just for reference the Polisy power cycle is something ISY does when it reboots, and it waits 10 mins after turning the (WebSwitch outlet) for Polisy off before turning it back on so the ISY has time to stabilize. I had to build complicated, multi program startup routine to do things is chunks otherwise I get "full queue" messages and a variety of things don't happen as they should. My startup routine works 60% of the time, and for the other 40%, I think I've been able to minimize the things that don't happen. So I'm clear, the only thing running on the Polisy right now is PG3? I'm assuming you did some experiments that prompted you to set up the Polisy reboot when the ISY stops/starts, why? What happens if you don't? PG3 shouldn't care if the ISY stops or crashes, other than it will fail to get info from it and fail to send information to it. I don't believe it opens any connections to the ISY that it expects to say open until it is restarted. If it's not able to restore operation when the ISY comes back up, that would be a bug in PG3. 56 minutes ago, johnnyt said: The problem right now is that the data in ISY has not updated since before the spontaneous reboot at 11:53 (about 7.5 hours ago). Here are the variables that get updated when ISY sees changes in values such as temperature and humidity Interestingly when I saved data from the ISY event viewer and separated out the data for OWM there are regular updates being seen by the ISY. See attached spreadsheet. However the displayed data has not changed accordingly (and shows node server is not online) and even if I run the program for temp update manually, it doesn't change the variable. The program is "True" (green) but either the data is 0 or ISY is badly broken and the program isn't actually performing the action. Clearly data is being sent to the ISY, the event viewer log shows that as do the node server and PG3 logs. If the ISY isn't doing anything with that data, then it's a problem with the ISY. As I mentioned previous, the PG3 log shows a lot of failures communicating with the ISY when it starts up. When PG3 starts, it starts all the node servers which can generate a quite a bit of traffic as it tries to make sure the ISY is fully up to date with each node server's current status. With all the random failures it's getting, it's not really surprising that you're having issues. That would explain why restarting the node server when things are a lot less busy allows it to fully update the ISY and work. You obviously have a complicated setup on the ISY which means it's going to be difficult or impossible for anyone to replicate what you're seeing. There are a few things you could try that may help narrow down the problem. 1. You could stop all but one node server and restart PG3 and see if you still get the communication failures on startup. (If you manually stop a node server, it won't automatically restart when PG3 starts). You could try this with each node server, maybe only one really impacts the ISY. 2. With a good backup of the ISY, you could try disabling or removing all programs and then restart PG3 and see if it still has failures starting. Maybe a program or set of programs is causing so much load that it can't handle starting multiple node servers. Based on what I've seen in the logs it seems like the ISY is just overloaded and there may not be anything anyone can do about that.
johnnyt Posted July 22, 2022 Author Posted July 22, 2022 Thanks @bpwwer. You make a really good point about how all NS' are starting at once when Polisy/PG3 is restarted. That alone (along with programs it triggers) may overload my 994i, even if it has finished doing its own stuff at startup. I may try what you're suggesting but I'm pretty sure I'll find there's no problem if I simply turn stuff off. I think the situation is just that there's too much being asked of the 994i. Hopefully IoP will resolve that. For now I've turned off WeatherFlow NS as it can send a lot of data in quick succession and I haven't really begun to leverage that data as I just got the device and NS a few weeks ago. So far so good. Bottom line, though, it seems I'm firmly at a point where the only thing that works right now is to not automate stuff I'd like to automate. In the interest of making the Polisy/IoP able to handle problems when they happen (while the humans are at work, sleeping or away) as well as handle the workload demands of up to 2000 programs, I wonder if the following features could be added to PG3 and ISY, where applicable: (I know you don't control either but believe you have influence with the ones who do) ability for ISY to stop/start/restart individual node servers. Not just for working around problems, but also consider that maybe some NS data/functionality is not needed while sleeping or on vacation, while some is only needed when sleeping or away ability to stagger the startup of NS' using a configurable delay by NS/slot or as a single global setting. Ideally there would be a way to select NS startup order, but as a starting point just the ability to manage the bombardment of ISY at startup would be good. ability to throttle the messages being sent to ISY, e.g. make all NS' put stuff for delivery to ISY on a single PG3 queue (if that isn't already the case) and set a configurable delay between each item being sent to ISY. Retries would see the item being added to bottom of queue rather than compounding what may be a processing capacity issue. Ideally there would be a way to set a message (or NS) priority as some NS data is needed faster than other. Also, there would be a way to adjust the throttle via an ISY program. I think this feature could potentially negate the need for #2 as ISY would be able to adjust throttle to a very high delay between messages when it's restarting. To give a bit more context on my set up because my need for more processing capacity and better processing efficiency isn't going away, and because I've had a few people in the past politely suggest that surely better programming on my part would fix my problems. My biggest source of programs (estimated at 70-75%) are related to HVAC. Unlike lighting, there are many more conditions and data points affecting HVAC decisions/functions. It's easy to underestimate all the conditions. Believe me. Especially when there's huge differences in seasons where one lives (I'm in Eastern Ontario, Canada) For example, here's some of the data I have to get and act on: indoor and outdoor temp and humidity, forecasted temp (today and tomorrow), temp differences within the house, temp diff inside vs outside, humidity differences, time of day, season (based on temp, forecasted temp, and time of year), home vs. away vs. on vacation. (OWM provides data that's critical to proper functioning, which is why I'm looking for some self healing tools if things aren't working between it and ISY for whatever reason.) Aside from controlling heat, AC, furnace fan (low/high), HRV (low/high), HEPA filter (low/high), duct fan (low/medium/high), humidifier in winter, two dehumidifiers in summer, and 2 bathroom fans based on the above data, I also have 7 Airthings devices that I use to trigger ventilation when VOC or CO2 levels are high. I also have to deal with humans wanting to manually run something or change the thermostat mode/set point without the automation coming right behind and undoing it, yet eventually taking over again. It's way more complicated than you might think, and certainly than I expected when I started on the journey. Over past 12-18 months, I've tried multiple times (and am always looking) to reduce the number of programs. I've had some successes but not on scale that would be needed. Last year I completely scrapped and started over the chunk controlling my HRV to see if I could lean things up and reduce the number of programs. In the end I ended up with just as many programs, if not more, and probably 40+ hours of my life I would have rather used to add automation not fix what I already had. In the mean time, to cope with overloads at start up and minimize them at other times, I've added different 'wait' commands throughout most programs to help reduce concurrent load to some extent. Of course that's a bit of a guessing game when you have 900 programs. Also in the category of "hours of my life I would have rather used to add automation not fix what I already had", I've had to build a complicated startup sequence for things to work mostly reliably, otherwise I would get "queue full" messages and (unknown to anyone) actions would be dropped. Informal scan suggests it works 60% of the time and that, for the other 40%, there are only a few missed actions that seem to fix themselves later. I never had a startup crash for some reason (maybe because I delay Polisy start?). I've just had stuff not working properly after a startup (most recently the interaction with OWM NS.) Key for a large majority of my programs is that very few things need to be done instantaneously with HVAC and other stuff. A lot of it can easily wait a bit. Most lighting automation needs quick response but even some lighting can wait, e.g. outside light ON when it's dark. What's another 30 seconds when turning on outside light, HRV or Humidifier? I propose the ability to manage work with throttling and prioritization would allow for demand/supply imbalances to be handed gracefully for 1000 programs today and up to 2000 programs eventually. It feels like some of that is already in place within ISY, e.g. backups clearly run at lower priority, but not with respect to the crucial interactions between PG3 and ISY. Personally I think HVAC automation is the next big thing in HA, and HVAC enthusiasts (nerds) like me don't mind and even like programming an ISY, so for the benefit of the ISY and the UDI model, I'm really hoping Polisy/IoP will be able to handle 2000 programs, especially when a majority doesn't need to be processed instantaneously in the interest of being able to do the stuff that does need rapid response and allow for everything one wants to automate. I mention all this believing you have influence in what PG3 (and possibly ISY) is/will be capable of doing, but if I'm barking up the wrong tree, let me know. Thanks again.
bpwwer Posted July 22, 2022 Posted July 22, 2022 1 hour ago, johnnyt said: I may try what you're suggesting but I'm pretty sure I'll find there's no problem if I simply turn stuff off. I think the situation is just that there's too much being asked of the 994i. Hopefully IoP will resolve that. For now I've turned off WeatherFlow NS as it can send a lot of data in quick succession and I haven't really begun to leverage that data as I just got the device and NS a few weeks ago. So far so good. Bottom line, though, it seems I'm firmly at a point where the only thing that works right now is to not automate stuff I'd like to automate. I only suggest that to try and determine if there is something specific your doing or using that causes this. It may well be that the ISY is just overloaded. It has limited processing power and fairly limited TCP/IP stack. The IoP will certainly help as it has significantly more processing power. 1 hour ago, johnnyt said: In the interest of making the Polisy/IoP able to handle problems when they happen (while the humans are at work, sleeping or away) as well as handle the workload demands of up to 2000 programs, I wonder if the following features could be added to PG3 and ISY, where applicable: (I know you don't control either but believe you have influence with the ones who do) ability for ISY to stop/start/restart individual node servers. Not just for working around problems, but also consider that maybe some NS data/functionality is not needed while sleeping or on vacation, while some is only needed when sleeping or away ability to stagger the startup of NS' using a configurable delay by NS/slot or as a single global setting. Ideally there would be a way to select NS startup order, but as a starting point just the ability to manage the bombardment of ISY at startup would be good. ability to throttle the messages being sent to ISY, e.g. make all NS' put stuff for delivery to ISY on a single PG3 queue (if that isn't already the case) and set a configurable delay between each item being sent to ISY. Retries would see the item being added to bottom of queue rather than compounding what may be a processing capacity issue. Ideally there would be a way to set a message (or NS) priority as some NS data is needed faster than other. Also, there would be a way to adjust the throttle via an ISY program. I think this feature could potentially negate the need for #2 as ISY would be able to adjust throttle to a very high delay between messages when it's restarting. I have a lot of influence over what is done in PG3 as I'm the only developer working on it. But not much influence over the ISY software. #1 PG3 does support the ability for a node server to restart itself as that's currently the only way it can be implemented. The ISY doesn't have any built in methods to control node servers and it would be a pretty big project to add that. We can discuss this, but I'll make no promises. #2 is a possibility. Right now, I don't know of anyone else having problems with the current startup process and there are folks that run quite a few node servers. I've had near 25 starting and have not seen the issues you have. Of course this is with IoP with no programs running on it. Because node servers can vary quite a bit in how long they take to initialize, it may be hard to tune something like this. It can't be too long, a couple of minutes and you could have systems that take 30-60 minutes to start. A few seconds may or may not help. #3 is already being done. PG3 has both a per-node server queue and a global queue for messages being sent to the ISY with retries if they fail. In your case, the ISY either doesn't respond (which will cause PG3 to give up) or it is responding with an error that indicates a retry wouldn't help. You log did not show any cases where PG3 was retrying requests. It didn't look like there was enough traffic to trigger throttling. I do understand about all the programming needed to deal with HVAC. For a while I had a lot of programming for mine that took into account windows being open, inside vs. outside temperature diff, etc. The data was coming from WeatherFlow, my alarm sensors, and the thermostat. It controlled the thermostat and an whole house fan. I disabling most of it at one point when a window sensor failed and I couldn't get the whole house fan to turn off.
Recommended Posts