johnnyt Posted August 15, 2022 Posted August 15, 2022 Hi, @JimboAutomates, would it be possible to provide a configurable delay between device reports to ISY? Meaning NS would poll Airthings and send data for each device (6 sensors) to ISY with a configurable delay between each device. I have 7 devices and the update process can take 12-13 seconds under normal circumstances. An eternity in computer terms. I only request updates every 4 mins (what I think should be the default, BTW) to avoid exceeding Airthings rate limit and because updates to the cloud from each device are 5 minutes apart (though not all at the same time, of course). Technically that's 42 sensor readings. I can see from scanning my log files that not all 42 sensors typically end up being updated every 4 mins but most queries update 30 or more sensor readings. Now I do have an overloaded ISY (with close to 1000 programs) and might be the only one, or one of a few (which really surprises me) but please consider adding this in anyway next time you "open up" the NS. The data does not have to hit ISY in real time and allowing ISY to do other stuff in between device readings, if needed, is generally a good idea, IMO. I'm also in the process of planning an Indoor Air Quality automation package to resell a combination of Airthings sensors with a pre-programmed Polisy/IoP and your NS to capture data and automate HVAC functions. Am just waiting to see if the next few months lead to major improvements in terms of ISY Zwave support (and Matter, I'm reading) and, for my own use, a long awaited zwave migration tool from 994i to IoP to confirm a number of things.
Jimbo.Automates Posted August 15, 2022 Posted August 15, 2022 Hi, So updating the devices puts very little load on the Polisy, all processes in PG3 are threaded so they don't even block each other. I have 19 node servers running, with a lot of devices and programs with no issues. But, the query of all devices is run on each short poll, so just changing the short poll and restarting the node server controls how often it is done. I also emailed their support and asked them to provide one API call to get data for all devices, and they said that will not be done for Consumer devices.
johnnyt Posted August 15, 2022 Author Posted August 15, 2022 21 minutes ago, JimboAutomates said: Hi, So updating the devices puts very little load on the Polisy, all processes in PG3 are threaded so they don't even block each other. I have 19 node servers running, with a lot of devices and programs with no issues. But, the query of all devices is run on each short poll, so just changing the short poll and restarting the node server controls how often it is done. I also emailed their support and asked them to provide one API call to get data for all devices, and they said that will not be done for Consumer devices. So I can see - and I think Bob explained - that all NS events go into a single PG3 queue. That could certainly help manage the push of data to IoP but do IoP events also end up on that queue? If not, IoP could be really busy handling it's own events then have PG3 bombarding it with a steady stream of data (which trigger more ISY events - which is the problem and not the processing of getting the data itself) for extended period of time with no breaks or slow downs, no? I recently turned the Weatherflow NS back on after turning it and ST-Inventory OFF due to multiple ISY overloads to the point of crashing/rebooting when I first moved to PG3 and added it and ST-Inventory (which I now leave OFF), and Airthings NS'. See this post for all the gory details Since I've turned Weatherflow back ON I noticed most of my ISY lighting events are delayed, e.g. motion sensor that triggers lights takes forever to do its thing. Weatherflow is a very chatty NS and also takes 10-12 seconds to update all the data (not blaming the NS for all the data). I just saw an opportunity thanks to multiple Airthings devices to give ISY a break for what is my second most chatty NS. If the Airthings NS needs to call for each device's data one device at a time anyway, that would be an easy place to put in a break, I would think. That said, I get the point that if everything ends up on the PG3 queue and there was a lot of data on the queue, a break might not make a difference. Perhaps that means I need to ask Bob to consider adding a configurable "throttle" for PG3 queue processing? Of course that's not good for things where quick responses are desired. So then we would need to get into message priority setting by NS. The idea of putting a delay in Airthings NS was to make it kind of run at a lower priority, or at least lower speed. For sure I don't know how PG3 and IoP work, separately and together, so I'm no doubt missing a fair amount of the details that may make my suggestions completely 'out to lunch' but am throwing them out there anyway in case they lead to some tweaks for folks like me who love their ISY and have/want to give it as much work to do as possible. I've got plenty more ideas I want to throw ISY's way. I'm just waiting to be able to move to IoP so I can get past the 1000 program limit of the 994i. I just hope I won't be overloading IoP too.
Jimbo.Automates Posted August 16, 2022 Posted August 16, 2022 >> but do IoP events also end up on that queue IoP events do not go on the PG3 queue, it can talk to itself directly. I have weatherflow running as well, with Rapid Wind disabled so it sends less data. I don't notice any delays, I have lights triggered based on doors opening using the ELK node server and it happens very fast. I've heard of issues with ST-Inventory filling up ISY subscription connections, but not crashing it. >> If the Airthings NS needs to call for each device's data one device at a time anyway, that would be an easy place to put in a break Each query to the Airthings server takes a little time, so that is a built in break between each one. I could add an optional user delay in between each one, but that will take some effort to avoid other potential issues it may cause with Node server functionality. As I mentioned I have 19 node servers running and a couple are very busy, like WirelessTag's and I have not noticed issues with ISY On Polisy.
bpwwer Posted August 16, 2022 Posted August 16, 2022 PG3 has multiple queues and a queue hierarchy. Each node server has it's own queue and those queues feed to a common queue. It's mostly the common queue that will determine how fast requests are sent to the ISY. It is certainly possible to push the ISY beyond what it is capable of handling and it seems like you have. If it is really taking 10-12 seconds to update just WeatherFlow data then I think you're pushing the ISY way to hard. The WeatherFlow hub sends out data once a minute (that's not configurable) and on mine, all of the data that needs updating is handled within 1 second. You can look at the PG3 log and look for lines like: ISY Response: [Try: 1] [00:0d:b9:4e:3a:44] :: [200] :: 5.08479ms ..... Those lines provide information on: 1) How many tries it took before the ISY accepted and responded to the request. PG3 will make up to 3 attempts. If you're seeing a lot of 2 or 3 try responses, that's because the ISY is dropping the requests PG3 is sending. 2) How long it took the ISY to respond to the request. The example above took 5ms. The time it takes the ISY to respond should correlate with how busy the ISY is. It may be interesting to look for periods where it takes a long time to respond and see if you can determine what the ISY was doing (from the ISY log/event viewer) a those points in time. That may give some insight into what's really causing the issues. 1
johnnyt Posted August 16, 2022 Author Posted August 16, 2022 Darn, I changed the short poll on Weatherflow from 60 to 150 thinking it would delay updates to every 2 1/2 mins but I now see it didn't have the anticipated effect. Any chance you could change that @bpwwer? Attached are sample log entries for both Weatherflow and Airthings for about the last hour. I downloaded the PG3 log, which only includes includes today's entries and I searched it for Try: 2 and Try: 3. Only found one Try: 2. but all the tries (Try: 1) I visually scanned for are taking hundreds and even thousands of milliseconds. Here's a sample. 8/15/2022, 24:00:18 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 448.735586ms - 8/15/2022, 24:00:19 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 723.250318ms 8/15/2022, 24:00:19 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 1021.064203ms - 8/15/2022, 24:00:25 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 703.886677ms - 8/15/2022, 24:00:25 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 743.913189ms - 8/15/2022, 24:00:26 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 1055.351757ms - 8/15/2022, 24:00:26 [pg3] debug: ISY Response: [Try: 1] [00:21:b9:02:55:cd] :: [200] :: 1103.256553ms - Couldn't attach PG3 log here I think because it exceeds the max for an attachment (it's about 1.5M compressed) but if it would help, I can grab a chunk and provide it. WeatherflowAirthingsLogEntries-sample.zip
bpwwer Posted August 16, 2022 Posted August 16, 2022 @johnnytI guess you missed it or I wasn't clear. The WeatherFlow hub sends the data every minute, that's not configurable in the hub or in the node server, it just the way it works. You can configure how often the WeatherFlow node server queries for forecast data and that's it. Since I don't have any PG3 node servers running on a i994, I don't have any way to directly compare, but those times seem very long for a response. On my IoP, I will occasionally see requests take 200-300ms but most of the time it's 4 or 5ms. From the log snippet, it seems like the ISY is always busy and responding slowly. There's not really anything the node servers or PG3 can do to make the ISY respond faster. Slowing down node servers or having them send less data won't really help, it doesn't appear that they are the cause. 1
johnnyt Posted August 16, 2022 Author Posted August 16, 2022 2 hours ago, bpwwer said: @johnnytI guess you missed it or I wasn't clear. The WeatherFlow hub sends the data every minute, that's not configurable in the hub or in the node server, it just the way it works. You can configure how often the WeatherFlow node server queries for forecast data and that's it. Since I don't have any PG3 node servers running on a i994, I don't have any way to directly compare, but those times seem very long for a response. On my IoP, I will occasionally see requests take 200-300ms but most of the time it's 4 or 5ms. From the log snippet, it seems like the ISY is always busy and responding slowly. There's not really anything the node servers or PG3 can do to make the ISY respond faster. Slowing down node servers or having them send less data won't really help, it doesn't appear that they are the cause. @bpwwer No, I didn't miss your mention of Tempest hub feeding the data to NS every 60 seconds. I just thought there might be the ability in the NS to control when it sends the data to ISY, i.e. store the hub data in NS table then every x seconds (or whatever user sets) potentially with y seconds delay in between send the data to ISY. I don't know what's possible and not possible so I'm just asking. Sorry for my ignorance. Didn't mean to offend or annoy. I'm hopeful the move to IoP will significantly improve things. I am, however, preparing myself for the possibility that I may need a second HA controller. I'm sure separating my HVAC from my lighting would solve my problems even if I stayed with the 994i but two controllers does have several drawbacks and added costs. Thanks for all the info you've been patiently providing.
bpwwer Posted August 16, 2022 Posted August 16, 2022 While anything is possible, it's just not real practical to try and reduce the data sent by the node server. WeatherFlow may be one of your more prolific, but it is far from the most prolific of node servers. I've created stress test node servers that send far more data to the ISY than the WeatherFlow node server, probably at least 100x more data without problems. I don't believe your issues are solvable by changes to PG3 or the node servers. Perhaps there's a way to profile what the ISY is doing to isolate what's really causing it to be so busy and re-work that so that. I don't know. Without data to determine what the ISY is really doing that causes it to be so slow in responding, there's not much we can do about it. Maybe @Michel Kohanimhas a way to debug issues like this or extended log info that would help. 1
johnnyt Posted August 17, 2022 Author Posted August 17, 2022 17 hours ago, bpwwer said: While anything is possible, it's just not real practical to try and reduce the data sent by the node server. WeatherFlow may be one of your more prolific, but it is far from the most prolific of node servers. I've created stress test node servers that send far more data to the ISY than the WeatherFlow node server, probably at least 100x more data without problems. I don't believe your issues are solvable by changes to PG3 or the node servers. Perhaps there's a way to profile what the ISY is doing to isolate what's really causing it to be so busy and re-work that so that. I don't know. Without data to determine what the ISY is really doing that causes it to be so slow in responding, there's not much we can do about it. Maybe @Michel Kohanimhas a way to debug issues like this or extended log info that would help. I wouldn't want @Michel Kohanim/UDI to spend any time on trying to fix problems that are simply horsepower related (i.e. not enough of it on 994i). I'm very keen to move to IoP and 700-series Zwave for the added hp and because a number of my zwave devices are not working well and the vendor is basically saying its a problem with my controller. So getting the latest dongle and the zwave migration tool are the top priority. That said if UDI (or anyone else) knows IoP will not fix my issues, I'd like to know sooner rather than later. I'm guessing, though, the only way to know is for me to move to IoP to find out. One of the things that hasn't been discussed so far that you might be able to provide insight on, is the delay inherent in communicating with the various devices (insteon and zwave) and how that may (or won't) cause bottlenecks/performance issues in dealing with NS' using IoP. No one can change the fact that the device communication is slow and can be a relative eternity if there are weaknesses in the network connections or lots of incoming and outgoing traffic at the same time. I think the waiting/processing of insteon and zwave events is a factor in my performance problems using 994i. How much of a factor might this be when I move to IoP? Is IoP going to be able to handle all PG3 traffic even while it waits separately for insteon or zwave events to finish? I've heard lots of mention of "I have tons of NS and no problem" but no one (elsewhere either) has yet said "I have (as many or more than) 177 insteon devices and 128 zwave devices and a ton of NS using 994i, and I'm not having any problems." Throw in mention of 900+ programs and that's when I'll know someone's experience is relevant to my situation. It's clear to me I'm an outlier in terms of the demand I put on ISY so not asking you to fix stuff that would be just for me. I might, however, be a canary in a coal mine. I hope so because I'd like to see Polisy gain huge popularity and end up being put to heavy use like I have by many. Thanks again for having patiently addressed my questions and comments.
bpwwer Posted August 17, 2022 Posted August 17, 2022 I understand that you have a very large system and it's possible that no one else has built one out to that same level. Unfortunately, folks won't normally be monitoring threads like this if they aren't having issues so we may never know. My point is, that the current assumption is that the box is just overloaded and something with more "horsepower" will work better. But what if that assumption is false? What if you have one program out of those 900+ that's doing something that is taking 90% of the CPU resources and causing everything else to wait and changing that one program makes everything better? What if there's a bug in the ISY firmware that you just happen to trigger with one of your programs that causes the ISY to sit in a busy loop for 30 seconds over and over? I like to know the root cause before trying to come up with solutions. It may be that you've simply overloaded the box, but Insteon, Z-wave, and node server devices typically are very low resource drains on the system because of the communication delays. In the time it takes an Insteon device to receive a command and return a status response, the ISY can likely process 100's of TCP network requests. But we're seeing the ISY take 100-200x the normal time to respond to TCP requests. What is it doing for all that time?
apostolakisl Posted August 17, 2022 Posted August 17, 2022 @johnnyt I would start by watching your program summary page and looking for programs that are actively running. With 900 programs, that will be many screens of programs, so tough. But I really doubt ISY is hanging on device communications. I suspect it is a program or group of programs. You could also go through and disable folders of programs and see what happens. Might disable 1/4 of your programs at a time, then drill it down.
Jimbo.Automates Posted August 17, 2022 Posted August 17, 2022 Also, make sure it's not trying to constantly write updates to wireless devices, that can make ISY very slow if there are a lot. I disable that option on each startup.
johnnyt Posted August 17, 2022 Author Posted August 17, 2022 Yes, programs are certainly part of the problem. Don't think I haven't look at that. I've got the Admin Console open all the time. It takes so long to start (5 mins, I think) that I don't close it - despite having been told multiple times it's not designed to do that. It crashes sometimes but generally runs for me for several days. I've re-written entire groups of programs to try to streamline/manage the demand. I add lots of waits of different lengths in my programs. I realize that isn't a silver bullet since simply delaying an overload isn't preventing it, but I did used to see race conditions in the past that made it hard just to start the AC and haven't seen one of those for well over a year maybe two. I created a sDISABLEPROGRAMS.all variable set to the top folder that I can call with REST to get out that, if needed, though even that command did not always succeed in the worst cases. I also have SDISABLEPROGRAMS.lots that are peppered over several folders to help manage startups. Yes I had the crashes/reboots that occurred after moving to PG3 and adding Weatherflow, Airthings and ST-Inventory but that was an unusual thing. I've never had 3 crashes in 2 days before like I did in this case. It hasn't happened since I turned off ST-Inventory. And I can manually run ST-Inventory until it updates and things don't crash. It hasn't really helped to disable a group of programs - I've tried that. Yes, it reduces the load but then I've just cut off a whole group of functionality. It just takes a load off that removing any group of programs would do. Certainly I could simply do less with ISY, which is why I'm thinking of getting a second controller. Probably another iSY but I won't lie it's got me thinking maybe I should get something else. The next month or two will be critical to this decision. I don't usually get crashes any more, just some things that don't work or are delayed. I see socket failure messages in AC at random and full queues in error log at startup (a process which I've programmed as much as possible to stagger events and program but can only do so much) That said, you gave me a clue as to what to look for that I didn't have before - or, more accurately stated, that I didn't know/think about before - the time it takes for PG3 to process requests. While it's not inside ISY stuff, it could provide clues so maybe I could get useful info by disabling groups of programs. What I think would be nice is to have the event viewer entries saved to SD like the log/ error log so I could see what's happening just before a crash. Are they and I just don't know where to get them? That said, the log and error log rarely help after crashes so perhaps logging is the first thing to go during an overload situation, meaning I wouldn't get any info from event viewer log either. Another thing I just thought of would be having the PG3 response time in an ISY device that I could save in a variable and periodically append to a web page with a timestamp. Then I could link PG3 load entries, log entries, error log entries, and event viewer entries to maybe get more info to help determine the root cause. To date my issue is that more often than not I can't determine the root cause of specific failures.
johnnyt Posted August 17, 2022 Author Posted August 17, 2022 2 hours ago, JimboAutomates said: Also, make sure it's not trying to constantly write updates to wireless devices, that can make ISY very slow if there are a lot. I disable that option on each startup. Yes, I've seen this issue after putting in / restoring a new PLM until I can go over to every battery device and "Write Updates to Device". The only way I know to disable that is by doing it in the AC. Are you referring to a way of doing this in a program that I could run first thing at startup?
bpwwer Posted August 17, 2022 Posted August 17, 2022 Pulling the ISY response times from the PG3 logs isn't that hard. Someone better at unix shell scripting than me could probably whip up a script that dumps just the timestamp and response time into a file that could then be imported into a spreadsheet to graph it. Just seeing how it changes over time may provide some insight like every day at 4pm the response gets slower, then you'd know to look at programs that start at 4pm. PG3 saves 2 weeks worth of daily logs so you do have some historical data available to correlate with other events/logs. You can just dump all the response time entries for the current day with the command line command: grep "ISY Reponse" /var/polyglot/pg3/logs/pg3-current.log Or to just get time and response time grep "ISY Response" /var/polyglot/pg3/logs/pg3-current.log | cut -d ' ' -f 1,2,13 This is from a ssh shell on the Polisy running PG3.
Recommended Posts