oskrypuch
Members
-
Joined
-
Last visited
-
Currently
Viewing Topic: Why did multiple devices become unresponsive?
Everything posted by oskrypuch
-
Why did multiple devices become unresponsive?
And, I'll keep the device event viewer/log rolling for now. Orest
-
Why did multiple devices become unresponsive?
Well then, thanks to @kclenden too! And, thanks to @Guy Lavoie and @paulbates who helped get the ball rolling. The time frame of that errant code insertion, does match up with the observed faulting issue, so good circumstantial evidence. I'm going to sit with just the few devices I have restored for now, and confirm that they stay fully working -- a day should do it. There are only 32 PLM table entries right now, so easy to download/compare. If no further resets/issues, will do a full PLM restore, and as you suggest, follow that up with a DEVICE RESTORE on each device, for good measure! Thanks again. And, will report back. Orest
-
Why did multiple devices become unresponsive?
@IndyMike I am so grateful for your time and expertise on this! I think we are well on the way to solving this issue. Having the PLM (errantly) reset from time to time, would explain everything we are seeing. Do you feel there is a good chance that this comm overload triggered the errant reception and reset? The every thirty seconds thermostat polling was not intentional. There was a (recent) bit of code I added, and it looks like it got into a loop from time to time, I've squashed that, and checking the device comm log, the severe polling of the stats is stopped. I also do monitor stats that are calling for heat and cool to "count" the time of operation, but this is not new, and this executes once a minute, only if a stat is calling. For good measure I've disabled this for now. If I reenable it, will move it out to five minute intervals, that is accurate enough. Runaway code is obviously a bad thing in general, to say nothing of triggering an ALL-ON issue! As a test, I have DEVICE RESTORED some usually affected devices, and saved out the PLM TABLE for comparison/reference. And now I know precisely what command to look for in the log, if I see that fault again. If that sorts it out, I'm just a PLM restore to be back to normal! Will report back, and happy to hear any other suggestions you may have. Orest
-
Why did multiple devices become unresponsive?
... and that single DEVICE RESTORE, restored the function of that one device. So, I am thinking, that the 70 or so entries of the PLM table (copy uploaded), really just represents the PLM entries from the few devices I had just DEVICE RESTOREd. Given the number of devices I have, I might expect to properly see many hundreds of PLM entries. But, what is causing the mass clearing of the PLM table!? Is a failing Polisy, the problem? Orest P.S. And, there are NO [ 02 6F ] entries in the log.
-
Why did multiple devices become unresponsive?
@Guy Lavoie Correct, to be absolutely clear, all scheduled commands from the Polisy to devices are working ok. -- AND, the fault trigger just occurred, all devices can no longer transmit, including the ones that I just recently DEVICE RESTORED. AND, looked at the PLM table, it was BLANK! Hit START a number of times, no change. I then DEVICE RESTOREd one device, some PLM links now show up in the PLM table, eleven of them. Well if one device gives eleven links, clearly the PLM table is way under populated, even when I thought it was "full". DEVICE RESTORE a second device, now there are 29 entries in the PLM. Something is clearly "rotten" in the PLM table (or access to the PLM table by the Polisy) when the fault occurs, and that is shutting down devices communicating!! @IndyMike Attached is the original 70 entry PLM table, which is likely "light", and the event log that covers today, including the period when the fault occurred, which appears to be an emptying of the PLM table! PLM Links Table.v5.4.4__Sat 2026.06.20 11.50.29 AM.xmlISY-Events-Log.v5.4.4__Sat 2026.06.20 02.05.23 PM.txt
-
Why did multiple devices become unresponsive?
For a temporal overview of this ... 1) State: normal, all working -> some kind of trigger occurs (seems to happen every day or so, now) 2) State: all devices can no longer successfully transmit packets to the PLM/Polisy, no other functions affected -> individual RESTORE DEVICE command on a given device 3) State: that one given device is restored to normal function, all other devices still remain faulted -> some kind of trigger occurs, again 4) State: ALL devices once again can no longer transmit to the PLM/Polisy. Orest
-
Why did multiple devices become unresponsive?
@IndyMike Do not apologize for a loooong post, I love it! The PLM table shows 70 entries (table is now saved out), presume that means it sees 70 insteon devices. That is about right, there are some more ZWave devices as well which obviously are not reflected there. Cable - this fault has survived through two PLMs, and also through two different cables, one the DB9 serial cable, and now the USB cable, so that eliminates cable, connectors and plugs as an issue. A second PLM restore (or even just a serial device restore on each device) is on the plate here for sure. I obviously did one PLM restore, when I installed the new 2413U PLM. For now, I once again device restored one 2334 keypad (yesterday) to monitor, it is still working some 14 hrs later, will see if a full 24 hours results in the fault. Also, today I device restored two more 2334's, four 2477S switches and one 2844 motion. Why those? ... The 2334s partly display scene status (that works fine), but some of the buttons also trigger programs so send a packet to the PLM/Polisy. The 2477S manual switch status is used as a logical trigger for some programs, so that requires status transmission to the PLM/Polisy. And, the motion of course sends packets to the PLM/Polisy for action. Those were chosen as a sampling, as anything that requires a packet from a device, to tx to the PLM and the Polisy, fails when the fault occurs. Once device restored, they resumed full normal functioning, for now. As noted, I have now saved out the PLM table, 70 entries, but curiously no NULL end of list item, should there be one? Perhaps I don't have the full listing captured, but I did try a few times. I am running the event viewer in mode 3, there will be a lot of data there! The one huge advantage with this otherwise annoying issue, is that it is100% consistent, so potentially can be debugged. 100% consistent, when the fault state occurs (no idea what triggers this) all devices in the system that need to transmit a packet to the PLM, and then Polisy, for action don't, or the packet is lost/dropped in transmission. I was watching the event viewer, and pushing buttons of devices when faulted, show no device activity, which makes sense. Because all devices that use this mode (Tx), lose this ability all at once, as has been suggested, it surely has to be something up stream, the PLM or Polisy. Also 100% consistent, once all devices are faulted, a DEVICE RESTORE on a given device will restore it to normal function (for a while, until the next "trigger"), and it will now get its transmission packet received by the Polisy, which then carries out its action. Nothing else, that I have been able to discern is affected. The system otherwise is working. Programs are running, executing commands, scenes external to the Polisy continue to work, as is direct control of lights from manual operation of switches. The one obvious exception to programs running properly during the fault, are those that rely on a report of state of devices (switch positions, motions, etc.), the logic may fail as the state is not correctly updated at the PLM/Polisy. And using the UD mobile app to trigger programs works fine, as a surrogate for example, of a curtain open/close button push of a 2334 keypad. But, you would expect that. Will report again tomorrow, and post the logs. Orest
-
Why did multiple devices become unresponsive?
Although I thought earlier, that the change in the device link table was pertinent, presently, with no change to the link table the fault occurs. There were some stray and duplicated entries previously, but they were not significant to this issue. And now (with the device link table "fixed"), the device restore function doesn't actually end up changing anything in the device's link table, yet results in an immediate restoration of the function of the device. I understand in a global sense what the device restore function does, but perhaps there is some internal data table or cache fix up that happens as well. Sure I am grasping at straws, to try to explain (and mitigate) what I am seeing. I just now have "device restored" one 8 button panel, it now works, for example one of the buttons opens/closes the curtain, button pushes from this device are now seen by Polisy and the correct program is triggered with effect. With that device restore the device link table did not change, so it must have done or triggered something else to happen. This fix up effect is immediate, and so far, 100% consistent -- once the fault occurs. I will be monitoring it for function as well as any change in the device link table. If it continues acting as devices have in the last few days, by tomorrow it will be no longer working. I'll also look at giving the Polisy a 24hr power off "rest", to see what happens. Orest
-
Why did multiple devices become unresponsive?
Just thinking, the other valuable data-point would be to know precisely what the RESTORE DEVICE command does, as it does reset the problem, even if only temporarily. But, for that, I think we would need to have the developers comment. @Michel Kohanim Orest
-
Why did multiple devices become unresponsive?
As I say, I'm game for anything, it is very frustrating and puzzling. We can do a dumb home for a day, it is half-dumb now! But, what is your thinking on this? To confirm/exclude that it is something the Polisy is doing or internally faulting, or that the Polisy is doing/faulting after running for a while? Orest
-
Why did multiple devices become unresponsive?
@IndyMike Thanks so much for explaining the tables. YES, 58.23.C7 is the old PLM. Not sure why it was not removed with the PLM replacement procedure, but perhaps that doesn't happen. But, regardless, having both control devices listed in there would not be part of the issue, as the problem existed before the new PLM replaced the old. FWIW, doing a device restore does remove that duplication in the table. But, having the cleaned up device link table doesn't make a difference either. Even though I did device restores on several devices as a test, cleaned up the tables, the problem recurred over night in these devices, and the tables did not corrupt -- as you suggest, that shouldn't happen, and it didn't. So, boiling it down even further... 1) with this fault condition, any device that needs to transmit state to the PLM, fails to do so. 2) performing a device restore on a given device alone, immediately allows it to function again (temporarily), and information (a keypress, motion detected, etc.) is once again seen by the Polisy from the device 3) function is restored for less than 24 hours, and then it returns to fault state "1" above -- this point in particular is a real head scratcher to me -- this is probably the key to the puzzle, understanding why this happens 4) outward (Polisy -> device, ON/OFF etc.) controls of the devices work normally and are not affected 5) grouped scenes of devices, with one device a control device, and actions independent of the PLM/Polisy work normally 6) this is not related to the state of the device table, that was a canard, even though the tables were "fixed" as of today in those devices I was testing, the fault pattern recurred. I do have one set of routines that changes the illumination of the LED keys on the 8 button panels, but only a few of them. The fault state affects EVERY device, wired, wireless, across the entire system, with the common point being that they transmit information/state to the Polisy. I am now starting to wonder if there is a problem with the Polisy, perhaps some bad memory developing or similar, or a corrupted firmware, or both. And, somehow, the "device restore" cleans up something in a table in the Polisy as well. I'm really not sure where to go. I would hate to replace the Polisy (a huge undertaking), and then find out that was not cause, but there seems little else to attack at the moment. A bit tongue in cheek, but if I did a device restore on every affected device, once every six hours or so, that might step around the issue! Does not seem practical, and not even sure how you would do that, clearly not possible to do for the wireless devices. Orest
-
New eisy lite
Interesting, I wondered if you could just plug it in to the new EISY. Would save a lot of work. Orest
-
New eisy lite
Looks like both are unavailable currently (aartech). eisy r2 - $839 "out of stock" eisy lite - $630 "on order with vendor" I had been thinking of making the move to eisy r2 from the Polisy (5.4.4) -- but when is r3 coming out? ;-) . The Insteon side is little more than a simple backup/restore, but also needing to at same time rebuild my 700 series USB dongle ZWave inventory, moving it to the new ZWave hardware, there is no easy path for that. Orest
-
Why did multiple devices become unresponsive?
And, it is the same pattern with all the devices I've looked at. There is one or two extra lines in the device table, sometimes with no high water mark null line. Otherwise, all the lines match. A device restore updates the device table to match the ISY table, and then the device starts fully working. (that is its transmissions are then seen by Polisy) I've fixed a dozen or so, and will monitor to see if they "unfix", like earlier with the old PLM still in place. Orest
-
Why did multiple devices become unresponsive?
So, perhaps the old failing PLM, continually farkled up the device tables to some extent, taking a bit of time to corrupt most of the devices -- enough that device transmissions failed. A replacement of the PLM, just carried that bad data (sourced from the old PLM table) into the new PLM, much as @paulbates is suggesting. Finally, the device RESTORE (vice just the PLM replacement procedure), from the Polisy, WITH a new PLM in place finally fixes it as there is no new/further corruption process. Maybe wishful thinking, but so far the observations would be consistent. Fingers crossed, will report back. Orest
-
Why did multiple devices become unresponsive?
Poking around the link tables, here is a capture of the device and ISY tables for a given device, when it is not working correctly ... And here it is, AFTER a device RESTORE (as per above) of that device (and return to functioning) ... They match now, and it is the device that was changed, the last three entries were deleted, and a high water mark was added. I don't know exactly what that means, but probably someone here can decode that. Now, to see if the disparity in the tables (and loss of function) recurs. We might be getting somewhere. Orest
-
Why did multiple devices become unresponsive?
When you developed a "similar" issue, with a gradual loss of your PLM, was that the ultimate solution for you? Orest
-
Why did multiple devices become unresponsive?
@paulbates It may well be that the PLM replacement function uses a different data table from the device restore, which restores both the device and PLM. Orest
-
Why did multiple devices become unresponsive?
Yes, worth a shot. The current outlet is an isolated circuit, straight to the panel, for small current electronics only. And it would be odd to hit only inbound PLM traffic, but we are into the Twilight Zone on this one. Orest
-
Why did multiple devices become unresponsive?
Thanks! An interesting idea, and I'm game for anything. So, Polisy has two sources of link data? That is, if it copies (possibly faulted) data from the PLM to its table, where would it get fresh "correct" data to restore with? The time lag, from working (when restored), to not is a little hard to explain. With the old modem in place, doing a RESTORE DEVICE, (always) worked but only for a day or so. But, I actually have done exactly that (RESTORE DEVICE) on several devices, with the new PLM in place, more just to restore their function! But, I was going to watch them to see when/if they stopped functioning. Will report back. Orest
-
Why did multiple devices become unresponsive?
So, in a nutshell, a large, mature Polisy install, with a few ZWave devices controlled by the old 700 series dongle. Just recently, noticed a fault, 100% consistent, and I've now defined it definitively: Any transmissions from peripheral insteon devices do not reach the Polisy, and are not acted on. That includes both wireless and wired devices, so motions, remotes, but also panel buttons Transmissions between insteon devices (shared scenes etc.) work fine Control of insteon devices, that is insteon device reception of signals from Polisy, works fine Activating programs or scenes from either the admin UI or the android app work fine ZWave devices are not affected, work 100% A DEVICE RESTORE of a given insteon device will result in restoration of function, but only briefly, perhaps 12 to 24 hours. Thought the PLM was going bad, replaced it with a new 2413U. Fully restored all functions, then bizarrely the next day back to the identical fault condition. I rebooted the system a couple of more times, for good measure, that actually made no change. Network noise seems unlikely, as it is only one way communications that is an issue, and there is no delay in that, it is "crisp". Further I used a remote button device within a couple of feet from the PLM, so the hop would have been wireless entirely, and it did not work. I'm not an expert on the event viewer, but from what I could see the system event viewer did not show any inbound activity when triggered, but normal outbound, reflecting what I am seeing with the system. It strikes me as unusual that it takes a while for the fault to cover, once temporarily mitigated, some sort of buffer over-run or something? Yet a reboot didn't help. @Guy Lavoie -- you were right to call this one an odd one, it may be the odd one of the year! I was thinking of refreshing the IOX install, but would have to find an old copy of 5.4.4, can't easily move beyond that as the OS in the Polisy is older. The early ZWave controller has been a bit of an anchor in that regard, as those definitions would have to be all redone, if I update. Is the Polisy dying, bad memory, or some component thereof? Running out of things that could be at fault. Disheartening, on top of losing the function of the smart home niceties. Orest
-
Why did multiple devices become unresponsive?
"Smart" house is coming back alive. Still have to crawl around and get the wireless water sensors, but then it will be done. Just needed to replace the PLM. Bought a new 2413U, got it connected to the Polisy, and initialized as per the procedure. EDIT: Not great, identical problem recurred after 24 hours with the brand new PLM. I am back to square ONE.
-
Ghosted by Michel?
By "ISY", you mean to say the older ISY 994 platform, or the newer POLISY as well? Orest
-
Why did multiple devices become unresponsive?
A Polisy will still accept the DB9 serial adapter, from a 2413S, that is actually what I have currently. There was a serial/USB adapter for the Eisy as well, but not sure if it is available any more, either. But for sure, best to get a 2413U, rather than trying to ebay out a 2413S combo. Orest
-
Why did multiple devices become unresponsive?
@auger66 Well, I think you are in the "interlude", at the moment. 😉 I would suggest, if you have a recurrence, buy a new PLM and carefully do the full the replacement procedure. (as per above) OTOH, if it doesn't recur soon, if you have an older PLM (pre v2.4 or so), just get a new PLM anyway, and have it in inventory for the future. I have ordered a new 2413U, should be here in a couple of days, and will report back. For now my smart house, is semi-dumb! It seems that the fault state, affects any device where the system expects to hear from it with a Tx. Obviously that includes wireless motions, remotes and water detectors, but that also includes the 6/8 button panels and similar. And the fact that a device restore (temporarily) corrected the issue, to me strongly suggests it is a hardware rather than logic/software issue. All points to the PLM, the usual weak link. Orest