-
Why did multiple devices become unresponsive?
And, I'll keep the device event viewer/log rolling for now. Orest
-
Why did multiple devices become unresponsive?
Well then, thanks to @kclenden too! And, thanks to @Guy Lavoie and @paulbates who helped get the ball rolling. The time frame of that errant code insertion, does match up with the observed faulting issue, so good circumstantial evidence. I'm going to sit with just the few devices I have restored for now, and confirm that they stay fully working -- a day should do it. There are only 32 PLM table entries right now, so easy to download/compare. If no further resets/issues, will do a full PLM restore, and as you suggest, follow that up with a DEVICE RESTORE on each device, for good measure! Thanks again. And, will report back. Orest
-
Why did multiple devices become unresponsive?
@IndyMike I am so grateful for your time and expertise on this! I think we are well on the way to solving this issue. Having the PLM (errantly) reset from time to time, would explain everything we are seeing. Do you feel there is a good chance that this comm overload triggered the errant reception and reset? The every thirty seconds thermostat polling was not intentional. There was a (recent) bit of code I added, and it looks like it got into a loop from time to time, I've squashed that, and checking the device comm log, the severe polling of the stats is stopped. I also do monitor stats that are calling for heat and cool to "count" the time of operation, but this is not new, and this executes once a minute, only if a stat is calling. For good measure I've disabled this for now. If I reenable it, will move it out to five minute intervals, that is accurate enough. Runaway code is obviously a bad thing in general, to say nothing of triggering an ALL-ON issue! As a test, I have DEVICE RESTORED some usually affected devices, and saved out the PLM TABLE for comparison/reference. And now I know precisely what command to look for in the log, if I see that fault again. If that sorts it out, I'm just a PLM restore to be back to normal! Will report back, and happy to hear any other suggestions you may have. Orest
-
Why did multiple devices become unresponsive?
... and that single DEVICE RESTORE, restored the function of that one device. So, I am thinking, that the 70 or so entries of the PLM table (copy uploaded), really just represents the PLM entries from the few devices I had just DEVICE RESTOREd. Given the number of devices I have, I might expect to properly see many hundreds of PLM entries. But, what is causing the mass clearing of the PLM table!? Is a failing Polisy, the problem? Orest P.S. And, there are NO [ 02 6F ] entries in the log.
-
Why did multiple devices become unresponsive?
@Guy Lavoie Correct, to be absolutely clear, all scheduled commands from the Polisy to devices are working ok. -- AND, the fault trigger just occurred, all devices can no longer transmit, including the ones that I just recently DEVICE RESTORED. AND, looked at the PLM table, it was BLANK! Hit START a number of times, no change. I then DEVICE RESTOREd one device, some PLM links now show up in the PLM table, eleven of them. Well if one device gives eleven links, clearly the PLM table is way under populated, even when I thought it was "full". DEVICE RESTORE a second device, now there are 29 entries in the PLM. Something is clearly "rotten" in the PLM table (or access to the PLM table by the Polisy) when the fault occurs, and that is shutting down devices communicating!! @IndyMike Attached is the original 70 entry PLM table, which is likely "light", and the event log that covers today, including the period when the fault occurred, which appears to be an emptying of the PLM table! PLM Links Table.v5.4.4__Sat 2026.06.20 11.50.29 AM.xmlISY-Events-Log.v5.4.4__Sat 2026.06.20 02.05.23 PM.txt
-
Why did multiple devices become unresponsive?
For a temporal overview of this ... 1) State: normal, all working -> some kind of trigger occurs (seems to happen every day or so, now) 2) State: all devices can no longer successfully transmit packets to the PLM/Polisy, no other functions affected -> individual RESTORE DEVICE command on a given device 3) State: that one given device is restored to normal function, all other devices still remain faulted -> some kind of trigger occurs, again 4) State: ALL devices once again can no longer transmit to the PLM/Polisy. Orest
-
Why did multiple devices become unresponsive?
@IndyMike Do not apologize for a loooong post, I love it! The PLM table shows 70 entries (table is now saved out), presume that means it sees 70 insteon devices. That is about right, there are some more ZWave devices as well which obviously are not reflected there. Cable - this fault has survived through two PLMs, and also through two different cables, one the DB9 serial cable, and now the USB cable, so that eliminates cable, connectors and plugs as an issue. A second PLM restore (or even just a serial device restore on each device) is on the plate here for sure. I obviously did one PLM restore, when I installed the new 2413U PLM. For now, I once again device restored one 2334 keypad (yesterday) to monitor, it is still working some 14 hrs later, will see if a full 24 hours results in the fault. Also, today I device restored two more 2334's, four 2477S switches and one 2844 motion. Why those? ... The 2334s partly display scene status (that works fine), but some of the buttons also trigger programs so send a packet to the PLM/Polisy. The 2477S manual switch status is used as a logical trigger for some programs, so that requires status transmission to the PLM/Polisy. And, the motion of course sends packets to the PLM/Polisy for action. Those were chosen as a sampling, as anything that requires a packet from a device, to tx to the PLM and the Polisy, fails when the fault occurs. Once device restored, they resumed full normal functioning, for now. As noted, I have now saved out the PLM table, 70 entries, but curiously no NULL end of list item, should there be one? Perhaps I don't have the full listing captured, but I did try a few times. I am running the event viewer in mode 3, there will be a lot of data there! The one huge advantage with this otherwise annoying issue, is that it is100% consistent, so potentially can be debugged. 100% consistent, when the fault state occurs (no idea what triggers this) all devices in the system that need to transmit a packet to the PLM, and then Polisy, for action don't, or the packet is lost/dropped in transmission. I was watching the event viewer, and pushing buttons of devices when faulted, show no device activity, which makes sense. Because all devices that use this mode (Tx), lose this ability all at once, as has been suggested, it surely has to be something up stream, the PLM or Polisy. Also 100% consistent, once all devices are faulted, a DEVICE RESTORE on a given device will restore it to normal function (for a while, until the next "trigger"), and it will now get its transmission packet received by the Polisy, which then carries out its action. Nothing else, that I have been able to discern is affected. The system otherwise is working. Programs are running, executing commands, scenes external to the Polisy continue to work, as is direct control of lights from manual operation of switches. The one obvious exception to programs running properly during the fault, are those that rely on a report of state of devices (switch positions, motions, etc.), the logic may fail as the state is not correctly updated at the PLM/Polisy. And using the UD mobile app to trigger programs works fine, as a surrogate for example, of a curtain open/close button push of a 2334 keypad. But, you would expect that. Will report again tomorrow, and post the logs. Orest
-
Why did multiple devices become unresponsive?
Although I thought earlier, that the change in the device link table was pertinent, presently, with no change to the link table the fault occurs. There were some stray and duplicated entries previously, but they were not significant to this issue. And now (with the device link table "fixed"), the device restore function doesn't actually end up changing anything in the device's link table, yet results in an immediate restoration of the function of the device. I understand in a global sense what the device restore function does, but perhaps there is some internal data table or cache fix up that happens as well. Sure I am grasping at straws, to try to explain (and mitigate) what I am seeing. I just now have "device restored" one 8 button panel, it now works, for example one of the buttons opens/closes the curtain, button pushes from this device are now seen by Polisy and the correct program is triggered with effect. With that device restore the device link table did not change, so it must have done or triggered something else to happen. This fix up effect is immediate, and so far, 100% consistent -- once the fault occurs. I will be monitoring it for function as well as any change in the device link table. If it continues acting as devices have in the last few days, by tomorrow it will be no longer working. I'll also look at giving the Polisy a 24hr power off "rest", to see what happens. Orest
-
Why did multiple devices become unresponsive?
Just thinking, the other valuable data-point would be to know precisely what the RESTORE DEVICE command does, as it does reset the problem, even if only temporarily. But, for that, I think we would need to have the developers comment. @Michel Kohanim Orest
-
Why did multiple devices become unresponsive?
As I say, I'm game for anything, it is very frustrating and puzzling. We can do a dumb home for a day, it is half-dumb now! But, what is your thinking on this? To confirm/exclude that it is something the Polisy is doing or internally faulting, or that the Polisy is doing/faulting after running for a while? Orest
-
Why did multiple devices become unresponsive?
@IndyMike Thanks so much for explaining the tables. YES, 58.23.C7 is the old PLM. Not sure why it was not removed with the PLM replacement procedure, but perhaps that doesn't happen. But, regardless, having both control devices listed in there would not be part of the issue, as the problem existed before the new PLM replaced the old. FWIW, doing a device restore does remove that duplication in the table. But, having the cleaned up device link table doesn't make a difference either. Even though I did device restores on several devices as a test, cleaned up the tables, the problem recurred over night in these devices, and the tables did not corrupt -- as you suggest, that shouldn't happen, and it didn't. So, boiling it down even further... 1) with this fault condition, any device that needs to transmit state to the PLM, fails to do so. 2) performing a device restore on a given device alone, immediately allows it to function again (temporarily), and information (a keypress, motion detected, etc.) is once again seen by the Polisy from the device 3) function is restored for less than 24 hours, and then it returns to fault state "1" above -- this point in particular is a real head scratcher to me -- this is probably the key to the puzzle, understanding why this happens 4) outward (Polisy -> device, ON/OFF etc.) controls of the devices work normally and are not affected 5) grouped scenes of devices, with one device a control device, and actions independent of the PLM/Polisy work normally 6) this is not related to the state of the device table, that was a canard, even though the tables were "fixed" as of today in those devices I was testing, the fault pattern recurred. I do have one set of routines that changes the illumination of the LED keys on the 8 button panels, but only a few of them. The fault state affects EVERY device, wired, wireless, across the entire system, with the common point being that they transmit information/state to the Polisy. I am now starting to wonder if there is a problem with the Polisy, perhaps some bad memory developing or similar, or a corrupted firmware, or both. And, somehow, the "device restore" cleans up something in a table in the Polisy as well. I'm really not sure where to go. I would hate to replace the Polisy (a huge undertaking), and then find out that was not cause, but there seems little else to attack at the moment. A bit tongue in cheek, but if I did a device restore on every affected device, once every six hours or so, that might step around the issue! Does not seem practical, and not even sure how you would do that, clearly not possible to do for the wireless devices. Orest
-
New eisy lite
Interesting, I wondered if you could just plug it in to the new EISY. Would save a lot of work. Orest
-
New eisy lite
Looks like both are unavailable currently (aartech). eisy r2 - $839 "out of stock" eisy lite - $630 "on order with vendor" I had been thinking of making the move to eisy r2 from the Polisy (5.4.4) -- but when is r3 coming out? ;-) . The Insteon side is little more than a simple backup/restore, but also needing to at same time rebuild my 700 series USB dongle ZWave inventory, moving it to the new ZWave hardware, there is no easy path for that. Orest
-
Why did multiple devices become unresponsive?
And, it is the same pattern with all the devices I've looked at. There is one or two extra lines in the device table, sometimes with no high water mark null line. Otherwise, all the lines match. A device restore updates the device table to match the ISY table, and then the device starts fully working. (that is its transmissions are then seen by Polisy) I've fixed a dozen or so, and will monitor to see if they "unfix", like earlier with the old PLM still in place. Orest
-
Why did multiple devices become unresponsive?
So, perhaps the old failing PLM, continually farkled up the device tables to some extent, taking a bit of time to corrupt most of the devices -- enough that device transmissions failed. A replacement of the PLM, just carried that bad data (sourced from the old PLM table) into the new PLM, much as @paulbates is suggesting. Finally, the device RESTORE (vice just the PLM replacement procedure), from the Polisy, WITH a new PLM in place finally fixes it as there is no new/further corruption process. Maybe wishful thinking, but so far the observations would be consistent. Fingers crossed, will report back. Orest
oskrypuch
Members
-
Joined
-
Last visited