Michel Kohanim Posted December 22, 2011 Posted December 22, 2011 Hi everyone, With your help, Chris did figure out the problem and we are currently working on a fix. As LeeG noted (quite ingeniously I might add) the problem has to do with how we handle group cleanup messages. The problem is isolated and it only happens in cases of extreme communication problems. With kind regards, Michel
apostolakisl Posted December 22, 2011 Author Posted December 22, 2011 I am not at all very good with these Insteon communication logs, but. It would appear to me that this particular switch is needing 2 hops to get the messages back and forth. I don't mean to be a whiner, but I would think that we should accept that Insteon may need "hops" to get the message around, it was after all built around the premise that the perfect comm was no longer a necessity (as opposed to x10). If we are going to forgo the "hops" then we are giving up a big feature of Insteon. Having said that, I appreciate that you guys think you have a solution to this problem and are working on a fix. Thanks.
LeeG Posted December 22, 2011 Posted December 22, 2011 It is not the hop count that causes this. If it was as simple as taking 3 hops to get from KPL to PLM this would not be problem. The Group protocol normally consists of a Group Broadcast message followed by what should have been a single Group Cleanup Direct message even if it took 3 hops for the Group Cleanup Direct to be received by the PLM. There are three Group Cleanup Direct messages from the KPL, received processed and ACKed by the PLM and passed to the application (ISY). The first one with max hops 1 was successfully received by the PLM. The second one with max hops 2 was successfully received by the PLM. The third with max hops 3 was successfully received by the PLM. None of the ACKs sent by the PLM in acknowledgement of those three messages made it back to the KPL. Very unusual situation for traffic to get from KPL to PLM with such success and traffic from the PLM to the KPL fail so totally. The problem is each additional Group Cleanup Direct message looks like another button press and could be another button press based on the message itself. If comm. was poor in the other direction, messages having trouble getting from the KPL to the PLM it is very possible to miss the initial Group Broadcast altogether because it has no ACK and no retry associated with it. That is why Insteon designed the Group protocol with a follow up Group Cleanup Direct to each Responder. Since this message is sent to a specific device it must be ACKed and can be retried by the KPL if an ACK is not received. From the message sequence it could actually be three distinct buttons presses with very poor comm. That is a very unusual sequence which is likely why the ISY analyzed it incorrectly.
apostolakisl Posted December 22, 2011 Author Posted December 22, 2011 It is not the hop count that causes this. If it was as simple as taking 3 hops to get from KPL to PLM this would not be problem. The Group protocol normally consists of a Group Broadcast message followed by what should have been a single Group Cleanup Direct message even if it took 3 hops for the Group Cleanup Direct to be received by the PLM. There are three Group Cleanup Direct messages from the KPL, received processed and ACKed by the PLM and passed to the application (ISY). The first one with max hops 1 was successfully received by the PLM. The second one with max hops 2 was successfully received by the PLM. The third with max hops 3 was successfully received by the PLM. None of the ACKs sent by the PLM in acknowledgement of those three messages made it back to the KPL. Very unusual situation for traffic to get from KPL to PLM with such success and traffic from the PLM to the KPL fail so totally. The problem is each additional Group Cleanup Direct message looks like another button press and could be another button press based on the message itself. If comm. was poor in the other direction, messages having trouble getting from the KPL to the PLM it is very possible to miss the initial Group Broadcast altogether because it has no ACK and no retry associated with it. That is why Insteon designed the Group protocol with a follow up Group Cleanup Direct to each Responder. Since this message is sent to a specific device it must be ACKed and can be retried by the KPL if an ACK is not received. From the message sequence it could actually be three distinct buttons presses with very poor comm. That is a very unusual sequence which is likely why the ISY analyzed it incorrectly. OK, but I don't get it. How can the message go one direction so successfully and go the other direction so poorly. But still, even if it thought there were three button presses, none of them should have been a button that ran that program.
LeeG Posted December 22, 2011 Posted December 22, 2011 I have no idea what is blocking PLM to KPL communication intermittently. With the Group Broadcast and all three Group Cleanup Direct messages received from the KPL okay, one would think the ACKs from the PLM would get back to the KPL. Once the source of interference is identified it may make more sense. As far as triggering the Program, Michel acknowledged that is a bug which UDI is fixing.
apostolakisl Posted December 22, 2011 Author Posted December 22, 2011 Hopefully the filter I ordered last weekend will arrive today. I ran a number of scene tests with that circuit off and they went perfect, then they failed with it on. It was reproducable about 5 times each way during a 15 minute span, so I feel like filtering the circuit should help. Also, after I moved the PLM off of that branch panel things worked better as well.
apostolakisl Posted December 23, 2011 Author Posted December 23, 2011 I got the filter and installed it on the breaker. I ran scene tests by the boat load. 100% success. The program still misbehaves. In fact the program design: if status off and switched off then turn on 25% misbehaves in general. When one of the lights associated with one of these programs is on and I turn it off, it will still run and go back on to 25%. It does it randomly about 50% of the time. I am thinking that this entire program structure is going to need to be abandoned.
LeeG Posted December 23, 2011 Posted December 23, 2011 An Event Trace with Device communications events selected when it fails is needed to determine if the filter resolved the problem of the device not receiving ACKs from the PLM. EDIT: the symptom with the KeypadLinc was the wrong node was being posted with an Off when there is a series of Group Cleanup Direct messages from a device. With a single node device that is sending multiple Group Cleanup Direct messages because the ACKs are not being received, the only node would be posted with an additional Off command. The end symptom is different because of the difference in the number of nodes a particular device has but the underlying problem is likely the same. That lack of ACKs back to the device looks like multiple button presses when the Insteon messages are evaluated programmatically.
apostolakisl Posted December 23, 2011 Author Posted December 23, 2011 The lack of ACK's back to the device has got to be something more than simple lack of comm. Reasons to say this: 1) The comm going the other direction never has an issue. 2) ISY initiated events always get through (ie programs run on ISY and execute events at the switch without fail.) 3) Tests of my comm (Scene tests) are now 100% successful. I have to wonder if this ACK is actually being sent in the first place, or if it is being formated improperly so the switch doesn't recognize it. I will also note that the problem is now consistent between 2 PLM's. I will also note that I didn't always have this problem. The biggest change to my system since this problem began is ISY firmware.
LeeG Posted December 23, 2011 Posted December 23, 2011 I cannot explain why comm in one direct is good and not in the other. For sure there is some aspect of this that has not been identified yet. Did you run an Event Trace and confirm that there are three Group Cleanup Direct messages being received when the light turns back On? There is no command or option that I have ever seen that will force the PLM to send an ACK or suppress it. That part of the process is all driven by the PLM firmware.
Michel Kohanim Posted December 23, 2011 Posted December 23, 2011 apostolakisl/LeeG, I think it would be safe to say: a. The way we have handled cleanups and ACKs has NOT changed for as far back as 2.8.16 b. The way we currently handle cleanups "may" cause all the problems apostolakisl is experiencing As such, I recommend waiting for our next release so that we can have a better understanding of where the actual problems may be. With kind regards, Michel
apostolakisl Posted December 23, 2011 Author Posted December 23, 2011 apostolakisl/LeeG, I think it would be safe to say: a. The way we have handled cleanups and ACKs has NOT changed for as far back as 2.8.16 b. The way we currently handle cleanups "may" cause all the problems apostolakisl is experiencing As such, I recommend waiting for our next release so that we can have a better understanding of where the actual problems may be. With kind regards, Michel I have been having problems with this program type for several months now, perhaps 6. So, if a change occurred around 2.8.16, that could have something to do with it.
apostolakisl Posted December 30, 2011 Author Posted December 30, 2011 It would appear that 3.1.16 has fixed the problem. I have tried to get it to screw up and it has not even after many many attempts. Only time will tell for sure, but certainly I would have seen dozens of screw ups as it was before in the number of tests I did.
apostolakisl Posted January 11, 2012 Author Posted January 11, 2012 It has been a couple weeks now with not a single malfunction. I believe you have fixed it!
Recommended Posts