Jump to content

Infinite Loop - What conditions to look for?


Teken

Recommended Posts

During the last few months the ISY has become very unresponsive. The symptoms appear to be the following to the lay person.

 

- The UI freezes, but the system clock functions and ticks away.

 

- Programs appear to not execute.

 

- Changes in the UI do not always take hold. 

 

- Reboots are common to recover from a stuck state.

 

- The entire interface sometimes requires a CTRL ALT DEL to kill the process and allow the user to shut down. At this point the entire system is actually frozen and the system clock has stopped.

 

- Login in sometimes requires 4-10 tries.

 

- The Java console does not always fully populate. Meaning the user normally gets prompted to enter the user name / password but this doesn't happen.

 

- I have seen programs flicker from true / false like a strobe light. Turning on / off yet the program is not in a true / false state to cause this?

 

 

I am currently working with Michel at UDI and have been asked to turn off all programs from the master folder tree. I am going to login the next 999999999999 times and watch how the system is along with watching the energy module to see if my power readings stop being updated.

 

This time 48 hours ago the system was in such a critical state. All of my programs literally just went poof and disappeared! I was in the middle of adding UDI to a program for an alert when I noticed the system was in some unknown state,

 

Meaning there was no system busy flag, just that clicking on any area seemed to take minutes to hours to move along. The strange thing during that complete melt down was the fact all of the e-mail list people were gone?

 

Existing programs that had email alias were replaced with numbers?

 

Anyways I am looking for insight for things I should consider or keep in mind that may cause such an infinite loop / race condition. Up until two months ago my system has been rock solid and want to address anything on my end to restore the system to that same state.

 

Your feed back and insight is greatly appreciated.

Link to comment

Hi Teken

Sounds like a tough spot and sorry to hear that. Some thoughts:

 

Something could be wrong with the unit and perhaps not your code?

  • Eliminate firmware release candidate problem. Did you go to a new release candidate around the time it started losing it?
  • Reset the ISY to factory and reload your configurations and code. Being relatively new I don't know how hard that is to do. The base OS/Firmware could have gotten trashed should be eliminated as a possibility
  • Finally and hopefully not, could the ISY have a hardware problem caused by a power spike or something else? Is it possible to run diagnostics on it in its current state?

If those things don't pan out, I think Michele is on the right track,

  • shut down all programs, see what happens,
  • then turn them on group at a time until it breaks.

Hopefully that pinpoints where the problem is occurring, and then turn programs on / off until you narrow it down. If its code it could be one program, or two programs that unintentionally interact and 'run away' with the ISY.

 

Paul

Link to comment

Thanks for the feed back and insight. Moments ago I started down the path to isolate what programs could be causing me so much grief. Below is video I just took of one program that was just created and have no clue why it would act this way.

 

This gives you the over all feel about the strobing effect I spoke of.  :?   :shock:

 

http://vid941.photobucket.com/albums/ad254/EVIL_Teken/ISY%20Infinite%20Loop/newfile2_zps4f32f305.mp4

Link to comment

The strobing of program status is something I have seen. Disabling the offending program caused it to stop then I could examine the "if" conditions for items controlled by it's own "then" or "else" section. Of course if this involves a "round robin" of several program interlinked this gets  lot more complicated. State variables can be dangerous this way.

 

Check your events for repeated device codes.

 

After watching your video I see those two programs are already disabled. It would appear they call each other in a loop fashion. In view of the slow speed they do this I would guess there is another looping structure going on somewhere else also. Maybe the video doesn't show the actual speed or maybe the Admin Console doesn't either.

 

 

Another thought I had was a firmware add-on or 3rd party software that is stuffing more variables into ISY than you have defined or are overlapping and writing to state variables they aren't supposed to be.

Link to comment

@Teken-

 

How frequent are updates from the Autelis unit? I have to agree with Larry that it looks like these programs are stuck in a loop.  Can you shut down the Autelis for a few minutes when this happens and see if it stops?

 

I'm going to review the programs and see if I missed something that might cause this.

 

-Xathros

Link to comment

@Teken-

 

Is i.NotificationSuspend possibly defined as a state variable instead of an Integer?  If so, There's your cause!

 

-Xathros

Link to comment

@Teken-

 

How frequent are updates from the Autelis unit? I have to agree with Larry that it looks like these programs are stuck in a loop.  Can you shut down the Autelis for a few minutes when this happens and see if it stops?

 

I'm going to review the programs and see if I missed something that might cause this.

 

-Xathros

 

I have no clue how often the temperature is pushed into the ISY. But, I can certainly shut it down and see what happens.Please keep in mind I am also seeing similar *flicker* of programs that are not in a true / false state. They too are randomly (no predictable pattern) are doing this flicker but not like in the video you just saw.

 

For those unsure, what you see is exactly what I see in front of the monitor. Now, whether or not those fast pules are faster / slower to the ISY I have no clue.

 

@Teken-

 

Is i.NotificationSuspend possibly defined as a state variable instead of an Integer?  If so, There's your cause!

 

-Xathros

 

I copied exactly what you provided and have since deleted the program. I will review the code and double check if the program is indeed a state vs integer in a few moments.

Link to comment

I have no clue how often the temperature is pushed into the ISY. But, I can certainly shut it down and see what happens.Please keep in mind I am also seeing similar *flicker* of programs that are not in a true / false state. They too are randomly (no predictable pattern) are doing this flicker but not like in the video you just saw.

 

For those unsure, what you see is exactly what I see in front of the monitor. Now, whether or not those fast pules are faster / slower to the ISY I have no clue.

 

 

I copied exactly what you provided and have since deleted the program. I will review the code and double check if the program is indeed a state vs integer in a few moments.

Check the Variables tab and see if i.NotifySuspend is on the State or Integer sub tab. and of my vars that start with "i." should be defined on the Integer tab and any beginning with "s." should be on the State tab.

 

-Xathros

Link to comment

I just checked the variable tabs and they are correct. Here is a screen capture of both sections in the ISY. I am going to recreate the program and see what happens now.

Good.  That rules out a loop caused by that at least.

 

I'd be interested to see some of the other flickering programs to see if there is a common element.

 

-Xathros

Link to comment

Good.  That rules out a loop caused by that at least.

 

I'd be interested to see some of the other flickering programs to see if there is a common element.

 

-Xathros

 

I am monitoring it right this minute and will record what I see and find and share with the group. I replied back to the thread in question which has this program.

 

So far it seems to be working fine.

 

But, have modified it to be less sensitive and have a larger window for change / variance in hopes this buffer will be the solution to the data bombardment.

 

I don't like to say its the data, opposed to the user programming (error) but time will tell.  :oops:

Link to comment

Well that didn't last very long.  :cry:  The system is completely gone down hill and is stalling out. I have removed the Ethernet cable from the Autelis device and this problem still persists?

 

I have deleted the program and redone it from scratch and even bumped up the values to allow a larger buffer. Yet this program does not behave correctly?

 

Am I missing something here?

 

Below are the last two videos of this problem: http://vid941.photobucket.com/albums/ad254/EVIL_Teken/ISY%20Infinite%20Loop/secureroom_zpsf83f872d.mp4

 

http://vid941.photobucket.com/albums/ad254/EVIL_Teken/ISY%20Infinite%20Loop/secureroom2_zpsd54ccc2c.mp4

Link to comment

Hi Teken,

 

Unfortunately this is a very hard problem to solve. My recommendations:

1. Make sure you are accessing your ISY locally (not through the remote IP address)

2. Create a new Folder with no conditions such as My New Programs

3. Disable My Programs Folder by having an On Never condition. This will stop all your programs in that folder

4. Right mouse clip | Copy each folder in the original My Programs and then move it to My New Programs

5. Observe for whatever duration that causes the slow down, if OK, then repeat step 4 for other folders

 

Good luck.

 

With kind regards,

Michel

Link to comment

@teken

I found an error in my code that is responsible for this. I'm so sorry that i have put you through so much grief trying to track this down. I will post revisions shortly that will resolve the loop. The problem is in calculating the AlertDelta. The first step in the calculation is triggering the override.

I feel horrible that I have caused you so much grief.

-Xathros

 

EDIT: Nevermind.  After another review, the problem I thought I saw is a non issue.  i.AlertDelta is an integer variable and does not cause a trigger.   Also, these programs are fine on my test system.  Only difference is that I don't have an Autelis updating the SecureRoom variable, I do that manually.

 

-Xathros

Link to comment

@Teken-

 

Any updates?  Have you worked out whats going wrong yet?

 

-Xathros

 

Hello Xathros,

 

After a grueling 48 hours! I was finally able to bring the ISY-994iZ controller from a complete bricked state.  :x  :?  I had endless login issues, java pop up error messages, never ending upgrade failure messages saying XYZ, whether it be completed at 2% to 80+.

 

The system literally indicated five separate back up versions were invalid?? I was basically about to throw this box out the window at the 30 hour mark.

 

After cooler heads prevailed and taking a few moments to read threw some of the similar error pop up messages that other members have indicated in the past. I determined part of my problems could be a failing PLM.

 

I replaced the almost two year old PLM with one of two brand new back up PLM's I had on hand. Loaded the ISY with a 4.1.X back up copy and worked my way up to 4.2.8

 

I think what was really troubling and frustrating at that time was that all versions of the restore would always indicate 4.2.8?? :?   :shock:  I have never personally experience this behaviour before and both the UI / System indicated the same even after selecting all three options of the Java App for deletion.

 

There is a bug in 4.2.8 with that respect because it should never hold the value once a restore is installed in its place. The only way I knew the restore was valid is from screen shots of each back up and a marker I placed in a file indicating the version.

 

Regardless, after my 48 hour marathon to bring the ISY from this state. I have observed a few things I believe are not related to the program you have crafted for me. For your reference I have not created any of the last three programs you have provided.

 

I did this to see how the system behaves with a brand new PLM, 4,2,8 base, and existing programs in place.

 

I have seen extremely slow web page loads, my existing GEM, Dash Box, show delay / lack of updating energy readings. I believe my problem lies with the Autelis Bridge and the fact it may be over loading my LAN with endless streams of data.

 

I have wireshark running to measure the load / volume of packets being streamed over the LAN for review and analysis. Since moving the Autelis to another network things seem to be better but still see some lag in connectivity.

 

After running almost a mile of CAT6 Ethernet this long week end and trying to bring the ISY from the brink of death.

 

I am one tired puppy !

 

I shall report back once I can sit down and really monitor the network traffic and see the cause and effect of the Autelis Bridge on my LAN. I have since asked the vendor to upgrade the firmware so the end user can adjust the send interval (assuming this is the root cause) even if its not. That would be a very important feature to allow the Admin the ability to manage the send interval to balance the network traffic load.

 

I thank you so very much for the follow up and your continued support in helping me craft such powerful and helpful programs for my energy monitoring efforts. 

post-1970-0-13157000-1407209847_thumb.png

Link to comment

@Teken-

 

Wow!  Sounds like quite an ordeal.

 

It's possible that I'm just misunderstanding your terminology but a few things jumped out at me.  First, restoring an ISY backup from an older firmware will not change the current firmware or UI versions as the firmware is not contained in the backup. UI version is dependent on what UI you download from either your ISY or the UDI website and what version may already be in your java cache up until you clear the cache.  When plulling a new UI from the ISY, the UI will always equal the firmware version.  The backup contains only your settings, devices, scenes, programs, variables, network resources, custom notifications etc.  To go back to an older firmware you would have to "Manually Upgrade"  and select an older firmware image. To go back to an older UI, clear your java cache and pull the UI from the UDI website using a URL specifying the version.

 

Another thing to note is that often the error messages displayed during a backup / restore are unrelated to the backup/restore process and are instead related to other devices attempting to connect to the ISY over the network.  I get the impression that the ISY gets too busy to respond to some network traffic when doing backup/restore operations and connections may timeout. I often see socket timeouts during a backup.  Just acknowledge them continue on.  The backup is still happening behind the messages.  If the message says "Upgrade failed", "Backup Failed", "Restore Failed", then thats another story.

 

All in all it sounds like a rough few days.  I'm glad to hear you have got it back under control now.

 

-Xathros

Link to comment

@Teken-

 

Wow!  Sounds like quite an ordeal.

 

It's possible that I'm just misunderstanding your terminology but a few things jumped out at me.  First, restoring an ISY backup from an older firmware will not change the current firmware or UI versions as the firmware is not contained in the backup. UI version is dependent on what UI you download from either your ISY or the UDI website and what version may already be in your java cache up until you clear the cache.  When plulling a new UI from the ISY, the UI will always equal the firmware version.  The backup contains only your settings, devices, scenes, programs, variables, network resources, custom notifications etc.  To go back to an older firmware you would have to "Manually Upgrade"  and select an older firmware image. To go back to an older UI, clear your java cache and pull the UI from the UDI website using a URL specifying the version.

 

Another thing to note is that often the error messages displayed during a backup / restore are unrelated to the backup/restore process and are instead related to other devices attempting to connect to the ISY over the network.  I get the impression that the ISY gets too busy to respond to some network traffic when doing backup/restore operations and connections may timeout. I often see socket timeouts during a backup.  Just acknowledge them continue on.  The backup is still happening behind the messages.  If the message says "Upgrade failed", "Backup Failed", "Restore Failed", then thats another story.

 

All in all it sounds like a rough few days.  I'm glad to hear you have got it back under control now.

 

-Xathros

 

Hello Xathros,

 

After reading your reply this makes a lot more sense to me. The one part I am unclear on is when I cleared the Java cache and only use the direct IP to the ISY.

 

The UI still showed the 4.2.8 firmware version? Is this to be expected behaviour here? I was indeed seeing endless back up fails, restore fails and never ending socket this / java that errors.

 

What made this whole process even harder was the system would not even let me login. I am uncertain whether the PLM had any relations to the problems I had. But, decided to make a proactive decision on replacing the PLM unit given it was either nearing the 2 year shelf life and wanted to eliminate that as a possible culprit in my troubles.

 

I will be adding back in your programs once I have been able to monitor the three systems and the network health. Before I move forward to adding in the programs for HA activities.

 

I thank you once again for your insight and guidance.

Link to comment
The one part I am unclear on is when I cleared the Java cache and only use the direct IP to the ISY.

 

The UI still showed the 4.2.8 firmware version? Is this to be expected behaviour here?

 

Yes.  If the Firmware in the ISY was at 4.2.8 then any UI pulled from the ISY would also be 4.2.8.

 

This sounds more and more like network problems to me.

 

-Xathros

Link to comment

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...