Jump to content

OpenWeatherMap not showing as online though it is & I can't query it in program


Recommended Posts

Hi,

After an ISY restart today I noticed that while the OWM NS is updating data values in ISY it has not updated its status and I can't query it in a program as there is option to do so (only 'Remove Notices')

OpenWeatherMapProgOptions.JPG.abd7ccd234870734e9b1a37ca7545208.JPG

I'm counting on the 'online' status to do (and not do) certain things. In the 'do' category is, potentially, restarting PG3 if things are stuck for too long.  I noticed that while the program is true it actually did not run at startup. There's likely some troubleshooting on my part there but I will need to be able to query the NS in a program.

OpenWeatherMapDisconnected.thumb.JPG.a3d8a7285b415290a4f895fc8b08ce1c.JPG

OpenWeatherMapNotOnline.thumb.JPG.2224036da4a9c934d6bd1a7588f76542.JPG

 

So can 'query' be added as a program action?

In the meantime, how often is that status sent to ISY? Perhaps ISY was too busy to process it at startup but is it also sent periodically? I know I've seen a value there as I was writing up the program but it's been 2.5 hrs since restart and there's nothing there.

Thanks.

Link to comment
2 hours ago, johnnyt said:

Hi,

After an ISY restart today I noticed that while the OWM NS is updating data values in ISY it has not updated its status and I can't query it in a program as there is option to do so (only 'Remove Notices')

In the meantime, how often is that status sent to ISY? Perhaps ISY was too busy to process it at startup but is it also sent periodically? I know I've seen a value there as I was writing up the program but it's been 2.5 hrs since restart and there's nothing there.

The on-line status isn't doing what you probably think it does.   That value represents the status of the connection between the node server and PG3 only.  It is set only when the status of that connection changes. 

When  the node server is stopped, it is set to 'disconnected'.

If the node server crashes and drops the connection, it is set to 'failed'.

When the node server starts it is set to 'connected'.

The connection status does not indicate the node server's status beyond the connection to PG3, it is also not any kind of heartbeat indicator.  

2 hours ago, johnnyt said:

So can 'query' be added as a program action?

This node server's function is to query the OWM server for weather data and pass that data on to the ISY.  A query action would logically force it to query the OWM sever immediately instead of waiting for the next query interval. But that's not what you are really asking for.   I believe you're asking for the ability to query the status of the node server itself.  But that doesn't really work.  The node server could only ever report that it is running.  If it's not running, it can't report that it's not running.

I suspect that what you want is the ability to query PG3 for the status of the node server, but the ISY doesn't have a way to do that. 

Link to comment

Thanks for the explanation. I can look for data changes as my indicator that the NS working. I was doing that before but hoping to move to NS Online Status. How come 'blank' = Disconnected, at least according to my program condition (when it wasn't disconnected)?

Is there a command I can send to restart this (or any) NS in PG3? And, is there a command I can send to restart PG3? Right now I'm using a sledgehammer to restart Polisy, namely a WebSwitch with network command from 994i to power cycle the outlet Polisy is plugged into, but when I move to IoP that approach will be suicide for ISY so not really one I want to have to ever use - not to mention it's not ideal from the risk of data corruption perspective.

Thanks again

Link to comment
17 minutes ago, johnnyt said:

Thanks for the explanation. I can look for data changes as my indicator that the NS working. I was doing that before but hoping to move to NS Online Status. How come 'blank' = Disconnected, at least according to my program condition (when it wasn't disconnected)?

Because when you restart the ISY the ISY forgets everything it's been told about nodes.   The query on restart is a work-around to try and restore everything's current state.   The connection status is kind of a hack to provide some indication of the node server state.  The ISY was designed to monitor the state of nodes, not the the state of the node's controller.  I.E. where is the Insteon on-line status or the z-wave on-line status?  So basically, the ISY doesn't know the on-line connection status until PG3 sends it an update, which it won't do until the status changes.

PG3 will continue to get updates. One of the things we're looking at is adding more status communication between node servers and PG3 and better error reporting for when things do go wrong.  But since the ISY doesn't have any support for this, I'm not sure how this will be exposed to users yet.

Also, none of these components were designed to be restarted.  I understand the desire to try and automatically recover from catastrophic events, but  just how often are you experiencing these?  I know we'll never get to 100% uptime, but events that require a restart/reboot should be very rare.   And, yeah, I know we're not there yet.  But if everyone simply reboots or restarts automatically whenever there's an issue and the issue doesn't get reported, we'll never get there because developers won't know there's an issue to fix.

You can restart a node server by clicking the restart button in PG3 you can restart PG3 via a menu option.  

Maybe I'm just lucky.  But my production polyglot instance has been up for over 3 months which is how long the Polisy box has been running since I last cut power to it while doing electrical work.  Yes, I do occasionally have issues with node servers and most of time it is because the device the node server is for has issues and the node server doesn't handle the device failure very well.

Link to comment

So I had ISY crash 3 times in 2 days right after moving to PG3, another time a few days later, and again today. For sure it hasn't been this bad before but yes, you are lucky. I think it's overloaded and I really really hope IoP will fix my long standing issues with performance. (I have 980 programs and way more programs in my head when I move to IoP where program limit will go from 1000 to 2000)

I started by blaming ST-Inventory - see all the gory details here:

While it does have to read all my programs (which is taxing) and it did show intermittent errors, I don't think it's really an ST-Inventory issue specifically. For example, for today's crash there was no ST-Inventory activity in its logs at the time of the crash.

I'm okay with querying a NS not working quite as I had hoped, but what about being able to call for a NS (or PG3) restart from a program?

Even if the need (or hope) for restarts will truly be a rare 'edge' case someday, we're not there today, and it's not like there will never be restarts needed. Power outages are only expected to increase with climate change. I would think there's value in being able to do NS (or PG3) restarts via programs at ISY startup or if something goes wrong (e.g. no data updates for too long). Correct me if I'm wrong but that would reload data into ISY (unlike a query).

Not providing a restart capability because an "issue (may not) get reported" is not a good reason. It's not a customer centric perspective for one, but also I believe folks who use ISY - or at least enough of us - will report issues. Who would want crashes/failed node servers and restarts to be part of a weekly or monthly routine?

 

Link to comment

@johnnytThanks for the well written response. You make some very good points! 

In my defense, I didn't have anything to do with the original design of PG2 or PG3. I originally signed up to help test PG3 and now I'm the only developer working on it.  There's enough work that I do have a tendency to try and avoid major design changes if I can.  

My concerns are specifically around adding the capability to automatically reboot/restart.  Once you can automate a work-a-round for a problem, it tends to not be a problem for you anymore.   Sure, no one would ask for a weekly restart, but if it happens and you don't really know it happened are you really going to care?  I specifically didn't say anything about power outages because the system should recover from those without any user intervention.  If it isn't, we really need to understand why.

I'm not completely opposed to automatic restarts.  PG3 will automatically restart if it crashes and a lot of work has gone into make it recover and continue if it does crash.   There's not really anything you can set up that would improve the process.  Eventually, we should get to the same point with Node server, but we're a long way away from that now.

Keep in mind that the production release of PG3 is just a little over 2 months old.  It can take many iterations of a software product to work out the issues.  With the number of different node servers and use cases, it is impossible for us to test even a small fraction of what it's capable of doing.

I've been watching your other thread on ISY-Inventory with interest because I do want to know if there's anything we need to do in PG3 to prevent it from happening again.  The key is to figure out what is really happening.  Node server should not be able to crash an ISY.  I do sometimes do stress testing with PG3 to see if I can make bad things happen.  It's been a while since I've done any of that with a i994.  It is possible to overload the i994's network interface, but I've not seen that crash the ISY.  If you haven't, I highly recommend you submit a trouble ticket to UDI for the ISY crashes. Neither @simplextechor I have access to ISY code to debug it.  Best we can do is help to define a reproducible test case.

 

Link to comment
1 hour ago, bpwwer said:

I've been watching your other thread on ISY-Inventory with interest because I do want to know if there's anything we need to do in PG3 to prevent it from happening again.  The key is to figure out what is really happening.  Node server should not be able to crash an ISY.  I do sometimes do stress testing with PG3 to see if I can make bad things happen.  It's been a while since I've done any of that with a i994.  It is possible to overload the i994's network interface, but I've not seen that crash the ISY.  If you haven't, I highly recommend you submit a trouble ticket to UDI for the ISY crashes. Neither @simplextechor I have access to ISY code to debug it.  Best we can do is help to define a reproducible test case.

 

Thanks for reply and all your good work on PG3 on top of all the NS' you built. Yes, I've been working with UDI on the crashes. While still working to get to the bottom of it, It appears right now that ST-Inventory requesting all my info (most notably the ~980 programs) may be too much for the 994i when a lot of other stuff is going on too. I decided I will only use it 'manually' until I move to IoP (once zwave migration is available).

Aside from more horsepower behind IoP, if I heard/understood correctly, I think UDI is considering leveraging that extra horsepower to compress the ISY data so it loads faster in admin console and node servers (since I'm told they use the same API.) I know at this point it's just a 'consideration' but, personally, I'm really hopeful that one day I will not have to wait almost 3 mins to load AC w/ 2:30 of that loading programs. When I think about it, unless one can subscribe to node and program number changes as they happen, 2.5 - 3 mins of sustained draw on (the under-powered) 994i is a huge window of time for it to get overloaded providing ST-Inventory the info it needs while potentially having many other things to do at the same time (hence my decision to leave it mostly off)

 

Link to comment

I'm glad that UDI has reviewed the issue and at least has some idea of the cause.  980 programs is a lot of programs to format and send out and I new the AC was slow about loading programs, but I thought that was because the AC had to process them, more than the ISY having to send them.

It can be difficult to understand the effects of scale for something like this without doing the math.   3 minutes seemed like a long time so I did a quick test.  I only have 88 programs but it takes about 300ms to send that.  So scaling that up to 980 and I get close to 3.5 minutes.

It will probably take more than just compressing the data to get reasonable performance with 2000 programs.  I'm thinking it will need to reduce the size of the data by 20x or more.  But looking at the data, it may be possible simply by reformatting the data into something far less verbose. 

An interesting problem for sure.  Thanks for the effort to help resolve it.

Link to comment
4 hours ago, bpwwer said:

I'm glad that UDI has reviewed the issue and at least has some idea of the cause.  980 programs is a lot of programs to format and send out and I new the AC was slow about loading programs, but I thought that was because the AC had to process them, more than the ISY having to send them.

It can be difficult to understand the effects of scale for something like this without doing the math.   3 minutes seemed like a long time so I did a quick test.  I only have 88 programs but it takes about 300ms to send that.  So scaling that up to 980 and I get close to 3.5 minutes.

It will probably take more than just compressing the data to get reasonable performance with 2000 programs.  I'm thinking it will need to reduce the size of the data by 20x or more.  But looking at the data, it may be possible simply by reformatting the data into something far less verbose. 

An interesting problem for sure.  Thanks for the effort to help resolve it.

While Java is far from being a real time OS, I find it hard to believe the AC running on a modern CPU is a bottle neck in the AC's loading performance. Just as interesting are backups, which can take well over 6 minutes. Fortunately (though not for the time it takes) I can tell the backup occurs on a low priority because the time it takes varies greatly. This would help explain why I've never seen a crash doing one of those, and I think I would have most certainly seen one by now otherwise. Data wise a backup in my case is 632 KB of uncompressed data (413KB compressed). I don't know how much of that are programs for a straight comparison, but at 6 minutes (when things aren't too busy otherwise) that's 1.7 KB/s throughput, or about equivalent to 14.4 kbps dial up modem speed. So I'm not seeing the problem as related to the amount of data.

While I confess I have no idea how things are working in the background, it's really hard to believe my i7 core CPU with hyper threading (4Cores/8Threads) running at 3.4 GHz (turbo up to 3.8 GHz) connected via Gb Ethernet wouldn't be able to support the AC processing waaay more data per second than that.

While I'm pretty sure the Polisy does not have close to an i7 core, I was really hoping that both CPU and network speed improvements of the Polisy/IoP were going to mean a HUGE improvement in overall data processing speed - with 2 aspects to this: 1) IoP loading the console, and 2) IoP talking to NS'.

  1. Does/will the console loading not benefit from Polisy hardware improvements? Could the first 300ms to load the first 88 programs largely be attributed to the "startup" processing and end up notably more efficient for the next 900, or 1900?
  2. Please at least tell me IoP is talking at something approaching internal bus/memory speed to node servers (minus PG3 overhead) and that it will make a noticeable difference?

One of my ideas has been to put my lighting related insteon on one ISY (either 994i or IoP) and, on the other, put zwave with some insteon (all HVAC related). I'd rather not have to do that but if I'm deluding myself thinking that Polisy/IoP will fix my issues, I may go there.

Your advice would be welcomed.

 

Link to comment

Hi Guys,

I've been also following a few threads and tickets with similar issues.  I think this may be related to a client request timeout for 994i while data transmission is in progress. While the 994i recovers most of the time within a minute, if there is a loop of large requests and timeouts this could cause abnormal behavior/crash, similar to a denial of service.  The same issue may exist on Polisy but the system recovers in about 10 seconds and may drop request for graceful socket disconnect causing 817 (Already subscribed).

Link to comment
13 hours ago, johnnyt said:

While I'm pretty sure the Polisy does not have close to an i7 core, I was really hoping that both CPU and network speed improvements of the Polisy/IoP were going to mean a HUGE improvement in overall data processing speed - with 2 aspects to this: 1) IoP loading the console, and 2) IoP talking to NS'.

  1. Does/will the console loading not benefit from Polisy hardware improvements? Could the first 300ms to load the first 88 programs largely be attributed to the "startup" processing and end up notably more efficient for the next 900, or 1900?
  2. Please at least tell me IoP is talking at something approaching internal bus/memory speed to node servers (minus PG3 overhead) and that it will make a noticeable difference?

If it wasn't obvious, my test and maths seemed to disprove my original theory that the AC was the cause of slow program loading.   For the 88 programs, the time was split almost evenly between waiting for the ISY to prepare the data and the actual download of the data.  I used the developers tools in the browser to get the timing info for the /rest/programs request from the ISY.

So I just tried the same thing with IOP running on a PolisyPro.   Unfortunately, I don't have as many programs on that so I'm not sure this is a valid comparison.  But it was many times faster. 26 programs in 4.5ms, again split fairly evenly between processing time and content download time. If the time scales linearly (a big assumption), it would be able to send 980 programs in about 170ms.  That's like 20x what I calculated for the i994.  

These tests were done between the device and my browser over a wired ethernet connection. 

Since PG3 can work with both IOP and i994 ISY's, it communicates with them over the network interface.  It may be a bit more optimized if it's using the localhost address, but it still uses the network drivers.  However, that may be offset by the fact that it's one network driver that's handling both ends of the communication (twice as much work) when both IOP and PG3 are running on a single Polisy. 

Link to comment
Guest
This topic is now closed to further replies.

  • Recently Browsing

    • No registered users viewing this page.
  • Forum Statistics

    • Total Topics
      36.9k
    • Total Posts
      370.2k
×
×
  • Create New...