Health timeout Issue #687
Replies: 7 comments 8 replies
-
Difficult to say what's going on without knowing more about your environment. It's a fairly sizeable Sense setup, but nothing Butler and Butler SOS cannot handle. Some thoughts/questios:
|
Beta Was this translation helpful? Give feedback.
-
Hi Goran
Thanks for the quick reply!
My comments on your questions below:
On Wed, Dec 13, 2023 at 9:00 AM Göran Sander ***@***.***> wrote:
Difficult to say what's going on without knowing more about your
environment.
It's a fairly sizeable Sense setup, but nothing Butler and Butler SOS
cannot handle.
Some thoughts/questios:
- Where is Butler SOS running? Close to the Sense servers (from a
network perspective)? Is there high load on the server where Butler SOS is
running? I'm running Butler-SOS on my failover central node. It is a
passive node so not thinking high load would be the issue here.
- Any errors or warnings in the Butler SOS log files? The only errors
coming through is as below:
2023-12-13T03:36:57.937Z error: HEALTH: Error when calling health check
API: AxiosError: timeout of 5000ms exceeded
It was also difficult to figure out which servers were struggling, so I
added the ${host} in the error code to give the error as below:
2023-12-13T07:30:54.387Z error: HEALTH: Error when calling health check API
for pcdwqs9zatcwi.vodacom.corp:4747: AxiosError: timeout of 10000ms exceeded
As you can see I played with the timeout value, went up to 15000, but it
doesn't seem like that had an effect.
- More Sense nodes mean that Butler SOS has more to do (obviously). I
must say it doesn't look like we are struggling here
- The polling interval may be too short. This could in theory lead to
Butler SOS not being able to capture all data until it's time to gather the
next set of data, eventually leading to things timing out and loss of
metrics.
- I would expect errors in the Butler SOS logs in this case
- Polling for new data every 30 seconds should be fine, I'd say.
Has worked well in other similar sized Sense environments. But you could
try increasing poll time to a minute just to test the hypothesis.
- Poll time is controlled via the config file setting
Butler-SOS.serversToMonitor.pollingInterval. Remember to restart
Butler SOS after making changes to the config file!
- Network flakiness could be a reason to the symptoms you see.
Unlikely maybe, but I did once see similar things in a network where there
turned out to be routing issues. Some packets took the looong way to the
destination (many timed out en route) and some packets were routed as
planned. Are you aware of a way to check if there are issues like this
in my case?
Thanks again for the reply and looking forward to chatting some more. I am
looking to deploy Butler (To integrate to Pagerduty) and Butler-Spyglass
(for lineage, we have a store of about 3Tb)
…
—
Reply to this email directly, view it on GitHub
<#687 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHL6DU22WQYMOUFSKFLDQTTYJFHBTAVCNFSM6AAAAABASSLOACVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TQMZYG4ZTI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
--
Chris du Plessis
0825230782
***@***.***
|
Beta Was this translation helpful? Give feedback.
-
I was testing SOS on one server and wanted to deploy it on our central node as I know the port are open between that central node and the other nodes (trying to fix the earlier issue where engine health data was not coming through). Now the SOS instance there won't start with error: Error: Cannot find module 'node:http' Any ideas? Looks like it is trying to find a module node:http, which I can't install running npm for some reason... c:\tools\butler-sos\src>npm i node:http npm ERR! A complete log of this run can be found in: C:\Users\qliksens\AppData\Local\npm-cache_logs\2023-12-14T04_35_01_457Z-debug-0.log From what I've seen looking around, I should try http-node, and http though no success trying to install that either... though just running npm i goes through fine: up to date in 2m 120 packages are looking for funding |
Beta Was this translation helpful? Give feedback.
-
Nope, service monitoring is in Butler only.
Major rework on that one in Butler version that was released last night,
btw
…On Thu, 14 Dec 2023, 14:11 Chrisdup8710, ***@***.***> wrote:
Great ok I'll see what I get if I merge those.
Also wanted to find out, the service monitoring you have in butler, can
that be done in SOS?
—
Reply to this email directly, view it on GitHub
<#687 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAH3JDS45VZUQRFYMUCSOSDYJL3HDAVCNFSM6AAAAABASSLOACVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TQNJTGQ3DO>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Here is a sample XML log appender file for the Sense scheduler. It has sections for both Butler and Butler SOS., with the assumption that both tools run on a server with IP The file should be usable as-is (change the IP and possibly UDP ports). <?xml version="1.0" encoding="UTF-8"?>
<configuration>
<!-- Appender for detecting reload task failures. Only the last of potentially several retries is reported -->
<appender name="TaskFailureLogger" type="log4net.Appender.UdpAppender">
<filter type="log4net.Filter.StringMatchFilter">
<param name="stringToMatch" value="Max retries reached" />
</filter>
<filter type="log4net.Filter.DenyAllFilter" />
<param name="remoteAddress" value="10.11.12.13" />
<param name="remotePort" value="9998" />
<param name="encoding" value="utf-8" />
<layout type="log4net.Layout.PatternLayout">
<converter>
<param name="name" value="hostname" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.HostNamePatternConverter" />
</converter>
<param name="conversionpattern" value="/scheduler-reload-failed/;%hostname;%property{TaskName};%property{AppName};%property{User};%property{TaskId};%property{AppId};%date;%level;%property{ExecutionId};%message" />
</layout>
</appender>
<!-- Appender for detecting aborted reloads -->
<appender name="AbortedReloadTaskLogger" type="log4net.Appender.UdpAppender">
<filter type="log4net.Filter.StringMatchFilter">
<param name="stringToMatch" value="Execution State Change to Aborting" />
</filter>
<filter type="log4net.Filter.DenyAllFilter" />
<param name="remoteAddress" value="10.11.12.13" />
<param name="remotePort" value="9998" />
<param name="encoding" value="utf-8" />
<layout type="log4net.Layout.PatternLayout">
<converter>
<param name="name" value="hostname" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.HostNamePatternConverter" />
</converter>
<param name="conversionpattern" value="/scheduler-reload-aborted/;%hostname;%property{TaskName};%property{AppName};%property{User};%property{TaskId};%property{AppId};%date;%level;%property{ExecutionId};%message" />
</layout>
</appender>
<!-- Appender for detecting successful reload tasks -->
<appender name="ReloadTaskSuccessLogger" type="log4net.Appender.UdpAppender">
<filter type="log4net.Filter.StringMatchFilter">
<param name="stringToMatch" value="Reload complete" />
</filter>
<filter type="log4net.Filter.DenyAllFilter" />
<param name="remoteAddress" value="10.11.12.13" />
<param name="remotePort" value="9998" />
<param name="encoding" value="utf-8" />
<layout type="log4net.Layout.PatternLayout">
<converter>
<param name="name" value="hostname" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.HostNamePatternConverter" />
</converter>
<param name="conversionpattern" value="/scheduler-reloadtask-success/;%hostname;%property{TaskName};%property{AppName};%property{User};%property{TaskId};%property{AppId};%date;%level;%property{ExecutionId};%message" />
</layout>
</appender>
<!-- Generic appender for detecting warnings and errors -->
<appender name="LogEvent" type="log4net.Appender.UdpAppender">
<param name="threshold" value="warn" />
<param name="remoteAddress" value="10.11.12.13" />
<param name="remotePort" value="9996" />
<param name="encoding" value="utf-8" />
<layout type="log4net.Layout.PatternLayout">
<converter>
<param name="name" value="rownum" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.CounterPatternConverter" />
</converter>
<converter>
<param name="name" value="hostname" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.HostNamePatternConverter" />
</converter>
<converter>
<param name="name" value="longIso8601date" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.Iso8601TimeOffsetPatternConverter" />
</converter>
<converter>
<param name="name" value="user" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.ServiceUserNameCachedPatternConverter" />
</converter>
<converter>
<param name="name" value="encodedmessage" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.EncodedMessagePatternConverter" />
</converter>
<converter>
<param name="name" value="encodedexception" />
<param name="type" value="Qlik.Sense.Logging.log4net.Layout.Pattern.EncodedExceptionPatternConverter" />
</converter>
<param name="conversionpattern" value="/qseow-scheduler/;%rownum{9999};%longIso8601date;%date;%level;%hostname;%logger;%user;%encodedmessage;%encodedexception;%property{UserDirectory};%property{UserId};%property{User};%property{TaskName};%property{AppName};%property{TaskId};%property{AppId};%property{ExecutionId}" />
</layout>
</appender>
<!-- Send UDP message to Butler SOS on warnings and errors -->
<logger name="Service">
<appender-ref ref="LogEvent" />
</logger>
<logger name="System">
<appender-ref ref="LogEvent" />
</logger>
<!-- Send message to Butler on task failure -->
<!-- Send message to Butler on task abort -->
<logger name="System.Scheduler.Scheduler.Master.Task.TaskSession">
<appender-ref ref="TaskFailureLogger" />
<appender-ref ref="AbortedReloadTaskLogger" />
</logger>
<!-- Send message to Butler on reload task success -->
<logger name="System.Scheduler.Scheduler.Slave.Tasks.ReloadTask">
<appender-ref ref="ReloadTaskSuccessLogger" />
</logger>
</configuration> |
Beta Was this translation helpful? Give feedback.
-
By the way, you may find the blogs at https://ptarmiganlabs.com of interest too. |
Beta Was this translation helpful? Give feedback.
-
Closing old ticket |
Beta Was this translation helpful? Give feedback.
-
Good day
I've recently implemented Butler-SOS on our environment where we have 2 central node, 2 presentation nodes (Proxy running through main central), and 7 reload nodes.
I'm currently struggling with the health check call to 6 of the servers where Butler is reporting that the requests keep timing out. Though sometimes those requests do go through, very intermittently though. Any ideas of where I should start looking for the issue?
I must also add that we do get some of the log events coming through from these servers.
Beta Was this translation helpful? Give feedback.
All reactions