Unfortunately, the MID server cluster management in ServiceNow is not done in a mature way. On MID agent down always the same failover agent is selected (no load balancing in place anymore). In case of running jobs during the MID agent outage, reassigning the job to a failover agent can also cause duplicates in import jobs.
In case of MID server down the MIDServerCluster script include is designed to always select the first failover agent, so the jobs are never load balanced over a set of agents. A better way to deal with an outage is to ensure the designed number of load balance agents is always available.
- MID down, select first failover agent
- MID up, select random load balance agent
- Find all load balance agents (up or down), find a failover for every down load balance agent, select random load balance agent.
- If no load balance agent is defined, select random from failover
The OOB capability check is comparing the capabilities one-by-one. If the failing agent hast e.g. 2 capabilities and the failover with one ALL
capability will not be selected.
There are two options to install the Better MID Server Cluster Management:
If you're ok with changing the OOB script include OR use the MID discovery feature, use this version.
The BetterMIDServerCluster script include implements the same API as the OOB script MIDServerCluster but does things right, implements the 'To be rule' and fixes the capability issue. It also only requires two GlideRecord queries to lookup the corresponding agent ('ecc_agent_cluster_member_m2m' and 'ecc_agent_capability_m2m'). It also works natively with the Fail over MID server event.
To install follow these steps:
- Install the Better MID Server Cluster update set which contains following changes:
- Disable the OOB MIDServerCluster script include.
- Install the BetterMIDServerCluster replacement.
- Add the 'better_mid_server.cluster.debug' system property to control debug log.
- Add a comment to the platform upgrade run book to document the change and describe how to deal with upgrade conflicts in the future.
If you don't want to change the OOB script include and don't use the MID discovery feature, use this version.
The findLoadBalancerForAgent script implements the the same as the script include version above - but it does not replace any existing OOB scripts.
To install follow these steps:
- Disable (active = false) the MID Server Cluster Management Business Rule on 'ecc_queue'
- Install the Better MID Server Cluster Management update set
- Add a comment to the platform upgrade run book to document the change and describe how to deal with upgrade conflicts in the future.
ServiceNow supports to failover started jobs to a declared failover MID server. However this is can cause duplicate imports as the processed jobs (before the MID is down) will not be cleaned up automatically and the new job (on the failover MID server) will start again from the beginning (there is no alternative as by the nature of SQL and the DB the response can divert).
If a MID server goes down, the 'mid_server.down' event is triggered. The script action Fail over MID server is listening for this event and re assigns the failed job to a failover MID server.
var msc = new MIDServerCluster(current, "Failover");
if (!msc.clusterExists())
continue;
newMidName = msc.getClusterAgent();
This is causing issues to the failed import set run as it will re-run the import SQL statement on the failover MID and lead to duplicates in the ECC queue.
To prevent from importing duplicates ONE of the following can be done:
-
Foresee duplicates when the coalesce fields are declared on the transform map.
-
Unique constraint
-
Create a unique constraint on the import set table
-
Clean the import set table on start of the script. As imports will fail due to constraints, add a cleanup script to the import set map or the scheduler.
//Transform Script : onStart() var ic = new ImportSetCleaner(map.source_table); ic.setDataOnly(true); ic.clean(); //Scheduled Data Import : Pre script var ic = new ImportSetCleaner('u_import_set_table'); ic.setDataOnly(true); ic.clean();
-
-
Disable the failover event script (suggested, except you use discovery). Reassigning started jobs to another agent is obviously causing more issues than it solves. To prevent the platform from doing this set
active=false
on Fail over MID server. The only downside of this is that the DiscoveryAgents are also not reassigned.
Set the system property better_mid_server.cluster.debug
to true
to enable debug log in Better MID Server Cluster Management
(gs.debug() used)
Set the system property mid_server.cluster.debug
to true
to enable logging in MIDServerCluster
- MIDServerCluster OOB, called by:
- MID Server Cluster Management OOB Business Rule
- Fail over MID server OOB Script Action
- Better MID Server Cluster, as a replacement for above, also called by
- MID Server Cluster Management OOB Business Rule
- Fail over MID server OOB Script Action
- MID Server Cluster Management on 'ecc_queue', to re assign the job to a load balance or failover agent
- Better MID Server Cluster Management on 'ecc_queue', as a replacement for above
- Fail over MID server OOB 'mid_server.down' event, to re assign all 'ready' or 'processing' jobs to a failover agent