Skip to content

Harden and Document Quartz.NET Clustering Support #109

@sfmskywalker

Description

@sfmskywalker

Elsa’s Quartz.NET integration must be enhanced and documented to properly support clustered, multi-instance deployments and to prevent race conditions observed in real-world Kubernetes environments.

Cluster Mode

While the current Quartz.NET integration already exposes direct access to Quartz.NET’s configuration APIs—allowing advanced users to manually configure clustered mode—there is no first-class, opinionated configuration path. To improve usability and reduce misconfiguration risk, the integration should expose a convenience method (e.g. EnableClustering) that configures sensible defaults for clustered operation (such as instanceId = AUTO, clustered job store, and check-in intervals), while still allowing optional parameters to be overridden where appropriate.

Documentation

In parallel, Elsa’s documentation must be updated to clearly explain clustered Quartz.NET usage in distributed hosting scenarios. Specifically, the existing documentation section on Distributed Hosting and Quartz.NET clustered mode https://docs.elsaworkflows.io/hosting/distributed-hosting#id-4.-quartz.net-clustered-mode should be revised to reflect the current Elsa 3.6 architecture (optional add-on in elsa-extensions), clearly describe when clustering is required, and provide guidance aligned with the new convenience configuration API.

Race Conditions

Additionally, the Quartz.NET Elsa integration module contains potential race conditions during job and trigger registration when multiple pods start concurrently. Current logic performs a “check-then-act” sequence without distributed locking, for example:

  • In RegisterJobsTask.cs, checking for job existence before calling AddJob.
  • In QuartzWorkflowScheduler.cs, checking for trigger existence before calling ScheduleJob.

In clustered or multi-pod deployments, this can result in multiple instances attempting to register the same job or trigger simultaneously, leading to ObjectAlreadyExistsException errors. This behavior was observed in production during an incident, where multiple pods attempted to register identical Quartz triggers during tenant activation. The lack of effective cluster coordination (and/or insufficient guarding against concurrent registration) directly contributed to workflow interruptions and scheduler instability.

This requirement therefore covers:

  • Introducing a clustering convenience API to reduce configuration errors.
  • Updating official Elsa documentation to clearly describe clustered Quartz.NET usage in distributed hosting.
  • Hardening the Quartz.NET integration against concurrent job/trigger registration across multiple nodes, ensuring idempotent and cluster-safe behavior even during simultaneous startup or tenant activation.

Addressing these items will significantly improve reliability, reduce operational risk in Kubernetes and other distributed environments, and align Elsa Workflows’ Quartz.NET integration with real-world clustered deployment expectations.

Metadata

Metadata

Labels

No labels
No labels

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions