Harden and Document Quartz.NET Clustering Support

Elsa’s Quartz.NET integration must be enhanced and documented to properly support clustered, multi-instance deployments and to prevent race conditions observed in real-world Kubernetes environments.

### Cluster Mode
While the current Quartz.NET integration already exposes direct access to Quartz.NET’s configuration APIs—allowing advanced users to manually configure clustered mode—there is no first-class, opinionated configuration path. To improve usability and reduce misconfiguration risk, the integration should expose a convenience method (e.g. `EnableClustering`) that configures sensible defaults for clustered operation (such as `instanceId = AUTO`, clustered job store, and check-in intervals), while still allowing optional parameters to be overridden where appropriate.

### Documentation
In parallel, Elsa’s documentation must be updated to clearly explain clustered Quartz.NET usage in distributed hosting scenarios. Specifically, the existing documentation section on Distributed Hosting and Quartz.NET clustered mode [https://docs.elsaworkflows.io/hosting/distributed-hosting#id-4.-quartz.net-clustered-mode](https://docs.elsaworkflows.io/hosting/distributed-hosting#id-4.-quartz.net-clustered-mode) should be revised to reflect the current Elsa 3.6 architecture (optional add-on in `elsa-extensions`), clearly describe when clustering is required, and provide guidance aligned with the new convenience configuration API.

### Race Conditions
Additionally, the Quartz.NET Elsa integration module contains potential race conditions during job and trigger registration when multiple pods start concurrently. Current logic performs a “check-then-act” sequence without distributed locking, for example:

*   In `RegisterJobsTask.cs`, checking for job existence before calling `AddJob`.    
*   In `QuartzWorkflowScheduler.cs`, checking for trigger existence before calling `ScheduleJob`.
    
In clustered or multi-pod deployments, this can result in multiple instances attempting to register the same job or trigger simultaneously, leading to `ObjectAlreadyExistsException` errors. This behavior was observed in production during an incident, where multiple pods attempted to register identical Quartz triggers during tenant activation. The lack of effective cluster coordination (and/or insufficient guarding against concurrent registration) directly contributed to workflow interruptions and scheduler instability.

This requirement therefore covers:

*   Introducing a clustering convenience API to reduce configuration errors.    
*   Updating official Elsa documentation to clearly describe clustered Quartz.NET usage in distributed hosting.
*   Hardening the Quartz.NET integration against concurrent job/trigger registration across multiple nodes, ensuring idempotent and cluster-safe behavior even during simultaneous startup or tenant activation.
    
Addressing these items will significantly improve reliability, reduce operational risk in Kubernetes and other distributed environments, and align Elsa Workflows’ Quartz.NET integration with real-world clustered deployment expectations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden and Document Quartz.NET Clustering Support #109

Cluster Mode

Documentation

Race Conditions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Harden and Document Quartz.NET Clustering Support #109

Description

Cluster Mode

Documentation

Race Conditions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions