Skip to content

Conversation

@valentijnscholten
Copy link
Member

@valentijnscholten valentijnscholten commented Oct 21, 2025

Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.

For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.

This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).

During the development of the PR I realized:

Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.

The PR splits the deduplication process in three parts:

  1. Finding possible candidates
  2. Match the (new) finding against the candidates
  3. Act upon it if a match is found

One of the reasons for doing this is that we want to use the exact same matching logic for the reimport process. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.

A quick test with the jfrog_xray_unified/very_many_vulns.json samples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.

branch import time dedupe time total time
dev ~200s ~400s ~600s
dedupe-batching ~190s ~12s ~200s

Imagine what this can do for reimport performance if we switch that to batch mode.

@valentijnscholten valentijnscholten added this to the 2.53.0 milestone Oct 23, 2025
@github-actions
Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2025

Conflicts have been resolved. A maintainer will review the pull request shortly.

@valentijnscholten valentijnscholten marked this pull request as ready for review November 7, 2025 21:03
@dryrunsecurity
Copy link

dryrunsecurity bot commented Nov 7, 2025

DryRun Security

This pull request introduces a configurable IMPORT_REIMPORT_DEDUPE_BATCH_SIZE used when post-processing imported findings but lacks validation of a minimum value, so an administrator could set it to a very small number (e.g., 1) and cause thousands of Celery tasks to be dispatched for a large import, potentially overwhelming the broker/workers and causing a denial-of-service for background processing. It’s recommended to enforce sensible bounds or rate-limit task dispatch to prevent resource exhaustion.

Denial of Service via Misconfiguration in dojo/importers/default_importer.py
Vulnerability Denial of Service via Misconfiguration
Description The IMPORT_REIMPORT_DEDUPE_BATCH_SIZE setting, which controls the batch size for post-processing findings, can be configured by an administrator via an environment variable. While it defaults to 1000, there is no validation to enforce a minimum value. If an administrator sets this value to 1, importing a large report (e.g., 10,000 findings) would result in 10,000 individual Celery tasks being dispatched almost simultaneously. This flood of tasks can overwhelm the message broker and Celery workers, leading to resource exhaustion and a denial of service for all background processing within the application.

batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000)
"""
Saves findings in memory that were parsed from the scan report into the database.


All finding details can be found in the DryRun Security Dashboard.

@github-actions github-actions bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Breaking Changes settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR unittests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant