Deduplicate findings in batches #13491

valentijnscholten · 2025-10-21T19:28:52Z

Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.

For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.

This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).

During the development of the PR I realized:

more test cases were needed: 13499, 13464, 13463, 13372.
some deduplication bugs needed fixing: 13513

Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.

The PR splits the deduplication process in three parts:

Finding possible candidates
Match the (new) finding against the candidates
Act upon it if a match is found

One of the reasons for doing this is that we want to use the exact same matching logic for the reimport process. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.

A quick test with the jfrog_xray_unified/very_many_vulns.json samples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.

branch	import time	dedupe time	total time
dev	~200s	~400s	~600s
dedupe-batching	~190s	~12s	~200s

Imagine what this can do for reimport performance if we switch that to batch mode.

github-actions · 2025-10-24T16:31:34Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-11-01T08:57:35Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

dryrunsecurity · 2025-11-07T21:03:36Z

This pull request introduces a configurable IMPORT_REIMPORT_DEDUPE_BATCH_SIZE used when post-processing imported findings but lacks validation of a minimum value, so an administrator could set it to a very small number (e.g., 1) and cause thousands of Celery tasks to be dispatched for a large import, potentially overwhelming the broker/workers and causing a denial-of-service for background processing. It’s recommended to enforce sensible bounds or rate-limit task dispatch to prevent resource exhaustion.

Denial of Service via Misconfiguration in dojo/importers/default_importer.py

Vulnerability	Denial of Service via Misconfiguration
Description	The `IMPORT_REIMPORT_DEDUPE_BATCH_SIZE` setting, which controls the batch size for post-processing findings, can be configured by an administrator via an environment variable. While it defaults to 1000, there is no validation to enforce a minimum value. If an administrator sets this value to 1, importing a large report (e.g., 10,000 findings) would result in 10,000 individual Celery tasks being dispatched almost simultaneously. This flood of tasks can overwhelm the message broker and Celery workers, leading to resource exhaustion and a denial of service for all background processing within the application.

django-DefectDojo/dojo/importers/default_importer.py

Lines 163 to 166 in 239d7c7

    
                   batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000) 
        
                   """ 
        
                   Saves findings in memory that were parsed from the scan report into the database.

All finding details can be found in the DryRun Security Dashboard.

valentijnscholten added 15 commits October 19, 2025 16:25

initial batching code

74c6563

fix dedupe_inside_engagement

01842eb

all tests working incl sarif with internal dupes

ccc5ad1

cleanup

01c4911

deduplication: add more importer unit tests

53b2258

deduplication: add more importer unit tests

4f6992d

deduplication: log hash_code_fields_always

15a06e6

view_finding: show unique_id_from_tool with hash_code

8bb5292

view_finding: show unique_id_from_tool with hash_code

b2ea7eb

uncomment tests

99bafd3

add more assessments

4d470f0

fix duplicate finding links

5d2768f

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

8b272d9

split per algo, move into new file

cdabfea

align logging

7f2f661

github-actions bot added the unittests label Oct 21, 2025

valentijnscholten added 14 commits October 21, 2025 21:29

better method name and param order

301c3c3

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

18db8c9

ruff apps.py

e73ac73

update task/query counts

0945279

update comments, parameters names

d9dad18

finetune uidorhash logic

a1da692

fix tests to import from deduplication.py

2c6f941

ruff unit tests

0efac0c

simplify base queryset building

76b78d6

deduplication logic: add cross scanner unique_id tests

58d6934

hook old per finding dedupe to batch dedupe code

74a8b2d

fix and make uid_or_hash_code matching identical to old dedupe

95974ca

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate

9a876e3

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate in Batch mode

92a92ca

valentijnscholten added this to the 2.53.0 milestone Oct 23, 2025

valentijnscholten added 2 commits October 23, 2025 23:13

optimize prefetching

70031e2

update query counts in test

716e8b7

github-actions bot added the conflicts-detected label Oct 24, 2025

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

f347703

github-actions bot removed the conflicts-detected label Nov 1, 2025

valentijnscholten and others added 4 commits November 1, 2025 10:01

complete merge

182d5c3

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

934cdba

add more logging is_older, dedupe_eng_mismatch

be00200

support FINDING_DEDUPE_METHOD

04f24ad

valentijnscholten marked this pull request as ready for review November 7, 2025 21:03

valentijnscholten requested review from Maffooch and mtesauro as code owners November 7, 2025 21:03

Valentijn Scholten added 4 commits November 8, 2025 08:05

add support for FINDING_DEDUPE_BATCH_METHOD

3943767

simplify

f01ab16

update log line

42a5f48

make batch size a setting

1a721ea

github-actions bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Nov 8, 2025

add false positive history to new batch post process task

77e8ca1

valentijnscholten added the Breaking Changes label Nov 8, 2025

valentijnscholten and others added 7 commits November 9, 2025 09:26

commands: add command to clear celery queue

232fe7d

update dedupe command to use batch mode

b91330a

default to batch_mode for dedupe command

b336b75

do not deduplicate duplicates

93382b0

improve logging

edd8c04

prefetch better in dedupe command

ab18a94

dedupe command: max batch size 1000

239d7c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate findings in batches #13491

Deduplicate findings in batches #13491

valentijnscholten commented Oct 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

dryrunsecurity bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deduplicate findings in batches #13491

Are you sure you want to change the base?

Deduplicate findings in batches #13491

Conversation

valentijnscholten commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

dryrunsecurity bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

valentijnscholten commented Oct 21, 2025 •

edited

Loading

dryrunsecurity bot commented Nov 7, 2025 •

edited

Loading