Skip to content

fix: deduplicate vulnerabilities in recommend endpoint#2160

Merged
Strum355 merged 1 commit intoguacsec:mainfrom
Strum355:nsc/recommend-dedup
Dec 4, 2025
Merged

fix: deduplicate vulnerabilities in recommend endpoint#2160
Strum355 merged 1 commit intoguacsec:mainfrom
Strum355:nsc/recommend-dedup

Conversation

@Strum355
Copy link
Copy Markdown
Member

@Strum355 Strum355 commented Dec 2, 2025

If multiple advisories referenced the same vulnerability, the /api/v2/purl/recommend endpoint would return duplicate vulnerabilities as there is no advisory ID included to differentiate them from each other.

Summary by Sourcery

Deduplicate vulnerabilities returned by the purl recommendation endpoint when multiple advisories reference the same underlying vulnerability.

Bug Fixes:

  • Ensure the /api/v2/purl/recommend endpoint returns each vulnerability only once per package even if multiple advisories reference it.

Enhancements:

  • Adjust vulnerability status collection to operate on unique vulnerability identifiers across advisories.
  • Update VEX status modeling to allow an untagged "Other" status variant for more flexible (de)serialization.

Tests:

  • Add an integration test that verifies the recommendation endpoint returns a single vulnerability entry when duplicated advisories are ingested, along with corresponding test data.

@Strum355 Strum355 requested a review from dejanb December 2, 2025 16:32
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Dec 2, 2025

Reviewer's Guide

This PR ensures the /api/v2/purl/recommend endpoint returns each vulnerability only once even when multiple advisories reference the same vulnerability, by deduplicating vulnerability statuses in the service layer, adjusting the VexStatus model to accept arbitrary string statuses, and adding a regression test (with new test data) that covers the deduplication behavior for a real-world OSV/RustSec advisory pair.

Sequence diagram for deduplicated vulnerabilities in recommend endpoint

sequenceDiagram
    actor Client
    participant ApiV2PurlRecommend as ApiV2PurlRecommendHandler
    participant PurlService
    participant AdvisoryRepo as AdvisoryRepository

    Client->>ApiV2PurlRecommend: GET /api/v2/purl/recommend?purl=...
    ApiV2PurlRecommend->>PurlService: recommend(purl)
    PurlService->>AdvisoryRepo: fetch_purl_details(purl)
    AdvisoryRepo-->>PurlService: PurlDetails{ advisories }

    Note over PurlService: Iterate advisories, flat_map advisory.status, unique_by vulnerability.identifier

    PurlService-->>ApiV2PurlRecommend: RecommendResponse{ vulnerabilities (deduplicated) }
    ApiV2PurlRecommend-->>Client: 200 OK
Loading

Updated class diagram for purl recommendation vulnerability modeling

classDiagram
    class PurlService {
        +recommend(purl)
    }

    class PurlDetails {
        +Vec~Advisory~ advisories
    }

    class Advisory {
        +Vec~AdvisoryStatus~ status
    }

    class AdvisoryStatus {
        +Vulnerability vulnerability
        +VexStatus status
    }

    class Vulnerability {
        +String identifier
    }

    class VulnerabilityStatus {
        +String id
        +Option~VexStatus~ status
        +Option~String~ justification
    }

    class VexStatus {
        <<enum>>
        Affected
        Fixed
        NotAffected
        UnderInvestigation
        Recommended
        Other(String)
    }

    PurlService --> PurlDetails : builds_from
    PurlDetails --> Advisory : has_many
    Advisory --> AdvisoryStatus : has_many_status
    AdvisoryStatus --> Vulnerability : refers_to
    AdvisoryStatus --> VexStatus : uses
    PurlService --> VulnerabilityStatus : produces_deduplicated
    VulnerabilityStatus --> VexStatus : wraps

    %% Deduplication behavior in PurlService
    class DeduplicationLogic {
        +collect_unique_vulnerabilities(advisories)
    }

    PurlService --> DeduplicationLogic : uses
    DeduplicationLogic --> AdvisoryStatus : iterates_over
    DeduplicationLogic --> Vulnerability : unique_by_identifier
Loading

File-Level Changes

Change Details Files
Deduplicate vulnerabilities per PURL recommendation so that multiple advisories pointing to the same vulnerability produce a single vulnerability entry in the response.
  • Change vulnerability collection from nested iteration over advisories and their statuses to iterating advisory statuses directly.
  • Apply uniqueness by vulnerability identifier when mapping statuses to VulnerabilityStatus objects.
  • Preserve existing mapping of status into VulnerabilityStatus while avoiding duplicates in the final vector.
modules/fundamental/src/purl/service/mod.rs
Allow arbitrary string values for VEX status deserialization to support unexpected or non-enumerated status values from ingested data.
  • Annotate the Other variant of VexStatus with serde(untagged) to accept and deserialize raw string statuses.
  • Ensure VexStatus continues to support known enum cases alongside open-ended string statuses without breaking existing serialized formats.
modules/fundamental/src/purl/model/mod.rs
Add a regression test (and test data) verifying that the recommend endpoint returns deduplicated vulnerabilities when multiple advisories reference the same CVE.
  • Ingest a qualified Cargo package and two OSV/RustSec documents that reference the same vulnerability identifier.
  • Call the /api/v2/purl/recommend endpoint with a PURL missing the build metadata and parse the JSON response.
  • Assert that only one vulnerability appears in the recommendation and that its id matches the expected CVE.
  • Introduce a duplicate RustSec OSV JSON fixture used by the new regression test.
modules/fundamental/src/purl/endpoints/test.rs
etc/test-data/osv/RUSTSEC-2021-0079-DUPLICATE.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The #[serde(untagged)] attribute on the Other(String) variant of VexStatus looks misplaced; untagged is an enum-level attribute and putting it on a single variant is likely to be ignored or fail to compile—if the intention is to change how VexStatus is deserialized, consider moving this to the enum or clarifying why it’s variant-scoped.
  • When deduplicating vulnerabilities with unique_by(|status| &status.vulnerability.identifier), you now implicitly assume that all advisories for a given vulnerability share the same effective status; if differing statuses per advisory are possible, consider documenting this behavior or revisiting whether deduplication should merge or prioritize statuses explicitly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `#[serde(untagged)]` attribute on the `Other(String)` variant of `VexStatus` looks misplaced; `untagged` is an enum-level attribute and putting it on a single variant is likely to be ignored or fail to compile—if the intention is to change how `VexStatus` is deserialized, consider moving this to the enum or clarifying why it’s variant-scoped.
- When deduplicating vulnerabilities with `unique_by(|status| &status.vulnerability.identifier)`, you now implicitly assume that all advisories for a given vulnerability share the same effective status; if differing statuses per advisory are possible, consider documenting this behavior or revisiting whether deduplication should merge or prioritize statuses explicitly.

## Individual Comments

### Comment 1
<location> `modules/fundamental/src/purl/model/mod.rs:159-160` </location>
<code_context>
     NotAffected,
     UnderInvestigation,
     Recommended,
+    #[serde(untagged)]
     Other(String),
 }

</code_context>

<issue_to_address>
**issue (bug_risk):** Using `#[serde(untagged)]` on a single enum variant is likely incorrect and may not yield the intended serialization format.

`#[serde(untagged)]` is meant for whole enums, not individual variants, so mixing a single untagged variant with tagged ones is likely to behave unexpectedly or fail to compile. If you need to capture unknown string values, consider a more explicit approach (e.g., a custom `Deserialize` impl or a string-backed enum with `#[serde(rename = "...")]` for known variants) and align it with the intended JSON schema and Serde’s `untagged` docs.
</issue_to_address>

### Comment 2
<location> `modules/fundamental/src/purl/endpoints/test.rs:385` </location>
<code_context>
+
+#[test_context(TrustifyContext)]
+#[test(actix_web::test)]
+async fn get_recommendations_dedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
+    ctx.ingestor
+        .graph()
</code_context>

<issue_to_address>
**suggestion (testing):** Add a complementary test to ensure different vulnerabilities are not merged by the deduplication logic.

You already cover multiple advisories for the same vulnerability ID. Please also add a test where two advisories reference different vulnerability IDs for the same PURL and assert that both are returned. This will help catch any future over-deduplication that might merge distinct vulnerabilities.

Suggested implementation:

```rust
#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_dedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
    ctx.ingestor
        .graph()
        .ingest_qualified_package(
            &Purl::from_str("pkg:cargo/hyper@0.14.1-redhat-00001")?,
            &ctx.db,
        )
        .await?;

    // Two advisories that refer to the same vulnerability ID for the same package.
    ctx.ingest_documents([
        "osv/RUSTSEC-2021-0079.json",
        "osv/RUSTSEC-2021-0079-DUPLICATE.json",
    ])
    .await?;

    let app = caller(ctx).await?;
    let body: Value = app
        .call_and_read_body_json(
            TestRequest::post()
                .uri("/api/v1/purl/recommendations")
                .set_json(&json!({
                    "packages": [{
                        "purl": "pkg:cargo/hyper@0.14.1-redhat-00001"
                    }]
                })),
        )
        .await;

    // Ensure that deduplication merges multiple advisories for the same vulnerability.
    let packages = body["packages"].as_array().expect("packages should be an array");
    assert_eq!(packages.len(), 1, "expected a single package in the response");

    let vulns = packages[0]["vulnerabilities"]
        .as_array()
        .expect("vulnerabilities should be an array");
    assert_eq!(
        vulns.len(),
        1,
        "expected duplicate advisories for the same vulnerability to be deduplicated"
    );

    Ok(())
}

#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_no_overdedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
    // Same package as in the deduplication test.
    ctx.ingestor
        .graph()
        .ingest_qualified_package(
            &Purl::from_str("pkg:cargo/hyper@0.14.1-redhat-00001")?,
            &ctx.db,
        )
        .await?;

    // Two advisories that reference the same PURL but represent *different* vulnerability IDs.
    ctx.ingest_documents([
        "osv/RUSTSEC-2021-0079.json",
        "osv/RUSTSEC-2021-0080.json",
    ])
    .await?;

    let app = caller(ctx).await?;
    let body: Value = app
        .call_and_read_body_json(
            TestRequest::post()
                .uri("/api/v1/purl/recommendations")
                .set_json(&json!({
                    "packages": [{
                        "purl": "pkg:cargo/hyper@0.14.1-redhat-00001"
                    }]
                })),
        )
        .await;

    // Ensure that the two distinct vulnerabilities are both returned and not merged.
    let packages = body["packages"].as_array().expect("packages should be an array");
    assert_eq!(packages.len(), 1, "expected a single package in the response");

    let vulns = packages[0]["vulnerabilities"]
        .as_array()
        .expect("vulnerabilities should be an array");
    assert_eq!(
        vulns.len(),
        2,
        "expected two distinct vulnerabilities for the same PURL"
    );

    let vuln_ids: std::collections::HashSet<_> = vulns
        .iter()
        .filter_map(|v| v["id"].as_str().map(|s| s.to_string()))
        .collect();
    assert_eq!(
        vuln_ids.len(),
        2,
        "expected two distinct vulnerability IDs"
    );

    Ok(())
}

```

- Ensure `serde_json::json` is imported at the top of the file (e.g. `use serde_json::{json, Value};`) if it is not already.
- Adjust the endpoint path (`"/api/v1/purl/recommendations"`), request shape, and response field paths (`"packages"`, `"vulnerabilities"`, `"id"`) to match the actual API if they differ in your codebase.
- Replace `"osv/RUSTSEC-2021-0080.json"` with an advisory fixture that:
  1. Affects the same PURL (`pkg:cargo/hyper@0.14.1-redhat-00001`), and
  2. Has a different vulnerability identifier than `RUSTSEC-2021-0079`,
  so that the new test truly verifies that different vulnerabilities are not merged by the deduplication logic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_dedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a complementary test to ensure different vulnerabilities are not merged by the deduplication logic.

You already cover multiple advisories for the same vulnerability ID. Please also add a test where two advisories reference different vulnerability IDs for the same PURL and assert that both are returned. This will help catch any future over-deduplication that might merge distinct vulnerabilities.

Suggested implementation:

#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_dedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
    ctx.ingestor
        .graph()
        .ingest_qualified_package(
            &Purl::from_str("pkg:cargo/hyper@0.14.1-redhat-00001")?,
            &ctx.db,
        )
        .await?;

    // Two advisories that refer to the same vulnerability ID for the same package.
    ctx.ingest_documents([
        "osv/RUSTSEC-2021-0079.json",
        "osv/RUSTSEC-2021-0079-DUPLICATE.json",
    ])
    .await?;

    let app = caller(ctx).await?;
    let body: Value = app
        .call_and_read_body_json(
            TestRequest::post()
                .uri("/api/v1/purl/recommendations")
                .set_json(&json!({
                    "packages": [{
                        "purl": "pkg:cargo/hyper@0.14.1-redhat-00001"
                    }]
                })),
        )
        .await;

    // Ensure that deduplication merges multiple advisories for the same vulnerability.
    let packages = body["packages"].as_array().expect("packages should be an array");
    assert_eq!(packages.len(), 1, "expected a single package in the response");

    let vulns = packages[0]["vulnerabilities"]
        .as_array()
        .expect("vulnerabilities should be an array");
    assert_eq!(
        vulns.len(),
        1,
        "expected duplicate advisories for the same vulnerability to be deduplicated"
    );

    Ok(())
}

#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_no_overdedup(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
    // Same package as in the deduplication test.
    ctx.ingestor
        .graph()
        .ingest_qualified_package(
            &Purl::from_str("pkg:cargo/hyper@0.14.1-redhat-00001")?,
            &ctx.db,
        )
        .await?;

    // Two advisories that reference the same PURL but represent *different* vulnerability IDs.
    ctx.ingest_documents([
        "osv/RUSTSEC-2021-0079.json",
        "osv/RUSTSEC-2021-0080.json",
    ])
    .await?;

    let app = caller(ctx).await?;
    let body: Value = app
        .call_and_read_body_json(
            TestRequest::post()
                .uri("/api/v1/purl/recommendations")
                .set_json(&json!({
                    "packages": [{
                        "purl": "pkg:cargo/hyper@0.14.1-redhat-00001"
                    }]
                })),
        )
        .await;

    // Ensure that the two distinct vulnerabilities are both returned and not merged.
    let packages = body["packages"].as_array().expect("packages should be an array");
    assert_eq!(packages.len(), 1, "expected a single package in the response");

    let vulns = packages[0]["vulnerabilities"]
        .as_array()
        .expect("vulnerabilities should be an array");
    assert_eq!(
        vulns.len(),
        2,
        "expected two distinct vulnerabilities for the same PURL"
    );

    let vuln_ids: std::collections::HashSet<_> = vulns
        .iter()
        .filter_map(|v| v["id"].as_str().map(|s| s.to_string()))
        .collect();
    assert_eq!(
        vuln_ids.len(),
        2,
        "expected two distinct vulnerability IDs"
    );

    Ok(())
}
  • Ensure serde_json::json is imported at the top of the file (e.g. use serde_json::{json, Value};) if it is not already.
  • Adjust the endpoint path ("/api/v1/purl/recommendations"), request shape, and response field paths ("packages", "vulnerabilities", "id") to match the actual API if they differ in your codebase.
  • Replace "osv/RUSTSEC-2021-0080.json" with an advisory fixture that:
    1. Affects the same PURL (pkg:cargo/hyper@0.14.1-redhat-00001), and
    2. Has a different vulnerability identifier than RUSTSEC-2021-0079,
      so that the new test truly verifies that different vulnerabilities are not merged by the deduplication logic.

@Strum355 Strum355 force-pushed the nsc/recommend-dedup branch from 02e9785 to 0f63fe1 Compare December 2, 2025 16:35
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.18%. Comparing base (8338c24) to head (476ada3).
⚠️ Report is 139 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2160   +/-   ##
=======================================
  Coverage   68.17%   68.18%           
=======================================
  Files         375      375           
  Lines       21048    21046    -2     
  Branches    21048    21046    -2     
=======================================
  Hits        14350    14350           
+ Misses       5836     5828    -8     
- Partials      862      868    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NotAffected,
UnderInvestigation,
Recommended,
#[serde(untagged)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. I would also note that serde untagged is both not documented nor tested behaviour. Do we have an example of "Other" status, that we can test with?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a test but it seems I forgot to git add the change. QE had examples of it during testing but I wasnt able to reproduce it locally so I had to do a bit of a workaround in the unit test by manually adding it to postgres

#[test_context(TrustifyContext)]
#[test(actix_web::test)]
async fn get_recommendations_other_status(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
    use sea_orm::{ActiveModelTrait, ColumnTrait, EntityTrait, QueryFilter, Set};
    use trustify_entity::{purl_status, status};

    ctx.ingestor
        .graph()
        .ingest_qualified_package(
            &Purl::from_str("pkg:cargo/hyper@0.14.1-redhat-00001")?,
            &ctx.db,
        )
        .await?;

    ctx.ingest_documents(["osv/RUSTSEC-2021-0079.json"]).await?;

    let custom_status_id = Uuid::new_v4();
    let custom_status = status::ActiveModel {
        id: Set(custom_status_id),
        slug: Set("custom_status".to_string()),
        name: Set("Custom Status".to_string()),
        description: Set(Some("A custom status for testing".to_string())),
    };
    status::Entity::insert(custom_status).exec(&ctx.db).await?;

    let purl_statuses = purl_status::Entity::find()
        .filter(purl_status::Column::VulnerabilityId.eq("CVE-2021-32714"))
        .all(&ctx.db)
        .await?;

    assert!(!purl_statuses.is_empty());

    for ps in purl_statuses {
        let mut active: purl_status::ActiveModel = ps.into();
        active.status_id = Set(custom_status_id);
        active.update(&ctx.db).await?;
    }

    let app = caller(ctx).await?;
    let recommendations: Value = app
        .call_and_read_body_json(
            TestRequest::post()
                .uri("/api/v2/purl/recommend")
                .set_json(json!({"purls": ["pkg:cargo/hyper@0.14.1"]}))
                .to_request(),
        )
        .await;

    log::info!("{recommendations:#?}");

    let entry =
        &recommendations["recommendations"].as_object().unwrap()["pkg:cargo/hyper@0.14.1"][0];
    let vulns = entry["vulnerabilities"].as_array().unwrap();
    let vuln = vulns
        .iter()
        .find(|v| v["id"].as_str().unwrap() == "CVE-2021-32714")
        .unwrap();

    assert_eq!(vuln["status"], "custom_status");

    Ok(())
}

without the variant being untagged, this test yields:

2025-12-03T11:30:22.239191Z  INFO trustify_module_fundamental::purl::endpoints::test: Object {
    "recommendations": Object {
        "pkg:cargo/hyper@0.14.1": Array [
            Object {
                "package": String("pkg:cargo/hyper@0.14.1-redhat-00001"),
                "vulnerabilities": Array [
                    Object {
                        "id": String("CVE-2021-32714"),
                        "status": Object {
                            "Other": String("custom_status"),
                        },
                    },
                ],
            },
        ],
    },
}

thread 'purl::endpoints::test::get_recommendations_other_status' (599692) panicked at modules/fundamental/src/purl/endpoints/test.rs:483:5:
assertion `left == right` failed
  left: Object {"Other": String("custom_status")}
 right: "custom_status"

@Strum355 Strum355 force-pushed the nsc/recommend-dedup branch from 0f63fe1 to 2d8ee62 Compare December 4, 2025 12:41
@Strum355 Strum355 force-pushed the nsc/recommend-dedup branch from 2d8ee62 to 476ada3 Compare December 4, 2025 13:05
Copy link
Copy Markdown
Contributor

@dejanb dejanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let's try to find a real case when status can be "other" and test with that. If that is not possible, we need to refactor the enum.

@Strum355 Strum355 added this pull request to the merge queue Dec 4, 2025
Merged via the queue into guacsec:main with commit eaaf627 Dec 4, 2025
6 checks passed
@Strum355 Strum355 deleted the nsc/recommend-dedup branch December 4, 2025 15:30
@dejanb dejanb added the backport release/0.4.z Backport (0.4.z) label Feb 4, 2026
@dejanb
Copy link
Copy Markdown
Contributor

dejanb commented Feb 4, 2026

/backport

@trustify-ci-bot
Copy link
Copy Markdown

Successfully created backport PR for release/0.4.z:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport release/0.4.z Backport (0.4.z)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants