Skip to content

RFC: Designing a Graph-Based Popularity Metric for Packages in PurlDB #833

@mohityadav8

Description

@mohityadav8

I have set up PurlDB locally and explored the data model, particularly Package and DependentPackage in packagedb/models.py.

From my understanding:
Dependencies are stored via the DependentPackage model
Each dependency links to a Package via a package ForeignKey (the source package)
The target dependency is stored as a PURL string (purl field), not a direct ForeignKey to another Package
This suggests that dependency relationships form a directed graph:

Package A ---> Package B
where B is resolved from the dependency PURL.

Proposal:
Graph-Based Popularity Metric
I would like to explore building a popularity metric based on:
Dependency graph centrality (PageRank-style algorithm)
In-degree (number of reverse dependencies)
Optional freshness/activity decay factor (e.g., release_date or mining_level)
Possibly ignoring versions and computing popularity at package identity level
Proposed Approach (PoC)
Start with a single ecosystem (e.g., PyPI)
Resolve dependency PURLs to canonical package identities

Build directed graph:
Nodes = Packages (ignoring version)
Edges = Dependency relationships

Compute:
In-degree
PageRank score
Store result as popularity_score field on Package
Expose score via REST API

Questions
Should popularity be computed:
As a periodic batch job
Or dynamcally?
Should we:
Store resolved dependency edges in a normalized table?
Or resolve PURLs during computation?
Is ignoring version appropriate for initial PoC?

I would appreciate feedback before proceeding with implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions