Skip to content

feat: dc-import-tool mcf resolution was slow for large number of mcfs#486

Draft
rohitkumarbhagat wants to merge 1 commit intodatacommonsorg:masterfrom
rohitkumarbhagat:dc-import-tool-slowness
Draft

feat: dc-import-tool mcf resolution was slow for large number of mcfs#486
rohitkumarbhagat wants to merge 1 commit intodatacommonsorg:masterfrom
rohitkumarbhagat:dc-import-tool-slowness

Conversation

@rohitkumarbhagat
Copy link
Contributor

No description provided.

…d a multi-round local reference resolution test, and improve resolver logging.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses performance bottlenecks in the dc-import-tool's MCF resolution process, especially when dealing with a large number of MCFs. It refactors the internal MCF resolution logic to be more efficient by minimizing redundant graph updates. Additionally, it enhances the flexibility of API interactions by allowing users to configure Data Commons API endpoints and API keys through environment variables, moving away from hardcoded values and improving adaptability across different deployment environments.

Highlights

  • Performance Optimization for MCF Resolution: The MCF resolution process has been significantly optimized, particularly for large datasets, by implementing a snapshot-based approach and deferring node updates to the end of each resolution round, which reduces redundant graph manipulations.
  • Configurable API Endpoints and Keys: The tool now supports configurable Data Commons API endpoints and API keys, allowing users to specify different environments (e.g., autopush, production) and manage their API credentials more flexibly via environment variables.
  • Improved Logging for Resolution Progress: Progress logging has been added to the MCF resolution rounds, providing better visibility into the processing of large numbers of nodes and helping to diagnose performance issues.
Changelog
  • tool/src/main/java/org/datacommons/tool/Main.java
    • Updated the description for the doExistenceChecks command-line flag to reflect the new configurable API endpoint.
  • util/src/main/java/org/datacommons/util/ApiHelper.java
    • Refactored API endpoint and key constants to support configurable roots and keys.
    • Extracted API request building logic into a new buildPropertyValuesRequest static method.
    • Introduced helper methods getNodeApiEndpoint, getApiKey, and normalizeApiRoot for dynamic API configuration.
  • util/src/main/java/org/datacommons/util/McfResolver.java
    • Added ROUND_PROGRESS_LOG_INTERVAL constant for logging.
    • Introduced AssignmentMode enum and RoundState class to manage the state of resolution rounds more efficiently.
    • Refactored assignDcids and replaceLocalRefs methods to use RoundState, snapshotting the graph at the start of a round and applying updates in a batch at the end.
    • Added progress logging within resolution rounds.
    • Introduced new helper methods: newRoundState, finishNode, findFirstUnresolvedLocalRef, getAssignmentMode, generateDcid, writeNodeUpdates, and logRoundProgress.
    • Updated the getLocalId method to accept TypedValueOrBuilder.
  • util/src/main/java/org/datacommons/util/ReconClient.java
    • Added API_KEY_ENV constant for API key environment variable.
    • Introduced a new buildRequest static method to construct HTTP requests, incorporating API key handling.
  • util/src/test/java/org/datacommons/util/ApiHelperTest.java
    • Added new test cases to verify the correct behavior of buildPropertyValuesRequest with different API root and key configurations.
  • util/src/test/java/org/datacommons/util/McfResolverTest.java
    • Added a new test multiRoundLocalRefChain to validate multi-round local reference resolution.
  • util/src/test/java/org/datacommons/util/ReconClientRequestTest.java
    • Added a new test file to specifically test the ReconClient.buildRequest method for API key handling.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors API interaction logic and the McfResolver for improved flexibility and performance. API endpoints and keys in ApiHelper and ReconClient are now configurable via environment variables, with request building logic extracted into new helper methods and validated by new unit tests. The McfResolver's assignDcids and replaceLocalRefs methods are significantly refactored to use a RoundState for efficient state management, including graph snapshots, centralized DCID generation based on AssignmentMode, and progress logging, with a new test case added for multi-round local reference resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant