Skip to content

Update DCP write api to use BaseDC spanner schema#17

Open
clincoln8 wants to merge 55 commits intodatacommonsorg:mainfrom
clincoln8:schema2
Open

Update DCP write api to use BaseDC spanner schema#17
clincoln8 wants to merge 55 commits intodatacommonsorg:mainfrom
clincoln8:schema2

Conversation

@clincoln8
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on migrating the DCP write API to a new Spanner schema, enhancing data integrity and storage efficiency. The changes involve significant refactoring of the GraphService, updates to dependencies, and the addition of comprehensive tests to validate the new implementation.

Highlights

  • BaseDC Spanner Schema: The pull request updates the Data Commons Platform (DCP) write API to use the BaseDC Spanner schema, which involves changes to data models, ingestion logic, and extraction processes.
  • Dependency Updates: Includes updates to dependencies such as sqlalchemy-spanner and python-dotenv, along with the addition of development dependencies like pytest.
  • Code Refactoring: Refactors the GraphService to align with the new schema, including changes to node and edge handling, batch processing, and data transformation.
  • Test Suite Enhancement: Adds new unit and integration tests to ensure the correctness of the changes, covering areas such as ID normalization, content abstraction, and proxy synthesis.
Changelog
  • .gitignore
    • Added .env and *.generated.yaml to the list of ignored files.
  • README.md
    • Updated instructions for running tests and starting the Data Commons API server, including options for development mode and overriding Spanner credentials via CLI.
  • build/cloudbuild.yaml
    • Replaced $COMMIT_SHA with $SHORT_SHA for container image tagging.
  • packages/datacommons-api/datacommons_api/api_cli.py
    • Updated the Uvicorn run command to specify the app module directly.
  • packages/datacommons-api/datacommons_api/core/config.py
    • Added python-dotenv to load environment variables from a .env file.
  • packages/datacommons-api/datacommons_api/services/graph_service.py
    • Refactored node and edge handling to align with the BaseDC Spanner schema.
    • Implemented ID normalization and content coercion for efficient data storage.
    • Enhanced batch processing and data transformation logic.
    • Added logic for synthesizing proxy nodes for external references.
    • Improved error handling and logging.
  • packages/datacommons-api/pyproject.toml
    • Updated dependencies, including sqlalchemy-spanner and python-dotenv, and added development dependencies like pytest.
  • packages/datacommons-db/datacommons_db/models/edge.py
    • Updated EdgeModel to EdgeRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
  • packages/datacommons-db/datacommons_db/models/node.py
    • Updated NodeModel to NodeRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
  • packages/datacommons-db/datacommons_db/models/observation.py
    • Updated ObservationModel to ObservationRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
  • packages/datacommons-db/datacommons_db/session.py
    • Added DDL for creating a property graph and a function to create the property graph schema in the database.
  • packages/datacommons-db/pyproject.toml
    • Added development dependencies like pytest.
  • packages/datacommons-schema/pyproject.toml
    • Added development dependencies like pytest.
Activity
  • The pull request involves a significant refactoring of the GraphService to align with the new Spanner schema.
  • It includes updates to dependencies and the addition of new unit and integration tests.
  • The changes aim to improve data integrity, storage efficiency, and overall system reliability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

curl -X POST "http://localhost:5000/nodes/" -H "Content-Type: application/json" -d @examples/people.jsonld
curl -X POST "http://localhost:5000/nodes" -H "Content-Type: application/json" -d @examples/people.jsonld

# With default provenance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major and well-executed refactoring of the data ingestion and storage layer to use a more robust, relational graph model on Spanner, aligning with the BaseDC schema. The changes are extensive, touching everything from database models and services to configuration and testing. The introduction of synthesized nodes for literals and external entities is a solid architectural improvement, and the new comprehensive test suite is a great addition.

I've identified a critical issue in the drop_tables function which appears incomplete in the provided changes, and some minor inconsistencies in the updated README.md documentation. My detailed comments are below.

Comment on lines 752 to 754
logger.info("Dropping table %s", EDGE_TABLE_NAME)
query = f"DROP TABLE {EDGE_TABLE_NAME}"
self.session.execute(text(query))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The drop_tables function is incomplete as shown in the diff. It only drops the Edge table. This will leave the Node table behind, which can cause issues when re-initializing the database, as Edge has foreign keys to Node. The Node table should also be dropped, and the transaction should be committed to ensure a clean state.

Suggested change
logger.info("Dropping table %s", EDGE_TABLE_NAME)
query = f"DROP TABLE {EDGE_TABLE_NAME}"
self.session.execute(text(query))
logger.info("Dropping table %s", EDGE_TABLE_NAME)
query = f"DROP TABLE {EDGE_TABLE_NAME}"
self.session.execute(text(query))
logger.info("Dropping table %s", NODE_TABLE_NAME)
query = f"DROP TABLE {NODE_TABLE_NAME}"
self.session.execute(text(query))
self.session.commit()
logger.info("Successfully dropped Node and Edge tables")

Comment on lines 182 to 188
curl -X POST "http://localhost:5000/nodes/" -H "Content-Type: application/json" -d @examples/person-schema.jsonld

# With default provenance
curl -X POST "http://localhost:5000/nodes" \
-H "Content-Type: application/json" \
-d @examples/person-schema.jsonld
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The curl examples for uploading the schema are a bit redundant and inconsistent. The command on line 182 uses a trailing slash (/nodes/), while the new example on lines 185-187 does not. To avoid confusion and improve clarity, I recommend removing the first single-line command and keeping only the more descriptive multi-line version.

Comment on lines +193 to 199
curl -X POST "http://localhost:5000/nodes" -H "Content-Type: application/json" -d @examples/people.jsonld

# With default provenance
curl -X POST "http://localhost:5000/nodes" \
-H "Content-Type: application/json" \
-d @examples/people.jsonld
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the schema upload example, the curl examples for uploading data are redundant. To keep the documentation concise and clear, I suggest removing the single-line command on line 193 and keeping only the multi-line version that follows, which includes an explanatory comment.

@dwnoble dwnoble self-requested a review March 9, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant