Update DCP write api to use BaseDC spanner schema by clincoln8 · Pull Request #17 · datacommonsorg/datacommons

clincoln8 · 2026-03-09T21:37:11Z

No description provided.

gemini-code-assist · 2026-03-09T21:38:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on migrating the DCP write API to a new Spanner schema, enhancing data integrity and storage efficiency. The changes involve significant refactoring of the GraphService, updates to dependencies, and the addition of comprehensive tests to validate the new implementation.

Highlights

BaseDC Spanner Schema: The pull request updates the Data Commons Platform (DCP) write API to use the BaseDC Spanner schema, which involves changes to data models, ingestion logic, and extraction processes.
Dependency Updates: Includes updates to dependencies such as sqlalchemy-spanner and python-dotenv, along with the addition of development dependencies like pytest.
Code Refactoring: Refactors the GraphService to align with the new schema, including changes to node and edge handling, batch processing, and data transformation.
Test Suite Enhancement: Adds new unit and integration tests to ensure the correctness of the changes, covering areas such as ID normalization, content abstraction, and proxy synthesis.

Changelog

.gitignore
- Added .env and *.generated.yaml to the list of ignored files.
README.md
- Updated instructions for running tests and starting the Data Commons API server, including options for development mode and overriding Spanner credentials via CLI.
build/cloudbuild.yaml
- Replaced $COMMIT_SHA with $SHORT_SHA for container image tagging.
packages/datacommons-api/datacommons_api/api_cli.py
- Updated the Uvicorn run command to specify the app module directly.
packages/datacommons-api/datacommons_api/core/config.py
- Added python-dotenv to load environment variables from a .env file.
packages/datacommons-api/datacommons_api/services/graph_service.py
- Refactored node and edge handling to align with the BaseDC Spanner schema.
- Implemented ID normalization and content coercion for efficient data storage.
- Enhanced batch processing and data transformation logic.
- Added logic for synthesizing proxy nodes for external references.
- Improved error handling and logging.
packages/datacommons-api/pyproject.toml
- Updated dependencies, including sqlalchemy-spanner and python-dotenv, and added development dependencies like pytest.
packages/datacommons-db/datacommons_db/models/edge.py
- Updated EdgeModel to EdgeRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
packages/datacommons-db/datacommons_db/models/node.py
- Updated NodeModel to NodeRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
packages/datacommons-db/datacommons_db/models/observation.py
- Updated ObservationModel to ObservationRecord and modified relationships and indexes to align with the BaseDC Spanner schema.
packages/datacommons-db/datacommons_db/session.py
- Added DDL for creating a property graph and a function to create the property graph schema in the database.
packages/datacommons-db/pyproject.toml
- Added development dependencies like pytest.
packages/datacommons-schema/pyproject.toml
- Added development dependencies like pytest.

Activity

The pull request involves a significant refactoring of the GraphService to align with the new Spanner schema.
It includes updates to dependencies and the addition of new unit and integration tests.
The changes aim to improve data integrity, storage efficiency, and overall system reliability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

clincoln8 · 2026-03-09T21:41:28Z

README.md

-curl -X POST "http://localhost:5000/nodes/" -H "Content-Type: application/json" -d @examples/people.jsonld
+curl -X POST "http://localhost:5000/nodes" -H "Content-Type: application/json" -d @examples/people.jsonld
+
+# With default provenance


gemini-code-assist

Code Review

This pull request introduces a major and well-executed refactoring of the data ingestion and storage layer to use a more robust, relational graph model on Spanner, aligning with the BaseDC schema. The changes are extensive, touching everything from database models and services to configuration and testing. The introduction of synthesized nodes for literals and external entities is a solid architectural improvement, and the new comprehensive test suite is a great addition.

I've identified a critical issue in the drop_tables function which appears incomplete in the provided changes, and some minor inconsistencies in the updated README.md documentation. My detailed comments are below.

gemini-code-assist · 2026-03-09T21:43:10Z

packages/datacommons-api/datacommons_api/services/graph_service.py

        logger.info("Dropping table %s", EDGE_TABLE_NAME)
        query = f"DROP TABLE {EDGE_TABLE_NAME}"
        self.session.execute(text(query))


The drop_tables function is incomplete as shown in the diff. It only drops the Edge table. This will leave the Node table behind, which can cause issues when re-initializing the database, as Edge has foreign keys to Node. The Node table should also be dropped, and the transaction should be committed to ensure a clean state.

Suggested change

logger.info("Dropping table %s", EDGE_TABLE_NAME)

query = f"DROP TABLE {EDGE_TABLE_NAME}"

self.session.execute(text(query))

logger.info("Dropping table %s", EDGE_TABLE_NAME)

query = f"DROP TABLE {EDGE_TABLE_NAME}"

self.session.execute(text(query))

logger.info("Dropping table %s", NODE_TABLE_NAME)

query = f"DROP TABLE {NODE_TABLE_NAME}"

self.session.execute(text(query))

self.session.commit()

logger.info("Successfully dropped Node and Edge tables")

gemini-code-assist · 2026-03-09T21:43:10Z

README.md

 curl -X POST "http://localhost:5000/nodes/" -H "Content-Type: application/json" -d @examples/person-schema.jsonld
+
+# With default provenance
+curl -X POST "http://localhost:5000/nodes" \
+  -H "Content-Type: application/json" \
+  -d @examples/person-schema.jsonld
 ```


The curl examples for uploading the schema are a bit redundant and inconsistent. The command on line 182 uses a trailing slash (/nodes/), while the new example on lines 185-187 does not. To avoid confusion and improve clarity, I recommend removing the first single-line command and keeping only the more descriptive multi-line version.

gemini-code-assist · 2026-03-09T21:43:10Z

README.md

+curl -X POST "http://localhost:5000/nodes" -H "Content-Type: application/json" -d @examples/people.jsonld
+
+# With default provenance
+curl -X POST "http://localhost:5000/nodes" \
+  -H "Content-Type: application/json" \
+  -d @examples/people.jsonld
 ```


Similar to the schema upload example, the curl examples for uploading data are redundant. To keep the documentation concise and clear, I suggest removing the single-line command on line 193 and keeping only the multi-line version that follows, which includes an explanatory comment.

clincoln8 added 30 commits February 13, 2026 22:56

models

87bc6df

create_node_model

571e34a

fix pytest stuff

881f814

udpate extract_edges_from_node

7a3a43c

fix warning

b4e386a

load edges

3ffe023

read values from .env file

5166464

person-schema.jsonld loads

c10bb5a

person example loaded

cdceadb

deployment

9fb561b

nit

9b12b3f

deploy dir structure plus spanner config

fe591c5

fix env vars

e5c0f06

nits

befacc6

fix deploy script

9dd6945

nit

f6c52d9

reorder template

901ba21

fix volume mounting

0edae40

fix

fbe2fc7

fix

24471af

debug attempt -- increase startup time + nl server debug logs

cf8b258

commit

79a087c

Merge branch 'main' into schema-update

fff19c8

deploy dcp

4eadcd7

Merge branch 'main' into schema-update

01155ce

clean up schema models

452e860

coerce_node_record

44da0cc

create_node_record

c1b97d0

all tests passing

8c6e17a

delete old code

be73b5f

clincoln8 added 24 commits March 9, 2026 05:24

nits

8047d60

strip namespace and separate literal nodes from extracted edges

0cc8fde

decode bytes when converting from record to graphnode

3f04e9b

read edge level provenance

b211664

add back context header

fcff4f2

add logging and method name nit

9e71137

comments

d0d9125

optimize insert_batch

438c458

fix literal id inconsistency

b094df7

fix noderecord->graphnode output

b452a56

revert changes to node_router.py

62feded

potentially handle remote nodes

713a5d6

successful remote nodes

2d0b9ae

remove drop for nonexistent index

a7fc895

remote nodes again

b1dcefc

update yaml

3157894

update dev env

18bfdbd

update dcp image

6b75ae9

semi

a94238b

revert yamls

3768206

format

8d9adc9

remove deploy code

c2a0bb8

remove legacy default prov

68aa3bd

revert bootstrapper

17ddf3e

fix lint workflow

0ed9ede

clincoln8 commented Mar 9, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

dwnoble self-requested a review March 9, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DCP write api to use BaseDC spanner schema#17

Update DCP write api to use BaseDC spanner schema#17
clincoln8 wants to merge 55 commits intodatacommonsorg:mainfrom
clincoln8:schema2

clincoln8 commented Mar 9, 2026

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Uh oh!

clincoln8 Mar 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Uh oh!

gemini-code-assist bot Mar 9, 2026

Uh oh!

gemini-code-assist bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clincoln8 commented Mar 9, 2026

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

clincoln8 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant