Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support reingestion of a published dataset #7443

Open
wants to merge 16 commits into
base: feature/schema-5-3
Choose a base branch
from

Conversation

Bento007
Copy link
Contributor

@Bento007 Bento007 commented Feb 20, 2025

Reason for Change

  • Support re-ingesting a published dataset artifacts a private revision if the publish datasets artifacts are part of one of the dataset versions of the canonical dataset.

Changes

  • add support for ingesting a publish dataset from the public dataset URL

Testing steps

  • unit tests have been updated to reflect the change

Checklist 🛎️

Notes for Reviewer

@Bento007 Bento007 requested a review from ivirshup February 20, 2025 20:35
Copy link
Contributor

Deployment Summary

Copy link
Contributor

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thing is that there are a couple methods which are not used anywhere, other than that minor naming/ error message things.

Comment on lines 550 to 551
def is_public_uri(self, uri):
return str(uri).startswith(CorporaConfig().dataset_assets_base_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially confused by the naming of this function. I don't know if "public" is the correct term here, since we are really just checking if it's an asset we already ingested.

E.g. I think a dataset could be somewhere totally different that is publicly accessible.

Maybe is_already_ingested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go for that.

Comment on lines 455 to 463
def check_artifact_is_part_of_dataset(self, dataset_version_id: DatasetVersionId, artifact_id: DatasetArtifactId):
dataset_version = self.datasets_versions[dataset_version_id.id]
return any(artifact.id == artifact_id for artifact in dataset_version.artifacts)

def get_artifact_by_uri_suffix(self, uri_suffix: str) -> Optional[DatasetArtifact]:
for artifact in self.dataset_artifacts.values():
if artifact.uri.endswith(uri_suffix):
return artifact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either of these methods are called anywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for the other definitions of these methods too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are called when we run the backend tests not in integration mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to that? When I search the your branch for usage I don't see it:

image

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few places where the unit tests fork and either use DatabaseProviderMock, or DatabaseProvider, for example here

.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry for the confusion.

My point here is not about the class. It is that I do not see these methods (the implementation here, or from any other class) being called anywhere. Like, if I do a search of the full codebase for the string "check_artifact_is_part_of_dataset" only the definitions show up. So these methods are currently dead code branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, they are part of my next PR

Comment on lines +889 to +890
with self.assertRaises(InvalidIngestionManifestException):
self.business_logic.ingest_dataset(revision.version_id, url, None, dataset_version.version_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is optional, but I think two meaningful improvements that could be made to these tests are:

  • Checking text of the error to make sure the correct helpful message is given
  • Ideally I would check that this message is actually returned in the response, e.g. make sure the curator can actually see the helpful error message

@Bento007 Bento007 requested a review from ivirshup February 21, 2025 17:32
Copy link
Contributor

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thing is the methods which I believe are not being called.

I think what happened here is that these were meant to go with another batch of changes, and were accidentally included in this PR.

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (feature/schema-5-3@77d0000). Learn more about missing BASE report.

Additional details and impacted files
@@                  Coverage Diff                  @@
##             feature/schema-5-3    #7443   +/-   ##
=====================================================
  Coverage                      ?   93.07%           
=====================================================
  Files                         ?      196           
  Lines                         ?    16791           
  Branches                      ?        0           
=====================================================
  Hits                          ?    15628           
  Misses                        ?     1163           
  Partials                      ?        0           
Flag Coverage Δ
unittests 93.07% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Bento007 Bento007 requested a review from ivirshup February 21, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants