-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include orphans in verbatim manifest for AnVIL #6529
Comments
Spike to review design. |
I believe this statement
Is false, given the definition
Because all biosamples will materialize as outer entities, since they form the basis of what we currently call primary bundles. Biosamples therefore cannot be orphans. Donors may be orphans if they are not associated with a biosample, but their status as orphans isn't affected by whether they're associated with a file. |
I don't understand why 8 is the value chosen for |
This seems backwards to me. I think it should say, "Replica bundles don't contribute any entities to the index". |
If it's possible for the |
|
Currently, the replicas index contains two replicas for every dataset row: one containing the BigQuery row, and one containing the DUOS description.
I presume this upserts the DUOS replica, as opposed to the BigQuery one. But perhaps the intent was for it to upsert both, or perhaps to combine the two.
If the primary bundle does not emit the BigQuery dataset replica, then it must be emitted by the DUOS bundle. But I think it would be more straightforward for the DUOS bundle to upsert the DUOS replica and for the primary bundle to upsert a replica for every entity, including the dataset (providing its BigQuery replica). This is how the current implementation works. |
Assignee to revise design. |
Regarding the last section (Eliminate RepositoryPlugin.list_partitions):
We decided instead to use the sandbox/personal heuristic for IT catalogs in main deployments, thus invalidating both of the previous points. |
Assignee to update description with everything discussed between @nadove-ucsc and @hannes-ucsc. |
Definitions
All replicas will refer to the dataset (project) they are part of as one of their hubs.
TBD: Should we make the hub_ids field of each replica a list of entity references? I'm leaning towards not doing so because entity IDs are UUIDs and should be unique without the entity type qualifier. If we needed to look up entities by a replica's hub IDs, the qualifier would help in determining the target index. At the moment, such a lookup is not needed.
An orphan is an entity that does not materialize as an inner or outer entity in the contributions index. By that definition, AnVIL's supplementary files are not orphans. An orphan may have non-null references (foreign keys) to other entities, or other entities may refer to it, but if Azul does not follow those references, the entity is still considered orphaned under this definition.
Note that this definition of orphan is Azul-centric. A more universal definition would define them as entities that don't relate to a file, or, even more restrictive, that don't relate to any other entity. However, it is immaterial whether a relation exists or whether Azul just doesn't index it. We know that there are many snapshots with donor and biosample entities that don't relate to a file and the Azul-centric definition orphan includes these entities, and the resolution to this ticket will cover them as well.
While HCA may have orphans, this issue is only concerned with AnVIL orphans. The implementor is free to choose whether their changes increase the number of entities included in the verbatim manifests for HCA, their changes just mustn't reduce that number, and their choice should be driven by expediency.
AnVIL has study-specific tables that are not governed by the AnVIL table schema. Azul is not aware of the relations between these tables so all rows in these tables are orphans under the above definition.
The replica of an orphan will have no hub other than the dataset (optional: project) it is part of.
Service
The intent behind this issue is to include the replicas of orphans in verbatim manifests for AnVIL, if the user is filtering exclusively by dataset.
Currently, when a verbatim manifest is prepared, the filter is applied to the aggregate files index and the matching documents are retrieved. This is the left side of the join. Then all replicas referring to those files as hubs are retrieved, that's the right side of the join. For each file, we also determine its dataset (project) ID (the implicit hub) and use that to retrieve the dataset (project) replica (the implicit hub, somewhat of a misnomer). Note that this is not the same as retrieving all replicas that have the dataset (project) as a hub. Doing so would extend the set of returned replicas to that of all replicas of every matching file's dataset (project). That's not what we want when the left side of the join is the aggregate files index.
When the filter only specifies the dataset ID field, it will be applied to the datasets index as the left side of the join. Because every replica, orphaned or not, has the containing dataset's replica ID as a hub, this effectively yields all replicas in that dataset. We don't need to use the implict hub lookup in that case since the dataset replica refers to itself as a hub.
The query against the dataset index will need to enforce the source filter. Otherwise, we would return replicas from managed-access snapshots to unauthorized users. This is in contrast to the
/index/datasets
endpoint which is also backed by that index, but the source filter is not enforced when serving that endpoint. The endpoint only returns project aggregates, which are deemed public, even for managed-access snapshots.Indexer
In order to populate the replica index with the replicas of orphans, we'll introduce a new type of "replica" bundle. I don't think that type will manifest explicitly as a new subclass of
TDRAnvilBundleFQID
, but simply as a set of instances of that class, sharing a particular combination of attributes, as defined below.We generalize a
TDRAnvilBundleFQID
instance to represent a batch of rows in a particular table. The batch will be defined in terms of particular prefix of thedatarepo_row_id
of these rows. All rows in a batch will share the same prefix and will come from the same table. The name of the table will be represented by the bundle FQID'sentity_type
attribute. The batch prefix will be represented by the bundle's UUID and a new attribute calledbatch_prefix_length
such that the firstbatch_prefix_length
characters of the bundle UUID appear at the beginning of thedatarepo_row_id
column of every row in the batch. This idea was first proposed by Noa (@nadove-ucsc).The
entity_type
attribute ofTDRAnvilBundleFQID
will be renamed totable_name
and its type will be changed fromBundleEntityType
(an Enum subclass) to a plain string.BundleEntityType
will be renamed toBundleType
.The
table_name
attribute of legacy (primary, supplementary, and DUOS) bundle FQIDs will be'anvil_biosample'
(BundleType.primary
),'anvil_file'
(BundleType.supplementary
) and 'anvil_dataset' (BundleType.duos), respectively.The UUID of legacy bundles is derived bijectively from the
datarepo_row_id
by switching the UUID version. Thebatch_prefix_length
attribute for legacy bundles will be fixed at 8. The actual batch prefix manifests as the first eight characters of the bundle's UUID. The batch size of legacy bundles is 1 and an assertion will fire if there are more than one row with that prefix in the respective table.The batch prefix of a replica bundle is at least the partition prefix, with more characters added as needed to break up larger tables into multiple batches, one batch per bundle. The UUID of these replica bundles begins with the batch prefix, and has the same version as the legacy bundle UUIDs. All other "random" bits are 0. Legacy bundles don't contribute any entities to the index, only replicas, so their UUIDs will not persist.
The
list_bundles
endpoint will enumerate all tables in the snapshot, using theINFORMATION_SCHEMA
or an BQ Python API method. For tables backing replica bundles, the method will then use the following query to determine the optimal batch size for the tables backing replica bundles. As mentioned above, the batch size for legacy bundles is fixed at 1.The central element of this query is the
GROUP BY ROLLUP
clause of thecounts
query. It essentially tries a fixed number of different batch prefix lengths and compiles the number of matching rows for each such prefix. A subsequent query averages the bundle count over all prefixes of a specific length. The example query above hard-codes the maximum prefix length that it tries to be 4. The implementor should write a function that generates the query given a table name, a partition prefix and a maximum batch prefix length.Given the result of the query, the
list_bundles
method will pick a batch prefix length that produces batches of around 256 rows. The method then emits another query against the same table to enumerate all batches, and emits the bundle FQIDs for them.Indexing a supplementary bundle, i.e. a bundle with an FQID whose
table_name
is equal toBundleType.supplementary.value
, will upsert a file replica, and, if the file row'sis_supplementary
column is set, contribute a file entity and a dataset (project) entity. Both the file entity and the associated dataset (project) entity will be hubs of the file replica.Indexing a DUOS bundle, i.e. a bundle with an FQID whose
table_name
is equal toBundleType.duos.value
will upsert adataset
replica and contribute the DUOS description to the dataset entity. The dataset replica will list the dataset entity as a hub.TBD: DUOS description in dataset replica? Special replica for DUOS description?
Indexing a primary bundle, i.e., a bundle with an FQID whose
table_name
is equal toBundleType.primary.value
will contribute the entities discovered during graph traversal starting at that biosample (including the dataset), and upsert a replica of each of the contributed entities, except for the dataset. The dataset (project) entity will be added as a hub of every replica by this bundle.Any other bundle is a replica bundle and indexing such a bundle will upsert a replica for every row in the batch. Each replica will have exactly one hub: the dataset. This means that the
datarepo_row_id
of the dataset entity needs to be known which implies that the indexer needs to fetch singleton row from theanvil_dataset
table in the snapshot.Eliminate
RepositoryPlugin.list_partitions
(#6531)The
list_partitions
method will be removed. The average duration of the Lambda invocation that handles thereindex
notification is under ten seconds inprod
andanvilprod
so there is a lot of leeway for not worrying about the partition size too much. We will stop tracking the number of bundles in each snapshot in the source tree, eliminate (move to the attic) theupdate_subgraphs_count.py
script and any logic inenvironment.py
files that is used to compute the common and partition prefix lengths. Typically, source specs won't reflect the common prefix or the partition prefix length, while still allowing for manual overrides (see below). Instead, we'll add a repository plugin method calledpartition_source
that determines both values dynamically. This will be done a separate, first PR. The method will be called by the remote reindex functionality inazulclient.py
. The initial implementation of that method for HCA will count the rows in thelinks
table. The initial implementation of that method for AnVIL will count the rows in theanvil_biosample
table plus the supplementary rows in theanvil_file
table.In main deployments, the method returns a common prefix of
''
and heuristically computes a partition prefix length that yields partitions of up to 8,192 bundles. In other deployments (sandbox or personal), the methods heuristically computes a common prefix length containing between 2 and 32 bundles, and a partition prefix length of 1 (yes, some of the partitions could be empty, but we want the IT to cover multiple partitions in those deployments). While the length of the common prefix is computed, the concrete digits should be derived pseudorandomly but deterministically, e.g., by hashing the source spec.The
post_deploy
make target must ensure that, if a common prefix is defined, the source contains between 1 and 512 bundles. The operator manual should document that sources failing that requirement can be fixed by manually overriding the common prefix inenvironment.py
to be one digit shorter than the computed prefix if that prefix is empty, and one digit longer if the computed prefix contains more bundles than allowed.When a subsequent PR modifies the
list_bundles
implementation for AnVIL as described above, thepartition_source
method for AnVIL will count all rows in all tables and estimate a common prefix and partition prefix length accordingly.The text was updated successfully, but these errors were encountered: