deleters, importers ... update

progenetix · Sep 16, 2024 · 4dcee29 · 4dcee29
1 parent 51c4ebc
commit 4dcee29
Show file tree

Hide file tree

Showing 13 changed files with 256 additions and 216 deletions.
diff --git a/docs/applications.md b/docs/applications.md
@@ -39,52 +39,7 @@ as well as some other statistics (e.g. CNV coverage per chromosomal arms ...).
 * `bin/analysesStatusmapsRefresher.py -d progenetix --filters "pgx:icdom-81703"`
 * `bin/analysesStatusmapsRefresher.py -d cellz --filters "cellosaurus:CVCL_0312"`
 
---------------------------------------------------------------------------------
-
-### `collationsCreator`
-
-The `collationsCreator` script updates the dataset specific `collations` collections
-which provide the aggregated data (sample numbers, hierarchy trees etc.) for all
-individual codes belonging to one of the entities defined in the `filter_definitions`
-in the `bycon` configuration. The (optional) hierarchy data is provided
-in `rsrc/classificationTrees/__filterType__/numbered-hierarchies.tsv` as a list
-of ordered branches in the format `code | label | depth | order`.
-
-**TBD** The filter definition should be one of the configuration where users can
-provide additions and overrides in the `byconaut/local` directory.
-
-#### Arguments
-
-* `-d`, `--datasetIds` ... to select the dataset (only one per run)
-* `--filters` ... to (optionally) limit the processing to a subset of samples
-  (e.g. after a limited update)
-
-#### Use
-
-* `bin/collationsCreator.py -d progenetix`
-* `bin/collationsCreator.py -d examplez --collationTypes "PMID"`
-
---------------------------------------------------------------------------------
-
-### `frequencymapsCreator`
-
-This app creates the frequency maps for the "collations" collection. Basically,
-all samples matching any of the collation codes and representing CNV analyses
-are selected and the frequencies of CNVs per genomic bin are aggregated. The
-result contains teh gain and loss frquencies for all genomic intervals, for the
-given entity.
-
-#### Arguments
-
-* `-d`, `--datasetIds` ... to select the dataset (only one per run)
-* `--collationTypes` ... to (optionally) limit the processing to a selected
-  collation types (e.g. `NCIT`, `PMID`, `icdom` ...)
-
-#### Use
-
-* `bin/frequencymapsCreator.py -d progenetix`
-* `bin/frequencymapsCreator.py -d examplez --collationTypes "icdot"`
-
+-------------------------------------------------------------------------------
 
 ## Utility apps
 

diff --git a/docs/housekeeping.md b/docs/housekeeping.md
@@ -72,3 +72,13 @@ Records are deleted by providing a standard pgx-style tab-delimited metadata fil
 where only the corresponding `..._id` column is essential. As example, the 
 `deleteIndividuals.py` app will take a table which includes a column `individual_id`
 and use these values to delete the matching records.
+
+### Deleting variants
+
+Variant `id` values are generated upon insertion and are not supposed to be
+stable or recoverable. For variants it only makes sense to perform management
+at the `analysis` level. Therefore variants should be deleted removing the
+corresponding analyses and their variants using the `deleteAnalysesWDS.py` app.
+Also, when inserting variants through `importers/variantsInserter.py` by default
+all existing variants with the `id` values corresponding to any of the `analysis_id`
+values in the variants file are being purged before inserting the variants themselves.
diff --git a/docs/index.md b/docs/index.md
@@ -33,6 +33,33 @@ mongorestore --db $database .../mongodump/examplez/
 
 ### Option B: Create your own databases
 
+#### Core Data
+
+A basic setup for a Beacon compatible database - as supported by the `bycon` package -
+consists of the core data collections mirroring the Beacon default data model:
+
+* `variants`
+* `analyses` (which covers parameters from both Beacon `analysis` and `run` entity schemas)
+* `biosamples`
+* `individuals`
+
+Databases are implemented in an existing MongoDB setup using utility applications
+contained in the `importers` directory by importing data from tab-delimited data
+files. In principle, only 2 import files are needed for inserting and updating of records:
+* a file for the non-variant metadata[^1] with specific header values, where as
+  the absolute minimum id values for the different entities have to be provided
+* a file for genomic variants, again with specific headers but also containing
+  the upstream ids for the corresponding analysis, biosample and individual
+
+Examples:
+
+```
+individual_id   biosample_id    analysis_id
+pgxind-kftx25eh pgxbs-kftva59y  pgxcs-kftvldsu
+```
+
+#### Further and optional procedures
+
 1. Create database and variants collection
 2. update the local `bycon` installation for your database information andlocal parameters
     * database name(s)
@@ -50,3 +77,7 @@ mongorestore --db $database .../mongodump/examplez/
 
 Please see the [helper apps documentation](applications/#data-transformation-database-maintenance).
 
+
+
+[^1]: Metadata in biomedical genomics is "everything but the sequence variation"
+
diff --git a/housekeepers/deleteAnalyses.py b/housekeepers/deleteAnalyses.py
diff --git a/housekeepers/deleteAnalysesWDS.py b/housekeepers/deleteAnalysesWDS.py
diff --git a/housekeepers/deleteBiosamples.py b/housekeepers/deleteBiosamples.py
diff --git a/housekeepers/deleteBiosamplesWDS.py b/housekeepers/deleteBiosamplesWDS.py
diff --git a/housekeepers/deleteIndividuals.py b/housekeepers/deleteIndividuals.py
diff --git a/housekeepers/deleteIndividualsWDS.py b/housekeepers/deleteIndividualsWDS.py
diff --git a/housekeepers/recordsMoverWDS.py b/housekeepers/recordsMoverWDS.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python3
+
+from os import pardir, path
+from bycon import *
+
+loc_path = path.dirname( path.abspath(__file__) )
+lib_path = path.join(loc_path , pardir, "importers", "lib")
+sys.path.append( lib_path )
+from importer_helpers import *
+
+"""
+./housekeepers/recordsMoverWDS.py -d progenetix --output cellz -i ./imports/1kdeltest.tsv --testMode false
+"""
+
+################################################################################
+################################################################################
+################################################################################
+
+def main():
+    initialize_bycon_service()
+    BI = ByconautImporter()
+    BI.move_individuals_and_downstream()
+
+
+################################################################################
+################################################################################
+################################################################################
+
+if __name__ == '__main__':
+    main()