From 6dffe9f567487734f05e5652bb36a939c794af75 Mon Sep 17 00:00:00 2001 From: Ruslan Forostianov Date: Fri, 31 May 2024 17:43:32 +0200 Subject: [PATCH 1/2] Add Incremental Data Loading Documentation --- docs/Data-Loading-Maintaining-Studies.md | 8 +++ docs/Data-Loading.md | 5 ++ docs/Incremental-Data-Loading.md | 61 ++++++++++++++++++++++ docs/Using-the-metaImport-script.md | 13 ++++- docs/deployment/docker/example_commands.md | 3 ++ 5 files changed, 89 insertions(+), 1 deletion(-) create mode 100644 docs/Incremental-Data-Loading.md diff --git a/docs/Data-Loading-Maintaining-Studies.md b/docs/Data-Loading-Maintaining-Studies.md index d0c247ef35d..037372c36d7 100644 --- a/docs/Data-Loading-Maintaining-Studies.md +++ b/docs/Data-Loading-Maintaining-Studies.md @@ -27,6 +27,14 @@ For example: ./cbioportalImporter.py -s ../../../test/scripts/test_data/study_es_0/ ``` +## Importing part of the data +To import only some new or updated data entries, you can specify `-d` instead `-s` option: +``` +./cbioportalImporter.py -d +``` +Although the -d option accepts a directory that follows the same structure as the study directory, not all data types are supported for incremental upload. +For more details on incremental data loading, see [this page](./Incremental-Data-Loading.md). + ## Deleting a study To remove a study, run: ``` diff --git a/docs/Data-Loading.md b/docs/Data-Loading.md index 39f27d9b385..9857a92aab2 100644 --- a/docs/Data-Loading.md +++ b/docs/Data-Loading.md @@ -53,6 +53,11 @@ The validation can be run standalone, but it is also integrated into the [metaIm ## Loading Data To load the data into cBioPortal, the [metaImport script](/Using-the-metaImport-script.md) has to be used. This script first validates the data and, if validation succeeds, loads the data. +### Incremental Loading + +You can incorporate data entries of certain data types without re-uploading the whole study. +To do this, you have to specify `--data_directory` (or `-d`) instead of `--study_directory` (or `-s`) option for the [metaImport script](./Using-the-metaImport-script.md). + ## Removing a Study To remove a study, the [cbioportalImporter script](/Data-Loading-Maintaining-Studies.md#deleting-a-study) can be used. diff --git a/docs/Incremental-Data-Loading.md b/docs/Incremental-Data-Loading.md new file mode 100644 index 00000000000..6f32c0a64cf --- /dev/null +++ b/docs/Incremental-Data-Loading.md @@ -0,0 +1,61 @@ +# Incremental Data Loading + +To add or update a few entries (patient/sample/genetic profile) more quickly, especially for larger studies, you can use incremental data loading instead of re-uploading the entire study. + +## Granularity of Incremental Data Loading + +Think of updating an entry as a complete swap of data for a particular data type for this entry (patient/sample/genetic profile). +When you update an entry, you must provide the complete data for this data type for this entry again. +For example, if you want to add or update the `Gender` attribute of a patient by incrementally uploading the `PATIENT_ATTRIBUTES` data type, you have to supply **all** other attributes of this patient again. +Note that in this case, you don't have to supply all sample information or molecular data types for this patient again as those are separate data types, and the rule applies to them in their own turn. + +**Note:** Although incremental upload will create a genetic profile (name, description, etc.) when you upload molecular data for the first time, it does not update the profile attributes on subsequent uploads. +It simply reuses the genetic profile if none of the identifying attributes (`cancer_study_identifier`, `genetic_alteration_type`, `datatype` and `stable_id`) have changed. + +## Usage +To load data incrementally, you have to specify `--data_directory` (or `-d`) instead of `--study_directory` (or `-s`) option for the [metaImport script](./Using-the-metaImport-script.md) or `cbioportalImporter.py` scripts. + +The data directory follows the same structure and data format as the study directory. +The data files should contain complete information about entries you want to add or update. + +## Supported Data Types +Please note that incremental upload is supported for subset of data types only. +Not supported data types have to be omitted from the directory. + +Here is the list of data types as they specified in `datatype` attribute of meta file. + +- `CASE_LIST` +- `CNA_CONTINUOUS` +- `CNA_DISCRETE` +- `CNA_DISCRETE_LONG` +- `CNA_LOG2` +- `EXPRESSION` +- `GENERIC_ASSAY_BINARY` (sample level only; `patient_level: false`) +- `GENERIC_ASSAY_CATEGORICAL` (sample level only; `patient_level: false`) +- `GENERIC_ASSAY_CONTINUOUS` (sample level only; `patient_level: false`) +- `METHYLATION` +- `MUTATION` +- `MUTATION_UNCALLED` +- `PATIENT_ATTRIBUTES` +- `PROTEIN` +- `SAMPLE_ATTRIBUTES` +- `SEG` +- `STRUCTURAL_VARIANT` +- `TIMELINE` (aka clinical events) + +You might want to check the `INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES` variable of the `cbioportal_common.py` module of the `cbioportal-core` project to ensure the list is up to date. + +These are the known data types for which incremental upload is not currently supported: + +- `CANCER_TYPE` +- `GENERIC_ASSAY_BINARY` (patient level; `patient_level: true`) +- `GENERIC_ASSAY_CATEGORICAL` (patient level; `patient_level: true`) +- `GENERIC_ASSAY_CONTINUOUS` (patient level; `patient_level: true`) +- `GISTIC_GENES` +- `GSVA_PVALUES` +- `GSVA_SCORES` +- `PATIENT_RESOURCES` +- `RESOURCES_DEFINITION` +- `SAMPLE_RESOURCES` +- `STUDY_RESOURCES` +- `STUDY` \ No newline at end of file diff --git a/docs/Using-the-metaImport-script.md b/docs/Using-the-metaImport-script.md index 261bca54bcb..9bb4ca4dee7 100644 --- a/docs/Using-the-metaImport-script.md +++ b/docs/Using-the-metaImport-script.md @@ -11,7 +11,7 @@ and then run the following command: This will tell you the parameters you can use: ``` $./metaImport.py -h -usage: metaImport.py [-h] -s STUDY_DIRECTORY +usage: metaImport.py [-h] [-s STUDY_DIRECTORY | -d DATA_DIRECTORY] [-u URL_SERVER | -p PORTAL_INFO_DIR | -n] [-jar JAR_PATH] [-html HTML_TABLE] [-v] [-o] [-r] [-m] @@ -22,6 +22,8 @@ optional arguments: -h, --help show this help message and exit -s STUDY_DIRECTORY, --study_directory STUDY_DIRECTORY path to directory. + -d DATA_DIRECTORY, --data_directory DATA_DIRECTORY + path to data directory for incremental upload. -u URL_SERVER, --url_server URL_SERVER URL to cBioPortal server. You can set this if your URL is not http://localhost/cbioportal @@ -68,5 +70,14 @@ This example imports the study to the localhost, creates an html report and show By adding `-o`, warnings will be overridden and import will start after validation. +#### Incremental Upload + +You have to specify `--data_directory` (or `-d`) instead of `--study_directory` (or `-s`) option to load data incrementally. +Incremental upload means incorporate data entries of certain data types without re-uploading the whole study. +The data directory follows the same structure and data format as the study directory. +It should contain complete information about entries you want to add or update. +Please note that some data types like study are not supported and must not be present in the data directory. +[Here](./Incremental-Data-Loading.md) you can find more details. + ## Development / debugging mode For developers and specific testing purposes, an extra script, cbioportalImporter.py, is available which imports data regardless of validation results. Check [this](Data-Loading-For-Developers.md) page for more information on how to use it. diff --git a/docs/deployment/docker/example_commands.md b/docs/deployment/docker/example_commands.md index f9f26f56009..84a1f4292ca 100644 --- a/docs/deployment/docker/example_commands.md +++ b/docs/deployment/docker/example_commands.md @@ -27,6 +27,9 @@ docker-compose run \ :warning: after importing a study, remember to restart `cbioportal-container` to see the study on the home page. Run `docker-compose restart cbioportal`. +To load data incrementally, specify `-d` instead of `-s` option. +For more details on incremental data loading, see [this page](./Incremental-Data-Loading.md). + #### Using cached portal side-data #### In some setups the data validation step may not have direct access to the web API, for instance when the web API is only accessible to authenticated browser sessions. You can use this command to generate a cached folder of files that the validation script can use instead. Make sure to replace `` with the absolute path where the cached folder is going to be generated. From 02e8f0782dd8eda66ed1bd6303c14e133f22a6d0 Mon Sep 17 00:00:00 2001 From: Ruslan Forostianov Date: Fri, 28 Jun 2024 15:02:55 +0200 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: pieterlukasse --- docs/Incremental-Data-Loading.md | 4 ++-- docs/Using-the-metaImport-script.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/Incremental-Data-Loading.md b/docs/Incremental-Data-Loading.md index 6f32c0a64cf..ce24a1fa482 100644 --- a/docs/Incremental-Data-Loading.md +++ b/docs/Incremental-Data-Loading.md @@ -9,7 +9,7 @@ When you update an entry, you must provide the complete data for this data type For example, if you want to add or update the `Gender` attribute of a patient by incrementally uploading the `PATIENT_ATTRIBUTES` data type, you have to supply **all** other attributes of this patient again. Note that in this case, you don't have to supply all sample information or molecular data types for this patient again as those are separate data types, and the rule applies to them in their own turn. -**Note:** Although incremental upload will create a genetic profile (name, description, etc.) when you upload molecular data for the first time, it does not update the profile attributes on subsequent uploads. +**Note:** Although incremental upload will create a genetic profile (name, description, etc.) when you upload molecular data for the first time, it does not update the profile (metadata)attributes on subsequent uploads. It simply reuses the genetic profile if none of the identifying attributes (`cancer_study_identifier`, `genetic_alteration_type`, `datatype` and `stable_id`) have changed. ## Usage @@ -20,7 +20,7 @@ The data files should contain complete information about entries you want to add ## Supported Data Types Please note that incremental upload is supported for subset of data types only. -Not supported data types have to be omitted from the directory. +Unsupported data types have to be omitted from the directory. Here is the list of data types as they specified in `datatype` attribute of meta file. diff --git a/docs/Using-the-metaImport-script.md b/docs/Using-the-metaImport-script.md index 9bb4ca4dee7..bc8d3aa1ec6 100644 --- a/docs/Using-the-metaImport-script.md +++ b/docs/Using-the-metaImport-script.md @@ -73,7 +73,7 @@ By adding `-o`, warnings will be overridden and import will start after validati #### Incremental Upload You have to specify `--data_directory` (or `-d`) instead of `--study_directory` (or `-s`) option to load data incrementally. -Incremental upload means incorporate data entries of certain data types without re-uploading the whole study. +Incremental upload enables data entries of certain data types to be updated without the need of re-uploading the whole study. The data directory follows the same structure and data format as the study directory. It should contain complete information about entries you want to add or update. Please note that some data types like study are not supported and must not be present in the data directory.