Database details

Jump to bottom

Edwin Huang edited this page Mar 31, 2025 · 2 revisions

What the database looks like

CAPER DB layout

Every "record" in MongoDB represents a projects, and when using PyMongo to filter & query the database, these can be represented as python dictionaries.

How to query sample information

Above is a screenshot of the "runs" field in a small project.
In the "runs" field, there may be n samples.
Each sample, sample_n, can have m features.
Each feature_m has the following fields
- ["AA_PDF_file", "AA_PNG_file", "AA_amplicon_number", "AA_directory", "AA_summary_file", "All_genes", "CNV_BED_file", "Captured_interval_length", "Classification", "Complexity_score", "Feature_BED_file", "Feature_ID", "Feature_maximum_copy_number", "Feature_median_copy_number", "Filter_flag", "Location", "Oncogenes", "Reference_version", "Run_metadata_JSON", "Sample_metadata_JSON", "Sample_name", "Sample_type", "Tissue_of_origin", "cnvkit_directory", "extra_metadata_from_csv"]
To query and filter based on some conditions, you can write a pymongo query or find the project ID first, then use Pandas to do filtering.
- For example:
1. call "get_one_project" to get a project via project ID. This should be a python dictionary.
2. From the project dictionary, search for the 'runs' field, which gives you the features.
3. Call replace_space_to_underscore to obtain a features list, then wrap it using pd.DataFrame.
- Code example:

Notes:

DB as of 11/20/24
"runs" is a key in the project, and is a dictionary.
- The keys are: [sample_1, sample_2, ... sample_n]
- Each sample has the keys in the "runs" table
"sample_data" is a dictionary in each project.