Skip to content

Database details

Edwin Huang edited this page Mar 31, 2025 · 2 revisions

What the database looks like

CAPER DB layout

  • Every "record" in MongoDB represents a projects, and when using PyMongo to filter & query the database, these can be represented as python dictionaries.

How to query sample information

image
  • Above is a screenshot of the "runs" field in a small project.
  • In the "runs" field, there may be n samples.
  • Each sample, sample_n, can have m features.
  • Each feature_m has the following fields
    • ["AA_PDF_file", "AA_PNG_file", "AA_amplicon_number", "AA_directory", "AA_summary_file", "All_genes", "CNV_BED_file", "Captured_interval_length", "Classification", "Complexity_score", "Feature_BED_file", "Feature_ID", "Feature_maximum_copy_number", "Feature_median_copy_number", "Filter_flag", "Location", "Oncogenes", "Reference_version", "Run_metadata_JSON", "Sample_metadata_JSON", "Sample_name", "Sample_type", "Tissue_of_origin", "cnvkit_directory", "extra_metadata_from_csv"]
  • To query and filter based on some conditions, you can write a pymongo query or find the project ID first, then use Pandas to do filtering.
    • For example:
    1. call "get_one_project" to get a project via project ID. This should be a python dictionary.
    2. From the project dictionary, search for the 'runs' field, which gives you the features.
    3. Call replace_space_to_underscore to obtain a features list, then wrap it using pd.DataFrame.
    • Code example: image

Notes:

  • DB as of 11/20/24
  • "runs" is a key in the project, and is a dictionary.
    • The keys are: [sample_1, sample_2, ... sample_n]
    • Each sample has the keys in the "runs" table
  • "sample_data" is a dictionary in each project.
Clone this wiki locally