Skip to content

Commit 2bbb60d

Browse files
clarification and accuracy adjustments in refvalidator readme
1 parent f40682d commit 2bbb60d

File tree

1 file changed

+9
-17
lines changed

1 file changed

+9
-17
lines changed

bedboss/refgenome_validator/README.md

Lines changed: 9 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,12 @@
11
# Reference Genome Predictor and Validator
22

33
-----
4-
### Background
5-
Original Lab Blog Post: https://lab.databio.org/docs/guide/Reference-genome-predictor-tool/
6-
74

85
## Research Project Outline
96

107
#### Question
11-
Original: Can we give a BED file as an input and predict the most likely reference genome assembly associated with the BED file?
128

13-
Modified: Can we give a BED file as an input and then determine a level of compatibility for different reference genome assemblies?
9+
Can we give a BED file as an input and then determine a level of compatibility for different reference genome assemblies?
1410

1511

1612
#### High Level Execution
@@ -24,13 +20,11 @@ Modified: Can we give a BED file as an input and then determine a level of compa
2420
- How many chrom names in BED file that are _not_ in the size file?
2521

2622
3. Compare regions in bed files with excluded ranges/black list regions for each of reference genome.
27-
28-
- Assumption: during bed file creation, excluded ranges were filtered out. Therefore, if a bed file has more regions in excluded ranges from hg19 vs less from hg38, then the bed file is more likely associated with hg38.
2923
- Two Tiers:
30-
1. Gaps/Centromeres/Telomeres
31-
2. All other Excluded Ranges
24+
1. Gaps/Centromeres/Telomeres (can be used to assign tiers for compatibility)
25+
2. All other Excluded Ranges (informational only)
3226

33-
4. Once the above are quantified, we can exclude some reference genomes altogether and give probability of compatibility with the remaining.
27+
4. Once the above are quantified, we can give probability of compatibility with reference genomes.
3428

3529

3630
#### Detailed Execution Steps
@@ -39,25 +33,23 @@ Modified: Can we give a BED file as an input and then determine a level of compa
3933
2. Cache relevant BED files which contain excluded ranges using [BBClient](https://docs.bedbase.org/geniml/tutorials/bbclient/)
4034
3. Build database for each refgenome assembly excluded ranges, gaps centromeres telomeres using [IGD](https://github.com/databio/gtars/tree/dev_igd) (Use rust implementation if finished, else use C++ implementation).
4135
4. Query "unknown" BED File against chrom size files and the IGDs (using `igd search`).
42-
5. Obtain overlap stats for each of the IGDs, rank them based on _least overlaps_.
36+
5. Obtain overlap stats for each of the IGDs
4337
6. Run on BED files whose ref genomes are _known_ and calculate accuracy of highest probability compatible ref genome.
4438

4539

4640
#### Additional Notes
4741
- Begin with human bed files and reference genomes for now.
48-
- predicting between hg38 and hg19 should be the first (simple) task
49-
- Prediction happens at a high level for mutually exclusive options,e.g. predicting hg38 vs hg19 as the coordinate systems are different
50-
- this mutual exclusivity would also apply to different types of hg38 (UCSC, ensembl, ncbi).
51-
- Validation occurs at the next level after basic prediction is done.
42+
- Compatibility assessed via:
5243
- different/levels of tiers based on cutoffs wrt specificity and sensitivity
5344
- These tiers are based on a variety of parameters of increasing complexity:
5445
- name matching of chromosomes
55-
- overlaps (chr.sizes, excluded ranges)
46+
- size overlaps (chr.sizes)
47+
- overlaps with centromeres/telomeres
5648
- ML (BED embeddings)
5749
- Bed annotations (text similarity)
5850
- Future work could involve machine learning for ref genome prediction. However, we will begin with a simple classifier (which may be sufficient).
5951
- Future work could add annotations/metadata for making the prediction.
6052

6153
#### Software Notes
6254

63-
- Create a validator class that can ingest a BED file as well as a genome_model object
55+
- Create a validator class that can ingest a BED file as well as a GenomeModel object

0 commit comments

Comments
 (0)