You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Original Lab Blog Post: https://lab.databio.org/docs/guide/Reference-genome-predictor-tool/
6
-
7
4
8
5
## Research Project Outline
9
6
10
7
#### Question
11
-
Original: Can we give a BED file as an input and predict the most likely reference genome assembly associated with the BED file?
12
8
13
-
Modified: Can we give a BED file as an input and then determine a level of compatibility for different reference genome assemblies?
9
+
Can we give a BED file as an input and then determine a level of compatibility for different reference genome assemblies?
14
10
15
11
16
12
#### High Level Execution
@@ -24,13 +20,11 @@ Modified: Can we give a BED file as an input and then determine a level of compa
24
20
- How many chrom names in BED file that are _not_ in the size file?
25
21
26
22
3. Compare regions in bed files with excluded ranges/black list regions for each of reference genome.
27
-
28
-
- Assumption: during bed file creation, excluded ranges were filtered out. Therefore, if a bed file has more regions in excluded ranges from hg19 vs less from hg38, then the bed file is more likely associated with hg38.
29
23
- Two Tiers:
30
-
1. Gaps/Centromeres/Telomeres
31
-
2. All other Excluded Ranges
24
+
1. Gaps/Centromeres/Telomeres (can be used to assign tiers for compatibility)
25
+
2. All other Excluded Ranges (informational only)
32
26
33
-
4. Once the above are quantified, we can exclude some reference genomes altogether and give probability of compatibility with the remaining.
27
+
4. Once the above are quantified, we can give probability of compatibility with reference genomes.
34
28
35
29
36
30
#### Detailed Execution Steps
@@ -39,25 +33,23 @@ Modified: Can we give a BED file as an input and then determine a level of compa
39
33
2. Cache relevant BED files which contain excluded ranges using [BBClient](https://docs.bedbase.org/geniml/tutorials/bbclient/)
40
34
3. Build database for each refgenome assembly excluded ranges, gaps centromeres telomeres using [IGD](https://github.com/databio/gtars/tree/dev_igd) (Use rust implementation if finished, else use C++ implementation).
41
35
4. Query "unknown" BED File against chrom size files and the IGDs (using `igd search`).
42
-
5. Obtain overlap stats for each of the IGDs, rank them based on _least overlaps_.
36
+
5. Obtain overlap stats for each of the IGDs
43
37
6. Run on BED files whose ref genomes are _known_ and calculate accuracy of highest probability compatible ref genome.
44
38
45
39
46
40
#### Additional Notes
47
41
- Begin with human bed files and reference genomes for now.
48
-
- predicting between hg38 and hg19 should be the first (simple) task
49
-
- Prediction happens at a high level for mutually exclusive options,e.g. predicting hg38 vs hg19 as the coordinate systems are different
50
-
- this mutual exclusivity would also apply to different types of hg38 (UCSC, ensembl, ncbi).
51
-
- Validation occurs at the next level after basic prediction is done.
42
+
- Compatibility assessed via:
52
43
- different/levels of tiers based on cutoffs wrt specificity and sensitivity
53
44
- These tiers are based on a variety of parameters of increasing complexity:
54
45
- name matching of chromosomes
55
-
- overlaps (chr.sizes, excluded ranges)
46
+
- size overlaps (chr.sizes)
47
+
- overlaps with centromeres/telomeres
56
48
- ML (BED embeddings)
57
49
- Bed annotations (text similarity)
58
50
- Future work could involve machine learning for ref genome prediction. However, we will begin with a simple classifier (which may be sufficient).
59
51
- Future work could add annotations/metadata for making the prediction.
60
52
61
53
#### Software Notes
62
54
63
-
- Create a validator class that can ingest a BED file as well as a genome_model object
55
+
- Create a validator class that can ingest a BED file as well as a GenomeModel object
0 commit comments