|
| 1 | +--- |
| 2 | +jupytext: |
| 3 | + formats: md:myst |
| 4 | + text_representation: |
| 5 | + extension: .md |
| 6 | + format_name: myst |
| 7 | + format_version: 0.13 |
| 8 | + jupytext_version: 1.11.3 |
| 9 | +kernelspec: |
| 10 | + display_name: Python 3 |
| 11 | + language: python |
| 12 | + name: python3 |
| 13 | +--- |
| 14 | + |
| 15 | +# Bioframe for bedtools users |
| 16 | + |
| 17 | + |
| 18 | +Bioframe is built around the analysis of genomic intervals as a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in memory, rather than working with tab-delimited text files saved on disk. |
| 19 | + |
| 20 | +Bioframe supports reading a number of standard genomics text file formats via [`read_table`](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.fileops.read_table), including BED files (see [schemas](https://github.com/open2c/bioframe/blob/main/bioframe/io/schemas.py)), which will load them as pandas DataFrames, a complete list of helper functions is [available here](API_fileops). |
| 21 | + |
| 22 | +Any DataFrame object with `'chrom'`, `'start'`, and `'end'` columns will support the genomic [interval operations in bioframe](API_ops). The names of these columns can also be customized via the `cols=` arguments in bioframe functions. |
| 23 | + |
| 24 | +For example, with gtf files, you do not need to turn them into bed files, you can directly read them into pandas (with e.g. [gtfparse](https://github.com/openvax/gtfparse/tree/master)). For gtfs, it is often convenient to rename the `'seqname'` column to `'chrom'`, the default column name used in bioframe. |
| 25 | + |
| 26 | +Finally, if needed, bioframe provides a convenience function to write dataframes to a standard BED file using [`to_bed`](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.bed.to_bed). |
| 27 | + |
| 28 | + |
| 29 | +## `bedtools intersect` |
| 30 | + |
| 31 | +### Original unique entries from the first bed `-u` |
| 32 | + |
| 33 | +```sh |
| 34 | +bedtools intersect -u -a A.bed -b B.bed > out.bed |
| 35 | +``` |
| 36 | + |
| 37 | +```py |
| 38 | +overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True) |
| 39 | +out = A.loc[overlap['index_1'].unique()] |
| 40 | +``` |
| 41 | + |
| 42 | +### Report the number of hits in B `-c` |
| 43 | + |
| 44 | +Reports 0 for A entries that have no overlap with B. |
| 45 | + |
| 46 | +```sh |
| 47 | +bedtools intersect -c -a A.bed -b B.bed > out.bed |
| 48 | +``` |
| 49 | + |
| 50 | +```py |
| 51 | +out = bf.count_overlaps(A, B) |
| 52 | +``` |
| 53 | + |
| 54 | +### Original entries from the first bed for each overlap`-wa` |
| 55 | + |
| 56 | +```sh |
| 57 | +bedtools intersect -wa -a A.bed -b B.bed > out.bed |
| 58 | +``` |
| 59 | + |
| 60 | +```py |
| 61 | +overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True) |
| 62 | +out = A.loc[overlap['index_1']] |
| 63 | + |
| 64 | +# Alternatively |
| 65 | +out = bf.overlap(A, B, how='inner')[A.columns] |
| 66 | +``` |
| 67 | + |
| 68 | +> **Note:** This gives one row per overlap and can contain duplicates. The output dataframe of the former method will use the same pandas index as the input dataframe `A`, while the latter result --- the join output --- will have an integer range index, like a pandas merge. |
| 69 | +
|
| 70 | +### Original entries from the second bed `-wb` |
| 71 | + |
| 72 | +```sh |
| 73 | +bedtools intersect -wb -a A.bed -b B.bed > out.bed |
| 74 | +``` |
| 75 | + |
| 76 | +```py |
| 77 | +overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True) |
| 78 | +out = B.loc[overlap['index_2']] |
| 79 | + |
| 80 | +# Alternatively |
| 81 | +out = bf.overlap(A, B, how='inner', suffixes=("_", ""))[B.columns] |
| 82 | +``` |
| 83 | + |
| 84 | +> **Note:** This gives one row per overlap and can contain duplicates. The output dataframe of the former method will use the same pandas index as the input dataframe `B`, while the latter result --- the join output --- will have an integer range index, like a pandas merge. |
| 85 | +
|
| 86 | +### Intersect with multiple beds |
| 87 | + |
| 88 | +```sh |
| 89 | +bedtools intersect -wa -a A.bed -b B.bed C.bed D.bed> out.bed |
| 90 | +``` |
| 91 | + |
| 92 | +```py |
| 93 | +others = pd.concat([B, C, D]) |
| 94 | +overlap = bf.overlap(A, others, how='inner', suffixes=('_1','_2'), return_index=True) |
| 95 | +out = A.loc[overlap['index_1']] |
| 96 | +``` |
| 97 | + |
| 98 | +### Keep no overlap `-v` |
| 99 | + |
| 100 | +```sh |
| 101 | +bedtools intersect -wa -a A.bed -b B.bed -v > out.bed |
| 102 | +``` |
| 103 | + |
| 104 | +```py |
| 105 | +out = bf.setdiff(A, B) |
| 106 | +``` |
| 107 | + |
| 108 | +### Force strandedness `-s` |
| 109 | + |
| 110 | +For intersection |
| 111 | + |
| 112 | +```sh |
| 113 | +bedtools intersect -wa -a A.bed -b B.bed -s > out.bed |
| 114 | +``` |
| 115 | + |
| 116 | +```py |
| 117 | +overlap = bf.overlap(A, B, on=['strand'], suffixes=('_1','_2'), return_index=True, how='inner') |
| 118 | +out = A.loc[overlap['index_1']] |
| 119 | +``` |
| 120 | + |
| 121 | +For non-intersection `-v` |
| 122 | + |
| 123 | +```sh |
| 124 | +bedtools intersect -wa -a A.bed -b B.bed -v -s > out.bed |
| 125 | +``` |
| 126 | + |
| 127 | +```py |
| 128 | +out = bf.setdiff(A, B, on=['strand']) |
| 129 | +``` |
| 130 | + |
| 131 | +### Minimum overlap as a fraction of A `-f` |
| 132 | + |
| 133 | +We want to keep rows of A that are covered at least 70% by elements from B |
| 134 | + |
| 135 | +```sh |
| 136 | +bedtools intersect -wa -a A.bed -b B.bed -f 0.7 > out.bed |
| 137 | +``` |
| 138 | + |
| 139 | +```py |
| 140 | +cov = bf.coverage(A, B) |
| 141 | +out = A.loc[cov['coverage'] / (cov['end'] - cov['start']) ) >= 0.70] |
| 142 | + |
| 143 | +# Alternatively |
| 144 | +out = bf.coverage(A, B).query('coverage / (end - start) >= 0.7')[A.columns] |
| 145 | +``` |
0 commit comments