Skip to content

Commit 6c5c115

Browse files
gamazepsFelix Raimundonvictuspre-commit-ci[bot]
authored
docs: Add bedtools intersect conversions (#198)
* initial commit for bedtools * add entries from second bed * add bedtools flags * typo in tools name * use setdiff * change page name * add -f and -s * change title + mistake * Update docs/guide-bedtools.md * add * add more file formats and -c * add to_bed * Add alternative implementations * forgot the inner * Remove intidxs and describe differences in indexes in output * Update docs/guide-bedtools.md * Notes about indexes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Felix Raimundo <felix.raimundo@3umassmed.edu> Co-authored-by: Nezar Abdennur <nabdennur@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent d8788b1 commit 6c5c115

File tree

3 files changed

+150
-0
lines changed

3 files changed

+150
-0
lines changed

docs/api-fileops.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1+
.. _API_fileops:
2+
13
File I/O
24
========
35

46
.. automodule:: bioframe.io.fileops
57
:autosummary:
68
:members:
9+
10+
.. autofunction:: bioframe.io.bed.to_bed

docs/guide-bedtools.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
jupytext:
3+
formats: md:myst
4+
text_representation:
5+
extension: .md
6+
format_name: myst
7+
format_version: 0.13
8+
jupytext_version: 1.11.3
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# Bioframe for bedtools users
16+
17+
18+
Bioframe is built around the analysis of genomic intervals as a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in memory, rather than working with tab-delimited text files saved on disk.
19+
20+
Bioframe supports reading a number of standard genomics text file formats via [`read_table`](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.fileops.read_table), including BED files (see [schemas](https://github.com/open2c/bioframe/blob/main/bioframe/io/schemas.py)), which will load them as pandas DataFrames, a complete list of helper functions is [available here](API_fileops).
21+
22+
Any DataFrame object with `'chrom'`, `'start'`, and `'end'` columns will support the genomic [interval operations in bioframe](API_ops). The names of these columns can also be customized via the `cols=` arguments in bioframe functions.
23+
24+
For example, with gtf files, you do not need to turn them into bed files, you can directly read them into pandas (with e.g. [gtfparse](https://github.com/openvax/gtfparse/tree/master)). For gtfs, it is often convenient to rename the `'seqname'` column to `'chrom'`, the default column name used in bioframe.
25+
26+
Finally, if needed, bioframe provides a convenience function to write dataframes to a standard BED file using [`to_bed`](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.bed.to_bed).
27+
28+
29+
## `bedtools intersect`
30+
31+
### Original unique entries from the first bed `-u`
32+
33+
```sh
34+
bedtools intersect -u -a A.bed -b B.bed > out.bed
35+
```
36+
37+
```py
38+
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
39+
out = A.loc[overlap['index_1'].unique()]
40+
```
41+
42+
### Report the number of hits in B `-c`
43+
44+
Reports 0 for A entries that have no overlap with B.
45+
46+
```sh
47+
bedtools intersect -c -a A.bed -b B.bed > out.bed
48+
```
49+
50+
```py
51+
out = bf.count_overlaps(A, B)
52+
```
53+
54+
### Original entries from the first bed for each overlap`-wa`
55+
56+
```sh
57+
bedtools intersect -wa -a A.bed -b B.bed > out.bed
58+
```
59+
60+
```py
61+
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
62+
out = A.loc[overlap['index_1']]
63+
64+
# Alternatively
65+
out = bf.overlap(A, B, how='inner')[A.columns]
66+
```
67+
68+
> **Note:** This gives one row per overlap and can contain duplicates. The output dataframe of the former method will use the same pandas index as the input dataframe `A`, while the latter result --- the join output --- will have an integer range index, like a pandas merge.
69+
70+
### Original entries from the second bed `-wb`
71+
72+
```sh
73+
bedtools intersect -wb -a A.bed -b B.bed > out.bed
74+
```
75+
76+
```py
77+
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
78+
out = B.loc[overlap['index_2']]
79+
80+
# Alternatively
81+
out = bf.overlap(A, B, how='inner', suffixes=("_", ""))[B.columns]
82+
```
83+
84+
> **Note:** This gives one row per overlap and can contain duplicates. The output dataframe of the former method will use the same pandas index as the input dataframe `B`, while the latter result --- the join output --- will have an integer range index, like a pandas merge.
85+
86+
### Intersect with multiple beds
87+
88+
```sh
89+
bedtools intersect -wa -a A.bed -b B.bed C.bed D.bed> out.bed
90+
```
91+
92+
```py
93+
others = pd.concat([B, C, D])
94+
overlap = bf.overlap(A, others, how='inner', suffixes=('_1','_2'), return_index=True)
95+
out = A.loc[overlap['index_1']]
96+
```
97+
98+
### Keep no overlap `-v`
99+
100+
```sh
101+
bedtools intersect -wa -a A.bed -b B.bed -v > out.bed
102+
```
103+
104+
```py
105+
out = bf.setdiff(A, B)
106+
```
107+
108+
### Force strandedness `-s`
109+
110+
For intersection
111+
112+
```sh
113+
bedtools intersect -wa -a A.bed -b B.bed -s > out.bed
114+
```
115+
116+
```py
117+
overlap = bf.overlap(A, B, on=['strand'], suffixes=('_1','_2'), return_index=True, how='inner')
118+
out = A.loc[overlap['index_1']]
119+
```
120+
121+
For non-intersection `-v`
122+
123+
```sh
124+
bedtools intersect -wa -a A.bed -b B.bed -v -s > out.bed
125+
```
126+
127+
```py
128+
out = bf.setdiff(A, B, on=['strand'])
129+
```
130+
131+
### Minimum overlap as a fraction of A `-f`
132+
133+
We want to keep rows of A that are covered at least 70% by elements from B
134+
135+
```sh
136+
bedtools intersect -wa -a A.bed -b B.bed -f 0.7 > out.bed
137+
```
138+
139+
```py
140+
cov = bf.coverage(A, B)
141+
out = A.loc[cov['coverage'] / (cov['end'] - cov['start']) ) >= 0.70]
142+
143+
# Alternatively
144+
out = bf.coverage(A, B).query('coverage / (end - start) >= 0.7')[A.columns]
145+
```

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ bioframe
2020
guide-recipes.md
2121
guide-definitions
2222
guide-specifications
23+
guide-bedtools
2324

2425
.. toctree::
2526
:maxdepth: 1

0 commit comments

Comments
 (0)