Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bedtools intersect conversions #198

Merged
merged 18 commits into from
Apr 26, 2024
2 changes: 2 additions & 0 deletions docs/api-fileops.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _API_fileops:

File I/O
========

Expand Down
141 changes: 141 additions & 0 deletions docs/guide-bedtools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
jupytext:
formats: md:myst
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.11.3
kernelspec:
display_name: Python 3
language: python
name: python3
---

# Bioframe for bedtools users


bioframe is built around the analysis of genomic intervals as a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in memory, rather than working with tab-delimited text files saved on disk.

Bioframe supports reading a number of standard genomics text file formats via [`read_table`](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.fileops.read_table), including BED files (see [schemas](https://github.com/open2c/bioframe/blob/main/bioframe/io/schemas.py)), which will load them as pandas DataFrames, a complete list of helper functions is [available here](API_fileops).

For example, with gtf files, you do not need to turn them into bed files, you can directly read them into pandas (with e.g. [gtfparse](https://github.com/openvax/gtfparse/tree/master)).
For gtfs, it is often convenient to rename the seqname column into chrom, the default column name used in bioframe.

Any DataFrame object with `'chrom'`, `'start'`, and `'end'` columns will support the genomic [interval operations in bioframe](API_ops).

Finally, if needed, bioframe provides a convenience function to write the back to a bed file using `to_bed`.


## `bedtools intersect`

### Original unique entries from the first bed `-u`

Note that this gives one row per overlap and can contain duplicates,

```sh
bedtools intersect -u -a A.bed -b B.bed > out.bed
```

```py
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
out = A.loc[overlap['index_1'].unique()]
```

### Report the number of hits in B `-c`

Reports 0 for A entries that have no overlap with B.

```sh
bedtools intersect -c -a A.bed -b B.bed > out.bed
```

```py
out = bf.count_overlaps(A, B)
```

### Original entries from the first bed for each overlap`-wa`

Note that this gives one row per overlap and can contain duplicates,

```sh
bedtools intersect -wa -a A.bed -b B.bed > out.bed
```

```py
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
out = A.loc[overlap['index_1']]
# Alternatively
out = A.loc[bioframe.ops._overlap_intidxs(A, B, how='inner')[:,0]]
nvictus marked this conversation as resolved.
Show resolved Hide resolved
```
gamazeps marked this conversation as resolved.
Show resolved Hide resolved

nvictus marked this conversation as resolved.
Show resolved Hide resolved
### Original entries from the second bed `-wb`
nvictus marked this conversation as resolved.
Show resolved Hide resolved

```sh
bedtools intersect -wb -a A.bed -b B.bed > out.bed
```

```py
overlap = bf.overlap(A, B, how='inner', suffixes=('_1','_2'), return_index=True)
out = B.loc[overlap['index_2']]
nvictus marked this conversation as resolved.
Show resolved Hide resolved
```

### Intersect with multiple beds

```sh
bedtools intersect -wa -a A.bed -b B.bed C.bed D.bed> out.bed
```

```py
others = pd.concat([B, C, D])
overlap = bf.overlap(A, others, how='inner', suffixes=('_1','_2'), return_index=True)
out = A.loc[overlap['index_1']]
```

### Keep no overlap `-v`

```sh
bedtools intersect -wa -a A.bed -b B.bed -v > out.bed
```

```py
out = bf.setdiff(A, B)
```

### Force strandedness `-s`

For intersection

```sh
bedtools intersect -wa -a A.bed -b B.bed -s > out.bed
```

```py
overlap = bf.overlap(A, B, on=['strand'], suffixes=('_1','_2'), return_index=True, how='inner')
out = A.loc[overlap['index_1']]
```

For non intersection

```sh
bedtools intersect -wa -a A.bed -b B.bed -v -s > out.bed
```

```py
out = bf.setdiff(A, B, on=['strand'])
```

### Minimum overlap as a fraction of A `-f`

We want to keep rows of A that are covered at least 70% by elements from B

```sh
bedtools intersect -wa -a A.bed -b B.bed -f 0.7 > out.bed
```

```py
cov = bf.coverage(A, B)
out = A.loc[cov['coverage'] / (cov['end'] - cov['start']) ) >= 0.70]
# alternatively
out = bf.coverage(A, B).query('coverage / (end - start) >= 0.7')[A.columns]
```
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ bioframe
guide-recipes.md
guide-definitions
guide-specifications
guide-bedtools

.. toctree::
:maxdepth: 1
Expand Down