Extract then merge #162

alienzj · 2024-04-25T05:00:53Z

When processing large-scale samples using SemiBin multi binning mode, data_cov.csv and data_split_cov.csv may require 1TB+ memory. This PR is dedicated to extracting sample-wise contigs coverage first and then merging, which can significantly reduce memory usage.

And after testing, I found it was still very slow when processing many (1K+) CSV files. So I updated the code to use polars to parse CSV file.

alienzj · 2024-04-26T10:23:12Z

Benchmark data:

sample_id = "sampleA:"

cov_list = [
    "sampleA.sorted.bam_0_data_cov.csv", #  ~1G
    "sampleB.sorted.bam_1_data_cov.csv", #  ~1G
    "sampleC.sorted.bam_2_data_cov.csv", #  ~1G
    "sampleD.sorted.bam_3_data_cov.csv", #  ~1G
    "sampleE.sorted.bam_4_data_cov.csv", #  ~1G
    "sampleF.sorted.bam_5_data_cov.csv", #  ~1G
    "sampleG.sorted.bam_6_data_cov.csv"  #  ~1G
]

alienzj · 2024-04-26T10:25:21Z

Test using pd.read_csv:

%%time
pd_dfs = []

for i in cov_list:
    data_cov = pd.read_csv(i, index_col=0, engine="pyarrow")
    data_cov = data_cov.reset_index()
    columns_list = list(data_cov.columns)
    columns_list[0] = 'contig_name'
    data_cov.columns = columns_list

    part_data = data_cov[data_cov['contig_name'].str.contains(sample_id, regex=False)]
    part_data = part_data.set_index("contig_name")
    part_data.index.name = None
    part_data.index = [ix.split(":")[1] for ix in part_data.index]
    
    pd_dfs.append(part_data)
    
sample_cov = pd.concat(pd_dfs, axis=1)
sample_cov.index = sample_cov.index.astype(str)

abun_scale = (sample_cov.mean() / 100).apply(np.ceil) * 100
sample_cov = sample_cov.div(abun_scale)

Results:

CPU times: user 49.1 s, sys: 19.4 s, total: 1min 8s
Wall time: 49.3 s

alienzj · 2024-04-26T10:30:46Z

Test using pl::read_csv:

%%time
pl_dfs_read = []

for i in cov_list:
    data_cov = pl.read_csv(i)\
    .rename({"": "contig_name"}).filter(pl.col("contig_name").str.contains(sample_id))
    
    pl_dfs_read.append(data_cov)

contig_cov = pl.concat(pl_dfs_read, how="align")

contig_names = [i.split(":")[1] for i in contig_cov["contig_name"]]

contig_cov = contig_cov.drop("contig_name")
headers = ["contig_name"] + list(contig_cov.columns)

abun_scale = (contig_cov.mean() / 100).map_rows(np.ceil) * 100
divided_columns = [pl.col(col) / abun_scale[0, index] for index, col in enumerate(list(contig_cov.columns))]

result = contig_cov.select(divided_columns)

result = result.with_columns(pl.Series("contig_name", contig_names))
result = result.select(headers)

Results:

CPU times: user 19.1 s, sys: 3.31 s, total: 22.4 s
Wall time: 3.49 s

alienzj · 2024-04-26T10:32:04Z

Test using pl::scan_csv:

%%time
pl_dfs_scan = []

for i in cov_list:
    data_cov = pl.scan_csv(i)\
    .rename({"": "contig_name"}).filter(pl.col("contig_name").str.contains(sample_id)).collect()
    
    pl_dfs_scan.append(data_cov)

contig_cov = pl.concat(pl_dfs_scan, how="align")

contig_names = [i.split(":")[1] for i in contig_cov["contig_name"]]

contig_cov = contig_cov.drop("contig_name")
headers = ["contig_name"] + list(contig_cov.columns)

abun_scale = (contig_cov.mean() / 100).map_rows(np.ceil) * 100
divided_columns = [pl.col(col) / abun_scale[0, index] for index, col in enumerate(list(contig_cov.columns))]

result = contig_cov.select(divided_columns)

result = result.with_columns(pl.Series("contig_name", contig_names))
result = result.select(headers)

Results:

CPU times: user 18.6 s, sys: 2.86 s, total: 21.5 s
Wall time: 2.3 s

alienzj · 2024-04-26T10:36:12Z

System information:

        #####           
       #######          ---------------------
       ##O#O##          OS: AlmaLinux 9.3 (Shamrock Pampas Cat) x86_64
       #######          Host: VMware7,1 None
     ###########        Kernel: 5.14.0-362.24.2.el9_3.x86_64
    #############       Uptime: 8 days, 1 hour, 49 mins
   ###############
   ################     Packages: 2261 (rpm)
  #################     Shell: fish 3.3.1
#####################   Resolution: 1280x800
#####################   Terminal: /dev/pts/0
  #################     CPU: AMD EPYC 7763 (100) @ 2.445GHz
                        GPU: NVIDIA A40
                        Memory: 27503MiB / 805752MiB

Software information:

Python 3.9.19
pandas 2.2.2
polars 0.20.21

apcamargo · 2024-08-12T08:30:39Z

After a quick look at the code, I think there are some potential optimizations that could improve memory usage and performance (though speed may not be the primary concern here).

Polars' scan_csv function supports wildcards (e.g., cov_*.csv) or a list of file paths as input. This eliminates the need to append data frames to a list in a loop. However, Polars will generate a single long data frame, which you would need to reshape into a wide format using the pivot function. One caveat is that you would need a suitable column to use as the index for the pivot operation.
The Polars streaming API could improve memory usage, as it allows for filtering data without loading the entire dataset into memory.
Instead of dropping the contig names, manipulating the coverage values, and then creating a new data frame with the contig names, you can use Polars' selectors to operate directly on the coverage columns while keeping the name column unchanged.
Using Polars expressions instead of pure Python (e.g., [i.split(":")[1] for i in contig_cov["contig_name"]]) or NumPy (e.g., (contig_cov.mean() / 100).map_rows(np.ceil) * 100) can reduce the creation of additional objects and improve code efficiency. The first example could be replaced with extract, and the second could potentially be replaced by using Polars' mean expression coupled with over. It might be feasible to convert this entire piece of code to Polars (which is mostly zero-copy, so memory usage should improve).

Apologies for the unsolicited advice! I’m particularly interested in this issue and thought I could put the time I spent learning Polars to good use. Please feel free to disregard any of this!

luispedro · 2024-08-13T12:16:54Z

At first, I was hesitant to add polars as a dependency, but I have also been switching to polars from pandas, so I am less worried about it now.

apcamargo · 2024-08-14T00:58:38Z

I had the same exact experience. I try to avoid extra dependencies as much as possible, but nowadays Polars is one of the few packages that I don't mind depending on. Polars itself also has very minimal dependencies.

alienzj · 2024-09-09T08:35:14Z

Thank you @apcamargo so much for your suggestions!
I am going to refine the code.

alienzj added 6 commits April 25, 2024 12:53

extract then merge

eb52cde

abun scale

effd671

return dataframe

d7442d7

parse csv using pyarrow backend

ec3429d

fix typo and parameters

832abc3

faster parsing csv file using polars

6fa4ad7

alienzj added 2 commits April 27, 2024 22:34

batch concat, avoid signal SIGSEGV

884776d

generate contig coverage using multi threads

e600aea

luispedro mentioned this pull request Aug 12, 2024

Large memory requirements (for very large binning jobs) #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract then merge #162

Extract then merge #162

alienzj commented Apr 25, 2024 •

edited

Loading

alienzj commented Apr 26, 2024 •

edited

Loading

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024 •

edited

Loading

apcamargo commented Aug 12, 2024 •

edited

Loading

luispedro commented Aug 13, 2024

apcamargo commented Aug 14, 2024

alienzj commented Sep 9, 2024

Extract then merge #162

Are you sure you want to change the base?

Extract then merge #162

Conversation

alienzj commented Apr 25, 2024 • edited Loading

alienzj commented Apr 26, 2024 • edited Loading

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024

alienzj commented Apr 26, 2024 • edited Loading

apcamargo commented Aug 12, 2024 • edited Loading

luispedro commented Aug 13, 2024

apcamargo commented Aug 14, 2024

alienzj commented Sep 9, 2024

alienzj commented Apr 25, 2024 •

edited

Loading

alienzj commented Apr 26, 2024 •

edited

Loading

alienzj commented Apr 26, 2024 •

edited

Loading

apcamargo commented Aug 12, 2024 •

edited

Loading