-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract then merge #162
base: main
Are you sure you want to change the base?
Extract then merge #162
Conversation
Benchmark data: sample_id = "sampleA:"
cov_list = [
"sampleA.sorted.bam_0_data_cov.csv", # ~1G
"sampleB.sorted.bam_1_data_cov.csv", # ~1G
"sampleC.sorted.bam_2_data_cov.csv", # ~1G
"sampleD.sorted.bam_3_data_cov.csv", # ~1G
"sampleE.sorted.bam_4_data_cov.csv", # ~1G
"sampleF.sorted.bam_5_data_cov.csv", # ~1G
"sampleG.sorted.bam_6_data_cov.csv" # ~1G
] |
Test using %%time
pd_dfs = []
for i in cov_list:
data_cov = pd.read_csv(i, index_col=0, engine="pyarrow")
data_cov = data_cov.reset_index()
columns_list = list(data_cov.columns)
columns_list[0] = 'contig_name'
data_cov.columns = columns_list
part_data = data_cov[data_cov['contig_name'].str.contains(sample_id, regex=False)]
part_data = part_data.set_index("contig_name")
part_data.index.name = None
part_data.index = [ix.split(":")[1] for ix in part_data.index]
pd_dfs.append(part_data)
sample_cov = pd.concat(pd_dfs, axis=1)
sample_cov.index = sample_cov.index.astype(str)
abun_scale = (sample_cov.mean() / 100).apply(np.ceil) * 100
sample_cov = sample_cov.div(abun_scale) Results:
|
Test using %%time
pl_dfs_read = []
for i in cov_list:
data_cov = pl.read_csv(i)\
.rename({"": "contig_name"}).filter(pl.col("contig_name").str.contains(sample_id))
pl_dfs_read.append(data_cov)
contig_cov = pl.concat(pl_dfs_read, how="align")
contig_names = [i.split(":")[1] for i in contig_cov["contig_name"]]
contig_cov = contig_cov.drop("contig_name")
headers = ["contig_name"] + list(contig_cov.columns)
abun_scale = (contig_cov.mean() / 100).map_rows(np.ceil) * 100
divided_columns = [pl.col(col) / abun_scale[0, index] for index, col in enumerate(list(contig_cov.columns))]
result = contig_cov.select(divided_columns)
result = result.with_columns(pl.Series("contig_name", contig_names))
result = result.select(headers) Results:
|
Test using %%time
pl_dfs_scan = []
for i in cov_list:
data_cov = pl.scan_csv(i)\
.rename({"": "contig_name"}).filter(pl.col("contig_name").str.contains(sample_id)).collect()
pl_dfs_scan.append(data_cov)
contig_cov = pl.concat(pl_dfs_scan, how="align")
contig_names = [i.split(":")[1] for i in contig_cov["contig_name"]]
contig_cov = contig_cov.drop("contig_name")
headers = ["contig_name"] + list(contig_cov.columns)
abun_scale = (contig_cov.mean() / 100).map_rows(np.ceil) * 100
divided_columns = [pl.col(col) / abun_scale[0, index] for index, col in enumerate(list(contig_cov.columns))]
result = contig_cov.select(divided_columns)
result = result.with_columns(pl.Series("contig_name", contig_names))
result = result.select(headers) Results:
|
System information:
Software information:
|
After a quick look at the code, I think there are some potential optimizations that could improve memory usage and performance (though speed may not be the primary concern here).
Apologies for the unsolicited advice! I’m particularly interested in this issue and thought I could put the time I spent learning Polars to good use. Please feel free to disregard any of this! |
At first, I was hesitant to add |
I had the same exact experience. I try to avoid extra dependencies as much as possible, but nowadays Polars is one of the few packages that I don't mind depending on. Polars itself also has very minimal dependencies. |
Thank you @apcamargo so much for your suggestions! |
When processing large-scale samples using
SemiBin multi
binning mode,data_cov.csv
anddata_split_cov.csv
may require 1TB+ memory. This PR is dedicated to extracting sample-wise contigs coverage first and then merging, which can significantly reduce memory usage.And after testing, I found it was still very slow when processing many (1K+) CSV files. So I updated the code to use
polars
to parse CSV file.