Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low throughput #10

Open
JulienPeloton opened this issue Mar 9, 2018 · 2 comments
Open

Low throughput #10

JulienPeloton opened this issue Mar 9, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed IO

Comments

@JulienPeloton
Copy link
Member

The current throughput is around 5-10 MB/sec to load and convert FITS data to DataFrame.
The decoding lib needs to be improved...

@JulienPeloton JulienPeloton added enhancement New feature or request help wanted Extra attention is needed labels Mar 9, 2018
@JulienPeloton
Copy link
Member Author

A bit better after #11.
Now around 15 MB/s.

@JulienPeloton JulienPeloton self-assigned this Mar 15, 2018
@JulienPeloton
Copy link
Member Author

JulienPeloton commented Apr 5, 2018

Current I/O benchmark (spark.readfits.load.count) on 110 GB (first iteration), with 128 MB partitions:

Median task duration (s) [GC] Comments
no reading/no decoding 0.5 [0.02] Spark/Hadoop overhead
reading/no decoding 8 [0.04] Overhead + I/O
reading/decoding 10 [1] Overhead + I/O + Scala FITSIO

Most of the time is spent in reading from disk (>60%).
This is the time spent in doing f.readFully(). Could be better?
Note that I/O contains also the effect of data locality -- so it is not only reading file from local DataNode, but transferring data as well from remote DataNode.

Decoding is 30% of the total (with large GC time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed IO
Projects
None yet
Development

No branches or pull requests

1 participant