Low throughput #10

JulienPeloton · 2018-03-09T07:19:48Z

The current throughput is around 5-10 MB/sec to load and convert FITS data to DataFrame.
The decoding lib needs to be improved...

JulienPeloton · 2018-03-12T14:50:16Z

A bit better after #11.
Now around 15 MB/s.

JulienPeloton · 2018-04-05T08:13:29Z

Current I/O benchmark (spark.readfits.load.count) on 110 GB (first iteration), with 128 MB partitions:

	Median task duration (s) [GC]	Comments
no reading/no decoding	0.5 [0.02]	Spark/Hadoop overhead
reading/no decoding	8 [0.04]	Overhead + I/O
reading/decoding	10 [1]	Overhead + I/O + Scala FITSIO

Most of the time is spent in reading from disk (>60%).
This is the time spent in doing f.readFully(). Could be better?
Note that I/O contains also the effect of data locality -- so it is not only reading file from local DataNode, but transferring data as well from remote DataNode.

Decoding is 30% of the total (with large GC time).

JulienPeloton added enhancement New feature or request help wanted Extra attention is needed labels Mar 9, 2018

JulienPeloton self-assigned this Mar 15, 2018

JulienPeloton added the IO label Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low throughput #10

Low throughput #10

JulienPeloton commented Mar 9, 2018

JulienPeloton commented Mar 12, 2018

JulienPeloton commented Apr 5, 2018 •

edited

Loading

Low throughput #10

Low throughput #10

Comments

JulienPeloton commented Mar 9, 2018

JulienPeloton commented Mar 12, 2018

JulienPeloton commented Apr 5, 2018 • edited Loading

JulienPeloton commented Apr 5, 2018 •

edited

Loading