forked from SIMULATOR-WG/SHARC
-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
Sometimes a lot of scenarios need to be run together. When that happens, the disk space needed gets too high. When processing results, for e.g. plotting, the RAM needed is just as high, since everything is done in memory.
A quickfix for the disk space problem has been done in the branch update/results_parquet by using parquet instead of csv. Parquet stores columnar data in compressed state. However, there are some problems with the current implementation:
- You cannot plot results or access data while the simulator is still running;
- When plotting or processing data, you still need to do it in memory;
- It sometimes still isn't enough;
We are currently considering:
Using duckdb for storing results:
- It provides state of the art compression algorithms for time-series/similar floats. It would probably work just as well as parquet for us;
- It works in-process, just like SQLite. That means you don't need to install and run duckdb as a separate process;
- It has close integration to csv and parquet files, so it is pretty easy to export data and to create a backwards compatibility layer;
- There is no need to in-memory process. We could use SQL for getting cdf's, etc;
The problem with this:
- It still would not be enough sometimes;
- SQL seems really unnecessary;
So, to be able to occupy less storage we could:
- Instead of all raw samples, choose a subset of samples to collect and save. Discard all others:
- We could either have the parameter
general.only_collect_samples: list[str]orgeneral.dont_collect_samples: list[str]to select what samples to collect or to not collect;
- We could either have the parameter
Problem:
- This still is problematic when we want to collect the samples that occupy more space;
- Instead of raw samples, store the pdf (probability density function) using bins of arbitrary precision:
- We would add a parameter
general.store : "pdf" or "raw samples"andgeneral.store.pdf.bin_width: floatfor storing the bin width/precision. The data stored forpdfwould then be a pair(bin_floor, frequency). - Even when using a very precise pdf, it would still be much better considering how limited each column values are. For example, if a sample goes from -500 to 500 (already uncommonly extensive), we select
bin_width: 0.001and there is at least one sample in each bin, it occupies, without compression, 4MB. This calculation comes fromn_bins * ( sizeof(float) + sizeof(int) ), withn_bins = 1e3/1e-3 = 1e6andsizeof(float) = sizeof(int) = 4bytes.- Take note that the ceiling of file size is not always reached;
- We would add a parameter
Problem:
- Backwards compatibility;
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels