Fix: Too much disk and RAM space needed

Sometimes a lot of scenarios need to be run together. When that happens, the disk space needed gets too high. When processing results, for e.g. plotting, the RAM needed is just as high, since everything is done in memory.

A quickfix for the disk space problem has been done in the branch `update/results_parquet` by using parquet instead of csv. Parquet stores columnar data in compressed state. However, there are some problems with the current implementation:
- You cannot plot results or access data while the simulator is still running;
- When plotting or processing data, you still need to do it in memory;
- It sometimes still isn't enough;

We are currently considering:

Using duckdb for storing results:
- It provides state of the art compression algorithms for time-series/similar floats. It would probably work just as well as parquet for us;
- It works in-process, just like SQLite. That means you don't need to install and run duckdb as a separate process;
- It has close integration to csv and parquet files, so it is pretty easy to export data and to create a backwards compatibility layer;
- There is no need to in-memory process. We could use SQL for getting cdf's, etc;

The problem with this:
- It still would not be enough sometimes;
- SQL seems really unnecessary;

So, to be able to occupy less storage we could:

1. Instead of all raw samples, choose a subset of samples to collect and save. Discard all others:
    - We could either have the parameter `general.only_collect_samples: list[str]` or `general.dont_collect_samples: list[str]` to select what samples to collect or to not collect;

Problem:
- This still is problematic when we want to collect the samples that occupy more space;

2. Instead of raw samples, store the pdf  (probability density function) using bins of arbitrary precision:
    - We would add a parameter `general.store : "pdf" or "raw samples"` and `general.store.pdf.bin_width: float` for storing the bin width/precision. The data stored for `pdf` would then be a pair `(bin_floor, frequency)`.
    - Even when using a very precise pdf, it would still be much better considering how limited each column values are. For example, if a sample goes from -500 to 500 (already uncommonly extensive), we select `bin_width: 0.001` and there is at least one sample in each bin, it occupies, **without compression**, 4MB. This calculation comes from `n_bins * ( sizeof(float) + sizeof(int) )`,  with `n_bins = 1e3/1e-3 = 1e6` and `sizeof(float) = sizeof(int) = 4` bytes. 
        - Take note that the ceiling of file size is not always reached;

Problem:
- Backwards compatibility;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Too much disk and RAM space needed #211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix: Too much disk and RAM space needed #211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions