Cannot use my own datasets with ALP Benchmark #8

aabduvakhobov · 2024-09-11T13:28:22Z

Hi,

I am trying to use the existing benchmark in ALP repository with my own datasets. I followed the documentation to plug in my datasets and all the methods are working, but ALP. The new datasets were preprocessed using the same method as in the datasets_transformer.ipynb. Also I made sure that my data doesn't have null, nan, inf, -inf values. The size of the data is also larger than some existing data in the benchmark.

Thank you!

azimafroozeh · 2024-09-11T13:35:53Z

Thanks for your question! Could you please provide a CSV file containing a portion of your dataset or give more details about the error you're encountering? I'd be happy to take a closer look. Also, to clarify, having null, NaN, inf, or -inf values should not cause any issues in ALP, as these are treated as exceptions by ALP.

aabduvakhobov · 2024-09-11T13:49:25Z

Thanks for the prompt response and sorry for not elaborating on the issue!
It is a segmentation fault with the following error message when I run it with gdb:

alp::AlpEncode<double>::find_best_exponent_factor_from_combinations (top_combinations=std::vector of length 0, capacity 0, top_k=5 '\005', input_vector=0x55555562dbe0, input_vector_size=1024, factor=@0x7fffffffd2d1: 9 '\t', exponent=@0x7fffffffd2d0: 14 '\016') at /home/abduvoris/Paper-2-Experiments/Lossless-model-types/ALP/include/alp/encode.hpp:208
208				const int exp_idx    = top_combinations[k].first;

I assume my datasets are failing the first sampling step and the empty array is returning for exponent and factor pairs.
My datasets are under NDA, but I will try to replicate the issue with the open dataset that I have.

azimafroozeh · 2024-09-11T13:54:34Z

No worries at all! That would be ideal if you could replicate the issue with an open dataset. I'll be happy to look into it further once you have the dataset ready. Looking forward to hearing back from you!

aabduvakhobov · 2024-09-11T14:10:10Z

Thanks! You can find the sample csv dataset. I converted it to the text format to be able to attach it here. In the experiments, I am feeding it as binary file. So far, I am running the benchmark for compression ratio, thus I am not specifying other parameters like exponent, factor and exceptions amount the data/include/double_columns.hpp file. Perhaps I have to compute them to be able to do it.
active_power.csv

azimafroozeh · 2024-09-11T16:53:59Z

Thank you for providing the dataset. I have submitted a PR (#9) that includes your dataset in our testing framework.

So, there is no issue with ALP. ALP consists of two schemes, ALP_RD and ALP, which are selected adaptively based on data. My guess is that you might have used tests designed for internal ALP, whereas your dataset is suited for ALP_RD. You may want to check the tests at

ALP/benchmarks/bench_compression_ratio/alp.cpp

Line 259 in 5f1ff16

TEST_F(alp_test, test_alprd_on_whole_datasets) {

.

aabduvakhobov · 2024-09-12T09:17:35Z

Thanks a lot for your significant effort! I can see the rest of my datasets are also suited for ALP_RD.

Since you also used my dataset, can I ask you why it falls back to ALP_RD?
My main intention from benchmarking ALP was to implement it as a compression scheme for 32-bit float values, but I see in your tests you used ALP_RD for 32-bit floats. Can I know your opinion on how practical ALP is for 32-bit floats?
Also I wanted to know your opinion if it is safe to run the [de]compression speed benchmark without validating the exceptions and bit_width as you do with existing datasets?
The dataset I shared with you comes from real-life turbines published as part of EDBT 2024 paper and is available at GitHub if you would like to use it for further experiments :)

azimafroozeh · 2024-09-12T11:05:02Z

Thanks for your questions!

ALP works best for doubles representing decimal-like numbers rather than real doubles, which use every bit for precision. For example, 10.3 is a decimal-like double, while 10.334324352532 is a real double. ALP compresses these "decimal" doubles by converting them into decimals. On the other hand, ALP_RD is designed for real doubles, where precision is extremely high, and there's little similarity in the right bits. Your dataset consists of real doubles, making ALP_RD the better choice. The fallback occurs because ALP tries to encode a sample of the data first and switches to ALP_RD as no compression opportunities are found. You can find more detailed explanations in our paper here.

The practicality of ALP for 32-bit floats depends on the dataset. In our experiments with real datasets, we found that most floats behave like real floats (similar to real doubles), where ALP_RD is more effective. However, if your floats resemble decimal-like values, ALP could indeed work well. We are considering adding support for ALP for floats as well.
When benchmarking [de]compression speeds, there’s no need to validate exceptions or bit-widths, as long as the decoded values match the original values after encoding and decoding. Our internal checks are mainly to ensure that ALP behaves as expected rather than validating the final results.
Thank you for sharing the EDBT 2024 paper and datasets. We will include them in our testing.

Feel free to reach out if you have more questions!

aabduvakhobov · 2024-09-12T14:35:54Z

Thanks a lot for your detailed response and hope to see ALP for floats soon.
Wish you best of luck in the future!
Cheers.

peterboncz · 2024-09-12T14:49:21Z

Note that if your doubles (or floats) have deep precision but you only care about limited precision, one trick is to cast the doubles to integer and then back.For instance, if you consider three digits of precision enough and your floats are between 0.0 and 1.0; you could cast to integer and then back to double, eg 0.001 * (int) (value*1000). The resulting doubles will compress to 10bits per value (6.4x) in ALP. Op 12 sep. 2024 om 16:36 heeft Abduvoris Abduvakhobov ***@***.***> het volgende geschreven: Thanks a lot for your detailed response and hope to see ALP for floats soon. Wish you best of luck in the future! Cheers. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

aabduvakhobov · 2024-09-12T18:40:45Z

Hi, thanks for the idea! We also had the idea of trimming the unnecessary decimals for error-bounded lossy compression and compress using the latest lightweight encoding algorithms. ALP seems to fit this scenario very well.

azimafroozeh · 2024-09-16T15:51:50Z

Hi, we have added all the primitives for the float implementation of ALP. Please check here. Moreover, you can use the FastLanes double encoder to encode the double data type with ALP. FastLanes will handle the low-level details of ALP and provide a better API. Please check an example of the FastLanes double encoder here.

aabduvakhobov · 2024-09-17T11:22:26Z

Hi, this is very helpful indeed. I am closing the issue and many thanks for your help!

aabduvakhobov closed this as completed Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot use my own datasets with ALP Benchmark #8

Cannot use my own datasets with ALP Benchmark #8

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 12, 2024 •

edited

Loading

azimafroozeh commented Sep 12, 2024

aabduvakhobov commented Sep 12, 2024

peterboncz commented Sep 12, 2024 via email

aabduvakhobov commented Sep 12, 2024 •

edited

Loading

azimafroozeh commented Sep 16, 2024

aabduvakhobov commented Sep 17, 2024

Cannot use my own datasets with ALP Benchmark #8

Cannot use my own datasets with ALP Benchmark #8

Comments

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 11, 2024

azimafroozeh commented Sep 11, 2024

aabduvakhobov commented Sep 12, 2024 • edited Loading

azimafroozeh commented Sep 12, 2024

aabduvakhobov commented Sep 12, 2024

peterboncz commented Sep 12, 2024 via email

aabduvakhobov commented Sep 12, 2024 • edited Loading

azimafroozeh commented Sep 16, 2024

aabduvakhobov commented Sep 17, 2024

aabduvakhobov commented Sep 12, 2024 •

edited

Loading

aabduvakhobov commented Sep 12, 2024 •

edited

Loading