Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use my own datasets with ALP Benchmark #8

Closed
aabduvakhobov opened this issue Sep 11, 2024 · 12 comments
Closed

Cannot use my own datasets with ALP Benchmark #8

aabduvakhobov opened this issue Sep 11, 2024 · 12 comments

Comments

@aabduvakhobov
Copy link

Hi,

I am trying to use the existing benchmark in ALP repository with my own datasets. I followed the documentation to plug in my datasets and all the methods are working, but ALP. The new datasets were preprocessed using the same method as in the datasets_transformer.ipynb. Also I made sure that my data doesn't have null, nan, inf, -inf values. The size of the data is also larger than some existing data in the benchmark.

Thank you!

@azimafroozeh
Copy link
Collaborator

Thanks for your question! Could you please provide a CSV file containing a portion of your dataset or give more details about the error you're encountering? I'd be happy to take a closer look. Also, to clarify, having null, NaN, inf, or -inf values should not cause any issues in ALP, as these are treated as exceptions by ALP.

@aabduvakhobov
Copy link
Author

Thanks for the prompt response and sorry for not elaborating on the issue!
It is a segmentation fault with the following error message when I run it with gdb:

alp::AlpEncode<double>::find_best_exponent_factor_from_combinations (top_combinations=std::vector of length 0, capacity 0, top_k=5 '\005', input_vector=0x55555562dbe0, input_vector_size=1024, factor=@0x7fffffffd2d1: 9 '\t', exponent=@0x7fffffffd2d0: 14 '\016') at /home/abduvoris/Paper-2-Experiments/Lossless-model-types/ALP/include/alp/encode.hpp:208
208				const int exp_idx    = top_combinations[k].first;

I assume my datasets are failing the first sampling step and the empty array is returning for exponent and factor pairs.
My datasets are under NDA, but I will try to replicate the issue with the open dataset that I have.

@azimafroozeh
Copy link
Collaborator

No worries at all! That would be ideal if you could replicate the issue with an open dataset. I'll be happy to look into it further once you have the dataset ready. Looking forward to hearing back from you!

@aabduvakhobov
Copy link
Author

Thanks! You can find the sample csv dataset. I converted it to the text format to be able to attach it here. In the experiments, I am feeding it as binary file. So far, I am running the benchmark for compression ratio, thus I am not specifying other parameters like exponent, factor and exceptions amount the data/include/double_columns.hpp file. Perhaps I have to compute them to be able to do it.
active_power.csv

@azimafroozeh
Copy link
Collaborator

Thank you for providing the dataset. I have submitted a PR (#9) that includes your dataset in our testing framework.

So, there is no issue with ALP. ALP consists of two schemes, ALP_RD and ALP, which are selected adaptively based on data. My guess is that you might have used tests designed for internal ALP, whereas your dataset is suited for ALP_RD. You may want to check the tests at

TEST_F(alp_test, test_alprd_on_whole_datasets) {
.

@aabduvakhobov
Copy link
Author

aabduvakhobov commented Sep 12, 2024

Thanks a lot for your significant effort! I can see the rest of my datasets are also suited for ALP_RD.

  1. Since you also used my dataset, can I ask you why it falls back to ALP_RD?
  2. My main intention from benchmarking ALP was to implement it as a compression scheme for 32-bit float values, but I see in your tests you used ALP_RD for 32-bit floats. Can I know your opinion on how practical ALP is for 32-bit floats?
  3. Also I wanted to know your opinion if it is safe to run the [de]compression speed benchmark without validating the exceptions and bit_width as you do with existing datasets?
  4. The dataset I shared with you comes from real-life turbines published as part of EDBT 2024 paper and is available at GitHub if you would like to use it for further experiments :)

@azimafroozeh
Copy link
Collaborator

Thanks for your questions!

  1. ALP works best for doubles representing decimal-like numbers rather than real doubles, which use every bit for precision. For example, 10.3 is a decimal-like double, while 10.334324352532 is a real double. ALP compresses these "decimal" doubles by converting them into decimals. On the other hand, ALP_RD is designed for real doubles, where precision is extremely high, and there's little similarity in the right bits. Your dataset consists of real doubles, making ALP_RD the better choice. The fallback occurs because ALP tries to encode a sample of the data first and switches to ALP_RD as no compression opportunities are found. You can find more detailed explanations in our paper here.
  1. The practicality of ALP for 32-bit floats depends on the dataset. In our experiments with real datasets, we found that most floats behave like real floats (similar to real doubles), where ALP_RD is more effective. However, if your floats resemble decimal-like values, ALP could indeed work well. We are considering adding support for ALP for floats as well.

  2. When benchmarking [de]compression speeds, there’s no need to validate exceptions or bit-widths, as long as the decoded values match the original values after encoding and decoding. Our internal checks are mainly to ensure that ALP behaves as expected rather than validating the final results.

  3. Thank you for sharing the EDBT 2024 paper and datasets. We will include them in our testing.

Feel free to reach out if you have more questions!

@aabduvakhobov
Copy link
Author

Thanks a lot for your detailed response and hope to see ALP for floats soon.
Wish you best of luck in the future!
Cheers.

@peterboncz
Copy link

peterboncz commented Sep 12, 2024 via email

@aabduvakhobov
Copy link
Author

aabduvakhobov commented Sep 12, 2024

Hi, thanks for the idea! We also had the idea of trimming the unnecessary decimals for error-bounded lossy compression and compress using the latest lightweight encoding algorithms. ALP seems to fit this scenario very well.

@azimafroozeh
Copy link
Collaborator

Hi, we have added all the primitives for the float implementation of ALP. Please check here. Moreover, you can use the FastLanes double encoder to encode the double data type with ALP. FastLanes will handle the low-level details of ALP and provide a better API. Please check an example of the FastLanes double encoder here.

@aabduvakhobov
Copy link
Author

Hi, this is very helpful indeed. I am closing the issue and many thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants