Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Ishan Darji GSoC Blog #1587

Merged
merged 4 commits into from
Sep 23, 2024
Merged

Create Ishan Darji GSoC Blog #1587

merged 4 commits into from
Sep 23, 2024

Conversation

Ish-D
Copy link
Contributor

@Ish-D Ish-D commented Sep 17, 2024

Final blog for HSF 2024 GSoC project "Lossless compression of raw data for the ATLAS experiment at CERN".

@Ish-D
Copy link
Contributor Author

Ish-D commented Sep 17, 2024

I do not seem to have the permission to add reviewers, but pinging my mentor @maszyman for visibility.

Copy link
Contributor

@maszyman maszyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @Ish-D for this nice blog post (and your work in general)!

Please have a look at a couple of my comments.

_gsocblogs/2024/blog_ATLCompression_IshanDarji.md Outdated Show resolved Hide resolved

To assist with the former, I made use of [lzbench](https://github.com/inikep/lzbench), a tool created to benchmark a variety of different LZ77/LZSS/LZMA compression libraries, with the command ``` lzbench -ebrotli/zstd/lzlib/xz/zlib/lz4 decompressed.data``` where decompressed.data is an example ATLAS raw data file (in this case, the data displayed below is gathered from /cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/Tier0ChainTests/TCT_Run3/data22_13p6TeV.00431493.physics_Main.daq.RAW._lb0525._SFO-16._0001.data, although I also tested with varying files to very similar results). All of the libraries and forms of compression use very similar methods to compress and decompress data, although each library will have varying quirks and capabilities that may be better or worse for our purposes here.

Below is a table of data for compressors that were thought to be relevant to look at. One thing that is important to note is that much of the ATLAS experiment data has previously used zlib with compression level 1 for compression, so it will be treated as something of a baseline and the remaining libraries will be compared to it. Below are tables of libraries that were thought to be particularly notable, for each, we take note of the library used and what compression level, the ratio ((compressed size / original size) * 100), as well as compression and decompression speed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below is a table of data for compressors that were thought to be relevant to look at. One thing that is important to note is that much of the ATLAS experiment data has previously used zlib with compression level 1 for compression, so it will be treated as something of a baseline and the remaining libraries will be compared to it. Below are tables of libraries that were thought to be particularly notable, for each, we take note of the library used and what compression level, the ratio ((compressed size / original size) * 100), as well as compression and decompression speed.
Below is a table of data for compressors that were thought to be relevant to look at. One thing that is important to note is that much of the ATLAS experiment data has previously used `zlib` with compression level 1 for compression, so it will be treated as something of a baseline and the remaining libraries will be compared to it. Below are tables of libraries that were thought to be particularly notable, for each, we take note of the library used and what compression level, the ratio ((compressed size / original size) * 100), as well as compression and decompression speed.

That's minor, but arguably the text would be easier to read if you put names like zlib, zstd, std::unique_ptr, putPrecompressedData, dataReader etc. in backticks.


The next step was to go through ATLCopyBSEvent.cpp and modernize it, then integrate zstd. I started this process by stripping back the file such that it could only perform the basic operations that it needed: reading, writing, and copying files. Much of the code was removed with the capability to read and write from collections because it was deemed non-essential. Then, starting with this bare-bones version, I ran the program on some existing files that were pre-compressed to various levels, ensuring that it would still be possible to read the files in their varying states and write them out. Once this was ensured, I moved on to modernizing the code. In its old state, there was lots of room for misuse of objects with lots of raw C pointers being passed around, as well as lots of cases of small errors such as passing the raw pointer obtained from std::unique_ptr<T>.get() to a function, rather than moving it. Fixing these issues made the code appear much cleaner, and also meant that we got rid of any potential memory-related issues that could've arisen if the system were used in a larger context. I also added clang-format, to automatically format the code for consistency with the rest of the Atlas codebase.

The final step was then to integrate zstd into the system. I had previously moved the logic to write out data into a C++ lambda such that it could more easily be reused. At this step, we have a fragment of data retrieved from our dataReader, I used the simple zstd compression API to take this pre-allocated fragment of data, and then write it out as pre-compressed data with our dataWriter. It is important to use the function putPrecompressedData() rather than putData(), because looking at the internal implementations of the two, we can see that they are wrappers around the same internal function with different settings for whether or not we want to compress the data using zlib. Using putData would result in a file that has been compressed twice which would not be useable in our system. This method of compression is also convenient because it means that individual events remain separate, with all of their metadata being written out as usual. With the decompression step being inserted as the aforementioned fragments are read in by the dataReader, we can decompress each event without having to worry about decompressing and recompressing the full file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The final step was then to integrate zstd into the system. I had previously moved the logic to write out data into a C++ lambda such that it could more easily be reused. At this step, we have a fragment of data retrieved from our dataReader, I used the simple zstd compression API to take this pre-allocated fragment of data, and then write it out as pre-compressed data with our dataWriter. It is important to use the function putPrecompressedData() rather than putData(), because looking at the internal implementations of the two, we can see that they are wrappers around the same internal function with different settings for whether or not we want to compress the data using zlib. Using putData would result in a file that has been compressed twice which would not be useable in our system. This method of compression is also convenient because it means that individual events remain separate, with all of their metadata being written out as usual. With the decompression step being inserted as the aforementioned fragments are read in by the dataReader, we can decompress each event without having to worry about decompressing and recompressing the full file.
The final step was then to integrate zstd into the system. I had previously moved the logic to write out data into a C++ lambda such that it could more easily be reused. At this step, we have a fragment of data retrieved from our dataReader, I used the simple zstd compression API to take this pre-allocated fragment of data, and then write it out as pre-compressed data with our dataWriter. It is important to use the function putPrecompressedData() rather than putData(), because looking at the internal implementations of the two, we can see that they are wrappers around the same internal function with different settings for whether or not we want to compress the data using zlib. Using putData would result in a file that has been compressed twice which would not be usable in our system. This method of compression is also convenient because it means that individual events remain separate, with all of their metadata being written out as usual. With the decompression step being inserted as the aforementioned fragments are read in by the dataReader, we can decompress each event without having to worry about decompressing and recompressing the full file.


Below we are plotting compression speed and compression ratio generated by the above data plotting in this [colab](https://colab.research.google.com/drive/12uCIDyMJEIvKDe3O2V4KeugHHjqZYKlV?usp=sharing):

![ratio vs speed](https://github.com/user-attachments/assets/68119bd4-daa7-406e-9787-fc1ab7d3110a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you defined the ratio in the text, but for people looking only at plots it may be confusing what you mean by compression ratio (recall one of the questions from your presentation at ANL meeting). Perhaps you could add somewhere a clarification that you mean the percentage of the original size?

# Lossless compression of raw data for the ATLAS experiment at CERN

### Abstract
The goal of this project is to study the performance and effectiveness of various compression algorithms, specifically on ATLAS RAW data. The ATLAS experiment produces extremely large amounts of data, and it is only expected to increase with future planned upgrades within the LHC. Prior studies into compression of the data has shown that due to the highly redundant nature of the generated data, lossless data compression algorithms are extremely effective in reducing the binary size of ATLAS data. Here, we would like to find an algorithm that has a good balance of compression time, and compressed binary size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here that this project can guide important decision regarding the compression scheme to be implemented in the ATLAS experiment. Also, the fact that ATLAS raw data must be preserved in its original form for an indefinite time. Maybe, I would also stress is that improvement of the compression ratio is a priority, while the speed-up would be a nice advantage.


To assist with the former, I made use of [lzbench](https://github.com/inikep/lzbench), a tool created to benchmark a variety of different LZ77/LZSS/LZMA compression libraries, with the command ``` lzbench -ebrotli/zstd/lzlib/xz/zlib/lz4 decompressed.data``` where decompressed.data is an example ATLAS raw data file (in this case, the data displayed below is gathered from /cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/Tier0ChainTests/TCT_Run3/data22_13p6TeV.00431493.physics_Main.daq.RAW._lb0525._SFO-16._0001.data, although I also tested with varying files to very similar results). All of the libraries and forms of compression use very similar methods to compress and decompress data, although each library will have varying quirks and capabilities that may be better or worse for our purposes here.

Below is a table of data for compressors that were thought to be relevant to look at. One thing that is important to note is that much of the ATLAS experiment data has previously used zlib with compression level 1 for compression, so it will be treated as something of a baseline and the remaining libraries will be compared to it. Below are tables of libraries that were thought to be particularly notable, for each, we take note of the library used and what compression level, the ratio ((compressed size / original size) * 100), as well as compression and decompression speed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add the references to papers/presentations that I shared with you. Either inline (like here when you write about zlib usage in ATLAS) or collectively at the end.


Given the data collected, we chose to use zstd as our compression algorithm. This decision came for many reasons and was primarily determined by the analysis above. The goal for this project was to find a faster replacement for the currently in-place Zlib compression system, with particular importance on compression ratio rather than compression speed. The reason for this is that when we look at the scale of data generated by the ATLAS experiment, a reduction of even a few percentages makes a large difference when the filesizes are upwards of exabytes. The two most relevant graphs for this comparison are the two smaller ones looking at zlib, zstd, and brotli at fixed metrics. For example, we see that if we wanted to keep the compression ratio of the system the same, switching from zlib to zstd would still yield improvements of roughly 7 times in compression speed, and 3 times in decompression speed. However, because we are looking to keep the compute resources required roughly the same, we instead hold the compression speed as close to fixed as possible. This gives us the first of the two smaller tables above where we see that, in terms of compression speed, zlib at level 1, zstd at level 8, and brotli at level 4 are very roughly equal. We now see slight differences in the compression ratio, where brotli offers the best compression, followed closely by zstd, and zlib trails behind by a slightly more notable portion.

Given the above comparisons, a natural question may be, why ztd? Brotli at first glance compresses more efficiently than zstd, and since it is the best metric why not choose that? There were a variety of factors that went into this decision: for one, zstd performs as a very good all-rounder. It compresses slightly faster than brotli and decompresses much faster, sacrificing only 0.5% of the original filesize in compression ratio for this win. Additionally, it offers a fairly wide range of potential settings in the advanced API, including using dictionaries to train better compression and streaming compression for files where we gain the input in fragments at a time. This opens up the possibility of future exploration that I was not able to look at this summer. Finally, zstd is guaranteed to be stable. The format used by zstd is very stable and well documented, the project is fully open source, it is guaranteed to be supported in the future and sees frequent maintenance/updates, and it is already pre-bundled in many popular Linux distributions. The last point was particularly important for CERN, where these files are expected to be compressed and kept for long periods of time till they are decompressed and used at some point in the future, so stability is a must.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to cite RFC document of zstandard and usage in LInux kernal and ROOT framework to support its stability and safety.


Here we see that if we are looking to get a similar compression ratio, zlib performs the slowest by far at close to a fourth of the speed brotli, and zstd reaches almost two times brotli's speed. If we are looking to get similar compression times, while we do not have as close of a match as compression ratio, both zstd and brotli result in a reduction of a few percent of the original filesize.

Given the data collected, we chose to use zstd as our compression algorithm. This decision came for many reasons and was primarily determined by the analysis above. The goal for this project was to find a faster replacement for the currently in-place Zlib compression system, with particular importance on compression ratio rather than compression speed. The reason for this is that when we look at the scale of data generated by the ATLAS experiment, a reduction of even a few percentages makes a large difference when the filesizes are upwards of exabytes. The two most relevant graphs for this comparison are the two smaller ones looking at zlib, zstd, and brotli at fixed metrics. For example, we see that if we wanted to keep the compression ratio of the system the same, switching from zlib to zstd would still yield improvements of roughly 7 times in compression speed, and 3 times in decompression speed. However, because we are looking to keep the compute resources required roughly the same, we instead hold the compression speed as close to fixed as possible. This gives us the first of the two smaller tables above where we see that, in terms of compression speed, zlib at level 1, zstd at level 8, and brotli at level 4 are very roughly equal. We now see slight differences in the compression ratio, where brotli offers the best compression, followed closely by zstd, and zlib trails behind by a slightly more notable portion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a statement that this study suggests improvements of ~6% in compression ratio at the same CPU cost compared to current method. You may want to add that is a significant as even modest gains can translate into massive savings in storage costs at exabyte scale.

The final step was then to integrate zstd into the system. I had previously moved the logic to write out data into a C++ lambda such that it could more easily be reused. At this step, we have a fragment of data retrieved from our dataReader, I used the simple zstd compression API to take this pre-allocated fragment of data, and then write it out as pre-compressed data with our dataWriter. It is important to use the function putPrecompressedData() rather than putData(), because looking at the internal implementations of the two, we can see that they are wrappers around the same internal function with different settings for whether or not we want to compress the data using zlib. Using putData would result in a file that has been compressed twice which would not be useable in our system. This method of compression is also convenient because it means that individual events remain separate, with all of their metadata being written out as usual. With the decompression step being inserted as the aforementioned fragments are read in by the dataReader, we can decompress each event without having to worry about decompressing and recompressing the full file.

### Conclusion
All in all, I had a great Summer working with CERN to investigate data compression methods for the ATLAS experiment. It was a very good learning experience, although daunting at times, having the opportunity to work in a large existing codebase on a project with real-world impact. I hope that my research over the Summer will prove useful for future compression-related endeavors at CERN. I would like to extend thanks to my mentor, Maciej Szymanski for this opportunity, as well as all of his very much needed help over the Summer. I definitely could not have completed the project without his guidance and valuable mentorship.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add also acknowledgements to Argonne National Lab :-) (since ANL is participating organization in GSoC, and is the lab of your mentors working on ATLAS@CERN).

### Conclusion
All in all, I had a great Summer working with CERN to investigate data compression methods for the ATLAS experiment. It was a very good learning experience, although daunting at times, having the opportunity to work in a large existing codebase on a project with real-world impact. I hope that my research over the Summer will prove useful for future compression-related endeavors at CERN. I would like to extend thanks to my mentor, Maciej Szymanski for this opportunity, as well as all of his very much needed help over the Summer. I definitely could not have completed the project without his guidance and valuable mentorship.

The final code for the project is available [here](https://github.com/Ish-D/AtlEventProcess)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The final code for the project is available [here](https://github.com/Ish-D/AtlEventProcess)
The final code for the project is available [here](https://github.com/Ish-D/AtlEventProcess).

Ish-D and others added 2 commits September 18, 2024 01:09
@Ish-D
Copy link
Contributor Author

Ish-D commented Sep 18, 2024

Updated blog with suggested changes, @maszyman .

Copy link
Contributor

@maszyman maszyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ish-D for the changes. I've left few more (minor) comments. Please have a look.

The final step was then to integrate `zstd` into the system. I had previously moved the logic to write out data into a C++ lambda such that it could more easily be reused. At this step, we have a fragment of data retrieved from our `dataReader`, I used the simple `zstd` compression API to take this pre-allocated fragment of data, and then write it out as pre-compressed data with our `dataWriter`. It is important to use the function `putPrecompressedData()` rather than `putData()`, because looking at the internal implementations of the two, we can see that they are wrappers around the same internal function with different settings for whether or not we want to compress the data using `zlib`. Using `putData` would result in a file that has been compressed twice which would not be usable in our system. This method of compression is also convenient because it means that individual events remain separate, with all of their metadata being written out as usual. With the decompression step being inserted as the aforementioned fragments are read in by the `dataReader`, we can decompress each event without having to worry about decompressing and recompressing the full file.

### Conclusion
All in all, I had a great Summer working with CERN and Argonne National Labto investigate data compression methods for the ATLAS experiment. It was a very good learning experience, although daunting at times, having the opportunity to work in a large existing codebase on a project with real-world impact. I hope that my research over the Summer will prove useful for future compression-related endeavors at the Large Hadron Collider. I would like to extend thanks to my mentor, Maciej Szymanski for this opportunity, as well as all of his very much needed help over the Summer. I definitely could not have completed the project without his guidance and valuable mentorship.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All in all, I had a great Summer working with CERN and Argonne National Labto investigate data compression methods for the ATLAS experiment. It was a very good learning experience, although daunting at times, having the opportunity to work in a large existing codebase on a project with real-world impact. I hope that my research over the Summer will prove useful for future compression-related endeavors at the Large Hadron Collider. I would like to extend thanks to my mentor, Maciej Szymanski for this opportunity, as well as all of his very much needed help over the Summer. I definitely could not have completed the project without his guidance and valuable mentorship.
All in all, I had a great Summer working with CERN and Argonne National Lab to investigate data compression methods for the ATLAS experiment. It was a very good learning experience, although daunting at times, having the opportunity to work in a large existing codebase on a project with real-world impact. I hope that my research over the Summer will prove useful for future compression-related endeavors at the Large Hadron Collider. I would like to extend thanks to my mentor, Maciej Szymanski for this opportunity, as well as all of his very much needed help over the Summer. I definitely could not have completed the project without his guidance and valuable mentorship.

# Lossless compression of raw data for the ATLAS experiment at CERN

### Abstract
The goal of this project is to study the performance and effectiveness of various compression algorithms, specifically on ATLAS RAW data. The [ATLAS experiment](https://arxiv.org/abs/2404.06335) produces extremely large amounts of data, and it is only expected to increase with future planned upgrades within the LHC. Prior studies into compression of the data have shown that due to the highly redundant nature of the generated data, lossless data compression algorithms are extremely effective in reducing the binary size of ATLAS data. Here, we would like to find an algorithm that has a good balance of compression time, and compressed binary size, with the compression ratio being of much higher priority than the compression time due to the size of the data being compressed. One additional requirement is that the data must be held in its compressed state for an indefinite period until it is losslessly compressed at some point in the future. Ultimately, the research conducted in this project will be able to guide important decisions regarding the compression scheme implemented in the ATLAS experiment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest rephrasing

until it is losslessly compressed at some point in the future

for clarity. The thing is that in science reproducibility is a must. So we need to be able to read this data in future.


![Ratio vs Speed](https://github.com/user-attachments/assets/a24500fa-392d-4600-9882-6176da1642eb)

The ideal library would have high compression speed and low compression ratio. We can see that `brotli` covers the widest range, giving us the option to have the slowest compression speed in exchange for the lowest compression ratio, but also the second-fastest speed with the highest ratio. It is also immediately noticeable that `zlib` consistently gives compression ratios that are higher than desirable for the speed being compressed. Both xz and `lzlib` perform very similarly, where they do not offer the same range of options that `brotli` and `zstd` do, and they only operate in the slower and lower compression ratio end of the spectrum, but they are an improvement over `brotli` and `zstd` within that range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ideal library would have high compression speed and low compression ratio. We can see that `brotli` covers the widest range, giving us the option to have the slowest compression speed in exchange for the lowest compression ratio, but also the second-fastest speed with the highest ratio. It is also immediately noticeable that `zlib` consistently gives compression ratios that are higher than desirable for the speed being compressed. Both xz and `lzlib` perform very similarly, where they do not offer the same range of options that `brotli` and `zstd` do, and they only operate in the slower and lower compression ratio end of the spectrum, but they are an improvement over `brotli` and `zstd` within that range.
The ideal library would have high compression speed and low compression ratio. We can see that `brotli` covers the widest range, giving us the option to have the slowest compression speed in exchange for the lowest compression ratio, but also the second-fastest speed with the highest ratio. It is also immediately noticeable that `zlib` consistently gives compression ratios that are higher than desirable for the speed being compressed. Both `xz` and `lzlib` perform very similarly, where they do not offer the same range of options that `brotli` and `zstd` do, and they only operate in the slower and lower compression ratio end of the spectrum, but they are an improvement over `brotli` and `zstd` within that range.


Given the above comparisons, a natural question may be, why ztd? `Brotli` at first glance compresses more efficiently than `zstd`, and since it is the best metric why not choose that? There were a variety of factors that went into this decision: for one, `zstd` performs as a very good all-rounder. It compresses slightly faster than `brotli` and decompresses much faster, sacrificing only 0.5% of the original filesize in compression ratio for this win. Additionally, it offers a fairly wide range of potential settings in the advanced API, including using dictionaries to train better compression and streaming compression for files where we gain the input in fragments at a time. This opens up the possibility of future exploration that I was not able to look at this summer. Finally, `zstd` is guaranteed to be stable. The format used by `zstd` is very stable and well documented, the project is fully open source, it is guaranteed to be supported in the future and sees frequent maintenance/updates, it is already pre-bundled in many popular Linux distributions, and its use in the Linux kernel and CERN ROOT framework are evidence of its ubiquitiousness. For more information about `ztd`'s stability, its [RFC document](https://datatracker.ietf.org/doc/html/rfc8878) provides additional information about how data is compressed, along with security and longevity considerations. The last point was particularly important for CERN, where these files are expected to be compressed and kept for long periods of time till they are decompressed and used at some point in the future, so stability is a must.

The next step was to go through ATLCopyBSEvent.cpp and modernize it, then integrate `zstd`. I started this process by stripping back the file such that it could only perform the basic operations that it needed: reading, writing, and copying files. Much of the code was removed with the capability to read and write from collections because it was deemed non-essential. Then, starting with this bare-bones version, I ran the program on some existing files that were pre-compressed to various levels, ensuring that it would still be possible to read the files in their varying states and write them out. Once this was ensured, I moved on to modernizing the code. In its old state, there was lots of room for misuse of objects with lots of raw C pointers being passed around, as well as lots of cases of small errors such as passing the raw pointer obtained from std::unique_ptr<T>.get() to a function, rather than moving it. Fixing these issues made the code appear much cleaner, and also meant that we got rid of any potential memory-related issues that could've arisen if the system were used in a larger context. I also added clang-format, to automatically format the code for consistency with the rest of the Atlas codebase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The next step was to go through ATLCopyBSEvent.cpp and modernize it, then integrate `zstd`. I started this process by stripping back the file such that it could only perform the basic operations that it needed: reading, writing, and copying files. Much of the code was removed with the capability to read and write from collections because it was deemed non-essential. Then, starting with this bare-bones version, I ran the program on some existing files that were pre-compressed to various levels, ensuring that it would still be possible to read the files in their varying states and write them out. Once this was ensured, I moved on to modernizing the code. In its old state, there was lots of room for misuse of objects with lots of raw C pointers being passed around, as well as lots of cases of small errors such as passing the raw pointer obtained from std::unique_ptr<T>.get() to a function, rather than moving it. Fixing these issues made the code appear much cleaner, and also meant that we got rid of any potential memory-related issues that could've arisen if the system were used in a larger context. I also added clang-format, to automatically format the code for consistency with the rest of the Atlas codebase.
The next step was to go through `ATLCopyBSEvent`and modernize it, then integrate `zstd`. I started this process by stripping back the file such that it could only perform the basic operations that it needed: reading, writing, and copying files. Much of the code was removed with the capability to read and write from collections because it was deemed non-essential. Then, starting with this bare-bones version, I ran the program on some existing files that were pre-compressed to various levels, ensuring that it would still be possible to read the files in their varying states and write them out. Once this was ensured, I moved on to modernizing the code. In its old state, there was lots of room for misuse of objects with lots of raw C pointers being passed around, as well as lots of cases of small errors such as passing the raw pointer obtained from `std::unique_ptr<T>.get()` to a function, rather than moving it. Fixing these issues made the code appear much cleaner, and also meant that we got rid of any potential memory-related issues that could've arisen if the system were used in a larger context. I also added `clang-format`, to automatically format the code for consistency with the rest of the ATLAS codebase.


Here we see that if we are looking to get a similar compression ratio, `zlib` performs the slowest by far at close to a fourth of the speed brotli, and `zstd` reaches almost two times `brotli`'s speed. If we are looking to get similar compression times, while we do not have as close of a match as compression ratio, both `zstd` and `brotli` result in a reduction of a few percent of the original filesize.

Given the data collected, we chose to use `zstd` as our compression algorithm. This decision came for many reasons and was primarily determined by the analysis above. The goal for this project was to find a faster replacement for the currently in-place `zlib` compression system, with particular importance on compression ratio rather than compression speed. The reason for this is that when we look at the scale of data generated by the ATLAS experiment, a reduction of even a few percentages makes a large difference when the filesizes are upwards of exabytes. The two most relevant graphs for this comparison are the two smaller ones looking at `zlib`, `zstd`, and `brotli` at fixed metrics. For example, we see that if we wanted to keep the compression ratio of the system the same, switching from `zlib` to `zstd` would still yield improvements of roughly 7 times in compression speed, and 3 times in decompression speed. in However, because we are looking to keep the compute resources required roughly the same, we instead hold the compression speed as close to fixed as possible. This gives us the first of the two smaller tables above where we see that, in terms of compression speed, `zlib` at level 1, `zstd` at level 8, and `brotli` at level 4 are very roughly equal. We now see slight differences in the compression ratio, where `brotli` offers the best compression, followed closely by `zstd`, and `zlib` trails behind by a slightly more notable portion. This reported 6% decrease in compression ratio between `zstd` and `zlib` at the same CPU cost translates to massive savings when we consider that data is being compressed at exabyte scales.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reported 6% decrease in compression ratio between zstd and zlib at the same CPU cost translates to massive savings when we consider that data is being compressed at exabyte scales.

decrease may not be clear to the reader. I would suggest something along those lines:

The study suggests that zstd offers 6% better compression ratio compared with zlib at the same CPU cost. It should be noted that 6% storage savings is significant at exabyte scale.


Given the data collected, we chose to use `zstd` as our compression algorithm. This decision came for many reasons and was primarily determined by the analysis above. The goal for this project was to find a faster replacement for the currently in-place `zlib` compression system, with particular importance on compression ratio rather than compression speed. The reason for this is that when we look at the scale of data generated by the ATLAS experiment, a reduction of even a few percentages makes a large difference when the filesizes are upwards of exabytes. The two most relevant graphs for this comparison are the two smaller ones looking at `zlib`, `zstd`, and `brotli` at fixed metrics. For example, we see that if we wanted to keep the compression ratio of the system the same, switching from `zlib` to `zstd` would still yield improvements of roughly 7 times in compression speed, and 3 times in decompression speed. in However, because we are looking to keep the compute resources required roughly the same, we instead hold the compression speed as close to fixed as possible. This gives us the first of the two smaller tables above where we see that, in terms of compression speed, `zlib` at level 1, `zstd` at level 8, and `brotli` at level 4 are very roughly equal. We now see slight differences in the compression ratio, where `brotli` offers the best compression, followed closely by `zstd`, and `zlib` trails behind by a slightly more notable portion. This reported 6% decrease in compression ratio between `zstd` and `zlib` at the same CPU cost translates to massive savings when we consider that data is being compressed at exabyte scales.

Given the above comparisons, a natural question may be, why ztd? `Brotli` at first glance compresses more efficiently than `zstd`, and since it is the best metric why not choose that? There were a variety of factors that went into this decision: for one, `zstd` performs as a very good all-rounder. It compresses slightly faster than `brotli` and decompresses much faster, sacrificing only 0.5% of the original filesize in compression ratio for this win. Additionally, it offers a fairly wide range of potential settings in the advanced API, including using dictionaries to train better compression and streaming compression for files where we gain the input in fragments at a time. This opens up the possibility of future exploration that I was not able to look at this summer. Finally, `zstd` is guaranteed to be stable. The format used by `zstd` is very stable and well documented, the project is fully open source, it is guaranteed to be supported in the future and sees frequent maintenance/updates, it is already pre-bundled in many popular Linux distributions, and its use in the Linux kernel and CERN ROOT framework are evidence of its ubiquitiousness. For more information about `ztd`'s stability, its [RFC document](https://datatracker.ietf.org/doc/html/rfc8878) provides additional information about how data is compressed, along with security and longevity considerations. The last point was particularly important for CERN, where these files are expected to be compressed and kept for long periods of time till they are decompressed and used at some point in the future, so stability is a must.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given the above comparisons, a natural question may be, why ztd? `Brotli` at first glance compresses more efficiently than `zstd`, and since it is the best metric why not choose that? There were a variety of factors that went into this decision: for one, `zstd` performs as a very good all-rounder. It compresses slightly faster than `brotli` and decompresses much faster, sacrificing only 0.5% of the original filesize in compression ratio for this win. Additionally, it offers a fairly wide range of potential settings in the advanced API, including using dictionaries to train better compression and streaming compression for files where we gain the input in fragments at a time. This opens up the possibility of future exploration that I was not able to look at this summer. Finally, `zstd` is guaranteed to be stable. The format used by `zstd` is very stable and well documented, the project is fully open source, it is guaranteed to be supported in the future and sees frequent maintenance/updates, it is already pre-bundled in many popular Linux distributions, and its use in the Linux kernel and CERN ROOT framework are evidence of its ubiquitiousness. For more information about `ztd`'s stability, its [RFC document](https://datatracker.ietf.org/doc/html/rfc8878) provides additional information about how data is compressed, along with security and longevity considerations. The last point was particularly important for CERN, where these files are expected to be compressed and kept for long periods of time till they are decompressed and used at some point in the future, so stability is a must.
Given the above comparisons, a natural question may be, why `zstd`? `Brotli` at first glance compresses more efficiently than `zstd`, and since it is the best metric why not choose that? There were a variety of factors that went into this decision: for one, `zstd` performs as a very good all-rounder. It compresses slightly faster than `brotli` and decompresses much faster, sacrificing only 0.5% of the original file size in compression ratio for this win. Additionally, it offers a fairly wide range of potential settings in the advanced API, including using dictionaries to train better compression and streaming compression for files where we gain the input in fragments at a time. This opens up the possibility of future exploration that I was not able to look at this summer. Finally, `zstd` is guaranteed to be stable. The format used by `zstd` is very stable and well documented, the project is fully open source, it is guaranteed to be supported in the future and sees frequent maintenance/updates, it is already pre-bundled in many popular Linux distributions, and its use in the Linux kernel and CERN ROOT framework are evidence of its ubiquitousness. For more information about `zstd`'s stability, its [RFC document](https://datatracker.ietf.org/doc/html/rfc8878) provides additional information about how data is compressed, along with security and longevity considerations. The last point was particularly important for CERN, where these files are expected to be compressed and kept for long periods of time till they are decompressed and used at some point in the future, so stability is a must.

@Ish-D
Copy link
Contributor Author

Ish-D commented Sep 21, 2024

@maszyman updated after feedback from new round of comments.

Copy link
Contributor

@maszyman maszyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ish-D, LGTM!

@maszyman
Copy link
Contributor

Hi @vvolkl 👋

Feel free to review Ishan's blog post, and merge at your convenience, it's all good from my side.

Cheers,
Maciej

@vvolkl
Copy link
Contributor

vvolkl commented Sep 23, 2024

Very nice report, Thank you!

@vvolkl vvolkl merged commit 3dd9c81 into HSF:main Sep 23, 2024
2 checks passed
@maszyman
Copy link
Contributor

Thanks @vvolkl for merging! Is there anything that needs to be done in order this post was published on https://hepsoftwarefoundation.org/gsoc/2024/blogs.html ?

I just noticed that Ishan's markdown document misses this metadata at the beginning:

---
project: 
title: 
author: 
photo: 
date: 
year: 2024
layout: blog_post
logo:
intro: |
---

Is that a problem perhaps?

@hegner
Copy link
Member

hegner commented Sep 23, 2024

@maszyman - yes, that is absolutely needed!

@maszyman
Copy link
Contributor

@Ish-D could you please make a PR to create the info needed to publish the blog post?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants