Denum

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

Dataset

Loghub:

https://github.com/logpai/loghub

download these datasets and copy them into Logs/{logname}/{logname}.log

PREREQUISITE

- Software dependencies

python >= 3.7.3

regex = 2012.1.8

gcc >= 9.4.0

PCRE2 = 10.34

libboost-iostreams-dev = 1.71.0.0ubuntu2

In our environment with gcc, we used the following two commands to complete the configuration of the experimental environment:

apt install libpcre2-dev 2. apt install libboost-iostreams-dev

- Hardware dependencies

1. CPU

Requirement: A modern CPU with at least 4 cores.
Recommendation: Intel i5 or AMD Ryzen 5 or better.
Explanation: The program heavily relies on a multi-core CPU. This suggests it may use multithreading or multiprocessing techniques to handle tasks in parallel.

2. Memory (RAM)

Minimum Requirement: 4 GB
Recommended: 8 GB
Explanation: The program used a maximum resident set size of approximately 2.4 GB of memory during execution. At least 4 GB of RAM is needed to run the program and to provide enough memory for the operating system and other applications.

COMPRESS

- C++ implementation

1. Compile

g++ -O3 -std=c++17 -o denum_compress denum_compress.cpp -lboost_iostreams -lpthread -lpcre2-8

2. Execution

Assume the chunksize is set to 100000, and the target log file is Logs/HDFS/HDFS.log

./denum_compress HDFS 100000 1

This repository contains Apache log file, you can run this directly without dataset downloading ./denum_compress Apache 100000 1

The last parameter is used to facilitate the retrieval of experimental results for each RQ: "1" indicates the default Denum, "2" indicates that logs will be output without numbers, used for RQ3, and "3" indicates Denum without string processing, used for RQ4.

- Python implementation

Assume the target log file is Logs/Apache/Apache.log

cd Denum_Package
python3 compress.py Apache

DECOMPRESS

Note that datasets using different tags may encounter errors during decompression. We have only designed the recovery of the IP address for Apache decompression. When applied to specific tags in specific datasets, users may need to mimic the function of line809 to design recovery functions. This is because although the IPaddress mode is<*>.<*>.<*>.<*>, The number represented by<*>may be 1-3, so we will fill it with 0 and remove the high-order 0 during decompression. For example, the value of 1.1.1.1 during compression is 001001001. Users need to pay attention to the changes in the number of numbers represented by<*>in the tag

cd Denum_Package
python3 decompress.py Apache

- Lossy Check

cd ..
python3 lossy_check.py

DOCKER

docker pull docker.io/gaiusyu/denumv1.0:latest
docker run -v /Your/Path/to/Logs:/app/Logs -v /Your/Path/to/output:/app/output -v /Your/Path/to/decompress_output:/app/decompress_output -it gaiusyu/denumv1.0 {logname} {chunksize} {stage}
Example to start a container: docker run -v E:/CUHKSZ/Denum_ASE2024/Logs:/app/Logs -v E:/CUHKSZ/Denum_ASE2024/output:/app/output -v E:/CUHKSZ/Denum_ASE2024/decompress_output:/app/decompress_output -it gaiusyu/denumv1.0 Apache 100000 1

EXPERIMENTS REPRODUCTION

Research questions:

• RQ1: What is the compression ratio of Denum?

• RQ2: What is the compression speed of Denum?

• RQ3: Can Denum’s Numeric Token Parsing module improve the performance of other log compressors?

• RQ4: How does each module in Denum affect its compression ratio?

You can also use Docker to replace the execution command below.

- RQ1 & RQ2

Download these datasets and put them into Logs/{logname}/{logname}.log from loghub
Compile the code according to the previous instructions, and then run the following command: ./denum_compress {logname} 100000 1
Perform the above operations for different datasets.
Record CR&CS

Results:

CR

Since CR is unrelated to environment, we sourced the CRs of other compressors directly from the original papers

CS

Reproduce LogShrink and LogReducer according their instructions.
Record CS

- RQ3

For RQ3, the logs need to undergo Denum's numeric token parsing module, generating logs without numbers, and then applying it to other log compressors.

Use mode "2" to generate logs without numbers ./denum_compress {logname} 100000 2, logs without numbers are saved in /output/{logname}.log
Compress logs without numbers according to the instructions in other log compressors such as LogReducer, LogZip and LogShrink
Note that the CR and CS output in the second step are not accurate. The compression time should include the time taken to generate logs without numbers. The CR is also incorrect because it is calculated based on logs without numbers. We need to calculate the CR using the achieved size and the original log file size.

Results:

- RQ4

The bar for Compressor(LZMA) is sourced from RQ1
The bar for Number+Compressor(LZMA): Use mode "3" to generate CRs ./denum_compress {logname} 100000 3
The bar for Number+String+Compressor(LZMA) is the complete Denum

Results:

Tradeoff between CR&CS

Different parameter setting to achieve this: 100K. ./denum_compress {logname} 100000 1 300K./denum_compress {logname} 300000 1 1m../denum_compress {logname} 1000000 1 3m../denum_compress {logname} 3000000 1

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Denum_python_package		Denum_python_package
Logs/Apache		Logs/Apache
decompress_output/Apache/1		decompress_output/Apache/1
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
denum_compress.cpp		denum_compress.cpp
img.png		img.png
img_1.png		img_1.png
img_2.png		img_2.png
img_3.png		img_3.png
img_4.png		img_4.png
img_5.png		img_5.png
lossy_check.py		lossy_check.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Denum

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Dataset

PREREQUISITE

- Software dependencies

- Hardware dependencies

1. CPU

2. Memory (RAM)

COMPRESS

- C++ implementation

1. Compile

2. Execution

- Python implementation

DECOMPRESS

- Lossy Check

DOCKER

EXPERIMENTS REPRODUCTION

- RQ1 & RQ2

- RQ3

- RQ4

Tradeoff between CR&CS

About

Releases

Packages

Contributors 2

Languages

License

gaiusyu/Denum

Folders and files

Latest commit

History

Repository files navigation

Denum

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Dataset

PREREQUISITE

- Software dependencies

- Hardware dependencies

1. CPU

2. Memory (RAM)

COMPRESS

- C++ implementation

1. Compile

2. Execution

- Python implementation

DECOMPRESS

- Lossy Check

DOCKER

EXPERIMENTS REPRODUCTION

- RQ1 & RQ2

- RQ3

- RQ4

Tradeoff between CR&CS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages