Skip to content

nutthakorn7/vector-log-reduction-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector Log Reduction Benchmark

Reproducible benchmark of in-flight telemetry reduction techniques using Vector on public network security datasets (CIC-IDS2017, UNSW-NB15).

This artifact accompanies the paper:

"Context-Aware Security Telemetry Reduction: A Streaming Architecture for Scalable SOC Pipelines"
IEEE Access, 2026
DOI: TBD (assigned upon acceptance)


Techniques

Profile Config Description
Baseline baseline.toml No reduction — reference forwarder
Filter-only filter.toml Pre-ingest rules dropping explicitly benign events
Field Pruning prune.toml Keeps minimal schema (timestamp, src_ip, dst_ip, event_id, proto)
Sampling (10%) sample.toml 10% stratified Bernoulli sampling preserving class balance
Template Hashing hash.toml xxHash64 fingerprint of structural template + .count aggregation (T_flush = 60s)
Hybrid combined.toml Field Pruning → 10% Sampling (sequential)

Note on hashing algorithm: The paper uses xxHash64 — deterministic, O(1)/event, 13.8 GB/s throughput, 8-byte key, collision resistance ~2³². SHA-1 was an earlier placeholder and does not reproduce paper results.


Hardware & Software

Parameter Value
CPU Xeon Silver 4310
RAM 64 GB
Storage NVMe SSD
OS Ubuntu 22.04
Vector v0.41.0
ClickHouse v23.12
Python 3.10+
Replay rate 5,000 EPS (fixed)
Warm-up discard 30 s
Repetitions n = 3

Quick Start

1) Prerequisites

  • Docker & Docker Compose
  • Python 3.10+

2) Start services

docker compose up -d

3) Initialize ClickHouse (once)

docker exec -i clickhouse bash -lc "clickhouse-client --multiquery < /docker-entrypoint-initdb.d/init.sql"

4) Prepare dataset

Place CSV/JSONL file in data/ with columns:

timestamp, src_ip, dst_ip, event_id, proto, message, is_attack, label

5) Replay (choose a profile)

python scripts/replay.py --input data/cicids_sample.csv --format csv --profile baseline
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile filter
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile prune
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile sample
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile template-hash
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile combined

6) Query metrics (4 paper metrics)

-- Throughput: measured externally via replay.py timer (kEPS)

-- Byte Reduction (%)
SELECT profile,
       (1 - sum(length(message)) / any(input_bytes)) * 100 AS byte_reduction_pct
FROM logs.reduced GROUP BY profile ORDER BY byte_reduction_pct DESC;

-- Attack Coverage (%)
SELECT profile,
       sum(if(is_attack = 1, 1, 0)) AS retained_attacks,
       retained_attacks / (SELECT count() FROM logs.raw WHERE is_attack=1) * 100 AS coverage_pct
FROM logs.reduced GROUP BY profile;

-- Latency p95 (ms): reported by replay.py

See docs/metrics.md for exact metric definitions used in the paper.


Repository Layout

vector-log-reduction-benchmark/
├─ README.md
├─ docker-compose.yml
├─ LICENSE
├─ CITATION.cff
├─ requirements.txt
├─ docs/
│  └─ metrics.md          ← Metric definitions (paper Section IV-C)
├─ .github/workflows/ci.yml
├─ clickhouse/init.sql
├─ vector/configs/
│  ├─ baseline.toml
│  ├─ filter.toml
│  ├─ prune.toml
│  ├─ sample.toml
│  ├─ hash.toml            ← Template Hashing (xxHash64, T_flush=60s)
│  └─ combined.toml        ← Hybrid: Pruning + Sampling
└─ scripts/replay.py

Reproducibility Statement

All Vector TOML configs, replay scripts, and ClickHouse DDL are provided.
A DOI will be minted upon acceptance (Zenodo/GitHub release).
See CITATION.cff for a citable reference.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages