Reproducible benchmark of in-flight telemetry reduction techniques using Vector on public network security datasets (CIC-IDS2017, UNSW-NB15).
This artifact accompanies the paper:
"Context-Aware Security Telemetry Reduction: A Streaming Architecture for Scalable SOC Pipelines"
IEEE Access, 2026
DOI: TBD (assigned upon acceptance)
| Profile | Config | Description |
|---|---|---|
| Baseline | baseline.toml |
No reduction — reference forwarder |
| Filter-only | filter.toml |
Pre-ingest rules dropping explicitly benign events |
| Field Pruning | prune.toml |
Keeps minimal schema (timestamp, src_ip, dst_ip, event_id, proto) |
| Sampling (10%) | sample.toml |
10% stratified Bernoulli sampling preserving class balance |
| Template Hashing | hash.toml |
xxHash64 fingerprint of structural template + .count aggregation (T_flush = 60s) |
| Hybrid | combined.toml |
Field Pruning → 10% Sampling (sequential) |
Note on hashing algorithm: The paper uses xxHash64 — deterministic, O(1)/event, 13.8 GB/s throughput, 8-byte key, collision resistance ~2³². SHA-1 was an earlier placeholder and does not reproduce paper results.
| Parameter | Value |
|---|---|
| CPU | Xeon Silver 4310 |
| RAM | 64 GB |
| Storage | NVMe SSD |
| OS | Ubuntu 22.04 |
| Vector | v0.41.0 |
| ClickHouse | v23.12 |
| Python | 3.10+ |
| Replay rate | 5,000 EPS (fixed) |
| Warm-up discard | 30 s |
| Repetitions | n = 3 |
- Docker & Docker Compose
- Python 3.10+
docker compose up -ddocker exec -i clickhouse bash -lc "clickhouse-client --multiquery < /docker-entrypoint-initdb.d/init.sql"Place CSV/JSONL file in data/ with columns:
timestamp, src_ip, dst_ip, event_id, proto, message, is_attack, label
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile baseline
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile filter
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile prune
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile sample
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile template-hash
python scripts/replay.py --input data/cicids_sample.csv --format csv --profile combined-- Throughput: measured externally via replay.py timer (kEPS)
-- Byte Reduction (%)
SELECT profile,
(1 - sum(length(message)) / any(input_bytes)) * 100 AS byte_reduction_pct
FROM logs.reduced GROUP BY profile ORDER BY byte_reduction_pct DESC;
-- Attack Coverage (%)
SELECT profile,
sum(if(is_attack = 1, 1, 0)) AS retained_attacks,
retained_attacks / (SELECT count() FROM logs.raw WHERE is_attack=1) * 100 AS coverage_pct
FROM logs.reduced GROUP BY profile;
-- Latency p95 (ms): reported by replay.pySee docs/metrics.md for exact metric definitions used in the paper.
vector-log-reduction-benchmark/
├─ README.md
├─ docker-compose.yml
├─ LICENSE
├─ CITATION.cff
├─ requirements.txt
├─ docs/
│ └─ metrics.md ← Metric definitions (paper Section IV-C)
├─ .github/workflows/ci.yml
├─ clickhouse/init.sql
├─ vector/configs/
│ ├─ baseline.toml
│ ├─ filter.toml
│ ├─ prune.toml
│ ├─ sample.toml
│ ├─ hash.toml ← Template Hashing (xxHash64, T_flush=60s)
│ └─ combined.toml ← Hybrid: Pruning + Sampling
└─ scripts/replay.py
All Vector TOML configs, replay scripts, and ClickHouse DDL are provided.
A DOI will be minted upon acceptance (Zenodo/GitHub release).
See CITATION.cff for a citable reference.