-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hi Sven, hope you can help.
I'm interested in measuring the IOP/s, average latency and maximum latency over time on a Ceph cluster (using the block interface),
so I can plot - for a given read or write workload at a specific block size, IOPs, average and maximum latency over a long period
(eg. 24 hours). I'd like to be able to plot this at a specific interval eg. 1 or 2 or 5-10 seconds etc. (obviously going below 1s for IOPs doesn't make sense).
As far as I can see the only way to do it is with the --livecsv and --livecsvex options where there is an entry in the CSV file at every interval with the stats.
Here is my command, I am using elbencho version - 3.0-38
Example command:
/usr/local/bin/elbencho --direct --read --rand --rwmixpct 100 --block 4K --iodepth 16 --threads 16 --timelimit 20 --lat --livecsv out.csv --csvfile out2.csv --livecsvex --allelapsed --latpercent --lathisto --liveint 1000 /dev/rbd0
The livecsv output contains:
ISO Date,Label,Phase,RuntimeMS,Rank,MixType,Done%,DoneBytes,MiB/s,IOPS,Entries,Entries/s,Lat Ent us,Lat IO us,Active,CPU,Service,
2026-02-09T11:18:30.170-0500,,READ,1000,Total,,0,123498496,117,30151,0,0,0,530,4,17,,
2026-02-09T11:18:30.170-0500,,READ,1000,0,,0,31059968,,,0,,,,,,,
2026-02-09T11:18:30.170-0500,,READ,1000,1,,0,31109120,,,0,,,,,,,
2026-02-09T11:18:30.170-0500,,READ,1000,2,,0,30642176,,,0,,,,,,,
2026-02-09T11:18:30.170-0500,,READ,1000,3,,0,30687232,,,0,,,,,,,
2026-02-09T11:18:31.170-0500,,READ,2000,Total,,0,272957440,142,36489,0,0,0,437,4,18,,
2026-02-09T11:18:31.170-0500,,READ,2000,0,,0,68579328,,,0,,,,,,,
2026-02-09T11:18:31.170-0500,,READ,2000,1,,0,68468736,,,0,,,,,,,
2026-02-09T11:18:31.170-0500,,READ,2000,2,,0,68034560,,,0,,,,,,,
2026-02-09T11:18:31.170-0500,,READ,2000,3,,0,67878912,,,0,,,,,,,
2026-02-09T11:18:32.170-0500,,READ,3000,Total,,0,435048448,154,39573,0,0,0,403,4,20,,
2026-02-09T11:18:32.170-0500,,READ,3000,0,,0,109244416,,,0,,,,,,,
2026-02-09T11:18:32.170-0500,,READ,3000,1,,0,108974080,,,0,,,,,,,
2026-02-09T11:18:32.170-0500,,READ,3000,2,,0,108474368,,,0,,,,,,,
2026-02-09T11:18:32.170-0500,,READ,3000,3,,0,108355584,,,0,,,,,,,
2026-02-09T11:18:33.170-0500,,READ,4000,Total,,0,598163456,155,39823,0,0,0,401,4,20,,
2026-02-09T11:18:33.170-0500,,READ,4000,0,,0,150179840,,,0,,,,,,,
2026-02-09T11:18:33.170-0500,,READ,4000,1,,0,149700608,,,0,,,,,,,
2026-02-09T11:18:33.170-0500,,READ,4000,2,,0,149213184,,,0,,,,,,,
2026-02-09T11:18:33.170-0500,,READ,4000,3,,0,149073920,,,0,,,,,,,
2026-02-09T11:18:34.170-0500,,READ,5000,Total,,0,761122816,155,39785,0,0,0,401,4,19,,
2026-02-09T11:18:34.170-0500,,READ,5000,0,,0,190996480,,,0,,,,,,,
2026-02-09T11:18:34.170-0500,,READ,5000,1,,0,190431232,,,0,,,,,,,
2026-02-09T11:18:34.170-0500,,READ,5000,2,,0,189915136,,,0,,,,,,,
2026-02-09T11:18:34.170-0500,,READ,5000,3,,0,189779968,,,0,,,,,,,
2026-02-09T11:18:35.170-0500,,READ,6000,Total,,0,923262976,154,39585,0,0,0,403,4,19,,
2026-02-09T11:18:35.170-0500,,READ,6000,0,,0,231636992,,,0,,,,,,,
2026-02-09T11:18:35.170-0500,,READ,6000,1,,0,230961152,,,0,,,,,,,
2026-02-09T11:18:35.170-0500,,READ,6000,2,,0,230404096,,,0,,,,,,,
2026-02-09T11:18:35.170-0500,,READ,6000,3,,0,230264832,,,0,,,,,,,
2026-02-09T11:18:36.170-0500,,READ,7000,Total,,0,1086955520,156,39964,0,0,0,399,4,20,,
2026-02-09T11:18:36.170-0500,,READ,7000,0,,0,272584704,,,0,,,,,,,
2026-02-09T11:18:36.170-0500,,READ,7000,1,,0,271863808,,,0,,,,,,,
2026-02-09T11:18:36.170-0500,,READ,7000,2,,0,271286272,,,0,,,,,,,
2026-02-09T11:18:36.170-0500,,READ,7000,3,,0,271224832,,,0,,,,,,,
From these statistics it looks like I can only see the average latency in the last 1 second interval.
I'm assuming "Lat IO us" is the average latency of completed requests in the last interval (on --liveint) ?
I think "Lat Ent us" is only applicable for file/directory access, not block, is that write, so for block it will always be zero?
The reason why knowing the maximum latency for IOs that completed in the interval (specified on --liveint X), is so I can see the peaks, eg say an event occur (like a failure or some other system event), then I can tie this up with a specific event within the Ceph storage system and what the system was doing at the time.
Whilst yes there are the histograms and percentiles which are output at the end, it would be useful to have the maximum latency of completed IOs in a given period recorded as you go along too. (FIO and vdbench do this).
Maybe elbencho does this already, but I can't seem find this within the options.
https://github.com/breuner/elbencho/blob/master/docs/csv-docs.md hints at showing the maximum, but is describing the output of --csvfile I think.
Thank you!