Graph API benchmarks added by mateuszpn · Pull Request #2560 · oneapi-src/unified-runtime

mateuszpn · 2025-01-14T13:36:22Z

No description provided.

github-actions · 2025-01-14T13:37:32Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12768839561

pbalcer · 2025-01-14T13:45:36Z

scripts/benchmarks/benches/compute.py

+            "--iterations=1000",
+            "--numKernels=100",
+        ]   
+


missing newline

pbalcer · 2025-01-14T13:45:49Z

scripts/benchmarks/benches/compute.py

+
+    def bin_args(self) -> list[str]:
+        return [
+            "--iterations=1000",


is that enough iterations for the benchmark to be stable?

pbalcer · 2025-01-14T13:46:16Z

scripts/benchmarks/benches/compute.py

+    def bin_args(self) -> list[str]:
+        return [
+            "--iterations=1000",
+            "--numKernels=100",


should we add more scenarios with different number of kernels? e.g., with 1 kernel, to see the cost of the whole machinery end-to-end.

github-actions · 2025-01-14T14:25:16Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12768839561
Job status: success. Test status: success.

Summary

Total 125 benchmarks in mean.
Geomean 100.642%.
Improved 24 Regressed 8 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (9): 99.869%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_sycl SubmitKernel out of order	23.390000 μs	23.552 μs	100.69%	0.69%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.218000 μs	11.255 μs	100.33%	0.33%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	101923.000000 instr	101923.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	107041.000000 instr	107041.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order	16.323 μs	16.295000 μs	99.83%	-0.17%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.077 μs	2.070000 μs	99.66%	-0.34%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.576 μs	15.520000 μs	99.64%	-0.36%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.631 μs	1.623000 μs	99.51%	-0.49%	.
api_overhead_benchmark_sycl SubmitKernel in order	25.159 μs	24.949000 μs	99.17%	-0.83%	.

Relative perf in group memory (4): 100.001%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.569000 μs	5.605 μs	100.65%	0.65%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.208000 GB/s	3.201 GB/s	100.22%	0.22%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	252.490 μs	252.056000 μs	99.83%	-0.17%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.124 μs	132.214000 μs	99.32%	-0.68%	.

Relative perf in group miscellaneous (1): 94.154%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	858.609 bw GB/s	808.411000 bw GB/s	94.15%	-5.85%	---

Relative perf in group multithread (10): 99.538%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7382.914000 μs	7495.284 μs	101.52%	1.52%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25681.874000 μs	25995.395 μs	101.22%	1.22%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6922.779000 μs	6974.233 μs	100.74%	0.74%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40479.072000 μs	40660.910 μs	100.45%	0.45%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17417.035000 μs	17477.781 μs	100.35%	0.35%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1166.967000 μs	1168.649 μs	100.14%	0.14%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8705.550 μs	8634.909000 μs	99.19%	-0.81%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2073.159 μs	2020.694000 μs	97.47%	-2.53%	-
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	112338.240 μs	109295.935000 μs	97.29%	-2.71%	--
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48365.917 μs	46977.855000 μs	97.13%	-2.87%	--

Relative perf in group Velocity-Bench (9): 100.125%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Hashtable	379.627909 M keys/sec	375.506 M keys/sec	101.10%	1.10%	.
Velocity-Bench Easywave	239.000000 ms	240.000 ms	100.42%	0.42%	.
Velocity-Bench dl-mnist	2.730000 s	2.740 s	100.37%	0.37%	.
Velocity-Bench svm	0.135200 s	0.136 s	100.30%	0.30%	.
Velocity-Bench Sobel Filter	533.583000 ms	533.897 ms	100.06%	0.06%	.
Velocity-Bench QuickSilver	118.200000 MMS/CTT	118.140 MMS/CTT	100.05%	0.05%	.
Velocity-Bench CudaSift	203.330 ms	202.738000 ms	99.71%	-0.29%	.
Velocity-Bench dl-cifar	23.437 s	23.339500 s	99.58%	-0.42%	.
Velocity-Bench Bitcracker	35.287 s	35.129800 s	99.56%	-0.44%	.

Relative perf in group Runtime (8): 103.487%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	266.359000 ms	287.619 ms	107.98%	7.98%	++++
Runtime_IndependentDAGTaskThroughput_SingleTask	251.902000 ms	268.256 ms	106.49%	6.49%	++++
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	271.741000 ms	280.817 ms	103.34%	3.34%	++
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	271.995000 ms	280.952 ms	103.29%	3.29%	++
Runtime_DAGTaskThroughput_BasicParallelFor	1707.116000 ms	1755.738 ms	102.85%	2.85%	++
Runtime_DAGTaskThroughput_SingleTask	1651.090000 ms	1676.803 ms	101.56%	1.56%	.
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1691.933000 ms	1717.167 ms	101.49%	1.49%	.
Runtime_DAGTaskThroughput_NDRangeParallelFor	1679.178000 ms	1697.583 ms	101.10%	1.10%	.

Relative perf in group MicroBench (14): 100.154%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.477000 ms	4.568 ms	102.03%	2.03%	+
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.346000 ms	4.411 ms	101.50%	1.50%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.608000 ms	4.625 ms	100.37%	0.37%	.
MicroBench_LocalMem_int32_4096	29.834000 ms	29.871 ms	100.12%	0.12%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.660000 ms	4.664 ms	100.09%	0.09%	.
MicroBench_LocalMem_fp32_4096	29.864000 ms	29.878 ms	100.05%	0.05%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.092000 ms	618.158 ms	100.01%	0.01%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.124000 ms	618.164 ms	100.01%	0.01%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.423000 ms	617.455 ms	100.01%	0.01%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.472 ms	617.470000 ms	100.00%	-0.00%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.473 ms	4.459000 ms	99.69%	-0.31%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.463 ms	4.447000 ms	99.64%	-0.36%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.466 ms	4.438000 ms	99.37%	-0.63%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.500 ms	4.469000 ms	99.31%	-0.69%	.

Relative perf in group Pattern (10): 100.568%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_NDRange_int32	16.623000 ms	17.271 ms	103.90%	3.90%	++
Pattern_Reduction_Hierarchical_int32	16.741000 ms	16.966 ms	101.34%	1.34%	.
Pattern_SegmentedReduction_NDRange_fp32	2.172000 ms	2.178 ms	100.28%	0.28%	.
Pattern_SegmentedReduction_NDRange_int32	2.170000 ms	2.173 ms	100.14%	0.14%	.
Pattern_SegmentedReduction_NDRange_int64	2.341000 ms	2.344 ms	100.13%	0.13%	.
Pattern_SegmentedReduction_NDRange_int16	2.270000 ms	2.271 ms	100.04%	0.04%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.806000 ms	11.811 ms	100.04%	0.04%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.597000 ms	11.597 ms	100.00%	0.00%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.598 ms	11.595000 ms	99.97%	-0.03%	.
Pattern_SegmentedReduction_Hierarchical_int64	11.786 ms	11.774000 ms	99.90%	-0.10%	.

Relative perf in group ScalarProduct (6): 100.305%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_NDRange_fp32	3.757000 ms	3.822 ms	101.73%	1.73%	.
ScalarProduct_Hierarchical_fp32	10.152000 ms	10.201 ms	100.48%	0.48%	.
ScalarProduct_NDRange_int32	3.865000 ms	3.874 ms	100.23%	0.23%	.
ScalarProduct_Hierarchical_int32	10.543000 ms	10.543 ms	100.00%	0.00%	.
ScalarProduct_Hierarchical_int64	11.496 ms	11.477000 ms	99.83%	-0.17%	.
ScalarProduct_NDRange_int64	5.483 ms	5.459000 ms	99.56%	-0.44%	.

Relative perf in group USM (7): 97.735%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.044000 ms	1.061 ms	101.63%	1.63%	.
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.788000 ms	1.811 ms	101.29%	1.29%	.
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.643000 ms	1.658 ms	100.91%	0.91%	.
USM_Allocation_latency_fp32_host	37.405000 ms	37.562 ms	100.42%	0.42%	.
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.197000 ms	1.198 ms	100.08%	0.08%	.
USM_Allocation_latency_fp32_device	0.068 ms	0.067000 ms	98.53%	-1.47%	.
USM_Allocation_latency_fp32_shared	0.064 ms	0.053000 ms	82.81%	-17.19%	----------

Relative perf in group VectorAddition (3): 101.026%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_fp32	1.456000 ms	1.556 ms	106.87%	6.87%	++++
VectorAddition_int64	3.064000 ms	3.177 ms	103.69%	3.69%	++
VectorAddition_int32	1.554 ms	1.446000 ms	93.05%	-6.95%	----

Relative perf in group Polybench (3): 100.147%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_2mm	1.212000 ms	1.223 ms	100.91%	0.91%	.
Polybench_3mm	1.734 ms	1.730000 ms	99.77%	-0.23%	.
Polybench_Atax	6.882 ms	6.866000 ms	99.77%	-0.23%	.

Relative perf in group Kmeans (1): 100.025%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	16.052000 ms	16.056 ms	100.02%	0.02%	.

Relative perf in group LinearRegressionCoeff (1): 101.258%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	863.524000 ms	874.384 ms	101.26%	1.26%	.

Relative perf in group MolecularDynamics (1): 103.571%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.028000 ms	0.029 ms	103.57%	3.57%	++

Relative perf in group llama.cpp (6): 100.287%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Prompt Processing Batched 128	799.566647 token/s	776.176 token/s	103.01%	3.01%	++
llama.cpp Text Generation Batched 128	62.658740 token/s	62.610 token/s	100.08%	0.08%	.
llama.cpp Text Generation Batched 256	62.667969 token/s	62.644 token/s	100.04%	0.04%	.
llama.cpp Text Generation Batched 512	62.667643 token/s	62.664 token/s	100.01%	0.01%	.
llama.cpp Prompt Processing Batched 512	446.621 token/s	447.955779 token/s	99.70%	-0.30%	.
llama.cpp Prompt Processing Batched 256	888.583 token/s	898.153602 token/s	98.93%	-1.07%	.

Relative perf in group alloc/max (20): 102.226%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	3813.870000 ns	4491.160 ns	117.76%	17.76%	++++++++++
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 os_provider	1688.180000 ns	1890.160 ns	111.96%	11.96%	+++++++
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 scalable_pool<os_provider>	960.228000 ns	1039.730 ns	108.28%	8.28%	+++++
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 os_provider	2011.990000 ns	2160.870 ns	107.40%	7.40%	++++
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 glibc	850.182000 ns	886.983 ns	104.33%	4.33%	++
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 scalable_pool<os_provider>	956.301000 ns	988.448 ns	103.36%	3.36%	++
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 glibc	712.064000 ns	735.791 ns	103.33%	3.33%	++
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	285.454000 ns	294.862 ns	103.30%	3.30%	++
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 glibc	753.012000 ns	768.256 ns	102.02%	2.02%	+
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	297.490000 ns	302.838 ns	101.80%	1.80%	.
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 glibc	177.480000 ns	179.529 ns	101.15%	1.15%	.
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 os_provider	186.242000 ns	187.761 ns	100.82%	0.82%	.
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 glibc	1230.370000 ns	1238.370 ns	100.65%	0.65%	.
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	215.466000 ns	216.850 ns	100.64%	0.64%	.
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	259.777 ns	258.719000 ns	99.59%	-0.41%	.
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.979 ns	211.876000 ns	99.48%	-0.52%	.
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 os_provider	193.173 ns	191.832000 ns	99.31%	-0.69%	.
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	274.346 ns	270.916000 ns	98.75%	-1.25%	.
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 glibc	2643.070 ns	2571.500000 ns	97.29%	-2.71%	--
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	3853.960 ns	3344.700000 ns	86.79%	-13.21%	-------

Relative perf in group multiple (12): 101.311%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 scalable_pool<os_provider>	41338.500000 ns	43389.000 ns	104.96%	4.96%	+++
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 glibc	32260.600000 ns	33605.200 ns	104.17%	4.17%	++
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 scalable_pool<os_provider>	26282.000000 ns	27311.800 ns	103.92%	3.92%	++
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 glibc	135820.000000 ns	139597.000 ns	102.78%	2.78%	++
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 scalable_pool<os_provider>	15162.100000 ns	15421.500 ns	101.71%	1.71%	.
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 glibc	30751.400000 ns	31247.600 ns	101.61%	1.61%	.
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 proxy_pool<os_provider>	158322.000000 ns	159366.000 ns	100.66%	0.66%	.
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 proxy_pool<os_provider>	1139900.000 ns	1139210.000000 ns	99.94%	-0.06%	.
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 scalable_pool<os_provider>	72122.400 ns	71690.300000 ns	99.40%	-0.60%	.
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 os_provider	138284.000 ns	137218.000000 ns	99.23%	-0.77%	.
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 glibc	4186.440 ns	4139.250000 ns	98.87%	-1.13%	.
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 os_provider	1195530.000 ns	1180540.000000 ns	98.75%	-1.25%	.

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00390882 s
bitcracker - total time for whole calculation: 35.2866 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1111 1271 30.1656% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1231 1262 33.4238% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1108 1272 30.0842% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1262 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1106 1278 30.0299% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1239 1274 33.6411% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1221 1272 33.1523% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1099 1259 29.8398% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1152 1278 31.2788% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1105 1272 30.0027% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1236 1270 33.5596% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1220 1262 33.1252% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1220 1259 33.1252% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1232 1270 33.451% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1110 1267 30.1385% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1039 1255 28.2107% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1222 1257 33.1795% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1104 1259 29.9756% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1124 1267 30.5186% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1236 1271 33.5596% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1221 1261 33.1523% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1272 33.6139% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1111 1265 30.1656% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1086 1258 29.4868% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1262 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1264 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1219 1256 33.098% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1224 1260 33.2338% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1113 1263 30.2199% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1210 1256 32.8537% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1147 1253 31.1431% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1223 1258 33.2066% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1131 1271 30.7087% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1108 1276 30.0842% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1231 1265 33.4238% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1199 1256 32.555% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1264 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1123 1268 30.4914% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1272 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1263 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1220 1256 33.1252% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1263 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1102 1268 29.9213% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1237 1271 33.5868% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1050 1266 28.5094% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1161 1250 31.5232% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1221 1257 33.1523% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1094 1273 29.704% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1268 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1242 1276 33.7225% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Avg workload time = 203.33 ms

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Output:

MAIN: Starting SYCL main program
MAIN: Attempting to clean up previous eWave tsunami files
MAIN: Clean up completed
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
MAIN: Program successfully completed

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Output:

Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version :
Quicksilver Git Hash :
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1

Loading params
Finished loading params
Simulation:
dt: 1e-08
fMax: 0.1
inputFile: /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp
energySpectrum:
boundaryCondition: octant
loadBalance: 1
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 10000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:

Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0

Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 0.1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 0
fissionCrossSectionRatio: 0
scatteringCrossSectionRatio: 1

CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
setting GPU
setting parameters
Building partition 0
Building partition 1
Building partition 2
Building partition 3
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Started copyMaterialDatabase_device
Finished copyMaterialDatabase_device
Finished copyNuclearData_device
Finished copyDomainDevice
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 1000000 0 9000000 0 18533189 0 0 18533189 1151780 8848220 55527935 1.854923e+09 3.711540e-01 6.081480e-01 0.000000e+00
1 8848220 1000000 0 151478 0 34281997 0 0 34281997 1664159 8335539 94633679 5.047651e+09 3.446580e-01 7.451310e-01 0.000000e+00
2 8335539 1000000 0 663717 0 34354432 0 0 34354432 1366771 8632485 95010375 7.705930e+09 3.413480e-01 7.599690e-01 0.000000e+00
3 8632485 1000000 0 367978 0 34302727 0 0 34302727 1242216 8758247 94953591 9.992076e+09 3.677370e-01 8.260470e-01 0.000000e+00
4 8758247 1000000 0 242076 0 34141236 0 0 34141236 1168452 8831871 94599337 1.199834e+10 3.373520e-01 7.986320e-01 0.000000e+00
5 8831871 1000000 0 168070 0 33948724 0 0 33948724 1121156 8878785 94148236 1.377636e+10 3.383960e-01 7.821140e-01 0.000000e+00
6 8878785 1000000 0 120572 0 33760567 0 0 33760567 1089103 8910254 93689264 1.535668e+10 3.422090e-01 7.806090e-01 0.000000e+00
7 8910254 1000000 0 89810 0 33552179 0 0 33552179 1065203 8934861 93216931 1.676993e+10 3.439010e-01 7.824050e-01 0.000000e+00
8 8934861 1000000 0 65491 0 33384605 0 0 33384605 1047720 8952632 92768273 1.804559e+10 3.410210e-01 7.819310e-01 0.000000e+00
9 8952632 1000000 0 47165 0 33198494 0 0 33198494 1033968 8965829 92324678 1.920208e+10 3.366740e-01 7.567830e-01 0.000000e+00

Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 1.109e+07 1.109e+07 1.109e+07 0.000e+00 100.00
cycleInit 10 3.464e+06 3.464e+06 3.464e+06 0.000e+00 100.00
cycleTracking 10 7.622e+06 7.622e+06 7.622e+06 0.000e+00 100.00
cycleTracking_Kernel 104 4.913e+06 4.913e+06 4.913e+06 0.000e+00 100.00
cycleTracking_MPI 117 2.069e+05 2.069e+05 2.069e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 4.000e+02 4.000e+02 4.000e+02 0.000e+00 100.00
Figure Of Merit 118.20 [Num Mega Segments / Cycle Tracking Time]

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Output:

SYMN: Welcome to the SYCL version of Sobel filter workload.
SYMN: Input image file: /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png
SYMN: Launching SYCL kernel with # of iterations: 5
time to subtract from total: 7.49573 s
sobelfilter - total time for whole calculation: 0.533583 s

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Output:

	Welcome to DL-CIFAR workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: DL NW size type: WORKLOAD_DEFAULT_SIZE
WL PARAMS: ==================================================
WL PARAMS:

dataFileReadTimer->getTotalOpTime(): 8.8e-05 s
dl-cifar - total time for whole calculation: 23.4368 s

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Output:

	Welcome to DL-MNIST workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: Tensor management policy: per_layer
WL PARAMS: Convolution algorithm: ONEDNN_AUTO
WL PARAMS: Dataset reader format: NCHW
WL PARAMS: Dry run: YES
WL PARAMS: OneDNN Conv PD memory format: ONEDNN_CONVPD_ANY
WL PARAMS: No of iterations for inference: 400
WL PARAMS: ==================================================
WL PARAMS:

dl-mnist - total time for whole calculation: 2.73 s

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

Output:

name,iterations,real_time,cpu_time,time_unit,bytes_per_second,items_per_second,label,error_occurred,error_message
"glibc/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4",800000,2643.07,1769.78,ns,,,,,
"glibc/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1",200000,720.452,720.452,ns,,,,,
"glibc/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4",800000,1256.51,1210.39,ns,,,,,
"glibc/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1",200000,759.477,759.476,ns,,,,,
"glibc/alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4",800000,899.469,838.579,ns,,,,,
"glibc/alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1",200000,177.48,177.476,ns,,,,,
"os_provider/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4",800000,2064.16,2062.54,ns,,,,,
"os_provider/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1",200000,185.724,185.719,ns,,,,,
"os_provider/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4",800000,1841.21,1841.16,ns,,,,,
"os_provider/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1",200000,187.905,187.9,ns,,,,,
"proxy_pool<os_provider>/alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:4",800000,3592.83,3586.22,ns,,,,,
"proxy_pool<os_provider>/alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:1",200000,261.294,261.286,ns,,,,,
"proxy_pool<os_provider>/alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:4",800000,3969.23,3963.81,ns,,,,,
"proxy_pool<os_provider>/alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:1",200000,285.403,285.396,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4",800000,297.49,283.975,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1",200000,216.039,215.987,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4",800000,262.345,261.416,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1",200000,207.696,207.694,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4",800000,947.498,937.967,ns,,,,,
"scalable_pool<os_provider>/alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1",200000,951.551,951.539,ns,,,,,
"glibc/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4",8000,32260.6,30779,ns,,,,,
"glibc/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1",2000,4202.52,4202.42,ns,,,,,
"glibc/multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4",8000,135820,87298.3,ns,,,,,
"glibc/multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1",2000,31681.2,31680.9,ns,,,,,
"proxy_pool<os_provider>/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4",8000,1.1399e+06,1.13951e+06,ns,,,,,
"proxy_pool<os_provider>/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1",2000,160162,160157,ns,,,,,
"os_provider/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4",8000,1.14093e+06,1.14044e+06,ns,,,,,
"os_provider/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1",2000,136959,136958,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4",8000,42823.2,41712.7,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1",2000,14736.3,14735.9,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4",8000,72122.4,72105.2,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1",2000,26282,26281.5,ns,,,,,

github-actions · 2025-01-14T14:29:24Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12769778913

pbalcer · 2025-01-14T14:29:04Z

scripts/benchmarks/benches/compute.py

            MemcpyExecute(self, 10, 16, 1024, 10000, 0, 1, 1),
            MemcpyExecute(self, 4096, 1, 1024, 10, 0, 1, 0),
            MemcpyExecute(self, 4096, 4, 1024, 10, 0, 1, 0),
+            GraphApiSinKernelSYCL(self, 0, 1),


hm, this might be too much :D

pbalcer · 2025-01-14T14:30:04Z

scripts/benchmarks/benches/compute.py

+        super().__init__(bench, "graph_api_benchmark_sycl", "SinKernel")
+
+    def name(self):
+        return f"graph_api_benchmark_sycl SinKernel"


names need to reflect arguments, otherwise the benchmarks won't be unique in output.

pbalcer

The graph benchmarks didn't run because compute-benchmarks is too old. You need to update the commit being used (line 25).

github-actions · 2025-01-15T10:37:48Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12786535660

github-actions · 2025-01-15T10:55:55Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12786535660
Job status: cancelled. Test status: cancelled.

github-actions · 2025-01-15T10:56:25Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12786836274

github-actions · 2025-01-15T14:38:32Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12790642289

github-actions · 2025-01-15T16:01:45Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12792217891

github-actions · 2025-01-15T16:45:36Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12793013959

pbalcer · 2025-01-16T13:26:33Z

scripts/benchmarks/benches/compute.py

            MemcpyExecute(self, 10, 16, 1024, 10000, 0, 1, 1),
            MemcpyExecute(self, 4096, 1, 1024, 10, 0, 1, 0),
            MemcpyExecute(self, 4096, 4, 1024, 10, 0, 1, 0),
+            GraphApiSinKernelGraphSYCL(self, 0, 10),


Please pick max 2/3 scenarios per benchmark. We need to keep the runtime of the whole job reasonable (I'm aiming for <30 minutes).

pbalcer · 2025-01-16T13:27:45Z

scripts/benchmarks/benches/compute.py

+
+    def bin_args(self) -> list[str]:
+        return [
+            "--iterations=100",


Is this enough iterations for the benchmarks to have reproducible results?
We aim for the stddev between runs (or, rather, the coefficient of variation of all the runs) to be smaller than 2%.

pbalcer · 2025-01-16T13:30:27Z

This failed with:

RequestError [HttpError]: Validation Failed: {"resource":"IssueComment","code":"unprocessable","field":"data","message":"Body is too long (maximum is 65536 characters)"}

If this keeps happening after you've reduced the number of scenarios, I suggest we temporarily remove output (just comment it out) from the markdown. I plan on eventually creating an HTML file per PR and then link to it in the markdown, and that will give us the ability to have longer content with all the details.

github-actions · 2025-01-16T13:31:45Z

Compute Benchmarks level_zero run (with params: --filter graph):
https://github.com/oneapi-src/unified-runtime/actions/runs/12809980280

github-actions · 2025-01-16T14:05:09Z

Compute Benchmarks level_zero run (--filter graph):
https://github.com/oneapi-src/unified-runtime/actions/runs/12809980280
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group graph (3): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	86719.406000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:50	248664.883000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	440612.037000 μs	-

Relative perf in group api (9): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.528000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.678000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.844000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.118000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.675000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	101923.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.896000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	107041.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.663000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	253.805000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	132.929000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.638000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.151000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	858.609000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6935.535000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17316.620000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	47907.007000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2022.915000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7452.758000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8555.721000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	25543.132000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1157.521000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	40973.625000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	108338.415000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	362.504819 M keys/sec
Velocity-Bench Bitcracker	-	35.129800 s
Velocity-Bench CudaSift	-	201.142000 ms
Velocity-Bench Easywave	-	229.000000 ms
Velocity-Bench QuickSilver	-	117.490000 MMS/CTT
Velocity-Bench Sobel Filter	-	602.045000 ms
Velocity-Bench dl-cifar	-	23.743900 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.139900 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	259.395000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	275.382000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	278.916000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	278.736000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1678.732000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1746.233000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1725.256000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1695.816000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	4.238000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.317000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.322000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.414000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.994000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.954000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.547000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.781000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.574000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	4.702000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.523000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.254000 ms
MicroBench_LocalMem_int32_4096	-	29.866000 ms
MicroBench_LocalMem_fp32_4096	-	29.833000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.163000 ms
Pattern_Reduction_Hierarchical_int32	-	16.411000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.164000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.336000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.163000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.599000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.779000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.589000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.733000 ms
ScalarProduct_NDRange_int64	-	5.456000 ms
ScalarProduct_NDRange_fp32	-	3.759000 ms
ScalarProduct_Hierarchical_int32	-	10.523000 ms
ScalarProduct_Hierarchical_int64	-	11.490000 ms
ScalarProduct_Hierarchical_fp32	-	10.170000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.068000 ms
USM_Allocation_latency_fp32_host	-	37.899000 ms
USM_Allocation_latency_fp32_shared	-	0.066000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.661000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.046000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.814000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.195000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.448000 ms
VectorAddition_int64	-	3.139000 ms
VectorAddition_fp32	-	1.445000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.216000 ms
Polybench_3mm	-	1.727000 ms
Polybench_Atax	-	6.880000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	16.083000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.028000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	838.869803 token/s
llama.cpp Text Generation Batched 128	-	63.338561 token/s
llama.cpp Prompt Processing Batched 256	-	872.377637 token/s
llama.cpp Text Generation Batched 256	-	63.361520 token/s
llama.cpp Prompt Processing Batched 512	-	434.541716 token/s
llama.cpp Text Generation Batched 512	-	63.295460 token/s

Relative perf in group alloc/max (20): cannot calculate

Benchmark	This PR	baseline
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 glibc	-	2589.180000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 glibc	-	710.936000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 glibc	-	1188.310000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 glibc	-	716.901000 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 glibc	-	861.597000 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 glibc	-	175.935000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 os_provider	-	2246.790000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 os_provider	-	187.819000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 os_provider	-	1690.250000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 os_provider	-	189.702000 ns
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	-	4441.700000 ns
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	-	256.696000 ns
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	-	3268.220000 ns
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	-	306.439000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	-	299.852000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	-	213.534000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	-	263.904000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	-	197.833000 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 scalable_pool<os_provider>	-	1051.720000 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 scalable_pool<os_provider>	-	952.492000 ns

Relative perf in group multiple (12): cannot calculate

Benchmark	This PR	baseline
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 glibc	-	32574.000000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 glibc	-	4128.530000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 glibc	-	138399.000000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 glibc	-	28197.400000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 proxy_pool<os_provider>	-	1161430.000000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 proxy_pool<os_provider>	-	161766.000000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 os_provider	-	1166110.000000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 os_provider	-	141737.000000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 scalable_pool<os_provider>	-	42212.800000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 scalable_pool<os_provider>	-	14889.200000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 scalable_pool<os_provider>	-	72778.500000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 scalable_pool<os_provider>	-	27538.700000 ns

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SinKernelGraph(api=sycl numKernels=100 withGraphs=0),441720.361,440612.037,1.42%,430127.709,457010.907,[CPU],[us]

github-actions · 2025-01-16T14:52:14Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12811438249

pbalcer · 2025-01-16T14:56:51Z

All graph-related benchmarks failed with:

Abort was called at 80 line in file:
./shared/source/command_stream/linear_stream.h

and then I think the gpu crashed:

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  No device of requested type available.

Have you seen that before?

github-actions · 2025-01-16T15:02:22Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12811438249
Job status: failure. Test status: failure.

github-actions · 2025-01-20T10:02:21Z

Compute Benchmarks level_zero run (with params: --filter "graph"):
https://github.com/oneapi-src/unified-runtime/actions/runs/12865513647

github-actions · 2025-01-20T10:20:24Z

Compute Benchmarks level_zero run (--filter "graph"):
https://github.com/oneapi-src/unified-runtime/actions/runs/12865513647
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group graph (14): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71750.474000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72598.586000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:50	196881.249000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:50	197215.709000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:200	666400.501000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:200	733155.088000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:20	11444.061000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:20	98.031000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:20	11444.025000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:20	113.459000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:200	116870.724000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:200	1513.660000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:200	117268.269000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:200	1690.024000 μs	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.848000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.745000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.710000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.891000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.143000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.702000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	105463.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.623000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110815.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.859000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123991.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.425000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	254.865000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	219.808000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.865000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.043000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	861.253000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6931.139000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17007.721000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	47383.460000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2073.904000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7868.958000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	9035.852000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	27237.512000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1194.467000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	42860.412000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	113343.613000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	356.084148 M keys/sec
Velocity-Bench Bitcracker	-	35.118800 s
Velocity-Bench CudaSift	-	204.342000 ms
Velocity-Bench Easywave	-	289.000000 ms
Velocity-Bench QuickSilver	-	117.450000 MMS/CTT
Velocity-Bench Sobel Filter	-	621.173000 ms
Velocity-Bench dl-cifar	-	23.972100 s
Velocity-Bench dl-mnist	-	2.380000 s
Velocity-Bench svm	-	0.140100 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	253.100000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	273.484000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	271.662000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	272.505000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1691.410000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1756.502000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1721.262000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1694.375000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.188000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.967000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.769000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.866000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.226000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.268000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.919000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.115000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.140000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.113000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.772000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.628000 ms
MicroBench_LocalMem_int32_4096	-	29.834000 ms
MicroBench_LocalMem_fp32_4096	-	29.857000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.971000 ms
Pattern_Reduction_Hierarchical_int32	-	17.024000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.263000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.164000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.333000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.163000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.587000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.777000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.588000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.734000 ms
ScalarProduct_NDRange_int64	-	5.456000 ms
ScalarProduct_NDRange_fp32	-	3.767000 ms
ScalarProduct_Hierarchical_int32	-	10.555000 ms
ScalarProduct_Hierarchical_int64	-	11.508000 ms
ScalarProduct_Hierarchical_fp32	-	10.174000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.068000 ms
USM_Allocation_latency_fp32_host	-	37.633000 ms
USM_Allocation_latency_fp32_shared	-	0.057000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.717000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.085000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.889000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.256000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.510000 ms
VectorAddition_int64	-	3.066000 ms
VectorAddition_fp32	-	1.460000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.221000 ms
Polybench_3mm	-	1.730000 ms
Polybench_Atax	-	6.855000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	16.091000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	908.423000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.030000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	830.457525 token/s
llama.cpp Text Generation Batched 128	-	62.530663 token/s
llama.cpp Prompt Processing Batched 256	-	872.219855 token/s
llama.cpp Text Generation Batched 256	-	62.524658 token/s
llama.cpp Prompt Processing Batched 512	-	426.427709 token/s
llama.cpp Text Generation Batched 512	-	62.477744 token/s

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): cannot calculate

Benchmark	This PR	baseline
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	-	2475.310000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	-	2120.000000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	-	3068.370000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	-	283.309000 ns

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): cannot calculate

Benchmark	This PR	baseline
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	-	706.837000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	-	197.281000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	-	268.948000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	-	213.433000 ns

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): cannot calculate

Benchmark	This PR	baseline
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	-	1259.770000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	-	1854.120000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	-	3771.150000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	-	253.839000 ns

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): cannot calculate

Benchmark	This PR	baseline
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	-	726.627000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	-	195.246000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	-	308.264000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	-	206.713000 ns

Relative perf in group alloc/min (4): cannot calculate

Benchmark	This PR	baseline
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	-	803.081000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	-	177.090000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	-	978.697000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	-	975.381000 ns

Relative perf in group multiple (12): cannot calculate

Benchmark	This PR	baseline
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	-	33503.600000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	-	4251.600000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	-	141113.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	-	30214.100000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	-	1170470.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	-	165011.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	-	1151930.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	-	145356.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	-	42332.700000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	-	15330.800000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	-	75942.600000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	-	25425.600000 ns

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitExecGraph(api=sycl measureSubmit=1 numKernels=200 ioq=1),1691.352,1690.024,1.20%,1637.617,1735.173,[CPU],[us]

github-actions · 2025-01-20T11:38:09Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12867191102

github-actions · 2025-01-20T12:25:48Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12867191102
Job status: success. Test status: success.

Summary

Total 128 benchmarks in mean.
Geomean 100.080%.
Improved 16 Regressed 13 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 99.833%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_l0 SubmitKernel out of order	11.686000 μs	11.848 μs	101.39%	1.39%	.
api_overhead_benchmark_ur SubmitKernel in order	16.647000 μs	16.859 μs	101.27%	1.27%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.690000 μs	1.702 μs	100.71%	0.71%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.555000 μs	23.710 μs	100.66%	0.66%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	105463.000000 instr	105463.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110815.000000 instr	110815.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	123991.000000 instr	123991.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.513 μs	21.425000 μs	99.59%	-0.41%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.997 μs	24.891000 μs	99.58%	-0.42%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.827 μs	11.745000 μs	99.31%	-0.69%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.177 μs	2.143000 μs	98.44%	-1.56%	.
api_overhead_benchmark_ur SubmitKernel out of order	16.083 μs	15.623000 μs	97.14%	-2.86%	--

Relative perf in group memory (4): 100.317%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.066000 GB/s	3.043 GB/s	100.76%	0.76%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.823000 μs	5.865 μs	100.72%	0.72%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	219.341000 μs	219.808 μs	100.21%	0.21%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	255.934 μs	254.865000 μs	99.58%	-0.42%	.

Relative perf in group miscellaneous (1): 107.050%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	804.534000 bw GB/s	861.253 bw GB/s	107.05%	7.05%	+++++

Relative perf in group multithread (10): 100.223%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26703.389000 μs	27237.512 μs	102.00%	2.00%	++
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8862.915000 μs	9035.852 μs	101.95%	1.95%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7806.548000 μs	7868.958 μs	100.80%	0.80%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2063.263000 μs	2073.904 μs	100.52%	0.52%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17033.193 μs	17007.721000 μs	99.85%	-0.15%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	43030.247 μs	42860.412000 μs	99.61%	-0.39%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113819.124 μs	113343.613000 μs	99.58%	-0.42%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6961.545 μs	6931.139000 μs	99.56%	-0.44%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47670.697 μs	47383.460000 μs	99.40%	-0.60%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1206.371 μs	1194.467000 μs	99.01%	-0.99%	.

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71758.766000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72524.970000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353498.172000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353215.904000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.135000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	61.707000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	677.085000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5598.586000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5599.166000 μs	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56454.825000 μs	-

Relative perf in group Velocity-Bench (9): 100.043%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Sobel Filter	598.763000 ms	621.173 ms	103.74%	3.74%	+++
Velocity-Bench dl-cifar	23.931100 s	23.972 s	100.17%	0.17%	.
Velocity-Bench svm	0.140100 s	0.140 s	100.00%	0.00%	.
Velocity-Bench CudaSift	204.385 ms	204.342000 ms	99.98%	-0.02%	.
Velocity-Bench QuickSilver	117.190 MMS/CTT	117.450000 MMS/CTT	99.78%	-0.22%	.
Velocity-Bench dl-mnist	2.390 s	2.380000 s	99.58%	-0.42%	.
Velocity-Bench Easywave	291.000 ms	289.000000 ms	99.31%	-0.69%	.
Velocity-Bench Hashtable	352.898 M keys/sec	356.084148 M keys/sec	99.11%	-0.89%	.
Velocity-Bench Bitcracker	35.546 s	35.118800 s	98.80%	-1.20%	.

Relative perf in group Runtime (8): 98.949%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_DAGTaskThroughput_NDRangeParallelFor	1689.594000 ms	1694.375 ms	100.28%	0.28%	.
Runtime_DAGTaskThroughput_SingleTask	1691.536 ms	1691.410000 ms	99.99%	-0.01%	.
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1721.582 ms	1721.262000 ms	99.98%	-0.02%	.
Runtime_DAGTaskThroughput_BasicParallelFor	1760.146 ms	1756.502000 ms	99.79%	-0.21%	.
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	274.722 ms	272.505000 ms	99.19%	-0.81%	.
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	276.324 ms	273.484000 ms	98.97%	-1.03%	.
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	277.470 ms	271.662000 ms	97.91%	-2.09%	--
Runtime_IndependentDAGTaskThroughput_SingleTask	264.863 ms	253.100000 ms	95.56%	-4.44%	---

Relative perf in group MicroBench (14): 101.666%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.842000 ms	5.188 ms	107.15%	7.15%	++++++
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.837000 ms	5.113 ms	105.71%	5.71%	++++
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.705000 ms	4.919 ms	104.55%	4.55%	++++
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.024000 ms	5.115 ms	101.81%	1.81%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.894000 ms	4.967 ms	101.49%	1.49%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.713000 ms	4.769 ms	101.19%	1.19%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.816000 ms	4.866 ms	101.04%	1.04%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.102000 ms	5.140 ms	100.74%	0.74%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.496000 ms	617.772 ms	100.04%	0.04%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.551000 ms	617.628 ms	100.01%	0.01%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.213000 ms	618.268 ms	100.01%	0.01%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.194000 ms	618.226 ms	100.01%	0.01%	.
MicroBench_LocalMem_fp32_4096	29.856000 ms	29.857 ms	100.00%	0.00%	.
MicroBench_LocalMem_int32_4096	29.857 ms	29.834000 ms	99.92%	-0.08%	.

Relative perf in group Pattern (10): 100.120%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_NDRange_int32	16.581000 ms	16.971 ms	102.35%	2.35%	++
Pattern_SegmentedReduction_Hierarchical_int64	11.772000 ms	11.777 ms	100.04%	0.04%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.803 ms	11.801000 ms	99.98%	-0.02%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.594 ms	11.588000 ms	99.95%	-0.05%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.595 ms	11.587000 ms	99.93%	-0.07%	.
Pattern_SegmentedReduction_NDRange_int16	2.266 ms	2.263000 ms	99.87%	-0.13%	.
Pattern_SegmentedReduction_NDRange_int32	2.167 ms	2.164000 ms	99.86%	-0.14%	.
Pattern_SegmentedReduction_NDRange_fp32	2.166 ms	2.163000 ms	99.86%	-0.14%	.
Pattern_SegmentedReduction_NDRange_int64	2.340 ms	2.333000 ms	99.70%	-0.30%	.
Pattern_Reduction_Hierarchical_int32	17.078 ms	17.024000 ms	99.68%	-0.32%	.

Relative perf in group ScalarProduct (6): 99.984%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_NDRange_fp32	3.748000 ms	3.767 ms	100.51%	0.51%	.
ScalarProduct_Hierarchical_fp32	10.142000 ms	10.174 ms	100.32%	0.32%	.
ScalarProduct_Hierarchical_int32	10.542000 ms	10.555 ms	100.12%	0.12%	.
ScalarProduct_Hierarchical_int64	11.500000 ms	11.508 ms	100.07%	0.07%	.
ScalarProduct_NDRange_int64	5.457 ms	5.456000 ms	99.98%	-0.02%	.
ScalarProduct_NDRange_int32	3.775 ms	3.734000 ms	98.91%	-1.09%	.

Relative perf in group USM (7): 99.828%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Allocation_latency_fp32_device	0.067000 ms	0.068 ms	101.49%	1.49%	.
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.708000 ms	1.717 ms	100.53%	0.53%	.
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.082000 ms	1.085 ms	100.28%	0.28%	.
USM_Allocation_latency_fp32_host	37.628000 ms	37.633 ms	100.01%	0.01%	.
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.891 ms	1.889000 ms	99.89%	-0.11%	.
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.277 ms	1.256000 ms	98.36%	-1.64%	.
USM_Allocation_latency_fp32_shared	0.058 ms	0.057000 ms	98.28%	-1.72%	.

Relative perf in group VectorAddition (3): 96.468%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_int32	1.523 ms	1.510000 ms	99.15%	-0.85%	.
VectorAddition_fp32	1.499 ms	1.460000 ms	97.40%	-2.60%	--
VectorAddition_int64	3.298 ms	3.066000 ms	92.97%	-7.03%	-----

Relative perf in group Polybench (3): 100.469%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_Atax	6.690000 ms	6.855 ms	102.47%	2.47%	++
Polybench_3mm	1.738 ms	1.730000 ms	99.54%	-0.46%	.
Polybench_2mm	1.228 ms	1.221000 ms	99.43%	-0.57%	.

Relative perf in group Kmeans (1): 99.863%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	16.113 ms	16.091000 ms	99.86%	-0.14%	.

Relative perf in group LinearRegressionCoeff (1): 99.105%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	916.629 ms	908.423000 ms	99.10%	-0.90%	.

Relative perf in group MolecularDynamics (1): 100.000%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.030000 ms	0.030 ms	100.00%	0.00%	.

Relative perf in group llama.cpp (6): 99.590%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Prompt Processing Batched 512	428.682502 token/s	426.428 token/s	100.53%	0.53%	.
llama.cpp Prompt Processing Batched 256	874.309970 token/s	872.220 token/s	100.24%	0.24%	.
llama.cpp Text Generation Batched 512	62.538520 token/s	62.478 token/s	100.10%	0.10%	.
llama.cpp Text Generation Batched 256	62.545637 token/s	62.525 token/s	100.03%	0.03%	.
llama.cpp Text Generation Batched 128	62.534186 token/s	62.531 token/s	100.01%	0.01%	.
llama.cpp Prompt Processing Batched 128	802.960 token/s	830.457525 token/s	96.69%	-3.31%	---

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 96.175%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2115.110000 ns	2120.000 ns	100.23%	0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3074.680 ns	3068.370000 ns	99.79%	-0.21%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	288.322 ns	283.309000 ns	98.26%	-1.74%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2843.650 ns	2475.310000 ns	87.05%	-12.95%	----------

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 100.875%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	189.975000 ns	197.281 ns	103.85%	3.85%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.024000 ns	213.433 ns	100.66%	0.66%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	270.051 ns	268.948000 ns	99.59%	-0.41%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	710.670 ns	706.837000 ns	99.46%	-0.54%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 104.520%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3341.340000 ns	3771.150 ns	112.86%	12.86%	++++++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1753.880000 ns	1854.120 ns	105.72%	5.72%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1234.200000 ns	1259.770 ns	102.07%	2.07%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	259.035 ns	253.839000 ns	97.99%	-2.01%	--

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 98.834%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.153000 ns	195.246 ns	102.14%	2.14%	++
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	206.892 ns	206.713000 ns	99.91%	-0.09%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	750.391 ns	726.627000 ns	96.83%	-3.17%	--
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	319.262 ns	308.264000 ns	96.56%	-3.44%	---

Relative perf in group alloc/min (4): 99.819%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	958.483000 ns	975.381 ns	101.76%	1.76%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.625000 ns	177.090 ns	100.83%	0.83%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	815.477 ns	803.081000 ns	98.48%	-1.52%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	996.175 ns	978.697000 ns	98.25%	-1.75%	.

Relative perf in group multiple (12): 99.977%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	137101.000000 ns	141113.000 ns	102.93%	2.93%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	73947.800000 ns	75942.600 ns	102.70%	2.70%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15022.500000 ns	15330.800 ns	102.05%	2.05%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32925.600000 ns	33503.600 ns	101.76%	1.76%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	41963.600000 ns	42332.700 ns	100.88%	0.88%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	29996.200000 ns	30214.100 ns	100.73%	0.73%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25290.500000 ns	25425.600 ns	100.53%	0.53%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4275.150 ns	4251.600000 ns	99.45%	-0.55%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	166136.000 ns	165011.000000 ns	99.32%	-0.68%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1199480.000 ns	1170470.000000 ns	97.58%	-2.42%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	149207.000 ns	145356.000000 ns	97.42%	-2.58%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1216290.000 ns	1151930.000000 ns	94.71%	-5.29%	----

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00379828 s
bitcracker - total time for whole calculation: 35.546 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1152 1267 31.2788% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1227 1262 33.3152% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1169 1269 31.7404% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1124 1260 30.5186% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1263 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1240 1275 33.6682% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1271 33.6139% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1209 1263 32.8265% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1129 1264 30.6544% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1063 1269 28.8623% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1097 1257 29.7855% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1129 1269 30.6544% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1142 1276 31.0073% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1273 33.6139% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1239 1274 33.6411% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1094 1260 29.704% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1220 1257 33.1252% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1264 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1088 1246 29.5411% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1227 1260 33.3152% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1215 1250 32.9894% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1138 1261 30.8987% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1181 1283 32.0662% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1112 1265 30.1928% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1263 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1263 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1226 1261 33.2881% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1266 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1191 1258 32.3378% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1163 1268 31.5775% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1125 1265 30.5458% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1265 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1124 1263 30.5186% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1234 1268 33.5053% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1148 1268 31.1702% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1094 1263 29.704% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1098 1257 29.8127% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1112 1268 30.1928% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1189 1268 32.2835% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1118 1260 30.3557% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1237 1273 33.5868% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1096 1262 29.7583% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1245 1278 33.804% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1241 1276 33.6954% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1061 1266 28.808% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1264 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1060 1259 28.7809% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1261 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1193 1267 32.3921% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1132 1268 30.7358% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Avg workload time = 204.385 ms

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Output:

MAIN: Starting SYCL main program
MAIN: Attempting to clean up previous eWave tsunami files
MAIN: Clean up completed
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.6.0)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
MAIN: Program successfully completed

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Output:

Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version :
Quicksilver Git Hash :
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1

Loading params
Finished loading params
Simulation:
dt: 1e-08
fMax: 0.1
inputFile: /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp
energySpectrum:
boundaryCondition: octant
loadBalance: 1
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 10000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:

Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0

Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 0.1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 0
fissionCrossSectionRatio: 0
scatteringCrossSectionRatio: 1

CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
setting GPU
setting parameters
Building partition 0
Building partition 1
Building partition 2
Building partition 3
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Started copyMaterialDatabase_device
Finished copyMaterialDatabase_device
Finished copyNuclearData_device
Finished copyDomainDevice
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 1000000 0 9000000 0 18533189 0 0 18533189 1151780 8848220 55527935 1.854923e+09 4.316810e-01 6.222860e-01 0.000000e+00
1 8848220 1000000 0 151478 0 34281997 0 0 34281997 1664159 8335539 94633679 5.047651e+09 3.639650e-01 7.642930e-01 0.000000e+00
2 8335539 1000000 0 663717 0 34354432 0 0 34354432 1366771 8632485 95010375 7.705930e+09 3.614890e-01 7.844610e-01 0.000000e+00
3 8632485 1000000 0 367978 0 34302727 0 0 34302727 1242216 8758247 94953591 9.992076e+09 3.684410e-01 8.416660e-01 0.000000e+00
4 8758247 1000000 0 242076 0 34141236 0 0 34141236 1168452 8831871 94599337 1.199834e+10 3.602160e-01 7.946330e-01 0.000000e+00
5 8831871 1000000 0 168070 0 33948724 0 0 33948724 1121156 8878785 94148236 1.377636e+10 3.603510e-01 7.694460e-01 0.000000e+00
6 8878785 1000000 0 120572 0 33760567 0 0 33760567 1089103 8910254 93689264 1.535668e+10 3.596770e-01 7.623760e-01 0.000000e+00
7 8910254 1000000 0 89810 0 33552179 0 0 33552179 1065203 8934861 93216931 1.676993e+10 3.331780e-01 7.835380e-01 0.000000e+00
8 8934861 1000000 0 65491 0 33384605 0 0 33384605 1047720 8952632 92768273 1.804559e+10 3.323780e-01 7.897290e-01 0.000000e+00
9 8952632 1000000 0 47165 0 33198494 0 0 33198494 1033968 8965829 92324678 1.920208e+10 3.325050e-01 7.748610e-01 0.000000e+00

Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 1.129e+07 1.129e+07 1.129e+07 0.000e+00 100.00
cycleInit 10 3.604e+06 3.604e+06 3.604e+06 0.000e+00 100.00
cycleTracking 10 7.687e+06 7.687e+06 7.687e+06 0.000e+00 100.00
cycleTracking_Kernel 104 4.919e+06 4.919e+06 4.919e+06 0.000e+00 100.00
cycleTracking_MPI 117 2.132e+05 2.132e+05 2.132e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 4.290e+02 4.290e+02 4.290e+02 0.000e+00 100.00
Figure Of Merit 117.19 [Num Mega Segments / Cycle Tracking Time]

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Output:

SYMN: Welcome to the SYCL version of Sobel filter workload.
SYMN: Input image file: /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png
SYMN: Launching SYCL kernel with # of iterations: 5
time to subtract from total: 7.49288 s
sobelfilter - total time for whole calculation: 0.598763 s

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Output:

	Welcome to DL-CIFAR workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.6.0)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.6.0)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: DL NW size type: WORKLOAD_DEFAULT_SIZE
WL PARAMS: ==================================================
WL PARAMS:

dataFileReadTimer->getTotalOpTime(): 8.8e-05 s
dl-cifar - total time for whole calculation: 23.9311 s

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Output:

	Welcome to DL-MNIST workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.6.0)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.6.0)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: Tensor management policy: per_layer
WL PARAMS: Convolution algorithm: ONEDNN_AUTO
WL PARAMS: Dataset reader format: NCHW
WL PARAMS: Dry run: YES
WL PARAMS: OneDNN Conv PD memory format: ONEDNN_CONVPD_ANY
WL PARAMS: No of iterations for inference: 400
WL PARAMS: ==================================================
WL PARAMS:

dl-mnist - total time for whole calculation: 2.39 s

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

Output:

name,iterations,real_time,cpu_time,time_unit,bytes_per_second,items_per_second,label,error_occurred,error_message
"glibc/alloc/size:10000/0/4096/iterations:200000/threads:4",800000,2843.65,1816.27,ns,,,,,
"glibc/alloc/size:10000/0/4096/iterations:200000/threads:1",200000,696.701,696.703,ns,,,,,
"glibc/alloc/size:10000/100000/4096/iterations:200000/threads:4",800000,1234.17,1176.85,ns,,,,,
"glibc/alloc/size:10000/100000/4096/iterations:200000/threads:1",200000,729.563,729.565,ns,,,,,
"glibc/alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4",800000,815.735,765.849,ns,,,,,
"glibc/alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1",200000,175.625,175.624,ns,,,,,
"os_provider/alloc/size:10000/0/4096/iterations:200000/threads:4",800000,2205.17,2203.59,ns,,,,,
"os_provider/alloc/size:10000/0/4096/iterations:200000/threads:1",200000,195.111,195.065,ns,,,,,
"os_provider/alloc/size:10000/100000/4096/iterations:200000/threads:4",800000,1753.88,1753.24,ns,,,,,
"os_provider/alloc/size:10000/100000/4096/iterations:200000/threads:1",200000,191.296,191.29,ns,,,,,
"proxy_pool<os_provider>/alloc/size:10000/0/4096/iterations:200000/threads:4",800000,3016.85,2968.45,ns,,,,,
"proxy_pool<os_provider>/alloc/size:10000/0/4096/iterations:200000/threads:1",200000,272.267,272.258,ns,,,,,
"proxy_pool<os_provider>/alloc/size:10000/100000/4096/iterations:200000/threads:4",800000,3201.31,3152.02,ns,,,,,
"proxy_pool<os_provider>/alloc/size:10000/100000/4096/iterations:200000/threads:1",200000,319.262,319.256,ns,,,,,
"scalable_pool<os_provider>/alloc/size:10000/0/4096/iterations:200000/threads:4",800000,288.322,287.324,ns,,,,,
"scalable_pool<os_provider>/alloc/size:10000/0/4096/iterations:200000/threads:1",200000,219.311,219.306,ns,,,,,
"scalable_pool<os_provider>/alloc/size:10000/100000/4096/iterations:200000/threads:4",800000,252.841,249.68,ns,,,,,
"scalable_pool<os_provider>/alloc/size:10000/100000/4096/iterations:200000/threads:1",200000,203.539,203.536,ns,,,,,
"scalable_pool<os_provider>/alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4",800000,973.518,953.704,ns,,,,,
"scalable_pool<os_provider>/alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1",200000,975.323,975.257,ns,,,,,
"glibc/multiple_malloc_free/size:10000/4096/iterations:2000/threads:4",8000,33531,30934.2,ns,,,,,
"glibc/multiple_malloc_free/size:10000/4096/iterations:2000/threads:1",2000,4275.15,4274.97,ns,,,,,
"glibc/multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4",8000,143398,89269.9,ns,,,,,
"glibc/multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1",2000,29288.5,29288.3,ns,,,,,
"proxy_pool<os_provider>/multiple_malloc_free/size:10000/4096/iterations:2000/threads:4",8000,1.19948e+06,1.19925e+06,ns,,,,,
"proxy_pool<os_provider>/multiple_malloc_free/size:10000/4096/iterations:2000/threads:1",2000,166136,166132,ns,,,,,
"os_provider/multiple_malloc_free/size:10000/4096/iterations:2000/threads:4",8000,1.23512e+06,1.23445e+06,ns,,,,,
"os_provider/multiple_malloc_free/size:10000/4096/iterations:2000/threads:1",2000,149207,149205,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/size:10000/4096/iterations:2000/threads:4",8000,41503.8,41358.6,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/size:10000/4096/iterations:2000/threads:1",2000,15022.5,15022,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4",8000,76237.9,76215.9,ns,,,,,
"scalable_pool<os_provider>/multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1",2000,25290.5,25289.9,ns,,,,,

pbalcer reviewed Jan 14, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

pbalcer reviewed Jan 14, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

mateuszpn marked this pull request as ready for review January 16, 2025 12:59

mateuszpn requested a review from a team as a code owner January 16, 2025 12:59

pbalcer reviewed Jan 16, 2025

View reviewed changes

pbalcer force-pushed the add-graph-bench branch from dda6187 to 4e5223a Compare January 20, 2025 10:00

add graph API benchmarks

3cc0249

pbalcer force-pushed the add-graph-bench branch from 4e5223a to 3cc0249 Compare January 20, 2025 11:37

pbalcer merged commit 64e8089 into oneapi-src:main Jan 20, 2025
10 of 71 checks passed

mateuszpn deleted the add-graph-bench branch February 5, 2025 12:10

Conversation

mateuszpn commented Jan 14, 2025

Uh oh!

github-actions bot commented Jan 14, 2025

Uh oh!

pbalcer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

pbalcer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

pbalcer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 14, 2025

Summary

Performance change in benchmark groups

Details

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

Output:

Environment Variables:

Command:

==================================
Retrieving Info

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!