Skip to content

Commit cded57b

Browse files
committed
Update summaries, add longevity results
1 parent 9f26aab commit cded57b

File tree

14 files changed

+279
-0
lines changed

14 files changed

+279
-0
lines changed

tests/results/dp-perf/2.2.0/2.2.0-oss.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,13 @@ GKE Cluster:
2020
- Zone: us-west1-b
2121
- Instance Type: n2d-standard-16
2222

23+
## Summary:
24+
25+
- 4 out of 5 tests showed slight latency increases, consistent with the trend noted in the 2.1.0 summary
26+
- The latency differences are minimal overall, with most changes under 1%.
27+
- The POST method routing increase of ~2.2% is the most significant change, though still relatively small in absolute terms (~21µs).
28+
- All tests maintained 100% success rates with similar throughput (~1000 req/s), indicating that the slight latency variations are likely within normal performance variance.
29+
2330
## Test1: Running latte path based routing
2431

2532
```text

tests/results/dp-perf/2.2.0/2.2.0-plus.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,16 @@ GKE Cluster:
2020
- Zone: us-west1-b
2121
- Instance Type: n2d-standard-16
2222

23+
## Summary:
24+
25+
- Average latency increased across all tests
26+
- Largest Increase: Header-based routing (+76.461µs, +8.60%)
27+
- Smallest Increase: Path-based routing (+28.988µs, +3.26%)
28+
- Average Overall Increase: ~51.1µs (+5.69% average across all tests)
29+
- Most Impacted: Header and query-based routing (8.60% and 5.91% respectively)
30+
- Method Routing: GET and POST both increased by ~5.3%
31+
- All tests maintained 100% success rate, similar throughput and similar max latencies
32+
2333
## Test1: Running latte path based routing
2434

2535
```text
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Results
2+
3+
## Test environment
4+
5+
NGINX Plus: false
6+
7+
NGINX Gateway Fabric:
8+
9+
- Commit: e4eed2dad213387e6493e76100d285483ccbf261
10+
- Date: 2025-10-17T14:41:02Z
11+
- Dirty: false
12+
13+
GKE Cluster:
14+
15+
- Node count: 3
16+
- k8s version: v1.33.5-gke.1080000
17+
- vCPUs per node: 2
18+
- RAM per node: 4015668Ki
19+
- Max pods per node: 110
20+
- Zone: europe-west2-a
21+
- Instance Type: e2-medium
22+
23+
## Summary:
24+
25+
- Still a lot of non-2xx or 3xx responses, but vastly improved on the last test run.
26+
- This indicates that while most of the Agent - control plane connection issues have been resolved, some issues remain.
27+
- All the observed 502s happened within the one window of time, which at least indicates the system was able to recover - although it is unclear what triggered Agent
28+
- The increase in memory usage for NGF seen in the previous test run appears to have been resolved.
29+
- We observe a steady increase in NGINX memory usage over time which could indicate a memory leak.
30+
- CPU usage remained consistent with past results.
31+
- Errors seem to be related to cluster upgrade or some other external factor (excluding the resolved inferences pool status error).
32+
33+
## Traffic
34+
35+
HTTP:
36+
37+
```text
38+
Running 5760m test @ http://cafe.example.com/coffee
39+
2 threads and 100 connections
40+
Thread Stats Avg Stdev Max +/- Stdev
41+
Latency 202.19ms 150.51ms 2.00s 83.62%
42+
Req/Sec 272.67 178.26 2.59k 63.98%
43+
183598293 requests in 5760.00m, 62.80GB read
44+
Socket errors: connect 0, read 338604, write 82770, timeout 57938
45+
Non-2xx or 3xx responses: 33893
46+
Requests/sec: 531.24
47+
Transfer/sec: 190.54KB
48+
```
49+
50+
HTTPS:
51+
52+
```text
53+
Running 5760m test @ https://cafe.example.com/tea
54+
2 threads and 100 connections
55+
Thread Stats Avg Stdev Max +/- Stdev
56+
Latency 189.21ms 108.25ms 2.00s 66.82%
57+
Req/Sec 271.64 178.03 1.96k 63.33%
58+
182905321 requests in 5760.00m, 61.55GB read
59+
Socket errors: connect 10168, read 332301, write 0, timeout 96
60+
Requests/sec: 529.24
61+
Transfer/sec: 186.76KB
62+
```
63+
64+
## Key Metrics
65+
66+
### Containers memory
67+
68+
![oss-memory.png](oss-memory.png)
69+
70+
### Containers CPU
71+
72+
![oss-cpu.png](oss-cpu.png)
73+
74+
## Error Logs
75+
76+
### nginx-gateway
77+
78+
- msg: Config apply failed, rolling back config; error: error getting file data for name:"/etc/nginx/conf.d/http.conf" hash:"Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew=" permissions:"0644" size:5430: rpc error: code = NotFound desc = file not found -> happened twice in the 4 days, related to agent reconciliation during token rotation
79+
- {hashFound: jmeyy1p+6W1icH2x2YGYffH1XtooWxvizqUVd+WdzQ4=, hashWanted: Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew=, level: debug, logger: nginxUpdater.fileService, msg: File found had wrong hash, ts: 2025-10-18T18:11:24Z}
80+
- The error indicates Agent requested a file that had since changed
81+
82+
- msg: Failed to update lock optimistically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io ngf-longevity-nginx-gateway-fabric-leader-election), falling back to slow path -> same leader election error as on plus, seems out of scope of our product
83+
84+
- msg: no matches for kind "InferencePool" in version "inference.networking.k8s.io/v1" -> Thousands of these, but fixed in PR 4104
85+
86+
### nginx
87+
88+
Traffic: nearly 34000 502s
89+
90+
- These all happened in the same window of less than a minute (approx 2025-10-18T18:11:11 - 2025-10-18T18:11:50), and resolved once NGINX restarted
91+
- It's unclear what triggered NGINX to restart, though it does appear a memory spike was observed around this time
92+
- The outage correlates with the config apply error seen in the control plane logs
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Results
2+
3+
## Test environment
4+
5+
NGINX Plus: true
6+
7+
NGINX Gateway Fabric:
8+
9+
- Commit: e4eed2dad213387e6493e76100d285483ccbf261
10+
- Date: 2025-10-17T14:41:02Z
11+
- Dirty: false
12+
13+
GKE Cluster:
14+
15+
- Node count: 3
16+
- k8s version: v1.33.5-gke.1080000
17+
- vCPUs per node: 2
18+
- RAM per node: 4015668Ki
19+
- Max pods per node: 110
20+
- Zone: europe-west2-a
21+
- Instance Type: e2-medium
22+
23+
## Summary:
24+
25+
- Total of 5 502s observed across the 4 days of the test run
26+
- The increase in memory usage for NGF seen in the previous test run appears to have resolved.
27+
- We observe a steady increase in NGINX memory usage over time which could indicate a memory leak.
28+
- CPU usage remained consistant with past results.
29+
- Errors seem to be related to cluster upgrade or some other external factor (excluding the resolved inferences pool status error).
30+
31+
## Key Metrics
32+
33+
### Containers memory
34+
35+
![plus-memory.png](oss-memory.png)
36+
37+
### Containers CPU
38+
39+
![plus-cpu.png](oss-cpu.png)
40+
41+
## Traffic
42+
43+
HTTP:
44+
45+
```text
46+
Running 5760m test @ http://cafe.example.com/coffee
47+
2 threads and 100 connections
48+
Thread Stats Avg Stdev Max +/- Stdev
49+
Latency 203.71ms 108.67ms 2.00s 66.92%
50+
Req/Sec 257.95 167.36 1.44k 63.57%
51+
173901014 requests in 5760.00m, 59.64GB read
52+
Socket errors: connect 0, read 219, write 55133, timeout 27
53+
Non-2xx or 3xx responses: 4
54+
Requests/sec: 503.19
55+
Transfer/sec: 180.96KB
56+
```
57+
58+
HTTPS:
59+
60+
```text
61+
Running 5760m test @ https://cafe.example.com/tea
62+
2 threads and 100 connections
63+
Thread Stats Avg Stdev Max +/- Stdev
64+
Latency 203.89ms 108.72ms 1.89s 66.92%
65+
Req/Sec 257.52 167.02 1.85k 63.64%
66+
173632748 requests in 5760.00m, 58.61GB read
67+
Socket errors: connect 7206, read 113, write 0, timeout 0
68+
Non-2xx or 3xx responses: 1
69+
Requests/sec: 502.41
70+
Transfer/sec: 177.84KB
71+
```
72+
73+
74+
## Error Logs
75+
76+
### nginx-gateway
77+
78+
msg: Failed to update lock optimistically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io ngf-longevity-nginx-gateway-fabric-leader-election), falling back to slow path -> same leader election error as on oss, seems out of scope of our product
79+
80+
msg: Get "https://34.118.224.1:443/apis/gateway.networking.k8s.io/v1beta1/referencegrants?allowWatchBookmarks=true&resourceVersion=1760806842166968999&timeout=10s&timeoutSeconds=435&watch=true": context canceled -> possible cluster upgrade?
81+
82+
msg: no matches for kind "InferencePool" in version "inference.networking.k8s.io/v1" -> Thousands of these, but fixed in PR 4104
83+
84+
### nginx
85+
86+
Traffic: 5 502s
87+
88+
```
89+
INFO 2025-10-19T00:12:04.220541710Z [resource.labels.containerName: nginx] 10.154.15.240 - - [19/Oct/2025:00:12:04 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
90+
INFO 2025-10-19T18:38:18.651520548Z [resource.labels.containerName: nginx] 10.154.15.240 - - [19/Oct/2025:18:38:18 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
91+
INFO 2025-10-20T21:49:05.008076073Z [resource.labels.containerName: nginx] 10.154.15.240 - - [20/Oct/2025:21:49:04 +0000] "GET /tea HTTP/1.1" 502 150 "-" "-"
92+
INFO 2025-10-21T06:43:10.256327990Z [resource.labels.containerName: nginx] 10.154.15.240 - - [21/Oct/2025:06:43:10 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
93+
INFO 2025-10-21T12:13:05.747098022Z [resource.labels.containerName: nginx] 10.154.15.240 - - [21/Oct/2025:12:13:05 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
94+
```
95+
96+
No other errors identified in this test run.
136 KB
Loading
163 KB
Loading
135 KB
Loading
124 KB
Loading

tests/results/ngf-upgrade/2.2.0/2.2.0-oss.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,15 @@ GKE Cluster:
2020
- Zone: us-west1-b
2121
- Instance Type: n2d-standard-16
2222

23+
## Summary:
24+
25+
- The 2.2.0 release shows massive improvements in upgrade behavior:
26+
- 2.1.0 Issue: The summary noted significant downtime during upgrades, with a manual uninstall/reinstall workaround recommended
27+
- 2.2.0 Fix: The new readiness probe (mentioned in 2.1.0 summary as a planned fix) appears to have successfully resolved the upgrade downtime issue
28+
- Remaining Failures: The 11 connection refused errors in 2.2.0 (0.18% failure rate) likely represent the minimal unavoidable disruption during pod replacement
29+
- 99.82% success rate during live upgrade is a production-acceptable result
30+
- System maintains near-normal throughput during upgrades
31+
2332
## Test: Send http /coffee traffic
2433

2534
```text

tests/results/ngf-upgrade/2.2.0/2.2.0-plus.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,15 @@ GKE Cluster:
2020
- Zone: us-west1-b
2121
- Instance Type: n2d-standard-16
2222

23+
## Summary:
24+
25+
- The 2.2.0 release shows massive improvements in upgrade behavior:
26+
- 2.1.0 Issue: The summary noted significant downtime during upgrades, with a manual uninstall/reinstall workaround recommended
27+
- 2.2.0 Fix: The new readiness probe (mentioned in 2.1.0 summary as a planned fix) appears to have successfully resolved the upgrade downtime issue
28+
- Remaining Failures: The 19 connection refused errors in 2.2.0 (0.32% failure rate) likely represent the minimal unavoidable disruption during pod replacement
29+
- 99.68% success rate during live upgrade is a production-acceptable result
30+
- System maintains near-normal throughput during upgrades
31+
2332
## Test: Send http /coffee traffic
2433

2534
```text

0 commit comments

Comments
 (0)