Network Resiliency - Timeouts, Retries and Circuit Breaking

This sub-module will cover the network resiliency and testing features such as Timeouts, Retries and Circuit Breaking of Istio service-mesh on Amazon EKS.

Use the following links to quickly jump to the desired section:

Timeouts
Retries
Circuit Breaking

Timeouts

To test the timeout functionality we will make the following two changes:

Add a delay of 5 seconds to the catalogdetail VirtualService
Add a timeout of 2 seconds to the productcatalog VirtualService

Since the productcatalog service calls the catalogdetail service and since catalogdetail will take about 5 seconds to respond, we will get to see the timeouts getting triggered.

Testing

Testing Delays

Apply the delay configuration to the catalogdetail VirtualService

# This assumes that you are currently in "istio-on-eks/modules/03-network-resiliency" folder
cd timeouts-retries-circuitbreaking

kubectl apply -f ./timeouts/catalogdetail-virtualservice.yaml

Test the delay by running a curl command against the productcatalog service from within the mesh.

# Set the FE_POD_NAME variable to the name of the frontend pod in the workshop namespace
export FE_POD_NAME=$(kubectl get pods -n workshop -l app=frontend -o jsonpath='{.items[].metadata.name}')

# Access the frontend container in the workshop namespace interactively

kubectl exec -it ${FE_POD_NAME} -n workshop -c frontend -- bash

root@frontend-container-id:/app#
# Allows accessing the shell inside the frontend container for executing commands

curl http://productcatalog:5000/products/ -s -o /dev/null -w "Time taken to start transfer: %{time_starttransfer}\n"

If the delay configuration is applied correctly, the output should be similar to:

Time taken to start transfer: 5.024133

Testing Timeouts

Apply the timeout configuration to the productcatalog VirtualService

# This assumes that you are currently in "istio-on-eks/modules/03-network-resiliency/timeouts-retries-circuitbreaking" folder

kubectl apply -f ./timeouts/productcatalog-virtualservice.yaml

Test the timeout by running a curl command against the productcatalog service from within the mesh.

# Access the frontend container in the workshop namespace interactively

kubectl exec -it ${FE_POD_NAME} -n workshop -c frontend -- bash
root@frontend-container-id:/app#
# Allows accessing the shell inside the frontend container for executing commands

curl http://productcatalog:5000/products/ -s -o /dev/null -w "Time taken to start transfer: %{time_starttransfer}\n"

Output should be similar to:

Time taken to start transfer: 2.006172

Reset the environment

Run the same set of steps as in the Initial state setup to reset the environment.

Retries:

To test the retries functionality we will make the following changes to the productcatalog VirtualService:

Add configuration for retries with 2 attempts to the productcatalog VirtualService
Edit the productcatalog deployment to run a container that does nothing other than to sleep for 1 hour. To achieve this we make the following changes to the deployment:
- Change the readinessProbe to run a simple command echo hello. Since the
  command always succeeds, the container would immediately be ready.
- Change the livenessProbe to run a simple command echo hello. Since the
  command always succeeds, the container is immediately marked to be live.
- Add a command to the container that will cause the main process to sleep for 1 hour

To apply these changes, run the following command:

# This assumes that you are currently in "istio-on-eks/modules/03-network-resiliency/timeouts-retries-circuitbreaking" folder

kubectl apply -f ./retries/

kubectl get deployment -n workshop productcatalog -o json |
jq '.spec.template.spec.containers[0].readinessProbe={exec:{command:["sh","-c","echo hello"]}}
| .spec.template.spec.containers[0].livenessProbe={exec:{command:["sh","-c","echo hello"]}}
| .spec.template.spec.containers[0]+={command:["sh","-c","sleep 1h"]}' |
kubectl apply --force=true -f -

Enable debug mode for the envoy logs of productcatalog service with the command below in another (2nd) terminal (Note: This assumes that you are currently in "istio-on-eks/modules/03-network-resiliency/timeouts-retries-circuitbreaking" folder and istioctl is installed in the another (2nd) terminal as well).

istioctl pc log --level debug -n workshop deploy/productcatalog

Testing Retries

To see the retries functionality from the logs, go back to the original (1st) terminal and enter the below command

kubectl -n workshop logs -l app=productcatalog -c istio-proxy -f | 
grep "x-envoy-attempt-count"

and from another (2nd) terminal, run the curl command against the productcatalog service from within the mesh by using the below command

export FE_POD_NAME=$(kubectl get pods -n workshop -l app=frontend -o jsonpath='{.items[].metadata.name}')
kubectl exec -it ${FE_POD_NAME} -n workshop -c frontend -- bash
curl http://productcatalog:5000/products/ -s -o /dev/null

Now, you can see the retries from logs as below in the original (1st) terminal.

'x-envoy-attempt-count', '1'
'x-envoy-attempt-count', '1'
'x-envoy-attempt-count', '2'
'x-envoy-attempt-count', '2'
'x-envoy-attempt-count', '3'
'x-envoy-attempt-count', '3'

We see 3 attempt counts as, in addition to 2 retries, it also includes the very first attempt at connecting to the service.

Reset the environment

Run the same set of steps as in the Initial state setup to reset the environment.

Circuit Breaking

To test the circuit-breaker functionality we will make the following changes:

Modify the existing catalogdetail destination rule to apply circuit breaking configuration

# This assumes that you are currently in "istio-on-eks/modules/03-network-resiliency/timeouts-retries-circuitbreaking" folder

kubectl apply -f ./circuitbreaking/

Testing

To be able to test the circuit breaker feature we will use an application called fortio and to do this we will run a pod with a single container based on the fortio image.

Run the command below to create a fortio pod in the workshop namespace:

kubectl run fortio --image=fortio/fortio:latest_release -n workshop --annotations='proxy.istio.io/config=proxyStatsMatcher:
  inclusionPrefixes:
  - "cluster.outbound"
  - "cluster_manager"
  - "listener_manager"
  - "server"
  - "cluster.xds-grpc"'

Now from within the fortio pod test out a single curl to the catalogdetail service:

kubectl exec fortio -n workshop -c fortio -- /usr/bin/fortio \
curl http://catalogdetail.workshop.svc.cluster.local:3000/catalogDetail

You can see the request succeeded as below.

{"ts":1704746733.172561,"level":"info","r":1,"file":"scli.go","line":123,"msg":"Starting","command":"Φορτίο","version":"1.60.3 h1:adR0uf/69M5xxKaMLAautVf9FIVkEpMwuEWyMaaSnI0= go1.20.10 amd64 linux"}
HTTP/1.1 200 OK
x-powered-by: Express
content-type: application/json; charset=utf-8
content-length: 37
etag: W/"25-+DP7kANx3olb0HJqt5zDWgaO2Gg"
date: Mon, 08 Jan 2024 20:45:33 GMT
x-envoy-upstream-service-time: 8
server: envoy

{"version":"1","vendors":["ABC.com"]}%

Tripping the circuit breaker

We can start testing the circuit breaking functionality by generating traffic to the catalogdetail service with two concurrent connections (-c 2) and by sending a total of 20 requests (-n 20):

kubectl exec fortio -n workshop -c fortio -- \
/usr/bin/fortio load -c 2 -qps 0 -n 20 -loglevel Warning \
http://catalogdetail.workshop.svc.cluster.local:3000/catalogDetail

Output should be similar to below:

{"ts":1704746942.859803,"level":"info","r":1,"file":"logger.go","line":254,"msg":"Log level is now 3 Warning (was 2 Info)"}
Fortio 1.60.3 running at 0 queries per second, 2->2 procs, for 20 calls: http://catalogdetail.workshop.svc.cluster.local:3000/catalogDetail
Starting at max qps with 2 thread(s) [gomax 2] for exactly 20 calls (10 per thread + 0)
{"ts":1704746942.889725,"level":"warn","r":13,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704746942.895346,"level":"warn","r":12,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704746942.912859,"level":"warn","r":13,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704746942.918274,"level":"warn","r":12,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704746942.927350,"level":"warn","r":13,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
Ended after 65.360704ms : 20 calls. qps=305.99
Aggregated Function Time : count 20 avg 0.0063910061 +/- 0.003646 min 0.000391494 max 0.013544634 sum 0.127820122
# range, mid point, percentile, count
>= 0.000391494 <= 0.001 , 0.000695747 , 15.00, 3
> 0.003 <= 0.004 , 0.0035 , 25.00, 2
> 0.004 <= 0.005 , 0.0045 , 30.00, 1
> 0.005 <= 0.006 , 0.0055 , 45.00, 3
> 0.006 <= 0.007 , 0.0065 , 65.00, 4
> 0.007 <= 0.008 , 0.0075 , 70.00, 1
> 0.008 <= 0.009 , 0.0085 , 75.00, 1
> 0.009 <= 0.01 , 0.0095 , 85.00, 2
> 0.011 <= 0.012 , 0.0115 , 90.00, 1
> 0.012 <= 0.0135446 , 0.0127723 , 100.00, 2
# target 50% 0.00625
# target 75% 0.009
# target 90% 0.012
# target 99% 0.0133902
# target 99.9% 0.0135292
Error cases : count 5 avg 0.001746463 +/- 0.001537 min 0.000391494 max 0.003912496 sum 0.008732315
# range, mid point, percentile, count
>= 0.000391494 <= 0.001 , 0.000695747 , 60.00, 3
> 0.003 <= 0.0039125 , 0.00345625 , 100.00, 2
# target 50% 0.000847874
# target 75% 0.00334219
# target 90% 0.00368437
# target 99% 0.00388968
# target 99.9% 0.00391021
# Socket and IP used for each connection:
[0]   3 socket used, resolved to 172.20.233.48:3000, connection timing : count 3 avg 8.8641667e-05 +/- 1.719e-05 min 6.4519e-05 max 0.000103303 sum 0.000265925
[1]   3 socket used, resolved to 172.20.233.48:3000, connection timing : count 3 avg 0.000145408 +/- 7.262e-05 min 8.4711e-05 max 0.000247501 sum 0.000436224
Connection time (s) : count 6 avg 0.00011702483 +/- 5.992e-05 min 6.4519e-05 max 0.000247501 sum 0.000702149
Sockets used: 6 (for perfect keepalive, would be 2)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
172.20.233.48:3000: 6
Code 200 : 15 (75.0 %)
Code 503 : 5 (25.0 %)
Response Header Sizes : count 20 avg 177.75 +/- 102.6 min 0 max 237 sum 3555
Response Body/Total Sizes : count 20 avg 268.9 +/- 16.57 min 241 max 283 sum 5378
All done 20 calls (plus 0 warmup) 6.391 ms avg, 306.0 qps

We can notice that most requests have been successful except for few. The istio-proxy does allow for some leeway.

Code 200 : 17 (85.0 %)
Code 503 : 3 (15.0 %)

Now rerun the same command by increasing the number of concurrent connections to 3 and number of calls to 30

kubectl exec fortio -n workshop -c fortio -- \
/usr/bin/fortio load -c 3 -qps 0 -n 30 -loglevel Warning \
http://catalogdetail.workshop.svc.cluster.local:3000/catalogDetail

Output should be similar to below:

{"ts":1704747392.512409,"level":"info","r":1,"file":"logger.go","line":254,"msg":"Log level is now 3 Warning (was 2 Info)"}
Fortio 1.60.3 running at 0 queries per second, 2->2 procs, for 30 calls: http://catalogdetail.workshop.svc.cluster.local:3000/catalogDetail
Starting at max qps with 3 thread(s) [gomax 2] for exactly 30 calls (10 per thread + 0)
{"ts":1704747392.520178,"level":"warn","r":28,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704747392.520799,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.523468,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.525391,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.527484,"level":"warn","r":27,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704747392.528848,"level":"warn","r":27,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704747392.531731,"level":"warn","r":28,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704747392.533632,"level":"warn","r":28,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704747392.534727,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.536192,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.538116,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.539070,"level":"warn","r":27,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704747392.541580,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.543714,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.545742,"level":"warn","r":29,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0}
{"ts":1704747392.550258,"level":"warn","r":27,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0}
{"ts":1704747392.554057,"level":"warn","r":28,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
{"ts":1704747392.560066,"level":"warn","r":28,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0}
Ended after 50.285039ms : 30 calls. qps=596.6
Aggregated Function Time : count 30 avg 0.0039136975 +/- 0.003494 min 0.00032006 max 0.011210633 sum 0.117410925
# range, mid point, percentile, count
>= 0.00032006 <= 0.001 , 0.00066003 , 26.67, 8
> 0.001 <= 0.002 , 0.0015 , 40.00, 4
> 0.002 <= 0.003 , 0.0025 , 56.67, 5
> 0.003 <= 0.004 , 0.0035 , 60.00, 1
> 0.004 <= 0.005 , 0.0045 , 63.33, 1
> 0.005 <= 0.006 , 0.0055 , 76.67, 4
> 0.007 <= 0.008 , 0.0075 , 80.00, 1
> 0.008 <= 0.009 , 0.0085 , 86.67, 2
> 0.009 <= 0.01 , 0.0095 , 93.33, 2
> 0.01 <= 0.011 , 0.0105 , 96.67, 1
> 0.011 <= 0.0112106 , 0.0111053 , 100.00, 1
# target 50% 0.0026
# target 75% 0.005875
# target 90% 0.0095
# target 99% 0.0111474
# target 99.9% 0.0112043
Error cases : count 18 avg 0.001365681 +/- 0.0008843 min 0.00032006 max 0.003261174 sum 0.024582258
# range, mid point, percentile, count
>= 0.00032006 <= 0.001 , 0.00066003 , 44.44, 8
> 0.001 <= 0.002 , 0.0015 , 66.67, 4
> 0.002 <= 0.003 , 0.0025 , 94.44, 5
> 0.003 <= 0.00326117 , 0.00313059 , 100.00, 1
# target 50% 0.00125
# target 75% 0.0023
# target 90% 0.00284
# target 99% 0.00321416
# target 99.9% 0.00325647
# Socket and IP used for each connection:
[0]   5 socket used, resolved to 172.20.233.48:3000, connection timing : count 5 avg 0.0001612534 +/- 8.84e-05 min 8.515e-05 max 0.000330062 sum 0.000806267
[1]   5 socket used, resolved to 172.20.233.48:3000, connection timing : count 5 avg 0.000214197 +/- 0.0002477 min 5.6614e-05 max 0.00070467 sum 0.001070985
[2]   9 socket used, resolved to 172.20.233.48:3000, connection timing : count 9 avg 0.00011278589 +/- 5.424e-05 min 6.4739e-05 max 0.000243749 sum 0.001015073
Connection time (s) : count 19 avg 0.00015222763 +/- 0.0001462 min 5.6614e-05 max 0.00070467 sum 0.002892325
Sockets used: 19 (for perfect keepalive, would be 3)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
172.20.233.48:3000: 19
Code 200 : 12 (40.0 %)
Code 503 : 18 (60.0 %)
Response Header Sizes : count 30 avg 94.8 +/- 116.1 min 0 max 237 sum 2844
Response Body/Total Sizes : count 30 avg 256 +/- 18.59 min 241 max 283 sum 7680
All done 30 calls (plus 0 warmup) 3.914 ms avg, 596.6 qps

As we increase the traffic towards the catalogdetail microservice we start to notice the circuit breaking functionality kicking in. We now notice that only 40% of the requests succeeded and the rest 60%, as expected, were trapped by circuit breaker.

Code 200 : 12 (40.0 %)
Code 503 : 18 (60.0 %)

Now, query the isto-proxy to see the stats of the requests flagged for circuitbreaking.

kubectl exec fortio -n workshop -c istio-proxy -- pilot-agent request GET stats | grep catalogdetail | grep pending

The output should be similar to:

cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.remaining_pending: 1
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_overflow: 0
cluster.outbound|3000|v1|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_total: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.remaining_pending: 1
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_overflow: 0
cluster.outbound|3000|v2|catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_total: 0
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.remaining_pending: 1
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_overflow: 17
cluster.outbound|3000||catalogdetail.workshop.svc.cluster.local.upstream_rq_pending_total: 34

As can be seen in the the output above upstream_rq_pending_overflow has a value of 17, which means 17 calls so far have been flagged for circuitbreaking proving that our circuit-breaker configuration to the catalogdetail DestinationRule worked.

Reset the environment

Delete the fortio pod using the following command and then run the same steps as in the Initial state setup to reset the environment one last time.

kubectl delete pod fortio -n workshop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Network Resiliency - Timeouts, Retries and Circuit Breaking

Timeouts

Testing

Testing Delays

Testing Timeouts

Reset the environment

Retries:

Testing Retries

Reset the environment

Circuit Breaking

Testing

Tripping the circuit breaker

Reset the environment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Network Resiliency - Timeouts, Retries and Circuit Breaking

Timeouts

Testing

Testing Delays

Testing Timeouts

Reset the environment

Retries:

Testing Retries

Reset the environment

Circuit Breaking

Testing

Tripping the circuit breaker

Reset the environment