-
Notifications
You must be signed in to change notification settings - Fork 4
Milestone 3.2 Project report
The following issues are some of the known issues of our Project from Milestone 2 :
-
Deployment: In our deployment process it is impossible to upgrade the version of any micro-service with zero downtime. In addition to that, Upgrading of a service doesn't take into account of the requests that are currently being handled by it. This leads to problems of having some requests in incomplete state forever.
-
Overloading of servers : There is a limitation in number of requests that our microservices can handle (introduced manually to mimic real systems) and since these systems can't infinitely scale horizontally , there is a issue of overloading of CPU leading to complete failure of the system leading to loss of data.
-
Customised content delivery : The content that is delivered through the service is same for all the users and it is inconvenient to give different urls for different target users to deliver different content. We wanted to show some specific screens only for specific kind of users.
-
Latency and errors: The message broker (Kafka) in our application might sometimes take more time to initialize and by the time everything is up and running we might lose the request. These types of issues are difficult to debug since this might also happen due to an invalid query.
Initially we had goals to tackle only the problems related to Deployment and errors. Since we had additional time to contribute, we have additionally tried to solve some problems related to overloading of servers and also delivering custom content to users.
Doubts related to problems that we tried to solve was discussed during the class hours and Professors helped us to get a deeper insight about the problems that we tried to tackle and pointed us to right technologies to explore in order to come up with solutions.
- We came to understand that many of the problems that we were facing could be solved by using the idea of service meshes from the class lecture - https://courses.airavata.org/slides/Spring2020-CS649-ServiceMesh.pdf.
- Since our application has already cloud native, we figured out that many issues can just be solved by using intelligent routing strategies and rollouts.
- So we tried exploring different tools like Istio and LinkerD which can used to create service mesh through sidecar injection.
- Setup Istio and inject sidecar proxies
Initial Setup
- Intitially we installed istio using helm but we faced issues in sidecar injection and accessing kiali dashboard. Hence, we uninstalled and used istioctl to install.
Service Mesh Integration
Enabling Sidecar injection into our system and inject a sidecar (envoy proxy) into every microservice pod.
Initially we manually injected sidecar to the services and tested it on the VM. Once we ensured that everything is working we enabled automatic injection. We had issues with automatic sidecar injection due to the problems in istio installation.
Canary Deployment
To achieve canary deployment we followed the following rules for the UI service.
-
Create a Gateway for the system using the mesh-gateway.yaml file. We open up the gateway to all hosts. This gateway creates the istio-ingress gateway of our system.
-
Create virtual services for microservices and connect virtual services to the gateway.
-
Create different version of the service and upload it to the docker hub. We changed the UI service home page welcome message to identify the difference between both the versions.
-
Add one more deployment with the new version of ui in the ui.yaml file and attached it to the same UI service.
-
Now add the destination rules file (ui-destination-rules.yaml) to route the traffic to different versions of the service.
-
Test the new version by passing less traffic and ensure that the new version does not break and gradually increase the traffic. This can be configured in the ui-vs-traffic.yaml
-
As a prototype, we have done all our experimental implementations with the UI microservice of our system.
To test various traffic routing loads, the following steps were taken.
- Change different combinations of weights pertaining to the old and new service versions in the virtual services
- Re-apply the ui-vs-traffic.yaml service configuration to the cluster.
- Hit the URL for the gateway and the UI service to check manually if different versions get displayed.
- Check the Kiali dashboard to see if the services were deployed with multiple versions or not.
- Check the Grafana dashboard to compare the traffic distribution in the old and new versions.
Troubleshooting of the traffic routing issues to both versions involved the following process.
- Kiali dashboard provided routing information with features like Traffic Animation, edge labeling denoting requests distribution between versions of the service. This helped to visualize how the traffic actually was getting distributed between the old and new versions of the UI.
- To further granularize the problem, Grafana Dashboard was utilized to monitor the actual amount of traffic being routed to different versions. The service and workload dashboards were used to see graphs charting the workloads on the services, i.e. the traffic directed to each service. This was used to realize if the number of requests directed to different versions was the same as that defined in the virtual service and destination rule for the UI service or not. Since manually checking the difference in services is not possible because the division in workload is not precise and would contain some randomness, collecting and reading data over time helped in monitoring and assessing the actual distribution
Integrating Observability/Monitoring Tools
Grafana is a general-purpose dashboard and graph composer. It's focused on providing rich ways to visualize time series metrics, mainly though graphs but supports other ways to visualize data through a pluggable panel architecture. And Kiali is used Service mesh observability and configuration. It is an observability console for Istio with service mesh configuration capabilities. It helps to understand the structure of the service mesh by inferring the topology, and also provides the health of the mesh.
We have installed the demo profile for istio which provides kiali, grafana and tracing options.
-
Enable kiali. To access it through public IP change the service to NodePort.
-
We followed similar steps for grafana also.
-
Access Kiali from http://149.165.168.90:31821/kiali/ (username: admin, password; admin) and grafana from http://149.165.168.90:30894/
-
The service can be accessed from http://149.165.168.90:31674/
Kiali graph visualisation
Distributed Tracing
Istio leverages Envoy’s distributed tracing feature to provide tracing integration out of the box. Specifically, Istio provides options to install various tracing backend and configure proxies like Zipkin, Jaeger and Lightstep to send trace spans to them automatically. We have configured Jaeger as tracing backend for our application to visualize the request traces
For Envoy-based tracing integrations, Envoy (the sidecar proxy) sends tracing information directly to jaeger on behalf of the applications being proxied.
Steps:
- Generates request IDs and trace headers (i.e. X-B3-TraceId) for requests as they flow through the proxy
- Generates trace spans for each request based on request and response metadata (i.e. response time)
- Sends the generated trace spans to the tracing backends(jaeger)
- Forwards the trace headers to the proxied application.
Link : https://istio.io/faq/distributed-tracing/
Zero downtime deployments
We were able to achieve zero downtime manual deployment of our UI service. This was possible by following the blue green deployment technique and adjusting the weightage of the blue and green deployments in the virtual services using Istio.
Custom routing based on header information
We implemented custom content delivery with use of virtual services and destination rules in Istio. We were able to achieve this by routing the traffic from the virtual gateway to different virtual services by matching on the header field of the request that was received.
Result from chrome
Result from safari
Handling Server Overloads
Handling the overload of servers was done by implementing circuit breaker rules in the destination routes. we managed to limit the maximum number of incoming requests and number of error that I can take place within a second . once the threshold is reached, circuit breaker comes into picture and returns 503 error and prevent crashing of our system.
Circuit breaker visualisation in kiali
Results for circuit breaker with Jmeter test for 1000 users per second
Jmeter result
Circuit breaking with 503 response
Overall we got a really deep understanding of the service mesh technology and side-car injection and how it can be potentially used to address multitudes of problems that we face in an application having cloud native microservice architecture. We also understood how this opens up a whole new paradigm of thinking about web applications. Overall we got a good experience learning to build scalable, fault tolerant systems with the help of concepts from distributed systems
Steps to reproduce our experiments can be found in here
-
Adithya Selvaprithiviraj
-
Rasmitha Koduru
-
Shubhangi Shrivastava