Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument RustyVault with Prometheus #76

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

cybershang
Copy link

@cybershang cybershang commented Sep 17, 2024

Instrument RustyVault with Prometheus

Design

To monitor the system's performance effectively, I applied both the USE and RED methods for metrics collection in RustyVault.

  • USE Method (Utilization, Saturation, Errors):

    Track resource utilization and detect bottlenecks. Metrics related to system resources have been added to ensure the system's health is continuously monitored:

    • CPU Utilization: Measures the percentage of CPU usage by the RustyVault service.

    • Memory Utilization: Tracks memory usage, including total, free, and cached memory.

    • Disk I/O Saturation: Monitors disk read/write speed and detects potential bottlenecks.

    • Network I/O Saturation: Tracks the amount of data sent and received.

  • RED Method (Rate, Errors, Duration)

    Track the behavior of requests within the application:

    • Rate: We implemented requests_total to track the rate of requests coming into the system. This allows us to monitor the overall throughput.

    • Errors: The errors_total counter tracks the number of failed requests and helps monitor the system's error rate.

    • Duration: Using request_duration_seconds, we measure the time taken to process each request, enabling us to analyze latency and potential performance issues.

Implemented Metrics

  • System Metrics
    • CPU
      • cpu_usage_percent: <Gauge, AtomicU64>
    • Memory
      • total_memory: <Gauge, AtomicU64>
      • used_memory: <Gauge, AtomicU64>
      • free_memory: <Gauge, AtomicU64>
    • Disk
      • total_disk_space: <Gauge, AtomicU64>
      • total_disk_available: <Gauge, AtomicU64>
    • Network
      • network_in_bytes: <Gauge, AtomicU64>
      • network_out_bytes: <Gauge, AtomicU64>
    • Load
      • load_average:
  • HTTP Request Metrics
    • struct HttpLabel {path:String, method:MetricsMethod, status:u16}
    • http_request_count: Family<HttpLabel, Counter>
    • http_request_duration_seconds: Family<HttpLabel, Histogram>

Changes

  1. Dependency Imported
  • prometheus-client = "0.22.3"
  • tokio = "1.40.0"
  • sysinfo = "0.31.4"
  1. MetricsManager Implementation:
  • Implemented MetricsManager in manager.rs to store Prometheus Registry, system metrics (system_metrics), and HTTP API metrics (http_metrics).
  • Integrated metrics_manager into the server in src/cli/command/server.rs by inserting it into app_data.
  1. Implemented metrics_handler:
  • Implemented init_metrics_service in metrics.rs, Sets up the /metrics service by configuring a route in the ServiceConfig.
    Associates the /metrics route with metrics_handler to handle GET requests and respond with Prometheus metrics in text format.
  1. System Metrics Collection:

    • Implemented SystemMetrics struct in system_metrics.rs to gather CPU, memory, load, and disk metrics using the sysinfo crate.
    • Added collect_metrics function to collect and store system information.
    • Launched the start_collecting method in server.block_on to periodically collect system metrics.
  2. HTTP Middleware:

    • Implemented MetricsMiddleware in middleware.rs as a function middleware to capture HTTP request metrics.
    • Configured the HTTP server in src/cli/command/server.rs to apply the middleware using .wrap(from_fn(metrics_middleware)).
    • Transformed Actix-web's HTTP methods into a custom MetricsMethod enum, tracking GET, POST, PUT, DELETE, and categorizing others as OTHER.
    • Recorded request duration by logging start and end timestamps for each request.
  3. HTTP Metrics:

    • Created HttpMetrics struct in http_metrics.rs to handle HTTP request counting and duration observation.
    • Registered two Prometheus metrics: requests counter and histogram for request durations.
    • Added methods increment_request_count and observe_duration for tracking requests and their durations, labeled by HTTP method and path.

Testing Steps

  1. Start RustyVault Service:
    • Ensure that Prometheus integration is enabled in the configuration.
  2. Access Metrics Endpoint:
    • Open a browser or use curl to visit http://localhost:<PORT>/metrics.
    • Verify that Prometheus metrics are correctly displayed.
  3. Trigger Various Requests:
    • Successful Requests:
      • Send valid requests to endpoints like /login and /register.
      • Confirm that requests_total and request_duration_seconds increment appropriately.
    • Failed Requests:
      • Send invalid or malformed requests to induce errors.
      • Check that errors_total increments accordingly.
  4. Integrate with Prometheus Server:
    • Add RustyVault's /metrics endpoint to the Prometheus configuration.
  5. Using Grafana Dashboard:
  • Use a Grafana dashboard to visualize the collected metrics and demonstrate the data.

image

@CLAassistant
Copy link

CLAassistant commented Sep 17, 2024

CLA assistant check
All committers have signed the CLA.

src/cli/command/server.rs Outdated Show resolved Hide resolved
src/cli/command/server.rs Outdated Show resolved Hide resolved
src/http/metrics.rs Outdated Show resolved Hide resolved
use prometheus_client::metrics::counter::Counter;
use prometheus_client::metrics::family::Family;
use prometheus_client::metrics::histogram::{linear_buckets, Histogram};
use prometheus_client::registry::Registry;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should also be formatted with cargo fmt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatted with cargo fmt.

src/metrics/http_metrics.rs Show resolved Hide resolved
}

pub async fn start_collecting(self: Arc<Self>) {
let mut interval = time::interval(Duration::from_secs(5));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the collection interval be set in the configuration file?

Copy link
Author

@cybershang cybershang Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wa5i

Decision item: Use a single interval for all system metrics?

Since current sysinfo only provides separate refresh for CPU, memory, and process metrics; network and disk metrics cannot be refreshed individually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wa5i

Decision item: Use a single interval for all system metrics?

Since current sysinfo only provides separate refresh for CPU, memory, and process metrics; network and disk metrics cannot be refreshed individually.

Decision: Interval configuration is supported through the configuration file, but currently limited to a single interval.

Method::GET => MetricsMethod::GET,
Method::POST => MetricsMethod::POST,
Method::PUT => MetricsMethod::PUT,
Method::DELETE => MetricsMethod::DELETE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, LIST is missing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIST added.

@@ -36,6 +36,8 @@ pub struct Config {
pub daemon_user: String,
#[serde(default)]
pub daemon_group: String,
#[serde(default)]
pub collection_interval: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collection_interval has no default value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value added through fn default_collection_interval() -> u64.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value added through fn default_collection_interval() -> u64.

@wa5i
Copy link
Collaborator

wa5i commented Sep 20, 2024

  1. There are conflicts in three files in the pull request, they need to be resolved.
  2. test case is missing.

@cybershang
Copy link
Author

  1. There are conflicts in three files in the pull request, they need to be resolved.
  2. test case is missing.
  1. Conflicts resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants