Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions docs/observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
## Observability Example Configuration (`examples/observability`)

This directory provides a complete, Docker Compose–based observability example environment, including:

* **Prometheus**: Metrics collection
* **Grafana**: Metrics visualization
* **OpenTelemetry Collector**: Distributed tracing data ingestion and processing

Developers can use this example to **launch a local monitoring and tracing system with a single command**.

---

### Prerequisites

Please make sure the following components are installed in advance:

* Docker
* Docker Compose (or a newer Docker CLI version that supports `docker compose`)

---

### Usage

#### Start All Services

Enter the directory:

```bash
cd examples/observability
```

Run the following command to start the complete monitoring and tracing stack:

```bash
docker compose -f docker-compose.yaml up -d
```

After startup, you can access:

* **Prometheus**: [http://localhost:9090](http://localhost:9090)
* **Grafana**: [http://localhost:3000](http://localhost:3000)
* **OTLP receiver**: Applications should send traces to the default ports of the OTel Collector (usually `4317` or `4318`)

* gRPC: `4317`
* HTTP: `4318`
* **Jaeger UI**: [http://localhost:16886](http://localhost:16886)

**Notes:**

* Update the Prometheus scrape targets to match your actual application endpoints.
* Map Grafana’s service port to a port that is accessible on your machine.
* Map the Jaeger UI port to a port that is accessible on your machine.
* When starting the full stack, there is no need to start individual sub-services separately.

---

#### Start Metrics Services Only

Enter the directory:

```bash
cd examples/observability/metrics
```

Run the following command:

```bash
docker compose -f prometheus_compose.yaml up -d
```

After startup, you can access:

* **Grafana**: [http://localhost:3000](http://localhost:3000)

---

#### Start Tracing Services Only

Enter the directory:

```bash
cd examples/observability/tracing
```

Run the following command:

```bash
docker compose -f tracing_compose.yaml up -d
```

After startup, you can access:

* **OTLP receiver**: Applications should send traces to the default ports of the OTel Collector (usually `4317` or `4318`)

* gRPC: `4317`
* HTTP: `4318`
* **Jaeger UI**: [http://localhost:16886](http://localhost:16886)

---

### Directory Structure and File Descriptions

#### Core Startup File

| File Name | Purpose | Description |
| --------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docker-compose.yaml` | Main entry | Defines and starts the full observability stack (Prometheus, Grafana, OTel Collector, and Jaeger). This is the single entry point to launch the entire environment. |

---

#### Metrics and Monitoring Configuration

| File / Directory | Purpose | Description |
| --------------------------------------------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------- |
| `metrics` | Metrics root directory | Contains all Prometheus- and metrics-related configurations. |
| `prometheus.yaml` | Prometheus main config | Defines scrape targets, global scrape parameters, and optional recording rules. All monitored endpoints are defined here. |
| `prometheus_compose.yaml` | Prometheus Docker config | Defines the Prometheus container, volume mounts, and network settings. |
| `grafana/datasources/datasource.yaml` | Datasource configuration | Configures how Grafana connects to Prometheus. |
| `grafana/dashboards/config/dashboard.yaml` | Dashboard provisioning | Specifies the locations of dashboard JSON files to be loaded. |
| `grafana/dashboards/json/fastdeploy-dashboard.json` | Dashboard definition | Contains visualization layouts and queries for `fastdeploy` monitoring metrics. |

---

#### Distributed Tracing Configuration

| File / Directory | Purpose | Description |
| ------------------------------------------------------------------------------- | ---------------------- | ---------------------------------------------------------------------- |
| `tracing` | Tracing root directory | Contains all configurations related to distributed tracing. |
| `opentelemetry.yaml` | OTel Collector config | Defines the Collector data pipelines: |
| • **receivers**: receive OTLP data (traces, metrics, logs) | | |
| • **processors**: data processing and batching | | |
| • **exporters**: export data to tracing backends (such as Jaeger) or files | | |
| • **extensions**: health check, pprof, and zpages | | |
| • **pipelines**: define complete processing flows for traces, metrics, and logs | | |
| `tracing_compose.yaml` | Tracing Docker config | Defines the container configuration for the OTel Collector and Jaeger. |

---

### Customization

#### 4.1 Modify Metrics Scrape Targets

If your application’s metrics endpoint, port, or path changes, edit:

```plain
metrics/prometheus.yaml
```

---

#### 4.2 Adjust Tracing Sampling Rate or Processing Logic

Edit:

```plain
tracing/opentelemetry.yaml
```

---

#### 4.3 Add Custom Grafana Dashboards

1. Add the new dashboard JSON file to:

```plain
grafana/dashboards/json/
```

2. Register the dashboard so Grafana can load it automatically by editing:

```plain
grafana/dashboards/config/dashboard.yaml
```
202 changes: 202 additions & 0 deletions docs/observability/trace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# FastDeploy Tracing with OpenTelemetry

**FastDeploy** exports request tracing data through the **OpenTelemetry Collector**.
Tracing can be enabled when starting the server using the `--trace-enable` flag, and the OpenTelemetry Collector endpoint can be configured via `--otlp-traces-endpoint`.

---

## Setup Guide

### 1. Install Dependencies

```bash
# Manual installation
pip install opentelemetry-sdk \
opentelemetry-api \
opentelemetry-exporter-otlp \
opentelemetry-exporter-otlp-proto-grpc
```

---

### 2. Start OpenTelemetry Collector and Jaeger

```bash
docker compose -f examples/observability/tracing/tracing_compose.yaml up -d
```

---

### 3. Start FastDeploy Server with Tracing Enabled

#### Configure FastDeploy Environment Variables

```shell
# Enable tracing
"TRACES_ENABLE": "true",

# Service name
"FD_SERVICE_NAME": "FastDeploy",

# Instance name
"FD_HOST_NAME": "trace_test",

# Exporter type
"TRACES_EXPORTER": "otlp",

# OTLP endpoint:
# gRPC: 4317
# HTTP: 4318
"EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",

# Optional headers
"EXPORTER_OTLP_HEADERS": "Authentication=Txxxxx",

# Export protocol
"OTEL_EXPORTER_OTLP_TRACES_PROTOCOL": "grpc",
```

#### Start FastDeploy

Start the FastDeploy server with the above configuration and ensure that tracing is enabled.

---

### 4. Send Requests and View Traces

* Open the **Jaeger UI** in your browser (port `16686`) to visualize request traces.
* The OpenTelemetry Collector will also export the trace data to a local file:

```plain
/tmp/otel_trace.json
```

---

## Adding Tracing to Your Own Code

FastDeploy already inserts tracing points at most critical execution stages.
Developers can use the APIs provided in `trace.py` to add more fine-grained tracing.

---

### 4.1 Initialize Tracing

Each **process** involved in tracing must call:

```python
process_tracing_init()
```

Each **thread** that participates in a traced request must call:

```python
trace_set_thread_info("thread_label", tp_rank, dp_rank)
```

* `thread_label`: identifier used for visual distinction of threads.
* `tp_rank` / `dp_rank`: optional values to label tensor parallelism or data parallelism ranks.

---

### 4.2 Mark Request Start and Finish

```python
trace_req_start(rid, bootstrap_room, ts, role)
trace_req_finish(rid, ts, attrs)
```

* Creates both a **Bootstrap Room Span** and a **Root Span**.
* Supports inheritance from spans created by the **FastAPI Instrumentor** (context copying).
* `attrs` can be used to attach additional attributes to the request span.

---

### 4.3 Add Tracing for Slices

#### Standard Slice

```python
trace_slice_start("slice_name", rid)
trace_slice_end("slice_name", rid)
```

#### Mark Thread Completion

The last slice in a thread can mark the thread span as finished:

```python
trace_slice_end("slice_name", rid, thread_finish_flag=True)
```

---

### 4.4 Trace Context Propagation Across Threads

#### Sender Side (ZMQ)

```python
trace_context = trace_get_proc_propagate_context(rid)
req.trace_context = trace_context
```

#### Receiver Side (ZMQ)

```python
trace_set_proc_propagate_context(rid, req.trace_context)
```

---

### 4.5 Add Events and Attributes

#### Events (recorded on the current slice)

```python
trace_event("event_name", rid, ts, attrs)
```

#### Attributes (attached to the current slice)

```python
trace_slice_add_attr(rid, attrs)
```

---

## Extending the Tracing Framework

### 5.1 Trace Context Hierarchy

* Two levels of Trace Context:

* **`TraceReqContext`** – request-level context
* **`TraceThreadContext`** – thread-level context

* Three-level Span hierarchy:

* `req_root_span`
* `thread_span`
* `slice_span`

---

### 5.2 Available Span Name Enum (`TraceSpanName`)

```python
FASTDEPLOY
PREPROCESS
SCHEDULE
PREFILL
DECODE
POSTPROCESS
```

* These enums can be used when creating slices to ensure consistent naming.

---

### 5.3 Important Notes

1. Each **thread span must be closed** when the final slice of that thread finishes.
2. Spans created by **FastAPI Instrumentor** are automatically inherited by the internal tracing context.
Loading
Loading