Skip to content

Commit

Permalink
Replace built-in metrics by custom instrumentation middleware
Browse files Browse the repository at this point in the history
closes #5943
  • Loading branch information
lubosmj committed Nov 4, 2024
1 parent ee10ce6 commit 79da4c6
Show file tree
Hide file tree
Showing 17 changed files with 314 additions and 129 deletions.
3 changes: 3 additions & 0 deletions CHANGES/5943.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Replaced built-in opentelemetry metrics with custom middlewares. Additionally, introduced a new
setting `OTEL_ENABLED` that toggles the opentelemetry instrumentation on and off. It defaults to
`False`.
53 changes: 13 additions & 40 deletions docs/admin/learn/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,56 +139,29 @@ An example payload:

## Telemetry Support

Pulp can produce OpenTelemetry data, like the number of requests, active connections and latency response for
`pulp-api` and `pulp-content` using OpenTelemetry. You can read more about
[OpenTelemetry here](https://opentelemetry.io).
Pulp can produce telemetry data, like the response latency, using OpenTelemetry. You can read more
about [OpenTelemetry here](https://opentelemetry.io). The telemetry is **disabled by default**.

!!! attention
This feature is provided as a tech preview and could change in backwards incompatible
ways in the future.

If you are using [Pulp in One Container](site:pulp-oci-images/docs/admin/tutorials/quickstart/#single-container)
or [Pulp Operator](site:pulp-operator/) and want to enable it, you will need to set the following environment variables:
In order to enable the telemetry, set `OTEL_ENABLED=True` in the settings file and follow the next
steps:

- `PULP_OTEL_ENABLED` set to `True`.
- `OTEL_EXPORTER_OTLP_ENDPOINT` set to the address of your OpenTelemetry Collector instance
ex. `http://otel-collector:4318`.
- `OTEL_EXPORTER_OTLP_PROTOCOL` set to `http/protobuf`.
- Spin up a new instance of the [Opentelemetry Collector](https://opentelemetry.io/docs/collector/).
- Configure the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable to point to the address of the
OpenTelemetry Collector instance (e.g.,`http://otel-collector:4318`).
- Set the `OTEL_EXPORTER_OTLP_PROTOCOL` environment variable to `http/protobuf`.

If you are using other type of installation maybe you will need to manually initialize Pulp using the
[OpenTelemetry automatic instrumentation](https://opentelemetry.io/docs/instrumentation/python/getting-started/#instrumentation)
and set the following environment variables:

- `OTEL_EXPORTER_OTLP_ENDPOINT` set to the address of your OpenTelemetry Collector instance
ex. `http://otel-collector:4318`.
- `OTEL_EXPORTER_OTLP_PROTOCOL` set to `http/protobuf`.

!!! note
A quick example on how it would run using this method:

```bash
/usr/local/bin/opentelemetry-instrument --service_name pulp-api /usr/local/bin/pulpcore-api \
--bind "127.0.0.1:24817" --name pulp-api --workers 4 --access-logfile -
```


You will need to run an instance of OpenTelemetry Collector. You can read more about the [OpenTelemetry
Collector here](https://opentelemetry.io/docs/collector/).

**At the moment, the following data is recorded by Pulp:**

- Access to every API endpoint (an HTTP method, target URL, status code, and user agent).
- Access to every requested package (an HTTP method, target URL, status code, and user agent).
- Disk usage within a specific domain (total used disk space and the reference to a domain). Currently disabled.
- Latency of API endpoints (along with an HTTP method, URL, status code, and unique worker name).
- Latency of delivering requested packages (an HTTP method, status code, and unique worker name).
- Disk usage within a specific domain (total used disk space and the reference to a domain).
- The size of served artifacts (total count of served data and the reference to a domain).

The information above is sent to the collector in the form of spans and metrics. Thus, the data is
emitted either based on the user interaction with the system or on a regular basis. Consult
[OpenTelemetry Traces](https://opentelemetry.io/docs/concepts/signals/traces/) and
The information above is sent to the collector in the form of metrics. Thus, the data is emitted
either based on the user interaction with the system or on a regular basis. Consult
[OpenTelemetry Metrics](https://opentelemetry.io/docs/concepts/signals/metrics/) to learn more.

!!! note
It is highly recommended to set the [`OTEL_METRIC_EXPORT_INTERVAL`](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#periodic-exporting-metricreader)
environment variable to `300000` (5 minutes) to reduce the frequency of queries executed on the
Pulp's backend. This value represents the interval between emitted metrics and should be set
before runtime.
8 changes: 8 additions & 0 deletions docs/admin/reference/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -459,3 +459,11 @@ Defaults to `False`.
Timeout in seconds for the kafka producer polling thread's `poll` calls.

Defaults to `0.1`.


### OTEL_ENABLED

Toggles the activation of OpenTelemetry instrumentation for monitoring and tracing the application's
performance.

Defaults to `False`.
12 changes: 12 additions & 0 deletions pulpcore/app/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,9 @@
KAFKA_SASL_USERNAME = None
KAFKA_SASL_PASSWORD = None

# opentelemetry settings
OTEL_ENABLED = False

# HERE STARTS DYNACONF EXTENSION LOAD (Keep at the very bottom of settings.py)
# Read more at https://www.dynaconf.com/django/
from dynaconf import DjangoDynaconf, Validator # noqa
Expand Down Expand Up @@ -449,6 +452,14 @@
messages={"is_type_of": "{name} must be a dictionary."},
)


def otel_middleware_hook(settings):
data = {"dynaconf_merge": True}
if settings.OTEL_ENABLED:
data["MIDDLEWARE"] = ["pulpcore.middleware.DjangoMetricsMiddleware"]
return data


settings = DjangoDynaconf(
__name__,
ENVVAR_PREFIX_FOR_DYNACONF="PULP",
Expand All @@ -467,6 +478,7 @@
json_header_auth_validator,
authentication_json_header_openapi_security_scheme_validator,
],
post_hooks=otel_middleware_hook,
)
# HERE ENDS DYNACONF EXTENSION LOAD (No more code below this line)

Expand Down
46 changes: 17 additions & 29 deletions pulpcore/app/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ def configure_periodic_telemetry():
dispatch_interval = timedelta(minutes=5)
name = "Emit OpenTelemetry metrics periodically"

if os.getenv("PULP_OTEL_ENABLED", "").lower() == "true" and settings.DOMAIN_ENABLED:
if settings.OTEL_ENABLED and settings.DOMAIN_ENABLED:
models.TaskSchedule.objects.update_or_create(
name=name, defaults={"task_name": task_name, "dispatch_interval": dispatch_interval}
)
Expand Down Expand Up @@ -613,34 +613,6 @@ def cache_key(base_path):
return base_path


class MetricsEmitter:
"""
A builder class that initializes an emitter.
If Open Telemetry is enabled, the builder configures a real emitter capable of sending data to
the collector. Otherwise, a no-op emitter is initialized. The real emitter may utilize the
global settings to send metrics.
By default, the emitter sends data to the collector every 60 seconds. Adjust the environment
variable OTEL_METRIC_EXPORT_INTERVAL accordingly if needed.
"""

class _NoopEmitter:
def __call__(self, *args, **kwargs):
return self

def __getattr__(self, *args, **kwargs):
return self

@classmethod
def build(cls, *args, **kwargs):
otel_enabled = os.getenv("PULP_OTEL_ENABLED", "").lower() == "true"
if otel_enabled and settings.DOMAIN_ENABLED:
return cls(*args, **kwargs)
else:
return cls._NoopEmitter()


@lru_cache(maxsize=1)
def get_worker_name():
return f"{os.getpid()}@{socket.gethostname()}"
Expand Down Expand Up @@ -672,3 +644,19 @@ def __exit__(self, exc_type, exc_value, traceback):
released = cursor.fetchone()[0]
if not released:
raise RuntimeError("Lock not held.")


def normalize_http_status(status):
"""Convert the HTTP status code to 2xx, 3xx, etc., normalizing the last two digits."""
if 100 <= status < 200:
return "1xx"
elif 200 <= status < 300:
return "2xx"
elif 300 <= status < 400:
return "3xx"
elif 400 <= status < 500:
return "4xx"
elif 500 <= status < 600:
return "5xx"
else:
return ""
29 changes: 0 additions & 29 deletions pulpcore/app/wsgi.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,40 +7,11 @@
https://docs.djangoproject.com/en/3.2/howto/deployment/wsgi/
"""

import os
from django.core.wsgi import get_wsgi_application
from opentelemetry.instrumentation.wsgi import OpenTelemetryMiddleware
from opentelemetry.exporter.otlp.proto.http.metric_exporter import (
OTLPMetricExporter,
)
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

from pulpcore.app.entrypoint import using_pulp_api_worker
from pulpcore.app.util import get_worker_name
from opentelemetry.sdk.resources import Resource

if not using_pulp_api_worker.get(False):
raise RuntimeError("This app must be executed using pulpcore-api entrypoint.")


class WorkerNameMetricsExporter(OTLPMetricExporter):
def export(self, metrics_data, timeout_millis=10_000, **kwargs):
for resource_metric in metrics_data.resource_metrics:
for scope_metric in resource_metric.scope_metrics:
for metric in scope_metric.metrics:
if metric.data.data_points:
point = metric.data.data_points[0]
point.attributes["worker.name"] = get_worker_name()

return super().export(metrics_data, timeout_millis, **kwargs)


exporter = WorkerNameMetricsExporter()
reader = PeriodicExportingMetricReader(exporter)
resource = Resource(attributes={"service.name": "pulp-api"})
provider = MeterProvider(metric_readers=[reader], resource=resource)

application = get_wsgi_application()
if os.getenv("PULP_OTEL_ENABLED", "").lower() == "true":
application = OpenTelemetryMiddleware(application, meter_provider=provider)
9 changes: 5 additions & 4 deletions pulpcore/content/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,10 @@
from importlib import import_module
import logging
import os
import socket

from asgiref.sync import sync_to_async
from aiohttp import web

from opentelemetry.instrumentation.aiohttp_server import middleware as instrumentation

import django


Expand All @@ -27,12 +24,16 @@
from pulpcore.app.util import get_worker_name # noqa: E402: module level not at top of file

from .handler import Handler # noqa: E402: module level not at top of file
from .instrumentation import instrumentation # noqa: E402: module level not at top of file
from .authentication import authenticate # noqa: E402: module level not at top of file


log = logging.getLogger(__name__)

app = web.Application(middlewares=[authenticate, instrumentation])
if settings.OTEL_ENABLED:
app = web.Application(middlewares=[authenticate, instrumentation()])
else:
app = web.Application(middlewares=[authenticate])

CONTENT_MODULE_NAME = "content"

Expand Down
49 changes: 49 additions & 0 deletions pulpcore/content/instrumentation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import time

from aiohttp import web

from pulpcore.metrics import init_otel_meter
from pulpcore.app.util import get_worker_name, normalize_http_status


def instrumentation(exporter=None, reader=None, provider=None):
meter = init_otel_meter("pulp-content", exporter=exporter, reader=reader, provider=provider)
request_duration_histogram = meter.create_histogram(
name="content.request_duration",
description="Tracks the duration of HTTP requests",
unit="ms",
)

@web.middleware
async def middleware(request, handler):
start_time = time.time()

try:
response = await handler(request)
status_code = response.status
except web.HTTPException as exc:
status_code = exc.status
response = exc

duration_ms = (time.time() - start_time) * 1000

request_duration_histogram.record(
duration_ms,
attributes={
"http.method": request.method,
"http.status_code": normalize_http_status(status_code),
"http.route": _get_view_request_handler_func(request),
"worker.name": get_worker_name(),
},
)

return response

return middleware


def _get_view_request_handler_func(request):
try:
return request.match_info.handler.__name__
except AttributeError:
return "unknown"
49 changes: 46 additions & 3 deletions pulpcore/metrics.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,54 @@
from opentelemetry import metrics
from functools import lru_cache

from pulpcore.app.util import MetricsEmitter, get_domain, get_worker_name
from django.conf import settings

from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource

from pulpcore.app.util import get_domain, get_worker_name


@lru_cache(maxsize=1)
def init_otel_meter(service_name, exporter=None, reader=None, provider=None):
exporter = exporter or OTLPMetricExporter()
reader = reader or PeriodicExportingMetricReader(exporter)
resource = Resource(attributes={"service.name": service_name})
provider = provider or MeterProvider(metric_readers=[reader], resource=resource)
return provider.get_meter("pulp.metrics")


class MetricsEmitter:
"""
A builder class that initializes an emitter.
If Open Telemetry is enabled, the builder configures a real emitter capable of sending data to
the collector. Otherwise, a no-op emitter is initialized. The real emitter may utilize the
global settings to send metrics.
By default, the emitter sends data to the collector every 60 seconds. Adjust the environment
variable OTEL_METRIC_EXPORT_INTERVAL accordingly if needed.
"""

class _NoopEmitter:
def __call__(self, *args, **kwargs):
return self

def __getattr__(self, *args, **kwargs):
return self

@classmethod
def build(cls, *args, **kwargs):
if settings.OTEL_ENABLED and settings.DOMAIN_ENABLED:
return cls(*args, **kwargs)
else:
return cls._NoopEmitter()


class ArtifactsSizeCounter(MetricsEmitter):
def __init__(self):
self.meter = metrics.get_meter("artifacts.size.meter")
self.meter = init_otel_meter("pulp-content")
self.counter = self.meter.create_counter(
"artifacts.size.counter",
unit="Bytes",
Expand Down
Loading

0 comments on commit 79da4c6

Please sign in to comment.