Skip to content

Commit

Permalink
Merge pull request #2 from waggle-sensor/develop
Browse files Browse the repository at this point in the history
Utilizing tegrastats to expose full tegrastats metrics
  • Loading branch information
gemblerz authored Dec 4, 2023
2 parents 1f9b605 + 72edcc1 commit 742d9cf
Show file tree
Hide file tree
Showing 11 changed files with 622 additions and 49 deletions.
20 changes: 17 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,25 @@
FROM golang:1.17-alpine as builder
FROM golang:1.20 as builder
ARG TARGETARCH
COPY . .
RUN mkdir -p /app \
&& unset GOPATH \
&& GOOS=linux GOARCH=${TARGETARCH} go build -o /app/jetson-exporter
&& CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -o /app/jetson-exporter

FROM waggle/plugin-base:1.1.1-base

RUN apt-get update \
&& apt-get install -y \
gnupg \
ca-certificates \
nano

COPY etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
/etc/apt/sources.list.d/nvidia-l4t-apt-source.list
RUN apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc \
&& apt-get update \
&& apt-get install --no-install-recommends -y \
nvidia-l4t-tools

FROM golang:1.17-alpine
COPY --from=builder /app/ /app/
WORKDIR /app
CMD ["/app/jetson-exporter"]
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
build:
go build -o ./out/jetson-exporter jetson_exporter.go
CGO_ENABLED=0 go build -o ./out/jetson-exporter .

build-arm64:
GOOS=linux GOARCH=arm64 go build -o ./out/jetson-exporter jetson_exporter.go
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o ./out/jetson-exporter .

10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,14 @@
Jetson exporter is a metric provider for Jetson Tegra GPU. Scrapers can hit `/metrics` endpoint to get Prometheus-formatted metrics.

# Metrics
Provided metrics include,
- **sys.metrics.gpu.average.1s**: exponential moving average of GPU utilization over last 1 second
- **sys.metrics.gpu.average.5s**: exponential moving average of GPU utilization over last 5 second
- **sys.metrics.gpu.average.15s**: exponential moving average of GPU utilization over last 15 second
Provided metrics can be found in [tegrastats.go](./tegrastats.go)

# Kubernetes
The exporter can be deployed as Kubernetes DaemonSet to provide the metrics per Jetson device.
The jetons exporter can be deployed as Kubernetes DaemonSet to provide the metrics per Jetson device.

# Main Advantage
Current Jetson platform for CUDA GPU (Sep 2022) is implemented differently from Desktop (amd64) CUDA platform. This blocks Jetson users from taking full features of Nvidia tools for device monitoring. `tegrastats` only provides a snapshot of GPU utilization which also makes users difficult to monitor usage while running CUDA-enabled programs. This exporter aggregates GPU utilization and provides wider picture of how CUDA GPU performs.

# Limitation
- Jetson GPU shares memory with CPU such that this exporter does not provide GPU memory usage
- We have not found a way to map GPU utilization with a process ID to identify which process is using the resource. This means that GPU utilization does not necessarily come from a particular program, but could come from other program running at the same time.

# Developer Note
Current provided metrics are limited to a few metrics. More metrics may be added if there are needs.
76 changes: 76 additions & 0 deletions data/test_tegrastats.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# HELP tegra_cpu_frequency_hz CPU Clock frequency
# TYPE tegra_cpu_frequency_hz gauge
tegra_cpu_frequency_hz{cpu="1"} 1.42e+09
tegra_cpu_frequency_hz{cpu="2"} 1.42e+09
tegra_cpu_frequency_hz{cpu="3"} 1.42e+09
tegra_cpu_frequency_hz{cpu="4"} 1.42e+09
tegra_cpu_frequency_hz{cpu="5"} 1.42e+09
tegra_cpu_frequency_hz{cpu="6"} 1.42e+09
# HELP tegra_cpu_util_percentage Utilization of CPU in percentage
# TYPE tegra_cpu_util_percentage gauge
tegra_cpu_util_percentage{cpu="1"} 47
tegra_cpu_util_percentage{cpu="2"} 23
tegra_cpu_util_percentage{cpu="3"} 32
tegra_cpu_util_percentage{cpu="4"} 22
tegra_cpu_util_percentage{cpu="5"} 31
tegra_cpu_util_percentage{cpu="6"} 96
# HELP tegra_emc_frequency_hz External memory controller clock frequency
# TYPE tegra_emc_frequency_hz gauge
tegra_emc_frequency_hz 1.6e+09
# HELP tegra_emc_util_percentage Utilization of external memory controller in percentage
# TYPE tegra_emc_util_percentage gauge
tegra_emc_util_percentage 2
# HELP tegra_gpu_frequency_hz GPU clock frequency
# TYPE tegra_gpu_frequency_hz gauge
tegra_gpu_frequency_hz 1.109e+09
# HELP tegra_gpu_util_percentage Utilization of GPU in percentage
# TYPE tegra_gpu_util_percentage gauge
tegra_gpu_util_percentage 0
# HELP tegra_last_updated_timestamp_epoch An epoch time of when the stats were collected from the system
# TYPE tegra_last_updated_timestamp_epoch gauge
tegra_last_updated_timestamp_epoch 1.701465532e+09
# HELP tegra_lfb_nblock_count Count of largest free block
# TYPE tegra_lfb_nblock_count gauge
tegra_lfb_nblock_count 7
# HELP tegra_lfb_size_bytes Size of largest free block
# TYPE tegra_lfb_size_bytes gauge
tegra_lfb_size_bytes 4.194304e+06
# HELP tegra_mts_bg_percentage Time spent in foreground tasks
# TYPE tegra_mts_bg_percentage gauge
tegra_mts_bg_percentage 9
# HELP tegra_mts_fg_percentage Time spent in background tasks
# TYPE tegra_mts_fg_percentage gauge
tegra_mts_fg_percentage 1
# HELP tegra_ram_total_bytes Total memory
# TYPE tegra_ram_total_bytes gauge
tegra_ram_total_bytes 8.148484096e+09
# HELP tegra_ram_used_bytes Current used memory
# TYPE tegra_ram_used_bytes gauge
tegra_ram_used_bytes 5.500829696e+09
# HELP tegra_swap_cached_bytes Current swap cache memory
# TYPE tegra_swap_cached_bytes gauge
tegra_swap_cached_bytes 2.9360128e+08
# HELP tegra_swap_total_bytes Total swap memory
# TYPE tegra_swap_total_bytes gauge
tegra_swap_total_bytes 2.1253586944e+10
# HELP tegra_swap_used_bytes Current swap used memory
# TYPE tegra_swap_used_bytes gauge
tegra_swap_used_bytes 1.030750208e+09
# HELP tegra_temperature_celcius Temperature reading in Celcius
# TYPE tegra_temperature_celcius gauge
tegra_temperature_celcius{sensor="ao"} 29
tegra_temperature_celcius{sensor="aux"} 30
tegra_temperature_celcius{sensor="cpu"} 33.5
tegra_temperature_celcius{sensor="gpu"} 31.5
tegra_temperature_celcius{sensor="pmic"} 100
tegra_temperature_celcius{sensor="thermal"} 31.350000381469727
# HELP tegra_wattage_average_milliwatts Averaged Watts of the hardware
# TYPE tegra_wattage_average_milliwatts gauge
tegra_wattage_average_milliwatts{sensor="vdd_cpu_gpu_cv"} 2119
tegra_wattage_average_milliwatts{sensor="vdd_in"} 5510
tegra_wattage_average_milliwatts{sensor="vdd_soc"} 1051
# HELP tegra_wattage_current_milliwatts Current Watts of the hardware
# TYPE tegra_wattage_current_milliwatts gauge
tegra_wattage_current_milliwatts{sensor="vdd_cpu_gpu_cv"} 2706
tegra_wattage_current_milliwatts{sensor="vdd_in"} 6140
tegra_wattage_current_milliwatts{sensor="vdd_soc"} 1074
2 changes: 2 additions & 0 deletions etc/apt/sources.list.d/nvidia-l4t-apt-source.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
deb https://repo.download.nvidia.com/jetson/common r32.4 main
deb https://repo.download.nvidia.com/jetson/t194 r32.4 main
3 changes: 2 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
module github.com/waggle-sensor/jetson-exporter

go 1.17
go 1.20

require (
github.com/influxdata/influxdb-client-go/v2 v2.11.0
Expand All @@ -10,6 +10,7 @@ require (
require (
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.1.2 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/deepmap/oapi-codegen v1.8.2 // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/influxdata/line-protocol v0.0.0-20200327222509-2487e7298839 // indirect
Expand Down
7 changes: 0 additions & 7 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,6 @@ github.com/google/go-cmp v0.5.1/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/
github.com/google/go-cmp v0.5.4/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.8 h1:e6P7q2lk1O+qJJb4BtCQXlK8vWEO8V1ZeuEdJNOqZyg=
github.com/google/go-cmp v0.5.8/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs=
github.com/google/martian/v3 v3.0.0/go.mod h1:y5Zk1BBys9G+gd6Jrk0W3cC1+ELVxBWuIGO+w/tUAp0=
Expand Down Expand Up @@ -216,14 +215,11 @@ github.com/sirupsen/logrus v1.4.2/go.mod h1:tLMulIdttU9McNUspp0xgXVQah82FyeX6Mwd
github.com/sirupsen/logrus v1.6.0/go.mod h1:7uNnSEd1DgxDLC74fIahvMZmmYsHGZGEOFrfsX/uA88=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0 h1:pSgiaMZlXftHpm5L7V1+rVB+AZJydKsMxsQBIJw4PKk=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/valyala/bytebufferpool v1.0.0/go.mod h1:6bBcMArwyJ5K/AmCkWv1jt77kVWyCJ6HpOuEn7z0Csc=
github.com/valyala/fasttemplate v1.0.1/go.mod h1:UQGH1tvbgY+Nz5t2n7tXsz52dQxojPUpymEIMZ47gx8=
github.com/valyala/fasttemplate v1.2.1/go.mod h1:KHLXt3tVN2HBp8eijSv/kGJopbvo7S+qRAEEKiv+SiQ=
Expand Down Expand Up @@ -322,7 +318,6 @@ golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJ
golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20200625203802-6e8e738ad208/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20201207232520-09787c993a3a/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sys v0.0.0-20180830151530-49385e6e1522/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20181116152217-5ac8a444bdc5/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
Expand Down Expand Up @@ -516,9 +511,7 @@ gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.5/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.3.0/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
honnef.co/go/tools v0.0.0-20190102054323-c2f93a96b099/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190106161140-3f1c8253044a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190418001031-e561f6794a2a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
Expand Down
4 changes: 4 additions & 0 deletions influxdb.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
package main

// This is archived to minimize functionality of jetson exporter
// If metrics need to be published use metrics collection agents
// like Grafana agent, Telegraf, Fluentd, etc.

import (
"time"

Expand Down
70 changes: 42 additions & 28 deletions jetson_exporter.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ import (
"log"
"net/http"
"os"
"os/signal"
"syscall"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/collectors"
Expand All @@ -23,36 +25,48 @@ func main() {
var port string
metricsPath := "/metrics"
flag.StringVar(&port, "port", getenv("PORT", "9091"), "Port number to listen")
var collectorConfig TegraGPUCollectorConfig
flag.IntVar(&collectorConfig.CollectionIntervalInMilli, "sampling", 100, "Sampling interval in milliseconds")
flag.StringVar(&collectorConfig.LoadPath, "loadpath", "/sys/devices/gpu.0/load", "Path to GPU load")
flag.StringVar(&collectorConfig.CurrentDeviceFrqPathRex, "devfreqpathrex", "/sys/devices/gpu.0/devfreq/*/cur_freq", "Path described in Regression to current frequency of GPU device")
var publisherConfig PublisherConfig
flag.StringVar(&publisherConfig.NodeName, "nodename", getenv("KUBENODE", ""), "Name of the Kubernetes node")
flag.StringVar(&publisherConfig.InfluxDBURL, "influxdb-url", getenv("INFLUXDB_URL", ""), "InfluxDB URL")
flag.StringVar(&publisherConfig.InfluxDBToken, "influxdb-token", getenv("INFLUXDB_TOKEN", ""), "InfluxDB token")
flag.StringVar(&publisherConfig.InfluxDBOrganization, "influxdb-org", getenv("INFLUXDB_ORG", "waggle"), "InfluxDB organization")
flag.StringVar(&publisherConfig.InfluxDBBucket, "influxdb-bucket", getenv("INFLUXDB_BUCKET", "waggle"), "InfluxDB bucket")
flag.IntVar(&publisherConfig.InfluxDBPublishInterval, "influxdb-interval", 1, "InlufxDB publishing interval in seconds")
flag.Parse()
fmt.Println("Jetson exporter started")
fmt.Println("Parameters are:")
fmt.Printf("\t Sampling Interval: %d millisecond\n", collectorConfig.CollectionIntervalInMilli)
fmt.Printf("\t Loadpath: %s\n", collectorConfig.LoadPath)
fmt.Printf("\t Endpoint: %s\n", metricsPath)
collector := NewTegraGPUCollector(&collectorConfig)
collector.Configure()
stopCh := make(chan bool, 1)
tegrastats := NewTegraStats()

log.Println("Jetson exporter starts...")
log.Println("Parameters are:")
log.Printf("\t Endpoint: %s\n", metricsPath)
log.Printf("\t TegraStats command: %v", tegrastats.GetTegraStatsCommandWithArguments())

// watch signals to terminate external programs cleanly.
sigc := make(chan os.Signal, 1)
signal.Notify(sigc,
syscall.SIGHUP,
syscall.SIGINT,
syscall.SIGTERM,
syscall.SIGQUIT)

log.Println("Executing the tegrastats command in the background...")
err := tegrastats.Start()
if err != nil {
panic(err)
}
reg := prometheus.NewRegistry()
reg.MustRegister(collectors.NewGoCollector())
reg.MustRegister(collector)
go collector.RunUntil(stopCh)
if publisherConfig.InfluxDBURL != "" {
fmt.Println("InfluxDB URL is provided. Metrics will be published.")
fmt.Printf("\t Publishing Interval: %d second(s) \n", publisherConfig.InfluxDBPublishInterval)
publisher := NewInfluxDBPublisher(publisherConfig, collector)
go publisher.RunUntil(stopCh)
}
reg.MustRegister(tegrastats)
http.Handle(metricsPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{EnableOpenMetrics: true}))
log.Fatal(http.ListenAndServe(fmt.Sprintf("0.0.0.0:%s", port), nil))
sige := make(chan error, 1)
go func() {
err := http.ListenAndServe(fmt.Sprintf("0.0.0.0:%s", port), nil)
sige <- err
}()

for {
select {
case err := <-sige:
log.Println("HTTP listener returned with an error")
log.Printf("%s\n", err)
tegrastats.Close()
return
case <-sigc:
log.Printf("OS signal received. Gracefully terminating...")
tegrastats.Close()
return
}
}
}
Loading

0 comments on commit 742d9cf

Please sign in to comment.