NATS Monitoring, Simplified.
NATS surveyor polls the NATS server for Statz
messages to generate data for
Prometheus. This allows a single exporter to connect to any NATS server and
get an entire picture of a NATS deployment without requiring extra monitoring
components or sidecars. Surveyor has been used extensively by Synadia.
System accounts must be enabled to use surveyor.
Usage:
nats-surveyor [flags]
Flags:
--accounts Export per account metrics
-a, --addr string Network host to listen on. (default "0.0.0.0")
--config string config file (default is ./nats-surveyor.yaml)
-c, --count int Expected number of servers (-1 for undefined). (default 1)
--creds string Credentials File
-h, --help help for nats-surveyor
--http-pass string Set the password for HTTP scrapes. NATS bcrypt supported.
--http-tlscacert string Client certificate CA for verification (used with HTTPS).
--http-tlscert string Server certificate file (Enables HTTPS).
--http-tlskey string Private key for server certificate (used with HTTPS).
--http-user string Enable basic auth and set user name for HTTP scrapes.
--jetstream string Listen for JetStream Advisories based on config files in a directory.
--jwt string User JWT. Use in conjunction with --seed
--log-level string Log level, one of: trace|debug|info|warn|error|fatal|panic (default "info")
--nkey string Nkey Seed File
--observe string Listen for observation statistics based on config files in a directory.
--password string NATS user password
-p, --port int Port to listen on. (default 7777)
--prefix string Replace the default prefix for all the metrics.
--seed string Private key (nkey seed). Use in conjunction with --jwt
--server-discovery-timeout duration Maximum wait time between responses from servers during server discovery. Use in conjunction with -count=-1. (default 500ms)
-s, --servers string NATS Cluster url(s) (default "nats://127.0.0.1:4222")
--timeout duration Polling timeout (default 3s)
--tlscacert string Client certificate CA on NATS connections.
--tlscert string Client certificate file for NATS connections.
--tlskey string Client private key for NATS connections.
--tlsfirst bool Whether to use TLS First connections.
--user string NATS user name or token
-v, --version version for nats-surveyor
At this time, NATS 2.0 System credentials are required for meaningful usage. Those can be provided in 2 ways:
- using
--creds
option to supply chained credentials file (containing JWT and NKey seed):
./nats-surveyor --creds ./test/SYS.creds
2019/10/14 21:35:40 Connected to NATS Deployment: 127.0.0.1:4222
2019/10/14 21:35:40 No certificate file specified; using http.
2019/10/14 21:35:40 Prometheus exporter listening at http://0.0.0.0:7777/metrics
- using
--jwt
and--seed
options to provide user JWT and NKey seed directly:
./nats-surveyor --jwt $NATS_USER_JWT --seed $NATS_NKEY_SEED
2019/10/14 21:35:40 Connected to NATS Deployment: 127.0.0.1:4222
2019/10/14 21:35:40 No certificate file specified; using http.
2019/10/14 21:35:40 Prometheus exporter listening at http://0.0.0.0:7777/metrics
Surveyor uses Viper to read configs, so it will support all file types that Viper supports (JSON, TOML, YAML, HCL, envfile, and Java properties)
To use a config file pass the --config
flag. The defaults are /etc/nats-surveyor/nats-surveyor[.ext]
and ./nats-surveyor[.ext]
with one of the supported extensions.
The config is simple, just set each flag in the config file. Example nats-surveyor.yaml
:
servers: nats://127.0.0.1:4222
accounts: true
log-level: debug
Environment variables are also taken into account. Any environment variable that is prefixed with NATS_SURVEYOR_
will be read.
Each flag has a matching environment variable, flag names should be converted to uppercase and dashes replaced with underscores. Example:
NATS_SURVEYOR_SERVERS=nats://127.0.0.1:4222
NATS_SURVEYOR_ACCOUNTS=true
NATS_SURVEYOR_LOG_LEVEL=debug
Scrape output is the in form of nats_core_NNNN_metric, where NNN is server
, route
, or gateway
.
To aid filtering, each metric has labels. These include server_cluster
, server_name
, and server_id
.
Routes have the additional label server_route_name
and gateways have the additional label server_gateway_name
.
The info metrics has a nats_server_version label with the current version.
Additionally, there is a nats_up
metric that will normally return 1, but will return 0
and no additional NATS metrics when there is no connectivity to the NATS system. This
allows users to differentiate between a problem with the exporter itself connectivity with
the NATS system.
An easy way to start the NATS Surveyor stack (Grafana, Prometheus, and NATS Surveyor) is through docker-compose.
Follow these links for installation instructions:
The following environment variables MUST be set, either in your environment or
through the .env file that is automatically read by
docker-compose. There is a survey.sh
script that will set them for you as
a convenience.
Environment Variable | Example | Description |
---|---|---|
NATS_SURVEYOR_SERVERS | nats://hostname:4222 | The URLs of any deployed NATS server(s) |
NATS_SURVEYOR_CREDS | ./SYS.creds | NATS 2.0 System Account credentials |
NATS_SURVEYOR_SERVER_COUNT | 9 | Number of expected NATS servers |
PROMETHEUS_STORAGE | ./storage/prometheus | Path to store prometheus data locally |
SURVEYOR_DOCKER_TAG | latest | Surveyor docker tag to pull |
PROMETHEUS_DOCKER_TAG | latest | Prometheus docker tag to pull |
GRAFANA_DOCKER_TAG | latest | Grafana docker tag to pull |
Note: For referencing files and paths, docker always expects volume mounts
to be either a fully qualified directory, or a relative directory beginning
with with ./
.
You only need to connect to a single NATS server to monitor your entire NATS
deployment. In configuring NATS_SURVEYOR_SERVERS, only one server is required,
but it's recommended you provide a list for backup servers to connect to, e.g.
nats://host1:4222,nats://host2:5222
. Valid urls are formatted as hostname
(defaulting to port 4222), hostname:port
, or nats://hostname:port
.
You can start the Surveyor stack two ways. The first is through docker
compose. Ensure the environment varibles are set, that you are working
from the /docker-compose directory and run docker-compose up
.
$ docker-compose up
Recreating nats-surveyor ... done
Recreating prometheus ... done
Recreating grafana ... done
Attaching to nats-surveyor, prometheus, grafana
...
Alternatively, you can pass variables into the survey.sh
script in the
docker-compose directory.
$ ./survey.sh
usage: survey.sh <url> <server count> <system credentials>
e.g.
./survey.sh nats://mydeployment:4222 24 /privatekeys/SYS.creds
If things aren't working, look in the output for any lines that contain
exited with code 1
and address the problem. They are usually docker
volume mount problems or connectivity problems.
Next, with your browser, navigate to http://127.0.0.1:3000
, or if you are
running the Surveyor stack remotely, the hostname of the host running the
NATS surveyor stack, e.g. http://yourremotehost:3000
.
The first time you connect, you'll need to login:
- User: admin
- Password: admin
After logging in, navigate to "Manage dashboards" and you'll see a dashboard available named NATS Surveyor, where you'll be able to monitor your entire NATS deployment.
To stop the surveyor stack, but keep the containers run: docker-compose stop
To restart the surveyor stack after being stopped, run: docker-compose up
To cleanup your installation, run: docker-compose down
For platforms that support systemd
, surveyor.service
is provided as a service definition template. Modify and save this file as
/etc/systemd/system/surveyor.service
.
systemctl start surveyor
will launch the service.
The logs should normally contain enough information about the cause of problems or errors.
If you encounter a Prometheus error of:
panic: Unable to create mmap-ed active query log
, set the UID of the
container to match the UID of your user in the
docker-compose file.
e.g:
prometheus:
image: prom/prometheus:${PROMETHEUS_DOCKER_TAG}
user: "1000:1000"
If the above doesn't work, using root
will work but may pose a
security thread to the node it is running on.
prometheus:
image: prom/prometheus:${PROMETHEUS_DOCKER_TAG}
user: root
More information can be found here.
Services can be observed by creating JSON files in the observations
directory.
The file extension must be .json
.
Only one authentication method needs to be provided.
Example file format:
{
"name": "my service",
"topic": "email.subscribe.>",
"jwt": "jwt portion of creds, must include seed also",
"seed": "seed portion of creds, must include jwt also",
"credential": "/path/to/file.creds",
"nkey": "nkey seed",
"token": "token",
"username": "username, must include password also",
"password": "password, must include user also",
"tls_ca": "/path/to/ca.pem, defaults to surveyor's ca if one exists",
"tls_cert": "/path/to/cert.pem, defaults to surveyor's cert if one exists",
"tls_key": "/path/to/key.pem, defaults to surveyor's key if one exists"
}
Files are watched and updated using fsnotify
JetStream can be monitored on a per-account basis by creating JSON files in the jetstream
directory.
The file extension must be .json
.
Only one authentication method needs to be provided.
e sure that you give access to the $JS.EVENT.>
subject to your user.
Example file format:
{
"name": "my account",
"jwt": "jwt portion of creds, must include seed also",
"seed": "seed portion of creds, must include jwt also",
"credential": "/path/to/file.creds",
"nkey": "nkey seed",
"token": "token",
"username": "username, must include password also",
"password": "password, must include user also",
"tls_ca": "/path/to/ca.pem, defaults to surveyor's ca if one exists",
"tls_cert": "/path/to/cert.pem, defaults to surveyor's cert if one exists",
"tls_key": "/path/to/key.pem, defaults to surveyor's key if one exists"
}
Files are watched and updated using fsnotify
- Windows builds
- Other events (connections, disconnects, etc)
- Best Guess Server Count