Skip to content

2025-03-03 ZeroTier - health checking - alternative proposal #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Paraphraser
Copy link
Contributor

@Paraphraser Paraphraser commented Mar 3, 2025

This PR follows on from the extensive discussion associated with #37.

Never before have I even contemplated submitting a PR covering the same ground as an existing open PR. However, on this occasion I thought it might be useful to have a concrete proposal to compare and contrast with #37.

I sincerely hope that laying this on the (virtual) table and then minimising further interaction might help us converge on a solution.


  • docker-compose.yml and docker-compose-router.yml:

    • replaces deprecated version statement with ---.

    • adds example environment variables.

  • Dockerfile

  • Dockerfile.router

    • removes tzdata (moved to Dockerfile).
  • entrypoint-router.sh:

    • code for first launch auto join of listed networks expanded to include additional help material.
  • entrypoint.sh:

    • "first launch" auto join of listed networks (code copied from entrypoint-router.sh, as modified per above).

    • "self repair" of permissions in persistent store (code copied from entrypoint-router.sh).

    • adds launch-time message to make it clear that the client is launching (complements messages in entrypoint-router.sh).

    • abstracts some common strings to environment variables (opportunistic change).

  • README.md:

    • updates examples.

    • describes new environment variables (including move of ZEROTIER_ONE_NETWORK_IDS from README-router.md.

    • documents health-checking.

  • README-router.md

    • updates examples.
    • explains relationship of router and client.

Added:

I gave serious consideration to the code for synchronising networks in the entry point scripts. The idea is quite attractive. It is safe to automate joins in a "clean slate" situation. However, a leave followed by a join is not guaranteed to be idempotent. That's because the leave destroys the network-specific configuration options (allowManaged, allowGlobal, allowDefault, allowDNS).

On balance I think it's better left to users to send explicit leave commands via the CLI and take responsibility for restoring lost configuration options on any subsequent join.

I will post the results of testing this PR separately.


Additional changes as at 2025-04-06

healthcheck.sh:

  1. Full rewrite, including copious comments explaining theory of
    operation. In essence, if a «networkID» is mentioned in
    (internal path):

    /var/lib/zerotier-one/networks.d
    

    then it should be matched by a route in the host's routing table:

    • Zero «networkID» = zero routes
    • One «networkID» = one route
    • Two «networkID» = two routes
    • ...

    Any mismatch causes the container to go unhealthy. From the
    perspective of the health-checking script, the question of which
    network is immaterial so there is no need to employ techniques such
    as iterating to discover which network is causing a problem. This
    is because the script has no way of communicating anything other
    than an exit status. Any echo statements it issues will not make
    it into the container's log.

    The simple presence/absence of a «networkID» in networks.d is
    taken to indicate which networks the user intends the container
    to join.

    There is no reliance on environment variables to propagate any
    health-checking information into the container. No other variables
    are introduced so the argument about naming conventions goes away.

Dockerfile:

  1. Removes --start-interval flag from HEALTHCHECK command. The flag
    was preventing buildah builds from succeeding.

README.md:

  1. Removes references to the following environment variables:

    • ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS
    • ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH
  2. Rewrites explanation of health-checking.

README-router.md:

  1. Removes references to the following environment variables:

    • ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS
    • ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH

This PR follows on from the extensive discussion associated with zyclonite#37.

Never before have I even *contemplated* submitting a PR covering the
same ground as an existing open PR. However, on this occasion I thought
it might be useful to have a concrete proposal to compare and contrast
with zyclonite#37.

I sincerely hope that laying this on the (virtual) table and then
minimising further interaction *might* help us converge on a solution.

<hr>

Changes:

* `docker-compose.yml` and `docker-compose-router.yml`:

	- replaces deprecated `version` statement with `---`.

	- adds example environment variables.

* `Dockerfile`

	- corrects case of "as" to "AS" (silences build warning).

	- adds and configures `healthcheck.sh` (as per zyclonite#37).

	- includes `tzdata` package (moved from `Dockerfile.router`) so
	  messages have local timestamps.

* `Dockerfile.router`

	- removes `tzdata` (moved to `Dockerfile`).

* `entrypoint-router.sh`:

	- code for first launch auto join of listed networks expanded to
	  include additional help material.

* `entrypoint.sh`:

	- "first launch" auto join of listed networks (code copied from
	  `entrypoint-router.sh`, as modified per above).

	- "self repair" of permissions in persistent store (code copied from
	  `entrypoint-router.sh`).

	- adds launch-time message to make it clear that the client is
	  launching (complements messages in `entrypoint-router.sh`).

	- abstracts some common strings to environment variables
	  (opportunistic change).

* `README.md`:

	- updates examples.

	- describes new environment variables (including move of
	  `ZEROTIER_ONE_NETWORK_IDS` from `README-router.md`.

	- documents health-checking.

* `README-router.md`

	- updates examples.
	- explains relationship of router and client.

Added:

* `healthcheck.sh`, based on original proposal in zyclonite#37 and subsequent
  suggestions for modification by me.

I gave serious consideration to the code for synchronising networks in
the entry point scripts. The idea is quite attractive. It is safe to
automate joins in a "clean slate" situation. However, a *leave* followed
by a *join* is not guaranteed to be idempotent. That's because the
*leave* destroys the network-specific configuration options
(`allowManaged`, `allowGlobal`, `allowDefault`, `allowDNS`).

On balance I think it's better left to users to send explicit *leave*
commands via the CLI and take responsibility for restoring lost
configuration options on any subsequent *join*.

I will post the results of testing this PR separately.

Signed-off-by: Phill Kelley <34226495+Paraphraser@users.noreply.github.com>
@gb-123-git
Copy link

@Paraphraser @zyclonite

My humble question for a small use case scenario :
What if the user wants to check ALL networks he has joined (which can change dynamically) ?
How do we check that in this proposal ?

@zyclonite
Copy link
Owner

first i was on vacation and last week a bit downing in work - sorry for the delay

regarding buildah version - i fear we have to live with the one provided with the ubuntu 24 gh actions runner (they might upgrade it from time to time)
i did attempt to upgrade it individually in the past but they have quite some security hurdles around that, so it's not easy to achieve (this was my old project but it does not work anymore with the latest runners https://github.com/zyclonite/setup-podman)

about the sponsored by comment - i fully support individual contributions, so you are free to add your real name but i would like to not go down the route with having companies sponsoring code as this might be tricky from a licensing perspective if not in sync with the individual contributor and so on...

@gb-123-git
Copy link

@zyclonite
After due discussions with PMGA Tech LLP, I have changed the License text to match most major MIT license texts:
E.g.
TailWindCSS
https://github.com/tailwindlabs/tailwindcss/blob/main/LICENSE
VS-Code (By Microsoft)
https://github.com/microsoft/vscode/blob/main/LICENSE.txt
This should alleviate your fears as the text is matching most major MIT projects.

However, as per my agreement with PMGA Tech LLP, the code cannot be used without attaching the proper licensing text. It is now not possible for me to go back on it.

Further, I still do not understand the fuss behind all this as many major MIT projects owned by corporates also include license files (Examples already given above); and also since ZeroTier itself is not MIT but BSL, you can check here: https://github.com/zerotier/ZeroTierOne?tab=License-1-ov-file

As I do not want to take this argument further, so its your call on whether to merge the code or cancel it.

Please note one more thing, in case you intend to merge this proposal, the proper licensing text needs to be copied from PR #37 .
I'm afraid if the text is not attached 'as-is', it would be a deliberate copywrite infringement since knowingly the same has been removed and/or not attached.

I would suggest you merge PR #37 since this is just a copy of that MR with minor changes as already admitted by the OP in the first post. (attaching screenshot for future reference in case required)

Derivative Proof

1. Full rewrite, including copious comments explaining theory of
   operation. In essence, if a «networkID» is mentioned in
   (internal path):

	```
	/var/lib/zerotier-one/networks.d
	```

	then it should be matched by a route in the host's routing table:

	* Zero «networkID» = zero routes
	* One «networkID» = one route
	* Two «networkID» = two routes
	* ...

	Any mismatch causes the container to go unhealthy. From the
	perspective of the health-checking script, the question of *which*
	network is immaterial so there is no need to employ techniques such
	as iterating to discover *which* network is causing a problem. This
	is because the script has no way of communicating anything other
	than an exit status. Any `echo` statements it issues will not make
	it into the container's log.

	The simple presence/absence of a «networkID» in `networks.d` is
	taken to indicate which networks the user *intends* the container
	to join.

	There is no reliance on environment variables to propagate any
	health-checking information into the container. No other variables
	are introduced so the argument about naming conventions goes away.

`Dockerfile`:

1. Removes `--start-interval` flag from `HEALTHCHECK` command. The flag
   was preventing `buildah` builds from succeeding.

`README.md`:

1. Removes references to the following environment variables:

	* `ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS`
	* `ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH`

2. Rewrites explanation of health-checking.

`README-router.md`:

1. Removes references to the following environment variables:

	* `ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS`
	* `ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH`

Signed-off-by: Phill Kelley <34226495+Paraphraser@users.noreply.github.com>
@Paraphraser
Copy link
Contributor Author

Paraphraser commented Apr 6, 2025

Testing (with 2025-04-06 changes)

Reference service definition

  zerotier:
    container_name: zerotier
    image: "zyclonite/zerotier:local"
    restart: unless-stopped
    network_mode: host
    environment:
      - TZ=${TZ:-Etc/UTC}
    # - ZEROTIER_ONE_NETWORK_IDS=${ZEROTIER_ONE_NETWORK_IDS}
    volumes:
      - ./volumes/zerotier-one:/var/lib/zerotier-one
    devices:
      - "/dev/net/tun:/dev/net/tun"
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN

Note:

  • ZEROTIER_ONE_NETWORK_IDS commented-out to disable auto-join in clean-slate situation.

Test 1 - clean slate

  1. Show container not running and no persistent store:

    $ docker ps
    CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
    
    $ ls -ld ~/IOTstack/volumes/zerotier-one
    ls: cannot access '/home/moi/IOTstack/volumes/zerotier-one': No such file or directory
  2. Start container

    $ docker compose up -d zerotier

    Show persistent store created:

    $ ls -ld ~/IOTstack/volumes/zerotier-one
    drwxr-xr-x 4 999 994 4096 Apr  6 09:46 /home/moi/IOTstack/volumes/zerotier-one

    Show networks directory does not exist:

    $ ls -ld ~/IOTstack/volumes/zerotier-one/networks.d
    ls: cannot access '/home/moi/IOTstack/volumes/zerotier-one/networks.d': No such file or directory

    Show container healthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED         STATUS                   PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   3 minutes ago   Up 3 minutes (healthy)             zerotier
    

Test 2 - Join first network

  1. Join network:

    $ docker exec zerotier zerotier-cli join 9999888877776666
    200 join OK

    Show container goes unhealthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED         STATUS                     PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   6 minutes ago   Up 6 minutes (unhealthy)             zerotier
    

    Explore reason:

    $ docker exec zerotier zerotier-cli listnetworks
    200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips>
    200 listnetworks 9999888877776666  22:ef:5f:10:91:a9 ACCESS_DENIED PRIVATE ztr2qsmswx -
    
    $ docker exec zerotier zerotier-cli get 9999888877776666 status
    ACCESS_DENIED
    

    Authorise client in ZeroTier Central. Then show container goes healthy.

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                    PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   10 minutes ago   Up 10 minutes (healthy)             zerotier

Test 3 - interrupt first network

  1. List networks:

    $ ip r | grep "dev zt.* scope link"
    10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 
  2. Destroy network:

    $ sudo nmcli conn down ztr2qsmswx
    Connection 'ztr2qsmswx' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/144)

    Show route removed:

    $ ip r | grep "dev zt.* scope link"
    $

    Show container goes unhealthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                     PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   12 minutes ago   Up 12 minutes (unhealthy)             zerotier
    

    Show agent unaware of problem:

    $ docker exec zerotier zerotier-cli get 9999888877776666 status
    OK
    
  3. Restart container:

    $ docker compose restart zerotier
    [+] Restarting 1/1
     ✔ Container zerotier  Started                                                                                                                             1.5s 

    Show route restored:

    $ ip r | grep "dev zt.* scope link"
    10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 

    Show container healthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                   PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   13 minutes ago   Up 9 seconds (healthy)             zerotier

Test 4 - join second network

  1. Join network:

    $ docker exec zerotier zerotier-cli join 9999888877775555
    200 join OK

    Show container goes unhealthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                     PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   17 minutes ago   Up 4 minutes (unhealthy)             zerotier
    

    Explore reason:

    $ docker exec zerotier zerotier-cli get 9999888877775555 status
    ACCESS_DENIED
    

    Authorise client in ZeroTier Central. Then show container goes healthy.

    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                   PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   18 minutes ago   Up 5 minutes (healthy)             zerotier

Test 5 - interrupt one network

  1. List networks:

    $ ip r | grep "dev zt.* scope link"
    10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 
    10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 
    
  2. Destroy one of the networks (would not matter which one):

    $ sudo nmcli conn down ztr2qsmswx
    Connection 'ztr2qsmswx' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/145)

    Show route removed:

    $ ip r | grep "dev zt.* scope link"
    10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 
    $

    Show container goes unhealthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                      PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   25 minutes ago   Up 12 minutes (unhealthy)             zerotier
    

    Show agent unaware of problem:

    $ docker exec zerotier zerotier-cli listnetworks
    200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips>
    200 listnetworks 9999888877776666 My_ZeroTier 22:ef:5f:10:91:a9 OK PRIVATE ztr2qsmswx 10.244.235.233/16
    200 listnetworks 9999888877775555 Test 5e:d2:83:cc:ff:c4 OK PRIVATE ztc3qzoglu 10.242.235.233/16
    
  3. Restart container:

    $ docker compose restart zerotier
    [+] Restarting 1/1
     ✔ Container zerotier  Started                                                                                                                             1.5s 

    Show route restored:

    $ ip r | grep "dev zt.* scope link"
    10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 
    10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 

    Show container healthy:

    $ docker ps
    CONTAINER ID   IMAGE                      COMMAND              CREATED          STATUS                    PORTS     NAMES
    0e5b59715ff3   zyclonite/zerotier:local   "entrypoint.sh -U"   27 minutes ago   Up 23 seconds (healthy)             zerotier

Test 6 - down, up the container

$ docker compose down zerotier
[+] Running 1/1
 ✔ Container zerotier  Removed                                                                                                                             2.4s 

$ ip r | grep "dev zt.* scope link"

$ docker compose up -d zerotier
[+] Running 1/1
 ✔ Container zerotier  Started                                                                                                                             0.2s 

$ ip r | grep "dev zt.* scope link"
10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 
10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 

$ docker exec zerotier zerotier-cli listnetworks
200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips>
200 listnetworks 9999888877776666 My_ZeroTier 22:ef:5f:10:91:a9 OK PRIVATE ztr2qsmswx 10.244.235.233/16
200 listnetworks 9999888877775555 Test 5e:d2:83:cc:ff:c4 OK PRIVATE ztc3qzoglu 10.242.235.233/16

$ docker ps
CONTAINER ID   IMAGE                      COMMAND              CREATED         STATUS                   PORTS     NAMES
c2111263a821   zyclonite/zerotier:local   "entrypoint.sh -U"   8 seconds ago   Up 8 seconds (healthy)             zerotier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants