Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaged executable integration tests - initial integration tests with docker #391

Merged
merged 30 commits into from
Jul 27, 2022

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Jun 28, 2022

Description

This PR modifies the tests/bin tests to target a packaged executable for the tests. This will be done by setting the pk_pkg_executable_path as the path to the desired executable we wish to test. This should then apply all bin tests to this executable.

The details are inside of issue MatrixAI/Polykey-CLI#10.

Issues Fixed

Tasks

Given the scope of this PR, sub-tasks will be tracked with each issue.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@tegefaulkes tegefaulkes self-assigned this Jun 28, 2022
@CMCDragonkai
Copy link
Member

Can we keep the env variable simple? Just PK_PATH could be used. Then pkPath is the variable.

I'm also wondering if there's any self executing logic where in our code and if there is, we may want to out out bundled executable into the PATH itself so that it can be done. But any self executing logic should be done more robustly by not relying on the PATH and instead finding the path to it's own script.

Note that in comparison to describeIf and testIf, this env variable is acquired and centralised to the globals in jest.config.js. Whereas environmental conditions are done test-module-specific where describeIf and testIf are used.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 28, 2022

One should note that using PK_PATH in our .env.example to ensure that such testing can also be done locally. It must by default be unset.

Only if set, is it used.

And you want to have a check that if set, and is empty string or an invalid path or path to non-executable, throw an exception. This could be done in jest.config.js.

@tegefaulkes
Copy link
Contributor Author

I've modified pkSpawn to use the packaged executable if the pkgPath is set. I've been testing this on tests/bin/agent/start.test.ts/ starting in foreground. It mostly works however it seems that it is not getting terminated properly. The process is getting killed properly. However the SIGTERM signal is not getting caught and as such polykey is not stopping gracefully. It just immediately terminates. I'll have to check if windows has any quirks causing this.

@tegefaulkes
Copy link
Contributor Author

Tested it under Ubuntu and it works fine. So windows seems to be the problem here.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Jun 28, 2022

According to https://nodejs.org/docs/latest-v16.x/api/process.html#processkillpid-signal.

"On Windows, where POSIX signals do not exist, the signal argument will be ignored, and the process will be killed forcefully and abruptly (similar to 'SIGKILL'). See Signal Events for more details.".

This might just be a child_process problem. I'll check if there is an alternative.

Using process.kill(agentProcess.pid!, 'SIGTERM'); has the same problem.

@CMCDragonkai
Copy link
Member

As far I know windows does have signals. But in this case, you're right that windows does not support SIGTERM.

This becomes complicated, because we are using SIGTERM to perform graceful exit of the agent. There are equivalent windows events in https://stackoverflow.com/questions/2055753/how-to-gracefully-terminate-a-process#2055810.

So we might have to find a way for nodejs to receive the equivalent of a graceful exit signal, and send that signal instead in our tests if we want it to also work on windows.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Jun 28, 2022

According to this we may need to add some new handlers. We will then need to confirm the behaviour on windows by closing the window, ending via task manager and using kill PID.

  • SIGHUP is generated on Windows when the console window is closed, and on other platforms under various similar conditions. See signal(7). It can have a listener installed, however Node.js will be unconditionally terminated by Windows about 10 seconds later. On non-Windows platforms, the default behavior of SIGHUP is to terminate Node.js, but once a listener has been installed its default behavior will be removed.
  • SIGBREAK is delivered on Windows when Ctrl+Break is pressed. On non-Windows platforms, it can be listened on, but there is no way to send or generate it.

I can't find any info on what exactly happens when you do kill PID via CMD or powershell. Hopefully we receive a windows flavoured signal in that case. Worst case we receive no signal and the process exits. In any case I think the tests need to be modified to supply the correct signal depending on platform. Using process.kill or agentProcess.kill with SIGTERM or no signal results in the process abruptly exiting.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 28, 2022

Yea that makes sense. Basically to deal with Windows, we have to also listen for SIGINT, SIGHUP and SIGBREAK.

So in our jest tests, when we want to signal graceful termination, we send it SIGINT, that is likely going to be portable. I just coded it to SIGTERM since that was considered the standard on unix systems.

I think we could just make our graceful termination trigger upon all of these signals for the polykey agent:

  1. SIGINT
  2. SIGHUP
  3. SIGBREAK
  4. SIGTERM

Then change all of our tests that are using SIGTERM to instead use SIGINT which should be portable across all operating systems.

We should also have tests that test each signal for a given system. So on windows, we expect SIGINT, SIGHUP, SIGBREAK all should gracefully shutdown the program.

@CMCDragonkai
Copy link
Member

Sometimes SIGINT is sometimes useful for other things, but I think for daemons generally SIGINT means program shutdown too. Interactive shells use SIGINT to interrupt what it is currently doing, but not kill the shell. For all other foreground program/daemon, this should should just shutdown the program.

@tegefaulkes
Copy link
Contributor Author

Just some extra notes on what I've tried when testing the signal problem with windows.

  • so far all of the testing was done via the start in foreground test under tests/bin/agent/start.test.ts. we should separate the test into it's own file to reduce possible side effects with the test.
  • SIGINT should work universally and we already have a handler for it. However under windows we still have the same problem where the process ends abruptly without the signal getting handled.
  • SIGHUP and SIGBREAK causes the agentProcess.kill and process.kill to throw an error even though they should be valid signals. So we may need an alternative method to trigger these signals via the tests.
  • I have not tested what signals happen under windows if you do the following to the running agent under windows;
    • close the console window running the agent. should be a SIGHUP?
    • kill the process via task manager. should be a SIGBREAK?
    • use ctrl + c, ctrl + d or other keyboard break signals in the console. should be a SIGBREAK?
    • kill using the kill <PID> command. Not sure about this one. I just hope we can catch and handle it.
  • We need a to work out a way to trigger these signals during tests. using process.kill or child_process's agentProcess.kill doesn't work the way we expect under windows.

@emmacasolin

@tegefaulkes
Copy link
Contributor Author

I have a working bash script that allows us to run the docker image in a similar way to running the normal polykey command. This runs fine in the terminal. However I can't for the life of me get an output on the stdout and stderr when running the script via child_process.

@tegefaulkes
Copy link
Contributor Author

Problem was PBCAK... I wasn't allowing the process any time to output anything before the test ended. it's outputting the streams with no problem.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Jun 29, 2022

The script works now. It should mostly be a drop in replacement for running the bin commands. One major difference is that the node path will have to be specified with the PK_NODE_PATH env variable. I'm going to look at applying it to the start tests now.

@tegefaulkes
Copy link
Contributor Author

Testing this with the start in foreground test has revealed some new quirks.

  1. the PID inside the contain is always 1. more importantly it will not match the PID of the process we are spawning. I think we'll have to remove checks for this in tests or override it depending on what we're testing.
  2. We can bind mount a folder and work out of that. but anything the container creates has weird permissions. Essentially the host doesn't have permission to read files created by the docker image. In the case of start in foreground test this means we can't read the status.json file as part of the test. Annoyingly there doesn't seem to be a quick and easy fix for this. https://techflare.blog/permission-problems-in-bind-mount-in-docker-volume/ outlines the problem.
  3. Sending a signal to the child process doesn't actually propagate to the docker image. so we will have to kill it some other way.

@CMCDragonkai
Copy link
Member

The container has the root PID... I think you could resolve this by also mapping the host pids in the docker container, just like host network and host IPC?

@CMCDragonkai
Copy link
Member

For 2., this is because the container is the root user. It's possible to have a container run as a separate user, maybe inherit the user of the host system? That may be a compile time thing or a runtime thing, you can have a check.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 29, 2022

For 3., see this https://hynek.me/articles/docker-signals/, we would need to make that this is correct, as our orchestration system will rely on signals. (AWS ECS).

Also the linked article: https://engineeringblog.yelp.com/2016/01/dumb-init-an-init-for-docker.html

@tegefaulkes
Copy link
Contributor Author

  1. and 3. is solved by using the --pid host flag to share the pid namespace with the host. This avoids the quirks with having a PID of 1 so any signals are properly passed along.
  2. Is solved by using --userns host --user <UID> to allow the permissions to match up with the host.

With this and cutting out some un-needed expectations,the start in foreground test is passing.

@tegefaulkes
Copy link
Contributor Author

Things I need to look into changing.

  1. change the docker image to use entrypoint using the path to the file we run instead of using cmd.
  2. Make sure that the entrypoint is using the [ "command" ] syntax.

@CMCDragonkai
Copy link
Member

And using pstree to confirm that /bin/sh is not an intermediate process between docker and polykey.

@CMCDragonkai
Copy link
Member

Tomorrow separate out what can be merged into staging so we can do the testnet deployment procedures. Focus on agent start and bootstrap related commands, any fixes in source, or only fixes to tests right now?

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Jun 30, 2022

All of the agent start tests are functional now. The tests rely on some env variables to be set.

  • PK_TEST_CMD=scripts/docker-run.sh - path to the command you want to execute, this case the docker testing script.
  • PK_TEST_PLATFORM=DOCKER - This is used to filter what tests to run.
  • PK_DOCKER_IMAGE=polykey-1.0.0:lyi2g9qpaq5i05c01hh4gcidm0jv2vh3 - Should be set to the image name after loading the docker image with docker load --input <path to image>

Note that I've created a modified testIf and describeIf called runTestIf and runDescribeIf that can be used like runTestIf(condition)('name', ()=>{}, timeout)) so we can do things like runTestIf(condition).only('name', ()=>{}, timeout)).

Still todo:

  • modify the CI/CD integration:docker test job to use the new tests.
  • clean up commits and cherry pick to staging.
  • convert bootstrap.test.ts to test docker image.
  • update entrypoint of the docker image

@CMCDragonkai
Copy link
Member

Note that I've created a modified testIf and describeIf called runTestIf and runDescribeIf that can be used like runTestIf(condition)('name', ()=>{}, timeout)) so we can do things like runTestIf(condition).only('name', ()=>{}, timeout)).

That seems like an improvement over testIf and describeIf @emmacasolin. @tegefaulkes why not just replace testIf and describeIf usage?

  • pk_pkg_path=scripts/docker-run.sh - path to the command you want to execute, this case the docker testing script.

This should be PK_COMMAND right? We're using the PK_ prefix for all of env variables as can be seen in bin/utils.options.ts, but because this is a development time env variable, you must put it into .env.example, make sure to use #PK_COMMAND=. I believe that sets the env variable to empty string. By default users are expected to copy .env.example to .env which is autoloaded by nix-shell. So it should be commented out to indicate you only use this to run tests against a different polykey executable. And later this can expand to an actual command string, like docker run... Note it would never be a shell string, but instead a command string just means it would be split by newlines across the child process exec.

  • PK_TEST_PLATFORM=DOCKER - This is used to filter what tests to run.
  • PK_DOCKER_IMAGE="polykey-1.0.0:lyi2g9qpaq5i05c01hh4gcidm0jv2vh3" - Should be set to the image name after loading the docker image with docker load --input <path to image>

Are these needed? I was expecting the CI/CD to automatically set the docker image path inside the docker run script that is used by PK_COMMAND above?

At the same time, the PK_TEST_PLATFORM seems strange. Our existing usage of describeIf and testIf are based on platform differences that are acquired from in-code, instead of referring to environment variables. The main idea was that npm test should just always work, and figure out what to test based on the platform.

In this case, testing a separate pk executable, should just trigger automatically based on the existence of PK_COMMAND.

However I imagine that now our integration tests would be changed to run npm test -- ./tests/bin to focus only on the bin tests.

Kind of like:

echo 'exec docker run ...' > ./tmp/docker-run.sh
chmod u+x ./tmp/docker-run.sh
PK_COMMAND='./tmp/docker-run.sh' npm test -- ./tests/bin

# In the ideal case, this should be possible
PK_COMMAND='docker run ...' npm test -- ./tests/bin

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 30, 2022

I am not entirely convinced that some of the bin tests are not able to run via the integration? I think all the bin tests should be runnable for integration. Using docker run with all the right namespace options should be no different from running polykey directly on the host operating system.

Later we can have "docker" specific tests, but that's separate from tests/bin which are meant to be generic tests on the executable of polykey.

Another issue is tests/nat which we can address later, since they rely on alot of OS facilities, we don't know if they will run in a docker in side docker. We only know that they run inside docker (but not in dind) - @emmacasolin

@CMCDragonkai
Copy link
Member

Regarding:

update entrypoint of the docker image

Before you do this, please verify with pstree whether there is in fact a /bin/sh intermediate process.

If there isn't, then we don't need to modify this, we can continue using Cmd. On the other hand, because polykey container is a single-command container, it may be worth while to replace with Entrypoint. But verify it first, and even after using Entrypoint, you want to check with pstree to see the process tree.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Jun 30, 2022

Right now the two tests i'm skipping are

  • 'start in background' - the docker container ends when the main process exits. since this test starts a detached process and exits the main one we end up the the container closing and killing the new process. Since the background process is not really a use case for docker I don't think it's worth converting this test.
  • 'start with global agent - theses tests are using pkStdio, a global agent and mocking. it's not worth converting pkStdio, global agent and we can't use mocking if we do. So I don't think it's worth the time to convert these tests at this stage.

When using pstree -pa I only saw the container -> polykey and some processes under that. I'm not super certain what i'm looking for but I don't think it's running under a /bin/sh shell there.

@CMCDragonkai
Copy link
Member

When using pstree -pa I only saw the container -> polykey and some processes under that. I'm not super certain what i'm looking for but I don't think it's running under a /bin/sh shell there.

Can you copy paste the output of pstree to here on the relevant container running?

@tegefaulkes
Copy link
Contributor Author

Are these needed? I was expecting the CI/CD to automatically set the docker image path inside the docker run script that is used by PK_COMMAND above?

I think it's much simpler to just set an env variable for the docker image here than trying to meta-generate the script to contain it. Using PK_TEST_PLATFORM to filter the tests seemed the simplest since we can just set it when running the job. I wasn't sure of an in code way of checking if we're running against a docker image unless we want to check that PK_TEST_CMD is explicitly the path to the docker-run.sh script.

@tegefaulkes
Copy link
Contributor Author

pstree output.

  |-containerd-shim,485519 -namespace moby -id60ea62c4654cd
  |   |-polykey-agent,485540
  |   |   |-{polykey-agent},485553
  |   |   |-{polykey-agent},485554
  |   |   |-{polykey-agent},485555
  |   |   |-{polykey-agent},485556
  |   |   |-{polykey-agent},485557
  |   |   |-{polykey-agent},485558
  |   |   |-{polykey-agent},485559
  |   |   |-{polykey-agent},485560
  |   |   |-{polykey-agent},485561
  |   |   |-{polykey-agent},485562
  |   |   |-{polykey-agent},485563
  |   |   |-{polykey-agent},485564
  |   |   |-{polykey-agent},485565
  |   |   |-{polykey-agent},485566
  |   |   |-{polykey-agent},485567
  |   |   |-{polykey-agent},485568
  |   |   |-{polykey-agent},485569
  |   |   `-{polykey-agent},485570
  |   |-{containerd-shim},485521
  |   |-{containerd-shim},485522
  |   |-{containerd-shim},485523
  |   |-{containerd-shim},485524
  |   |-{containerd-shim},485525
  |   |-{containerd-shim},485526
  |   |-{containerd-shim},485527
  |   |-{containerd-shim},485528
  |   `-{containerd-shim},485546

@tegefaulkes tegefaulkes force-pushed the feature-pkg_integration_tests branch from d8078d3 to c0cd81d Compare July 27, 2022 02:26
@tegefaulkes
Copy link
Contributor Author

Commits have been cleaned up.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 27, 2022

Regarding the usage of shell: true. There are some differences now in the command executions.

There are 3 functions for executing things:

  • child_process.exec - defaults to shell: true
  • child_process.execFile - defaults to shell: false
  • child_process.spawn - defaults to shell: false

The exec is never used. Because it's the most basic one, but we do not have this used anywhere in our tests.

Now the problem is that shell: true affects the interpretation of spaced arguments. By default shell: false, means that ['a b'] is one single atomic argument value. So child_process.spawnSync('touch', ['a b'], { shell: true }); means touch a b and not touch 'a b'. This means you end up with 2 files being created rather than 1 file a b being created.

Furthermore, when using child_process.execFile, this actually interprets the arguments incorrectly. It expects the first argument to be a file path. For example: child_process.execFileSync('touch', ['a b'], { shell: true });, this fails as it tries to execute /bin/sh -c touch a b, and it doesn't properly quote the command as /bin/sh -c 'touch a b'.

Therefore, if we are switching to using shell: true, we cannot actually use execFile or execFileSync. They won't be portable.

Instead any usage of shell: true must be using spawn.

We are using execFile in a few places, it's possible to replace these usages with spawn instead, since spawn can do everything that execFile does, and we end up only using spawn. This means pkExec should be swapped to using childProcess.spawn in order to maintain consistency so that when the PK_TEST_COMMAND is being used, it will continue to using spawn.

However as for the 'a b' interpretation, when the shell: true, (and this occurs when PK_TEST_COMMAND is being used), then any spaced arguments must be either quoted around the argument value OR it has be escaped.

The simplest solution is to escape every space character when shell: true, so you maintain consistent behaviour between shell: false and shell: true. That way ['a b'] can contineu be used as a normal argument and test writers don't have to change their behaviour. Inside your testing utilities targetting PK_TEST_COMMAND, you have to map every argument to an escaped version of the argument using a regex replace with \ .

Something like:

arg.replace(/ /g, '\\ ')

@CMCDragonkai
Copy link
Member

Note that there could still be issues in the future, it's best to make sure that our test arguments doesn't use too many fancy stuff in the future to avoid problems.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 27, 2022

This might be a better escaping function:

arg.replace(/(["\s'$`\\])/g,'\\$1');

It also escapes a few other special variables that may be interpreted specially when shell: true.

I just tested it with:

touch \"\ \$\`\\\'

It works quite nicely to make sure it's all interpreted as a raw string.

@tegefaulkes
Copy link
Contributor Author

I'm consolidating all of the child_process spawning commands inside of tests/utils/exec. Nat test utils contain 2 such commands pkExecNs and pkSpawnNs. I can move them over but they depend on a function that's used extensively in nat/utils.ts called nsenter().

I don't want to split this function between the two files and I'm not sure if it's a good idea to have the two files depend on each other. I could;

  1. Move nsenter() to the exec.ts and have nat/utils import it. this would be the easiest but nsenter would be out of place.
  2. nsenter() just creates some extra args, we could just pass them into pkXNs. and make a wrapper pkXNs inside nat/utils.ts to recreate the original signature.

Not sure what to do here, I'm leaning towards 2.

@CMCDragonkai

@CMCDragonkai
Copy link
Member

We're moving nsenter into tests/utils/exec.ts.

Future enhancement requires the tests/nat to also use the PK_TEST_PLATFORM and PK_TEST_COMMAND when both are set. That would be the case where nat tests can run against a linux binary, and this would most likely occur on integration:linux job, which runs with an ubuntu container.

This should allow for faster testing since we skip the CPU intensive keypair generation.

Related #420
Also enabled `FF_NETWORK_PER_BUILD` flag for CI.

Related #407
This will remove the container when it ends.

#407
Disabling tests that don't work with docker for now.

#407
`DOCKER_OPTIONS` env variable containing all the docker run parameters to make the test work is provided to the command.

#410
…time

The `runTestIfPlatform` commands will default to running if `PK_TEST_PLATFORM` is not set.

#410
This is to key every major spawning command in one place in case we need to modify how we spawn using `ts-node`.
@tegefaulkes tegefaulkes force-pushed the feature-pkg_integration_tests branch from 882516f to 3affbca Compare July 27, 2022 09:17
Is is due to a compatability problems when using `shell: true` with `child_process.execFile`.
@tegefaulkes
Copy link
Contributor Author

Ok, this should be good to merge now. all the tests should be passing. I haven't enabled the integration tests to check them for a little bit but they are passing locally so they should be passing there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment