Skip to content

Conversation

@unkcpz
Copy link
Member

@unkcpz unkcpz commented Nov 3, 2021

  • Used AEP template from AEP 0
  • Status is submitted
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

@giovannipizzi
Copy link
Member

Thanks @unkcpz for the AEP!

Here a few comments from me:

  • I agree that one possible way is to can make Code a general base class, and have a ContainerizedCode as a subclass (code.containerized entry point?). However, if we do this way, in the process I would then also split code.remote and code.local in the same way, for consistency?
  • I see the point of not making the code bound to a given computer. However, we need to see if this makes things more complex for the user, that at submission time needs to specify also the computer? A few comments and suggestions on this:
    • also the local codes are not bound to the computer. When thinking to this issue, maybe we try to address it in the very same way for both
    • I wouldn't use a syntax with two @ signs. Rather, we can have a load_container_code(code_string, computer) that will load the code and in some way "bind" it to the computer, so that if you then get a builder, it will also properly set the computer on the underlying CalcJobNode. This would also have some validation - e.g. if the given computer supports the given container technology (see below).
  • We need indeed to "configure" for a given computer which container technologies it supports. I wouldn't do one more step of verdi computer configure, but rather ask also which container technology (if any) it's supported. As a note: now the computer setup sets the metadata of the computer (that is the very same for all users, and in principle immutable). The computer configure sets the AuthInfo and other user-specific mutable metadata (e.g. the time between opening new SSH connections). We should probably think what should go where (I don't know if this is something @chrisjsewell had in mind when mentioning the redesign of transport etc.).
  • One could have a computer support multiple container technologies, but then it becomes complex to provide (e.g. from the cli) a list of them, with (for each of them) the corresponding configuration parameters (name of the executable to call, like sarus; list of additional parameters; ...). Also, practically, a given computer will probably support at most 1 containerisation technology. So I would suggest that a given computer can have zero or at most 1 container technology supported, and I would say that which one goes in the verdi setup (default: none, and there is a list of plugins, similar to transport and scheduler). So you get a prompt: Container technology (leave empty if none) and only a set of plugin strings are accepted (sarus, docker, singularity, ...). If more, you configure more than one computer (if there is a simple way to configure more I'm OK to do this, but I feel it becomes very complex for little benefit).
    • Then, in the configure step, one also gets asked for the information of the container technology, if specified in the setup step (if too complex to do in the same step, OK to have a new call to verdi computer configure sarus, but let's see if we can avoid it - this might in any case require a bit of redesign of the setup/configure part).
      • The idea here is that the computer will have one specific technology, that is immutable (i.e., that is the "sarus-enabled version of that computer" even when you share it). However, the exact command line you specify is user-specific (e.g. what is the sarus executable name, how the command-line options --mount=... should look like, ...)
    • A containerised Code will contain as well, in its attributes (and will be set with verdi code setup, being asked if the code is a container, remote or local code) which container technology it is for; in the load_container_code(code_string, computer) I suggest above, it is validated that the code_string is for a containerised code, and the computer supports that technology.
  • note: one might also need to adapt a bit the logic that generates the submission-line script

It would be good to suggest already which technology we want to support (e.g. Docker, singularity and sarus?) and have working examples of how the submission strings would look like (something that actually works in one of the supercomputers we use, similar to what you already provided for Sarus - ensuring that you tested it and it works, and mentioning what command line parameters are essential to specify, what are optional.
(One question: does one also have to add lines to the #SBATCH part, or this is not required?

Regarding your comment on the pwd: is specifying the bind explicitly required, or this is done by default by sarus on CSCS? In any case, maybe one solution is to do what we also do for codes, i.e. specifying some template strings that will be substituted by AiiDA (e.g. we can write srun -n {tot_num_cpus}, so we could say one can write sarus run --mount=...source={pwd},destination=....

One final thing to mention is that, in order for this to work, probably one has to go first in the supercomputer and pull the corresponding image, right? Or this can be done automatically in the submission script with a #SBATCH command? It's still a bit better than compiling a code, but we shouldn't forget this. Maybe, as we are implementing a verdi code test, we could have a verdi code containerised-setup --computer=eiger that SSHs into the machine and does the pull of the correct image?
Similarly, I think verdi code test should do something different when the code is containerised, e.g. check if the image has been pulled.

Also, a few minor details on the text:

  • I wouldn't call them virtual technologies. Maybe just "Containerized Codes (CC)" and then you use the acronym CC across the text is more clear and correct?
  • I wouldn't stress that "It makes code setting process more delicate to typos" - you will have the same issue also with containerised codes, when writing down the exact name and version of the image

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2021

However, if we do this way, in the process I would then also split code.remote and code.local in the same way, for consistency?

I was going to mention this as well. I think them not being split causes a lot of problems in the interface and implementation. The only question is if we can make this change with no or very little code breaking, which I don't think will be trivial. But I would definitely advocate for the container code to be a separate class and have aiida.data:core.code.container as an entry point.

However, we need to see if this makes things more complex for the user, that at submission time needs to specify also the computer?

I haven't read the advantage of not hard-coding a computer, but if there is a real advantage, it is already possible to pass a computer at runtime through the inputs as metadata.computer. So this might actually be a very reasonable setup.

We need indeed to "configure" for a given computer which container technologies it supports. I wouldn't do one more step of verdi computer configure, but rather ask also which container technology (if any) it's supported. As a note: now the computer setup sets the metadata of the computer (that is the very same for all users, and in principle immutable). The computer configure sets the AuthInfo and other user-specific mutable metadata (e.g. the time between opening new SSH connections). We should probably think what should go where (I don't know if this is something @chrisjsewell had in mind when mentioning the redesign of transport etc.).

What container technology is supported definitely seems something that should be just part of the Computer and not the AuthInfo. However, this information may change in the future and so you may want to change it but that would not be possible if added to the Computer for provenance reasons. But this kind of changes would have zero impact on provenance so it is weird to forbid this. We really should have "mutable" attributes on Computer making a distinction between those that affect provenance and those that don't.

@giovannipizzi
Copy link
Member

When you say "this information might change in the future" you mean that a computer might stop supporting, say, singularity and start supporting sarus? In this case, I would just define a new code (in the same way we would define a new code if the computer passes from PBS to slurm), no? So it's OK to keep in the immutable properties (I would be very careful in allowing mutable attributes if not really crucial, since this will make all the exporting and sharing code more difficult - if we can limit this to extras, and very little else, we save ourselves a lot of headaches I think - e.g. what happens if I import a computer that used to be with singularity but became sarus). I would keep mutable things only those that are really user-specific, e.g. the connection username, or the settings on how often to reconnect that are only for performance and do not change the results.

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2021

When you say "this information might change in the future" you mean that a computer might stop supporting, say, singularity and start supporting sarus?

Exactly. The way you phrases it, in order to use a container plugin with a certain Computer, that computer would have to be configured such that it says explicitly what container plugins it supports. If that is the case, it would be useful if that information on the Computer is mutable, since it shouldn't affect provenance and it doesn't really belong in the AuthInfo.

In this case, I would just define a new code (in the same way we would define a new code if the computer passes from PBS to slurm), no?

I might misunderstand, but not really. The scheduler type is specified on the Computer not on the Code.

@giovannipizzi
Copy link
Member

Sorry - I meant you "In this case, I would just define a new computer (in the same way we would define a new computer if it passes from PBS to slurm), no?"

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2021

Sorry - I meant you "In this case, I would just define a new computer (in the same way we would define a new computer if it passes from PBS to slurm), no?"

Sure, the question is whether in this case it is really necessary or just an arbitrary limitation of our current setup. If there really is no reason to not allow it to change on an existing computer, it would be annoying if you would be forced to nonetheless. Again, this is barring that changing it would actually affect the provenance in anyway. I agree that the simplest is to not change anything and simply require to create a new computer.

@unkcpz
Copy link
Member Author

unkcpz commented Dec 4, 2021

Thanks @giovannipizzi Thanks @sphuber !!

Sorry for the delay in the update and reply, before the meeting with CSCS yesterday, I have no strong opinions for most of the questions you mentioned @giovannipizzi ;) Now I still hold my opinions for most of the questions, I update the AEP and list them as open questions. Let's settle them down during the coding week!

Just to try to clarify something as far as I know.

Also, practically, a given computer will probably support at most 1 containerisation technology. So I would suggest that a given computer can have zero or at most 1 container technology supported, and I would say that which one goes in the verdi setup (default: none, and there is a list of plugins, similar to transport and scheduler).

Eiger and daint now support both sarus and singularity.
I think it makes sense and is a normal thing to have more than one container technology in the HPC.
On the one hand, different container technologies do not conflict with each other, not like the scheduler managers.
On the other hand, container technologies are more like a code or a library(in the HPC concept) to lever the running of images.
Therefore, users from different communities are able to choose their preferences.
This is why to me the container technologies are more like the current code(in the AiiDA concept) and I what to set up with binding it to a specific computer and use it through @.

(One question: does one also have to add lines to the #SBATCH part, or this is not required?

I think for the --mpi=pmi2 setting, it is yes. This option is pass to the srun as describe by the documentation.

P.S. I just got some issue when running and testing openmpi compiled QE images with SARUS, where to report the issues about it officially?
By the CSCS Service Desk (https://support.cscs.ch/) for questions.

@unkcpz unkcpz force-pushed the aep/008/container-code branch from c9a77f2 to d4c5b97 Compare December 6, 2021 09:10
@sphuber sphuber changed the title container code proposal AEP: Add native support for containerized codes Dec 16, 2021
@ltalirz
Copy link
Member

ltalirz commented Jan 22, 2023

@unkcpz does this PR reflect the implementation route that was chosen in the end?

Should this PR be closed or merged (and if it should be merged, should it be updated)?

@unkcpz
Copy link
Member Author

unkcpz commented Jan 23, 2023

This PR should be updated and merged. I'll update it with the final decision we made and the things we implemented.

@sphuber
Copy link
Contributor

sphuber commented Jan 23, 2023

Note that there is this PR on aiida-core to add support for Docker. We could include the suggestion in this AEP to still add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants