crew with a nearly-airgapped HPC? #153

r2evans · 2024-02-26T14:56:03Z

r2evans
Feb 26, 2024

I'm hoping to get confirmation that crew would be a good fit for this use-case with minimal shoe-horning.

I do some real-time strategy-recommendation for car racing that involves 1000s of simultaneous models. The VM where the data and API are located is available via the internet (auth required), and the only access to the HPC is over SSH from the VM, no port-forwarding allowed, no other ports are exposed. The VM needs to get current-state (times, positions, etc) to the HPC every couple of laps, so I've set up a file-based mechanism that provides atomic transfer of data in both directions; this uses a dedicated process on both the VM and on the HPC (I'll call these process "brokers"). This is the only way I have to get anything other than human-typed information between the VM and the HPC.

VM -> HPC data is almost entirely "tasks": run the model with given parameters.
HPC -> VM data is entirely model results. We can consider all of this tabular (multiple tables per model).
FYI: model runtime is on the order of a few minutes each. This process runs continuously for 3-6 hours.

I'm able to SSH into the HPC (and use tmux) and start up a (personal?) redis instance, then I start the brokers. They poll redis and the filesystem, this part works well (enough). Once all that is started, I start up the HPC batch, and 1000s of processes periodically poll the redis instance for new tasks.

The legacy system (for the 2023 season) used Python and Pyomo on the HPC, and that is changing for this year, so a complete rewrite of that end of it is required. (The VM is running entirely R, no change.)

Are crew and rrq a good fit for this type of coordination? I think I'd use a single worker type (not sure if a group is required).
Would this require a controller (and zero workers) on the VM, and a controller and all workers on the HPC? Is it a problem to have multiple controllers like this?
The current method of returning the results is via compressed files, so when a model is complete, it uses the data-broker mechanism to get the results to the VM. However, it might be nice if the on HPC controller were notified of task completion without having to poll each task. Is that possible?
The way the HPC batch works, when a process stops I lose the CPUs. This isn't something I can adjust, so having an R session stop (i.e., "transient" works) is not good. I'm assuming I'd go with "persistent"; other than typical R exception handling, is there anything I need to be aware of with regards to ensuring persistence as long as I want it? (At the end of the race, I manually shut down the batch job, though it might be nice if I could notify the nodes to shut themselves down.)
The workers can access the HPC gateway (where redis and one data-broker are running), but there is no direct way for me (on the gateway) to initiate a connection to any of the workers. All communication with the workers must initiate on the controller and/or be on the shared filesystem. (If it matters, it's a lustre gpfs, and lag between nodes can be on the order of 10 seconds.) I think this is fine (the controller doesn't reach out to workers, it's all "poll" in a sense), but please confirm that I'm seeing this correctly.

Will, I met you several years ago at an R conference (UCLA, perhaps?) when you were still forming targets, I hope things are going well. I'm really intrigued by the notion of crew and want to wrap my brain around it. (FYI, my first "full-up run" with this system should be this upcoming weekend ... I'm feeling no pressure here ;-)

wlandau · 2024-02-26T20:58:37Z

wlandau
Feb 26, 2024
Maintainer

Looks like rrq might work if the HPC workers can dial into your local Redis instance. For crew to use SSH connections though, the workers would still have to have some way of dialing into the mirai host, and and so those workers would have to initiate all those SSH connections and treat your local machine/node as a server. Please correct me if I am wrong @shikokuchuol, but in mirai I was under the impression that SSH connections would have to go from host to daemon, not the other way around.
Is the VM your local machine / node? If so, that's where the controller typically lives. If you need different pools of workers that run in different places, you might have a look at https://wlandau.github.io/crew/articles/groups.html.
This is precisely why the event-driven async and networking of mirai is so nice. There is no polling, and everything from task completion to worker connections are detected via NNG synchronization primitives rather than local files.
If you do get crew to work for you, the most useful options for the controller are seconds_idle and tasks_max. Those will allow you to choose any point on a smooth continuum between fully transient workers and fully persistent ones.
If the HPC workers can dial into the VM where the controller is running, using a local IP and port, you should be fine. But if it doesn't work the first time, you might have to experiment with the host argument to see what kind of host name or IP address would work. If you are having a hard time, you can peel back layers (crew without targets, mirai without crew) as I explain at How to configure a routine containerized slurm cluster setup? crew.cluster#36.

I hope things are going well for you too. Good luck with your prototyping.

0 replies

r2evans · 2024-02-26T21:23:30Z

r2evans
Feb 26, 2024
Author

Okay, good points.

The redis instance will be within the HPC, so HPC workers will have direct access to that. One purpose of the data-brokers is to populate the HPC-local redis with things. (FYI, using a filesystem as a redis-transport-medium is not the most performant ...)
The VM is local-enough, I have full access and control over it. (I had read about groups, I currently don't need more than one type of worker, but that may be because that's all I've been able to engineer.)

From your first two points, I think should likely shift away from having a controller on the VM. I think that was a mis-application on my part. So let's assume that there is one controller on the HPC gateway (fully accessible by all workers) that is colocated with a redis instance.

Got it, more reading.
If I need to choose a continuum, I need to go with fully-persistent, no transience at all. Dropping workers is a really bad thing, the batch system prevents me from getting them back without stopping everything and restarting (which is neither automated nor instantaneous).
I think this is resolved my removing the notion of a VM-based controller, instead doing all of the controlling on the HPC gateway.

But something I didn't say before: the HPC has no internet access, no ability to reach back to the VM. The only way to get any information back from the HPC is via the data-brokers and files. (Next R conference we're at together, if you're really curious, we can chat over a beer or whiskey or espresso, depending. This was an adventure for creative architectures.)

It looks as though mirai, crew, and rrq are all relatively simple packages, the only non-R dependency would be for redux (which is covered).

Thank you for the encouraging feedback, I'll let you know if I find anything. Any R confs in your future?

1 reply

wlandau Feb 28, 2024
Maintainer

But something I didn't say before: the HPC has no internet access, no ability to reach back to the VM. The only way to get any information back from the HPC is via the data-brokers and files.

crew doesn't need access to the public internet (trying to do this would be a security risk) but it does need a local network to connect back on. So if data brokers and files are needed, that might be a hard limit.

Any R confs in your future?

@shikokuchuo and I submitted joint abstracts to Posit conf and useR! this year. I will be in person in Posit if everything works out as planned, and then virtual at useR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crew with a nearly-airgapped HPC? #153

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

crew with a nearly-airgapped HPC? #153

r2evans Feb 26, 2024

Replies: 2 comments · 1 reply

wlandau Feb 26, 2024 Maintainer

r2evans Feb 26, 2024 Author

wlandau Feb 28, 2024 Maintainer

r2evans
Feb 26, 2024

Replies: 2 comments 1 reply

wlandau
Feb 26, 2024
Maintainer

r2evans
Feb 26, 2024
Author

wlandau Feb 28, 2024
Maintainer