-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of Jailed space #111
Comments
Hello thien-lm, Thank you for the question! The impact on performance largely depends on the storage solution you're using. In general, shared storage tends to have higher latencies for I/O operations compared to non-shared options, though it can offer much higher overall throughput. We tested three shared storage solutions in practice:
Here’s a breakdown based on two common usage scenarios:
Since Soperator is primarily designed for ML model training, distributed filesystems like GlusterFS or the Nebius shared filesystem are well-suited for the most demanding tasks such as checkpointing and dataset loading. These operations benefit the most from high throughput, while tasks like installing software are typically less frequent and it's not a big deal if they take 2-3 times longer. If you use PyTorch, you can also set higher Additionally, Soperator allows for flexible storage customization. For example, the "Jail" storage can be an NFS share, while "Jail submounts" can be backed by distributed filesystems. These submounts can leverage any storage type supported by your Kubernetes cluster (e.g., ephemeral or persistent, local or shared, disk-based or in-memory, S3 or OCI) to meet specific use cases. Some files and directories are non-shared by default: all virtual filesystems including Some links:
|
In theory, seems that the jailed space will have poor performance.
Did anyone face that issue when the number of workers in Slurm cluster inrease ?
The text was updated successfully, but these errors were encountered: