Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Striping the restart.mesh.data file #6035

Closed
lhy11009 opened this issue Sep 11, 2024 · 11 comments · Fixed by #6104
Closed

Striping the restart.mesh.data file #6035

lhy11009 opened this issue Sep 11, 2024 · 11 comments · Fixed by #6104

Comments

@lhy11009
Copy link
Contributor

Hi all.

(Chris Ramos forwarded)

I am running a large 3d model on Frontera with 4000 cpus. A few days later I am notified of this big file, namely the restart.mesh.data.
The problem is all processors are trying to write on this one simultaneously. I take that the situation is more like a traffic jam.

Chris, the Frontera administrator suggests applying the stripping technique on this big restart file. Otherwise, the total number of tasks cannot exceed 840, according to his estimation based on the system capacity.

A further point I learned from Chris is reducing the model size, and therefore reducing the data size won't help much. It's mainly the number of simultaneous checkpointing processes that matters.

I'd like to know whether there is a ready feature in aspect? Or how to address this issue otherwise?

@bangerth
Copy link
Contributor

The restart.mesh.data file contains information about the triangulation that is stored in pieces across all processes. As a consequence, all processes need to write into this file at the same time, as each process needs to save what it knows about the mesh. We do this via the MPI I/O functionality, which hopefully is implemented as efficient as possible.

It is possible that one could do this more efficiently if one could tell the operating system to "stripe" the file, i.e., to store it not as one big block on one disk, but as many blocks on many disks at the same time. But, at least to my understanding, telling the operating system to do this is not part of the regular C/C++ interface used to create/open files. How would one set the striping property on a new file? Would you or Chris be able to offer a piece of code we could use to do this, and do it in a way so that it is as portable on all systems?

@lhy11009
Copy link
Contributor Author

@bangerth Thank you for your reply. Your input will greatly enhance my upcoming discussion with Chris. I must admit, I don't have extensive knowledge on this particular issue, so I will be relying on both you and Chris to help address it.

Resolving this matter is currently a high priority for me, especially as we have a 20,000 SUs allocation on Frontera to run our 3D models. Our ongoing simulations are utilizing over 1,000 cores, and we've thoroughly tested the memory and computational requirements to ensure they meet the demands of our 3D cases. As such, I’m counting on this issue being resolved soon.

I'll reach out to Chris to arrange a follow-up discussion during our next meeting, and I will keep you updated on the progress.

@lhy11009
Copy link
Contributor Author

I should also mention that I am currently using ASPECT version 9.4.0 on deal.II 9.4.0. If the MPI I/O functionality differs in newer versions of ASPECT or deal.II, I would greatly appreciate any clarification on that.

@gassmoeller
Copy link
Member

As Wolfgang mentioned you are probably thinking about "striping" a file, and striping on Frontera is handled by the operating system, see here in the Frontera manual: https://docs.tacc.utexas.edu/hpc/frontera/#files-striping. We have in the past successfully checkpointed and restarted models up to several thousand CPUs on Frontera, so I think this is something you will have to resolve with the Frontera staff first, and only if there is no solution that can be implemented on the cluster we could start to think about mitigation measures inside ASPECT, which would almost certainly be more complicated than a more appropriate configuration of the cluster filesystem.

@lhy11009
Copy link
Contributor Author

As René mentioned, checkpointing doesn’t seem to be a problem, and it works fine for my case with 4000 CPUs. However, Chris reached out to inform me that my checkpoint behavior is causing some issues with the Frontera system. He suspects that the problem lies with the restart.mesh.data file and suggested the number 840, which is lower than the number of CPUs I typically use. He also proposed that stripping the file might be a solution.

At this point, I believe it would be helpful to have a brief discussion to identify the exact issue and explore potential solutions.

@bangerth bangerth changed the title Stripping the restart.mesh.data file Striping the restart.mesh.data file Sep 23, 2024
@lhy11009
Copy link
Contributor Author

lhy11009 commented Oct 6, 2024

I wanted to provide an update on the debugging process related to file stripping on Frontera’s system. After some testing, I found that with 4000 MPI processes and three large restart files, each around 100 GB in size, setting the file stripping number to 8 appears to be optimal. It turns out that file stripping can be specifically configured for these three files.

I'll let you know how my test goes.

@gassmoeller
Copy link
Member

Thank you for the update 👍 when you conclude your analysis, would you be willing to summarize your findings in a bullet point or a few sentences in our wiki section on Frontera: https://github.com/geodynamics/aspect/wiki/Installation-on-Frontera? That would help others in the future when they encounter similar issues.

@lhy11009
Copy link
Contributor Author

lhy11009 commented Oct 7, 2024

Of course. I'll keep that in mind.

@tjhei
Copy link
Member

tjhei commented Oct 12, 2024

Are you talking about the restart.mesh_fixed.data file? How do you enable striping for the affected files?

One issue I can see is that we create restart.mesh.new* files and rename them afterwards:

triangulation.save (parameters.output_directory + "restart.mesh.new");

This means the actual IO happens into different files.

Maybe we should put all large restart files into a separate folder. This way one can specify the striping behavior for the whole folder (and consequently for each new file created there).

@lhy11009
Copy link
Contributor Author

That makes a lot of sense. My current solution is to set striping for all the following files: restart.mesh_fixed.data, restart.mesh.new_fixed.data, and restart.mesh_fixed.data.old. Grouping these files in a dedicated folder would make the process much easier to manage.

In practice, this would just require adding one additional line in the SLURM file like:

lfs setstripe -c 8 ./output/{new restart folder}

This should help streamline the setup. Let me know if you have any thoughts or further suggestions!

@lhy11009
Copy link
Contributor Author

lhy11009 commented Oct 15, 2024

Hi Timo,

As for now, my test case with 4000 nodes and a restart.mesh_fixed.data file of 100 GB are operating well on Frontera after adding the restart.mesh.new_fixed.data file.
I just wanted to add that if it’s feasible to make the change to ASPECT in a small PR, I’d be happy to handle that. If you could point me in the direction of what needs to be changed, I can take it from there.

Looking forward to your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants