ASGS Install on Unity #1064
Replies: 21 comments 22 replies
-
Hey @notstarboard great to hear from you! @wwlwpd is the expert here, but I will offer the following suggestions:
As far as Have fun and let us know how it goes! :-) |
Beta Was this translation helpful? Give feedback.
-
Thanks @notstarboard and @jasonfleming - another bit of advice I can offer is to take a look at this |
Beta Was this translation helpful? Give feedback.
-
@notstarboard - also we use Docker, mainly targeting desktops, because Ubuntu and other "desktop" Linux's do not come with things like a build environment already there. HPC environments necessarily have all that stuff by user requirement. If MGHPCC has these things already, like even Hopefully MGHPCC / Unity has these things already installed. If not, you're really going to want to use Docker because; e.g., we are not going to be supporting the installation of our own |
Beta Was this translation helpful? Give feedback.
-
Thank you both for you help on this! I created a branch for the issue on my fork of ASGS, made platforms files for Unity as per the platforms README, and was able to complete ASGS instantiation. Turns out there was indeed a flex module available on Unity and when I loaded that before running init-asgs the NCO errors went away. The modules I loaded were mpich-intel, icc/2022.2.3, and flex. I've been having no luck getting ADCIRC to build, though. I tried building a few of Dave's custom versions of ADCIRC first, but ran into a bunch of "undefined reference" errors on both attempts. I next tried to install v55.01 to rule out any issues with the custom version and I hit the exact same errors. I've attached the end of the console output as well as the platform files I installed with - would appreciate any suggestions you may have! |
Beta Was this translation helpful? Give feedback.
-
The machine needs a couple of entries in |
Beta Was this translation helpful? Give feedback.
-
Also, I don't plan on accepting this as an official platform for right now. But would rather flesh it out as a user provided PLATFORM that is made available via
Where This was put in so sites can maintain their own set of platforms without adding it to the list of officially supported machines. An additionall benefit is that sites or individuals can maintain their own repository of sets of platforms |
Beta Was this translation helpful? Give feedback.
-
At this point the ADCIRC instance is pretty much all set up! When I run the "verify" command the only things that don't succeed are the TDS connections, which makes sense, since I didn't set those up. Last night I kicked off a test job that appeared to spin up successfully ("the padcirc.hindcast job appears to have run to completion successfully"), but it's running into issues when trying to download GFS data. There are hundreds of repetitive errors like the following at the bottom of the log file:
From looking at get_gfs_status.pl this appears to be a path on the NOAA server. Is this probably just an issue with the site hosting the GFS data, then, or can you think of something I may have not done / done wrong in the setup to cause this? |
Beta Was this translation helpful? Give feedback.
-
Update on this! GFS data appeared to download successfully and padcirc jobs got kicked off as expected, but these jobs just spun forever, even on really coarse meshes like ec95d. I've been going back and forth with some Unity folks on this, and it seems that we need to build ADCIRC using a specific set of modules in order for Infiniband to be used, so if I build within ASGS it's going to run substantially slower than it should. The current approach is to try to build ADCIRC on my own and then point ASGS to the appropriate ADCIRCDIR. So far I'm still hitting compiler errors for padcirc/padcswan that folks are helping me with, but once that builds successfully it'll hopefully solve the performance issue. ASGS seems to be working as intended, though, which is why I've been quiet on here! |
Beta Was this translation helpful? Give feedback.
-
Hey Josh, great to hear that the ASGS workflow part of this is running as expected. :-) Each HPC has its idiosyncracies, we've definitely had experience accommodating that over the years, especially @wwlwpd. Hopefully it will get sorted quickly. |
Beta Was this translation helpful? Give feedback.
-
Hi All! I still need to test with the URI mesh and with SWAN enabled, but based on a recent ec95d + WAVES=off test we seem to have resolved the performance issues. With that said, I'm running into errors with my post-processing code that I'm hoping you may be able to help me with. My code relies on the scipy module, which I have already installed via pip3. However, when I attempt to import that module, I get the following error:
I found this StackOverflow post with a similar error, and based on the replies there I confirmed that libffi-dev is indeed installed on Unity. In case that was a recent change, I also tried the suggestion of uninstalling and reinstalling Python; I ran the following via ASGS:
No dice, though. Do you have any other thoughts on how to resolve this? I figured that I'd start here instead of asking the Unity admins, since ASGS should be using its own Python instance instead of the one installed on Unity. Perhaps that libffi.so file needs to be made available to ASGS's Python somehow? Thanks! |
Beta Was this translation helpful? Give feedback.
-
The post-processing code does indeed work both inside and outside ASGS! So, I'm back to performance testing. And, as is tradition, this brings me to another question: Kevin (Unity admin) suggested I run ADCIRC with the following flags: "--exclusive", "--mem=125g" (125g being the actual cap for a job on the URI nodes), and "--nodes=2" (or 3, or 4... the idea is to test and see how it goes). His thought is that it will be better for performance partition-wide, and also for the jobs themselves, if they're the only things running on each node they use. Is there a way to implement this in ASGS? I know we typically configure things at the CPU level with NCPU and NCPUCAPACITY, but I've never attempted this at the node level. Would it be enough to update the JOBLAUNCHER environment variable (currently just 'srun ') with these three flags and to comment out NCPU and NCPUCAPACITY in the config file? Also, from reading through the wiki and some config files to see what I could dig up on this, I encountered the NUMWRITERS flag. So far I haven't been using this, but I do see that some other config files might have 1, or a few, processors dedicated to this. Do you have any general advice as to when this is a good idea? Figure I may as well squeeze out every bit of performance while I'm down this rabbit hole :) |
Beta Was this translation helpful? Give feedback.
-
@notstarboard great to hear that the post processing is working! :-) Where did they suggest that you supply these flags? In the queue script itself? Or on the command line |
Beta Was this translation helpful? Give feedback.
-
This weekend I was finally able to complete a benchmark run with SWAN and one of the RICHAMP project meshes! I had been fighting with seg faults for all larger meshes too, but "ulimit -s unlimited" ultimately got me past those. The padcswan runtime for the 5-day forecast on 256 CPUs was 2:02:48, which was better than I was expecting. Deb had done a similar benchmark with 360 CPUs on Hatteras and it took 7-8 minutes longer despite the 104 extra CPUs! The combination of the new nodes on Unity, running exclusively ADCIRC on any nodes used, and adding that one writer CPU made a pretty big difference. I think we can finally consider this discussion complete. Thanks as always for all of your help :) |
Beta Was this translation helpful? Give feedback.
-
@jasonfleming I saw your post on the other issue - thanks for checking in! I had been sidetracked for a while with more post-processing related things (all on the richamp-support repo), but I'm getting back to this. There are two things I need to accomplish before we're basically ready on our end for a full live test, including dashboarding (probably in a few months - RIEMA is trying to schedule this but they're pretty jammed through the summer).
I still have some things I can try before I bug you both for help, and Dave Ullman is probably the first guy to talk to anyway since he got this working on Hatteras! So, unless you have a really good idea on either of these pieces off the top of your head, I'll just keep plugging away and keep you posted. |
Beta Was this translation helpful? Give feedback.
-
Hey @notstarboard great to hear from you! Yes, here are some ideas:
Enjoy. :-) |
Beta Was this translation helpful? Give feedback.
-
I do not recommend using s3 to do anything productive where interactive human effort is concerned, it's incredibly inefficient for this process because you have to futz with non-traditional utilities and JSON API clients; it's built for machines not humans. I would be surprised if y'all don't have access to a simple anonymous FTP (read only, anyway) that lets users copy files internally. This part of the process has been wrought with over complication over the last decade, and it's really unfortunate. Anyway, find a place that can expose either an public anonymous FTP or directory listing over http/https. An alternative is some place like mega.nz or worst case, dropbox. |
Beta Was this translation helpful? Give feedback.
-
My comment above applies for interactive/iterative collaboration between people. Not supporting uploading of files. If you want to automate that, then this would be done as part of a |
Beta Was this translation helpful? Give feedback.
-
Hey all, @wwlwpd has a good point: S3 vs THREDDS really depends on what you plan to do with the data once you've uploaded it. My response above was just related to automating the upload to S3 with a bash script without a thought about what happens to the data once it gets where it's going. @notstarboard can you give some detail about the fate of the uploaded data? Will there just be a subsequent automated download to the RICHAMP GIS stuff? Is the S3 or THREDDS server supposed to be archival storage? Is there an audience for the raw netCDF data that you will want to give interactive access to? |
Beta Was this translation helpful? Give feedback.
-
I appreciate all of the thoughts here! I don't know Kevin's reasons for suggesting S3, but I can share your suggestions after I give you a chance to reply to this post and then see what his take is. I do know that he wasn't very familiar with THREDDS, so some of it could be just trying to keep things simple by sticking to things that other groups are already using. The basic need we have is an intermediate place to store files that collaborators on the RICHAMP project will use to drive dashboards and such for RIEMA. Chris Damon (collaborator working on the dashboarding) has consistently said he needs some sort of public URL for this. I don't know what tools he's using to check for new files, but I've brought up him just grabbing them directly from Unity and it seemed to be a nonstarter. This intermediate location doesn't necessarily need to be a long-term storage location, as we'll still have the local runs done on Unity. We'll probably want to have some sort of backup location with massive amounts of storage to archive past forecasts, as I believe Isaac does want to archive them and I can see the utility in that. I suppose long term this could be S3, or some other volume on Unity, or something else entirely. It's a less pressing concern than just finding a location that all URI teams can access that allows for simple uploading and downloading of files. End users will only be interacting with this data via dashboards as far as I know, so they will have no need to get at the underlying NetCDFs. I did give s3cmd a shot and it was easy enough to get going! I was able to upload a test file, anyway. It should be easy enough to build into ASGS as a POSTPROCESS task as Brett mentioned. At a minimum I need to check in with Chris Damon again about how he wants to access the files, for example, is it best to just have one path and I'll overwrite any existing files there whenever we run a forecast, or should I create some sort of unique folder structure like ASGS does with THREDDS? If the former, could I just scp it to the actual final destination for these files instead to save a few minutes and cut out the middle man? |
Beta Was this translation helpful? Give feedback.
-
@notstarboard you have an option we haven't mentioned. You could ask the Unity admins to set up a simple web server with The last paragraph above gets to another issue: how do you notify Chris that new results are available, and what those results contain (e.g., what advisory was used, etc). We use email for that. Chris can also poll the server and parse metadata to see if there is new data. |
Beta Was this translation helpful? Give feedback.
-
Hi All! Update on the above - I was able to convince Chris to get an account on Unity so that I can just move the files to a staging directory via cp and he can then use scp to move them wherever he would like. While testing with S3 it quickly became clear that upload times were going to be a concern, so this was definitely the new path of least resistance. So, everything is pretty much working at this point - all that's left is more testing. I'm currently working on some hindcasts for Henri that RIEMA will use during a tabletop exercise next month to test out the system. I do have one question related to that, but I'll post that to its own discussion as it's not specific to Unity. |
Beta Was this translation helpful? Give feedback.
-
Hello & Happy New Year, Brett, Jason, and others!
I am currently trying to install ASGS on Unity so that we can make use of the URI computing nodes that have recently been installed there. Given that Unity isn't a known environment for ASGS and therefore can't be selected via installer, I just tried to install a "desktop-serial" instance on Unity with the thought that I could manually configure any cluster-specific stuff later. I loaded a few Intel compiler modules ahead of time and overrode the default desktop-serial option (gfortran) to install using intel compilers. init-asgs.sh chugged along for almost two hours but it ultimately bombed out while trying to install the NCO Toolkit. I've attached the end of the console output to this post.
I was wondering if you had any advice for getting this installed, whether it's a different approach to try or help troubleshooting the error. If it matters, I'm not sure if Unity even has a Thredds server set up, so I'm happy for the results stay in the scenario directories for now. Provided I can get through the installation and get the simulations to run, that'd be an awesome start. Thanks!
console_output.txt
Beta Was this translation helpful? Give feedback.
All reactions