Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP in Pyclaw #527

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

OpenMP in Pyclaw #527

wants to merge 23 commits into from

Conversation

weslowrie
Copy link
Contributor

I have updated step3ds.f90 in PyClaw to be able to use openmp. Some small changes were necessary to classic/solver.py to get the openmp environment variable OMP_NUM_THREADS and to properly set the size of the work array. Also the call to step3ds() was changed slightly such that there are not assumed size arrays (f2py does not like these; nthreads is passed in instead)
Additionally classic/setup.py was modified to explicitly add the '-fopenmp' f90 compiler flag as well as adding the 'gomp' library. I'm not sure if this is the best way to handle this, but it does work.

I have not added the openmp directive to step3.f90, but it should be mostly trivial at this point.

…lit algorithm. The unsplit algorithm can also be updated. Usage requires setting the environment variable OMP_NUM_THREADS before program execution. In addition the gfortran flag '-fopenmp' and the f2py flag '-lgomp' are required when compiling step3ds.f90. Added F90 and f2py options for OpenMP (-fopenmp, and gomp => -lgomp) in classic/setup.py so that when compiling 'classic3.so' the openmp lines are compiled.
…e in a dimension size in order to trick f2py into not making nthreads an optional argument. Otherwise it used nthreads=shape(qadd,2) as a size. Since qadd is optional, it get a bad value when qadd is not in the call to step3ds(), which is the default for pyclaw.
self.aux2 = np.empty((num_aux,maxm+2*num_ghost,3),order='F')
self.aux3 = np.empty((num_aux,maxm+2*num_ghost,3),order='F')
mwork = (maxm+2*num_ghost) * (31*num_eqn + num_waves + num_eqn*num_waves)
print "nthreads:",self.nthreads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to remove the print line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It was left from debugging.

@mandli
Copy link
Member

mandli commented Oct 17, 2015

We should modify the travis test case so that it tests this code (set OMP_NUM_THREADS=2 maybe). I am also not entirely comfortable with having to directly use the environment variable rather than leave it up to the system. I noticed that the number of threads is being used to allocate work space but I am uncertain if I completely get why OpenMP cannot do this for us.

@weslowrie
Copy link
Contributor Author

The line with
'int(os.environ.get('OMP_NUM_THREADS',1))'
gets the value set by the environment variable or if not set defaults to 1.
That would be nice if openmp could handle the work array sizing, as well as the extra dimension on aux1, qadd, etc. You may have noticed there is a slight hack to get f2py to not assume that 'nthreads' is optional.

@mandli
Copy link
Member

mandli commented Oct 17, 2015

Yeah, f2py is a bit annoying like that. In any case, if an array enters a parallel region and forks to the threads it should make copies of anything in the stack unless otherwise specified (via a shared directive). That being said, if these threads are being forked and collected often then it might be better to create persistent storage. I wonder if instead of using the environment variable that there is some way to fetch the desired number of threads.

removed debugging print statement
@weslowrie
Copy link
Contributor Author

@mandli

I wonder if instead of using the environment variable that there is some way to fetch the desired number of threads.

Are you wondering if we can query the system to find an optimal number of threads? Or maybe setting the threads via program input rather than using the environment variable?

We could set input on the python side with something like this:
nthreads = os.environ["OMP_NUM_THREADS"] = "2"

@mandli
Copy link
Member

mandli commented Oct 19, 2015

There are a number of ways to set the number of threads that OpenMP can use beyond just the environment variable. In the OpenMP API there is a way to query this but I am not sure how to do this in Python.

@mandli
Copy link
Member

mandli commented Oct 24, 2015

I did a bit of looking into this but could not find a good way to call omp_get_max_threads which is what I was thinking. I do have a question though, how would this interace with the process based parallelism, is their a way to spawn say 16 threads on a node and have the PETSc library handle inter-node communication?

@ketch
Copy link
Member

ketch commented Oct 25, 2015

This is great -- thanks, @weslowrie ! My main question is: does this cost us anything when running without OpenMP? I see two potential concerns:

  • Does compiling the code with -fopenmp make anything slower?
  • Is it at all probable that users will try to compile the code on a system without OpenMP and thus get compilation failure? Or does installation of gcc/gfortran always include OpenMP?

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

@weslowrie
Copy link
Contributor Author

Does compiling the code with -fopenmp make anything slower?
Is it at all probable that users will try to compile the code on a system without OpenMP and thus get compilation failure? Or does installation of gcc/gfortran always include OpenMP?

I'm not sure about either of these, probably worth some tests. I think it is possible to NOT include the 'gomp' library while still having gcc, so some systems might not have the proper openmp library.

I recently noticed that the self.nthreads = int(os.environ.get('OMP_NUM_THREADS',1)) line is not working as intended when the environment variable is not set. I'll look into this.

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

I have not tried the MPI+OpenMP combo yet. I might be able to run a test on a machine with multiple nodes to see if it works. I'll let you know if I am able to successfully test this.

@weslowrie
Copy link
Contributor Author

I'm trying to have a default behavior when the openmp env variable OMP_NUM_THREADS is unset. The problem is the line in step3ds():
!$ nthreads = omp_get_num_threads()
sets nthreads to the number of processors present on the system if the env variable is unset. This contradicts the python side, which sets self.nthreads=1 when unset, and the mwork array is sized incorrectly.

I'm not sure how to handle this yet. A child process cannot change the shell env variable, which is what openmp looks for. I also did not see an obvious way to override openmp's decision when omp_get_num_threads() is called. Anyone see an acceptable way to handle this?

@mandli
Copy link
Member

mandli commented Oct 26, 2015

omp_get_num_threads should return the current number of threads regardless of what OMP_NUM_THREADS is. omp_get_max_threads without OMP_NUM_THREADS defaults to the number of threads defined by the system as the maximum number allowed. This was one of the reasons I was a bit concerned about using OMP_NUM_THREADS in Python. Maybe we should just require that OMP_NUM_THREADS is set or create a simple Fortran subroutine that asks it how many threads are available and call it from python.

@ketch
Copy link
Member

ketch commented Oct 28, 2015

@weslowrie Some testing is being done now by @aymkhalil -- we're going to see if we can run with MPI+OpenMP on the new supercomputer here, and if it helps over just vanilla MPI.

I'm pretty sure we'll want to merge this in, so could you go ahead and add the OMP modifications to step3.f90 too?

@weslowrie
Copy link
Contributor Author

@ketch Yes no problem. I'll update step3.f90 as well.

We should find a resolution to the case where OMP_NUM_THREADS is unset or there is no openmp on a system. One possible easy solution that @mandli suggested is to just check for the OMP_NUM_THREADS env variable and exit the program if it is unset.
Would we want the default to be a run with the maximum possible number of threads?

Also, in the python setup.py do we need to have a way to skip -fopenmp and gomp compile flags if openmp is not present on the system?

@donnacalhoun
Copy link

Can you just use the

#if defined(_OPENMP)
    #pragma omp ...
#endif

pre-processing macro? This is defined by OpenMP standard. See

http://stackoverflow.com/questions/1300180/ignore-openmp-on-machine-that-doesnt-have-it

and

http://bisqwit.iki.fi/story/howto/openmp/#Discussion

These are C/C++ macros, but I would imagine that some thing similar is available in Fortran.

@mandli
Copy link
Member

mandli commented Oct 28, 2015

It is but I think @weslowrie needs it in Python which we have not been able to figure out beyond running a call through C or Fortran.

…of CPUs when the OMP_NUM_THREADS environment varibale is unset. This is consistent with what OpenMP returns for omp_get_num_threads() when the env variable is unset. Updated step3.f90 to use OpenMP, based on what is done in clawpack/classic. This was sucessfully tested with the Sedov.py test problem in the euler_3d example folder.
@weslowrie
Copy link
Contributor Author

I have added a modified step3.f90 to include OpenMP as done in classic clawpack. I tested this with the euler_3d, Sedov.py and the regression test within the folder. OpenMP appears to be working properly.
Also as a possibly temporary fix, I used the python multiprocessing module to check for the number of CPUs. I have used this to set the default number of threads, which is consistent with what omp_get_num_thread() returns when the environment variable is unset. This may not work on all systems, and a more rigorous solution probably needs to be implemented.

@mandli
Copy link
Member

mandli commented Oct 29, 2015

Good solution!

@mandli
Copy link
Member

mandli commented Nov 4, 2015

I may be wrong about this but if someone were to not set OMP_NUM_THREADS=1 and not compile with the OpenMP library incur a memory footprint penalty up to the number of cores they have, at least for the aux arrays?

@weslowrie
Copy link
Contributor Author

@mandli yes you are correct, and it is a problem. Not only are the aux arrays larger, but the work array is larger as well set in _allocate_workspace() in solver.py. This would occur even if they did not compile with the OpenMP library. I'll have to think of a better solution.

How does the PyClaw setup typically deal with custom compiler flags? Should OpenMP be the default, or should we expect the user to set custom flags while compiling/installing PyClaw?

@mandli
Copy link
Member

mandli commented Nov 4, 2015

The flags are usually in the setup.py files in the relevant src directories.

Interesting question about the default though, should OpenMP be enabled and use all of the cores on a machine if someone did not set OMP_NUM_THREADS? Seems we may want to require OMP_NUM_THREADS to be set to use this but I may be convinced otherwise.

@ketch
Copy link
Member

ketch commented Nov 5, 2015

@mandli: Don't AMRClaw and GeoClaw use OpenMP now? How are these issues handled there?

…arning that the 'OMP_NUM_THREADS' environment varibale is unset. This allows an unaware user to run their codes without any changes, and they will see the warning and can adjust accordingly.
@weslowrie
Copy link
Contributor Author

I made some changes, and I think it gives a desirable default behavior. The python code checks if the OMP_NUM_THREADS variable is set, if not it gives a warning and sets the default array sizes with self.nthreads=1.
On the Fortran side, it also checks to see if the environment variable is set, if not it forces the OpenMP number of threads to 1 with:
!$ call omp_set_num_threads(1)
Otherwise it continues as before and reads the environment variable.

I think this gives a desirable default behavior as an unsuspecting user will only see a warning about the unset env variable, and can set it if they want. If one sets the env variable, then it just does the calculation with set number of threads. This way we don't have any arrays that are larger than necessary.

Possible last fix, and I don't know how to do this:
Check if the OpenMP compiler flags and libraries are used in compiling PyClaw and then suppress the OMP_NUM_THREADS env variable warning that was just added to solver.py

@mandli
Copy link
Member

mandli commented Nov 5, 2015

I just cannot remember if warnings are printed to the console or just to the log file if using the standard PyClaw loggers. Speaking of that you also might want to instead use the logger attached to the solver instead. You can use it with

self.logger.warning(warning_stuff)

@weslowrie
Copy link
Contributor Author

@mandli I modified it so it uses the PyClaw logger as suggested. It writes the warning to the console and the log file with this method. I did have to move the code below the
super(ClawSolver3D,self).__init__(riemann_solver, claw_package)
line because the logger otherwise had not been instantiated yet.

@mandli
Copy link
Member

mandli commented Nov 6, 2015

I would perhaps send the message to a level that is not sent to the console so as to not worry users who are not concerned with such things. This should still send the message to the log file though so astute users can look there.

@weslowrie
Copy link
Contributor Author

@mandli Do you mean setting it to the INFO or DEBUG logger level? I'm not sure why, but if debug level is used, it is written neither to the console or file.

@weslowrie
Copy link
Contributor Author

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

I tried a PETSc + OpenMP run, and superficially it looks like it is working well. I did not time the runs, but it was a noticeable speedup. I ran on 4 nodes (4 MPI jobs), with 4 OpenMP threads per node. The machine I used has 4 CPUs per node.

@mandli
Copy link
Member

mandli commented Nov 6, 2015

Yes to the logger level. Which is not being sent to the logfile?

@weslowrie
Copy link
Contributor Author

If I modify the code to:
self.logger.debug(message)
The message is sent to neither the console nor logfile. Probably due to the default level set for the solver logging. I don't think I should modify the solver logging levels here.

@mandli
Copy link
Member

mandli commented Nov 6, 2015

Huh, I thought that one of the levels got sent only to the file. Must have been wrong. I would avoid changing the logging perhaps.

@weslowrie
Copy link
Contributor Author

@mandli Do you think it is worth investigating other solutions to the warning output/logging? Or just leave this as-is?

@mandli
Copy link
Member

mandli commented Nov 14, 2015

@weslowrie I think at this point it is fine as-is

…lit algorithm. The unsplit algorithm can also be updated. Usage requires setting the environment variable OMP_NUM_THREADS before program execution. In addition the gfortran flag '-fopenmp' and the f2py flag '-lgomp' are required when compiling step3ds.f90. This is done for a custom version of classic3.so, and might also be necessary for the general version.
…so that when compiling 'classic3.so' the openmp lines are compiled.
…of CPUs when the OMP_NUM_THREADS environment varibale is unset. This is consistent with what OpenMP returns for omp_get_num_threads() when the env variable is unset. Updated step3.f90 to use OpenMP, based on what is done in clawpack/classic. This was sucessfully tested with the Sedov.py test problem in the euler_3d example folder.
…arning that the 'OMP_NUM_THREADS' environment varibale is unset. This allows an unaware user to run their codes without any changes, and they will see the warning and can adjust accordingly.
@clawpack clawpack deleted a comment from coveralls Jul 11, 2017
@clawpack clawpack deleted a comment from coveralls Jul 11, 2017
@mandli
Copy link
Member

mandli commented Nov 25, 2017

This is already in a PR, the question is whether the default method in that example should also be changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants