`PwBaseWorkChain`: Always do full restart for `ERROR_OUT_OF_WALLTIME` #1012

mbercx · 2024-02-23T10:53:10Z

Fixes #968

The current error handler for the ERROR_OUT_OF_WALLTIME exit code of the PwCalculation will restart from scratch in case the structure has changed during the pw.x run, as is typically the case for relax/vc-relax calculations. For larger structures and more complex calculations - such as those using Hubbard corrections - this can be quite inefficient since obtaining the electronic ground state is often more challenging and hence expensive.

Here we adapt the error handler to always do a full restart from the previous calculation. In case the structure has changed, we still set it as the input structure of the restart calculation.

mbercx · 2024-02-23T13:19:37Z

@AndresOrtegaGuerrero this should do the job, but there is one thing I'm still considering. For provenance reasons, in case the structure has changed I still pass it as an input to the restart calculation. However, for a full restart (CONTROL.restart_mode = 'restart') Quantum ESPRESSO also restarts from the structure saved in the out/aiida.save folder:

     Atomic positions and unit cell read from directory:
     ./out/aiida.save/
     Atomic positions from file used, from input discarded

This means that we're passing an input structure that will be ignored in any case, and hence this might be slightly confusing for someone that's observing the provenance afterwards. However, running without the input structure is not to QE's liking:

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine init_pos (1):
     atomic position info missing
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

So you always have to provide an input structure, even when the restart settings you specify explicitly decides to ignore it. ^^

I was considering perhaps restarting from the charge density instead, which should be pretty efficient. However, a full restart will most likely still be even more efficient, since it will also start from the correct kpt:

     Atomic positions and unit cell read from directory:
     ./out/aiida.save/
     Atomic positions from file used, from input discarded


     Check: negative core charge=   -0.000016

     The initial density is read from file :
     ./out/aiida.save/charge-density

     Starting wfcs from file
     Calculation restarted from scf iteration #     3

     total cpu time spent up to now is        2.7 secs

     Self-consistent Calculation

     iteration #  3     ecut=    45.00 Ry     beta= 0.40
     Calculation restarted from kpoint #    20

However, this calculation failed shortly thereafter:

     WARNING: integrated charge=     6.39693880, expected=    68.00000000

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine electrons (1):
     charge is wrong
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Restarting from the charge density did not have this issue. So I would favor for this approach, which is perhaps slightly less efficient but probably more robust?

bastonero · 2024-03-25T09:48:25Z

I'll leave my 2-cents considerations:

If I remember correctly, the restart for a relaxation also involves the BFGS history (if using BFGS), which would be otherwise discarted, hence probably leading to doing more steps (although, I don't think there's any restart flag just for this restart)
I also experienced that restarting from wfcs is usually more problematic (don't know should be so), but in many cases is important, e.g. to keep the same magnetization, polarization branch, or Hubbard occupations for instance. So, to consider carefully. Maybe also the code could be compiled with better IO? E.g. using HDF5? When doing so, I did not have so much problems restarting fom wfcs.

mbercx · 2024-03-27T12:43:26Z

Thanks @bastonero! Unfortunately this doesn't make the decision easier. ^^ Can we see from the outputs if QE has been compiled with HDF5?

I suppose we could do a full restart, and in case this fails restart from scratch with the final structure.

bastonero · 2024-03-27T13:18:50Z

No clue if there's a way. I know the wfcs files have .hdf5 extension, if it can help

bastonero

Shall we merge this? I think it's quite useful, I am currently running some hybrid functional calculations and due to the from_scratch is never converging since it alwats lose the BFGS history, thus making the convergence extremely slow and inefficient.

mbercx · 2025-01-17T09:06:30Z

@bastonero ah yes, I discussed this with @superstar54 last week, who was also keen to merge. Let me have a quick look again and then I'll merge.

mbercx · 2025-01-17T09:49:16Z

Ok, after reading into this again, I'm fine merging this PR, but perhaps with the following addition: since a full restart can sometimes fail with the charge is wrong error:

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine electrons (1):
     charge is wrong
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Perhaps we should also add an error handler that handles this, that restarts from scratch?

EDIT: this should probably only trigger in case we're doing a geometry optimization though.

The current error handler for the `ERROR_OUT_OF_WALLTIME` exit code of the `PwCalculation` will restart from scratch in case the structure has changed during the `pw.x` run, as is typically the case for `relax`/`vc-relax` calculations. For larger structures and more complex calculations - such as those using Hubbard corrections - this can be quite inefficient since obtaining the electronic ground state is often more challenging and hence expensive. Here we adapt the error handler to always do a full restart from the previous calculation. In case the structure has changed, we still set it as the input structure of the restart calculation. Even though Quantum ESPRESSO will restart from the `.save` directory and hence ignore the structure information in the input file, this at least makes the structure in the provenance consistent with the one run by Quantum ESPRESSO.

mbercx · 2025-01-17T10:14:34Z

Adding such an error handler raises more issues that will slow down this PR, and I think this is still a vast improvement versus the status quo. So let's merge this and open a new issue/PR to deal with the charge is wrong error.

I am a bit worried this will happen a lot more now with these full restarts, but I guess we'll find out. ^^

bastonero · 2025-01-17T10:16:32Z

i am still puzzled that this error was thrown when doing the full restart. but as you say, let's see what happens in the long run, as indeed there is no general solution to that charge is wrong error AFAIK.

mbercx · 2025-01-17T10:19:08Z

as indeed there is no general solution to that charge is wrong error AFAIK.

Yeah, it was only one test case, and I can't reproduce it. If restarting fully is really an issue, I guess we'll notice quickly. :)

mbercx · 2025-01-17T10:28:51Z

Strange, the docs build passed, but it's not updating here. I'll just go ahead and merge.

bastonero approved these changes Jan 16, 2025

View reviewed changes

mbercx force-pushed the fix/968/restart-walltime branch from 6299c46 to cfdfc51 Compare January 17, 2025 10:13

mbercx merged commit fcb8da9 into aiidateam:main Jan 17, 2025
13 checks passed

mbercx deleted the fix/968/restart-walltime branch January 17, 2025 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PwBaseWorkChain`: Always do full restart for `ERROR_OUT_OF_WALLTIME` #1012

`PwBaseWorkChain`: Always do full restart for `ERROR_OUT_OF_WALLTIME` #1012

mbercx commented Feb 23, 2024

mbercx commented Feb 23, 2024

bastonero commented Mar 25, 2024

mbercx commented Mar 27, 2024

bastonero commented Mar 27, 2024

bastonero left a comment

mbercx commented Jan 17, 2025

mbercx commented Jan 17, 2025 •

edited

Loading

mbercx commented Jan 17, 2025

bastonero commented Jan 17, 2025

mbercx commented Jan 17, 2025

mbercx commented Jan 17, 2025

PwBaseWorkChain: Always do full restart for ERROR_OUT_OF_WALLTIME #1012

PwBaseWorkChain: Always do full restart for ERROR_OUT_OF_WALLTIME #1012

Conversation

mbercx commented Feb 23, 2024

mbercx commented Feb 23, 2024

bastonero commented Mar 25, 2024

mbercx commented Mar 27, 2024

bastonero commented Mar 27, 2024

bastonero left a comment

Choose a reason for hiding this comment

mbercx commented Jan 17, 2025

mbercx commented Jan 17, 2025 • edited Loading

mbercx commented Jan 17, 2025

bastonero commented Jan 17, 2025

mbercx commented Jan 17, 2025

mbercx commented Jan 17, 2025

`PwBaseWorkChain`: Always do full restart for `ERROR_OUT_OF_WALLTIME` #1012

`PwBaseWorkChain`: Always do full restart for `ERROR_OUT_OF_WALLTIME` #1012

mbercx commented Jan 17, 2025 •

edited

Loading