-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job cannot be restarted from UI in 1.6.2 #18547
Comments
I've just hit this issue as well, using 1.6.2 I stopped 4 jobs, planning on starting them again but they won't start. I haven't ever noticed a new version being created on stopping the job, but it is in this case, and after stopping the job clicking start does nothing (and since there is no cli alternative if you don't have the nomad.hcl file ready for your cli to use as described by #18558) and tells you nothing about why the jobs hasn't started I tried to workaround by reverting to the previous version, and that seems to work |
Hi @Rumbles! See my comment here: #18558 (comment) |
@tgross this should not be closed, the issue is, if you stop a job then click start, you see something spin, then nothing happens. This is new behaviour, nomad used to let you start the job this way. I have a group of jobs I would regularly restart in this way, and now it doesn't work. Now you have to find the last version and revert in the UI, clicking start doesn't start the job like it should |
@Rumbles if you can produce a minimal repro that demonstrates you're not seeing what I've described in #18558 (comment), and that the behavior is specifically different in the UI vs the CLI, then go ahead and I'll be happy to re-open this. |
@tgross the behaviour in the UI isn't the same as the cli, in the UI after pressing the stop button, it changes to start. This button used to allow you to start the job back up. It doesn't work any more. What more info do you need? |
@tgross this issue is specific to UI. It is not possible anymore to do a basic stop/start on nomad ui, it has nothing to do with cli. Reproduction steps are quite simple: We are stuck in step 4/5, and nothing append. Job is never restarted, the only way is to go to version tab. |
I want the job you're running because there's a half dozen other issues open recently where people are saying "I can't restart", and one of them is trying to restart a batch job (which never restart), another one is actually trying to restart individual allocations, another one is someone asking for a new feature that didn't previously exist, another one is actually as a result of HCL variables, and no one in this issue has a specific repro. But that's ok if you don't want to provide a minimal repro, we'll just have to start from zero and that means it'll take that much longer to decide whether there's even anything here. I'm going to reopen this and mark it for investigation. I'm not sure what the Start button actually did previously in terms of the HTTP API call, but if it's a regression than that's a bug. |
If you could define 'repro' then it might help get you the information
you're after
…On Tue, 26 Sept 2023, 19:43 Tim Gross, ***@***.***> wrote:
What more info do you need?
I want the job *you're* running because there's a half dozen other issues
open recently where people are saying "I can't restart", and one of them is
trying to restart a batch job (which *never* restart), another one is
actually trying to restart individual allocations, another one is someone
asking for a new feature that didn't previously exist, another one is
actually as a result of HCL variables, and no one in this issue has a
specific repro. But that's ok if you don't want to provide a minimal repro,
we'll just have to start from zero and that means it'll take that much
longer to decide whether there's even anything here.
I'm going to reopen this and mark it for investigation. I'm not sure what
the Start button actually did previously in terms of the HTTP API call, but
if it's a regression than that's a bug.
—
Reply to this email directly, view it on GitHub
<#18547 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB626UXOAYJ3WYRFTEOETZTX4MH5XANCNFSM6AAAAAA5BFD4QY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have just had a look on one of our nomad clusters and it looks like all the jobs we have defined there do the same thing, if you stop them, the button changes to say start but it no longer works. Some are raw_exec, some are docker jobs, but every job I have tried, if I stop it, I cannot start it by pressing start, I have to go to versions and revert to the previous version... let me know if you need more info to go on @tgross, but it seems to fairly easy to reproduce and quite a big issue that needs fixing quickly |
Hi @philrenaud , I can confirm:
|
After stopping the job in the UI and then starting again it sends a request to: GET And the response returned is:
|
Hello @philrenaud,
I notice a new version is created with the stop flag set to true: And the start button try to start this version but the call I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job. |
See my comment in #18558 (comment) that describes why this is. |
I've read it and stopped jobs should not be evaluated by the scheduler since they are stopped. Scheduling is actually the most complexe process - here I have to give great thanks for Hashicorp teams since it works very well -, neverthess stopped jobs should not interfere in this process. So I don't think it breaks idempotentness of evaluation since it shouldn't be evaluated at all. Stop flag should just indicate if the job is schedule or not. If set to true,no evaluation and scheduling should be done. If set to false, the job is submitted to the scheduler to be run. And a personnal thinking, using To focus again to the issue, the start button should revert to the most recent version with stop flag set to false, if I understand correctly. And for now, this is the regression that need to be fixed. Many thanks. |
The scheduler is what submits the plan to tell all the allocations to stop. No decisions about the life of an allocation get made without going thru the scheduler, so that there's a single source of truth. This is why you can do
Correct. As far as the Read Job Submission API call you're seeing, that has to do with the feature added in the recent version of Nomad where you can edit the HCL version in the UI. To support that, we had to start storing the original HCL in the state store (we were throwing it away previously), the Because you're getting a 404 on that, either we've dropped the source somehow or the job existed before the new feature was added. My guess is the latter because the start/stop workflow in the UI you're reporting is working just fine for me with a fresh job. Or it could be that the job was created before 1.6.2, which included a patch for this feature: #18120 I was under the impression we're supposed to handling that 404 gracefully though, so there could be another bug lurking in this same area. |
I think you mix two concepts :
A new job specification version imply a new job state. So a new version of the job state doesn't imply a new version of job specification. I understand that It has been decided that stop flag is part of specification but it is not obvious since stop/start is either an action, or a state (started/stopped). Concerning the issue, I've stop a job, purge it, proceed to a system garbage collect, and recreate it from scratch. Restart doesn't work. The job is not on the default namespace. Hope this help. |
@luhhujbb while one could model job states that way, I'm explaining to you what the data model actually is, in hopes that you can use that knowledge to be a more skillful operator of Nomad. I'm sure you understand that we're unlikely to change the data model at this point. |
@tgross The reason jobs can be STOPed but then not STARTed is because the UI is attempting to now use the submitted source file that is a new feature in 1.6 These are the situations that will cause the nomad 1.6.* UI to be unable to start a job.
This code here in the UI code needs to perform a fallback to the previous behavior where it was using the definition of the job to perform the restart. So the root issue is if the job doesn't have the Source submitted, the UI will not be able to start the job. |
@ghthor that's exactly it, yes. Change in the works. |
Nomad version
1.6.2
Operating system and Environment details
Ubuntu 22.04
Issue
Fail to restart a job in UI.
Reproduction steps
Saying job version is 4,
The job doesn't restart
When attempting to restart the job an API call is made : https://nomad/v1/job/my-job/submission?version=5 which return a 404.
Expected Result
The job should start using version 4 of jobs, maybe there is no need to create a new version of the job when changing the stop flag
Actual Result
A new version version of the job is created with stop flag set to true preventing job from start
Workaround
Go to Version Tab and revert the job to version 4 where Stop flag is set to false. The workaround is not always possible this the new version is not always present in the UI.
The text was updated successfully, but these errors were encountered: