Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job cannot be restarted from UI in 1.6.2 #18547

Closed
luhhujbb opened this issue Sep 21, 2023 · 21 comments · Fixed by #18621
Closed

Job cannot be restarted from UI in 1.6.2 #18547

luhhujbb opened this issue Sep 21, 2023 · 21 comments · Fixed by #18621
Assignees
Labels

Comments

@luhhujbb
Copy link
Contributor

luhhujbb commented Sep 21, 2023

Nomad version

1.6.2

Operating system and Environment details

Ubuntu 22.04

Issue

Fail to restart a job in UI.

Reproduction steps

Saying job version is 4,

  • Stop the job, it will create a new version - version 5 - of the job with Stop flag set to true.
  • Try to restart: it will attempt to start the previously create job version, ie version 5 - with stop flag set to true.
    The job doesn't restart

When attempting to restart the job an API call is made : https://nomad/v1/job/my-job/submission?version=5 which return a 404.

Expected Result

The job should start using version 4 of jobs, maybe there is no need to create a new version of the job when changing the stop flag

Actual Result

A new version version of the job is created with stop flag set to true preventing job from start

Workaround

Go to Version Tab and revert the job to version 4 where Stop flag is set to false. The workaround is not always possible this the new version is not always present in the UI.

@luhhujbb
Copy link
Contributor Author

cf #18536 (comment)

@Rumbles
Copy link

Rumbles commented Sep 26, 2023

I've just hit this issue as well, using 1.6.2 I stopped 4 jobs, planning on starting them again but they won't start.

I haven't ever noticed a new version being created on stopping the job, but it is in this case, and after stopping the job clicking start does nothing (and since there is no cli alternative if you don't have the nomad.hcl file ready for your cli to use as described by #18558) and tells you nothing about why the jobs hasn't started

I tried to workaround by reverting to the previous version, and that seems to work

@tgross
Copy link
Member

tgross commented Sep 26, 2023

Hi @Rumbles! See my comment here: #18558 (comment)

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023
@Rumbles
Copy link

Rumbles commented Sep 26, 2023

@tgross this should not be closed, the issue is, if you stop a job then click start, you see something spin, then nothing happens.

This is new behaviour, nomad used to let you start the job this way. I have a group of jobs I would regularly restart in this way, and now it doesn't work.

Now you have to find the last version and revert in the UI, clicking start doesn't start the job like it should

@tgross
Copy link
Member

tgross commented Sep 26, 2023

@Rumbles if you can produce a minimal repro that demonstrates you're not seeing what I've described in #18558 (comment), and that the behavior is specifically different in the UI vs the CLI, then go ahead and I'll be happy to re-open this.

@Rumbles
Copy link

Rumbles commented Sep 26, 2023

@tgross the behaviour in the UI isn't the same as the cli, in the UI after pressing the stop button, it changes to start. This button used to allow you to start the job back up. It doesn't work any more.

What more info do you need?

@luhhujbb
Copy link
Contributor Author

@tgross this issue is specific to UI. It is not possible anymore to do a basic stop/start on nomad ui, it has nothing to do with cli.

Reproduction steps are quite simple:
In the UI, go to a job, in overview tab stop it:

  1. Capture d’écran du 2023-09-26 18-36-55
  2. Capture d’écran du 2023-09-26 18-37-05
  3. Capture d’écran du 2023-09-26 18-37-12
  4. Capture d’écran du 2023-09-26 18-37-17
  5. Capture d’écran du 2023-09-26 18-37-29

We are stuck in step 4/5, and nothing append. Job is never restarted, the only way is to go to version tab.

@tgross
Copy link
Member

tgross commented Sep 26, 2023

What more info do you need?

I want the job you're running because there's a half dozen other issues open recently where people are saying "I can't restart", and one of them is trying to restart a batch job (which never restart), another one is actually trying to restart individual allocations, another one is someone asking for a new feature that didn't previously exist, another one is actually as a result of HCL variables, and no one in this issue has a specific repro. But that's ok if you don't want to provide a minimal repro, we'll just have to start from zero and that means it'll take that much longer to decide whether there's even anything here.

I'm going to reopen this and mark it for investigation. I'm not sure what the Start button actually did previously in terms of the HTTP API call, but if it's a regression than that's a bug.

@tgross tgross reopened this Sep 26, 2023
@Rumbles
Copy link

Rumbles commented Sep 26, 2023 via email

@Rumbles
Copy link

Rumbles commented Sep 27, 2023

I have just had a look on one of our nomad clusters and it looks like all the jobs we have defined there do the same thing, if you stop them, the button changes to say start but it no longer works. Some are raw_exec, some are docker jobs, but every job I have tried, if I stop it, I cannot start it by pressing start, I have to go to versions and revert to the previous version... let me know if you need more info to go on @tgross, but it seems to fairly easy to reproduce and quite a big issue that needs fixing quickly

@philrenaud
Copy link
Contributor

A few questions @Rumbles / @luhhujbb that will help us diagnose this a bit better:

  • Could you confirm that the jobs are service jobs?
  • Are you running them via Nomad Pack?
  • Is initial submission of the job happening in the UI as well, or via CLI/API?
  • Are you using HCL variables in your jobspec?

@Rumbles
Copy link

Rumbles commented Sep 27, 2023

Hi @philrenaud , I can confirm:

  • some are service jobs, but my batch and system jobs also exhibit the same symptoms
  • no idea what that is
  • initial submission is via terraform
  • yes we use hcl variables (e.g. getting vault or consul k/vs with code like {{ with secret "secret_v2/data/autoscaler" }}{{ .Data.data.secret }}{{ end }} ) in templates

@fernandomora
Copy link

fernandomora commented Sep 27, 2023

After stopping the job in the UI and then starting again it sends a request to:

GET /v1/job/${service}/submission?version=${version}

And the response returned is:

404 - Not Found

job source not found

@luhhujbb
Copy link
Contributor Author

luhhujbb commented Sep 27, 2023

Hello @philrenaud,

  • This issue occurs in all jobs type (system/service/batch)
  • no usage of nomad pack
  • job are either deployed initialy using terraform (system jobs) or cli (service/batch jobs) via our CI
  • no usage of HCL variable

I notice a new version is created with the stop flag set to true:
Job: "telegraf" +/- Stop: "false" => "true" Task Group: "telegraf" Task: "telegraf"

And the start button try to start this version but the call
GET /v1/job/${service}/submission?version=${version}
return a 404 as already mentioned.

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

@tgross
Copy link
Member

tgross commented Sep 27, 2023

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

See my comment in #18558 (comment) that describes why this is.

@luhhujbb
Copy link
Contributor Author

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

See my comment in #18558 (comment) that describes why this is.

I've read it and stopped jobs should not be evaluated by the scheduler since they are stopped. Scheduling is actually the most complexe process - here I have to give great thanks for Hashicorp teams since it works very well -, neverthess stopped jobs should not interfere in this process.

So I don't think it breaks idempotentness of evaluation since it shouldn't be evaluated at all.

Stop flag should just indicate if the job is schedule or not. If set to true,no evaluation and scheduling should be done. If set to false, the job is submitted to the scheduler to be run.

And a personnal thinking, using nomad run is maybe confusing.
Some people can think that if you run a job with stop set to false, it will be run, because it's just what it means.
The api call seems to be submission so maybe cli should move to the submit word which is more accurate to what is really done.
When I submit a periodic batch job, it is in fact not runned. It is submit to the scheduler to be runned periodicaly. But this is out of the scope of this issue.

To focus again to the issue, the start button should revert to the most recent version with stop flag set to false, if I understand correctly. And for now, this is the regression that need to be fixed.

Many thanks.

@tgross
Copy link
Member

tgross commented Sep 27, 2023

Stop flag should just indicate if the job is schedule or not. If set to true,no evaluation and scheduling should be done. If set to false, the job is submitted to the scheduler to be run.

The scheduler is what submits the plan to tell all the allocations to stop. No decisions about the life of an allocation get made without going thru the scheduler, so that there's a single source of truth. This is why you can do nomad job stop -detach and the job will still be shutdown in an orderly fashion. So without the stop flag being set on a new version of the job and an evaluation being processed, none of the allocations would stop when we stop the job. You can see this if you stop a job at the command line; afterwards there'll be an evaluation for the deregister:

$ nomad eval list
ID        Priority  Triggered By    Job ID  Namespace  Node ID  Status    Placement Failures
6cb1539b  50        job-deregister  httpd   default    <none>   complete  false
b0abb15f  50        job-register    httpd   default    <none>   complete  false

To focus again to the issue, the start button should revert to the most recent version with stop flag set to false, if I understand correctly. And for now, this is the regression that need to be fixed.

Correct.

As far as the Read Job Submission API call you're seeing, that has to do with the feature added in the recent version of Nomad where you can edit the HCL version in the UI. To support that, we had to start storing the original HCL in the state store (we were throwing it away previously), the /v1/job/${id}/submission API is how the UI is querying for that.

Because you're getting a 404 on that, either we've dropped the source somehow or the job existed before the new feature was added. My guess is the latter because the start/stop workflow in the UI you're reporting is working just fine for me with a fresh job. Or it could be that the job was created before 1.6.2, which included a patch for this feature: #18120 I was under the impression we're supposed to handling that 404 gracefully though, so there could be another bug lurking in this same area.

@luhhujbb
Copy link
Contributor Author

I think you mix two concepts :

  • a job specification (with it's related version)
  • a job state (with it's related version in scheduler)

A new job specification version imply a new job state.
The submission of a job specification to the scheduler is absolutely not idempotent since allocation ids are unique. It's easy to understand that using terraform. The terraform resouce ask for a jobSpec, and commit in tfstate the the job state (including allocation ids). If hardware breakdown occurs with running allocation of this job, new allocution will be created (ie the job state change, the tfstate will be updated) but the specification doesn't change.

So a new version of the job state doesn't imply a new version of job specification.

I understand that It has been decided that stop flag is part of specification but it is not obvious since stop/start is either an action, or a state (started/stopped).

Concerning the issue, I've stop a job, purge it, proceed to a system garbage collect, and recreate it from scratch. Restart doesn't work. The job is not on the default namespace.

Hope this help.

@tgross
Copy link
Member

tgross commented Sep 28, 2023

@luhhujbb while one could model job states that way, I'm explaining to you what the data model actually is, in hopes that you can use that knowledge to be a more skillful operator of Nomad. I'm sure you understand that we're unlikely to change the data model at this point.

@ghthor
Copy link
Contributor

ghthor commented Sep 28, 2023

@tgross The reason jobs can be STOPed but then not STARTed is because the UI is attempting to now use the submitted source file that is a new feature in 1.6

These are the situations that will cause the nomad 1.6.* UI to be unable to start a job.

  1. IF the job was submitted to the cluster prior to the cluster being upgraded to 1.6.*
  2. IF the job is submitted to the cluster with an older version of the CLI tool that doesn't send the new Source to the register API call
  3. IF the job is submitted directly using the API and doesn't provide the Source
  4. IF the job is submitted using and SDK and doesn't provide the source

https://github.com/hashicorp/nomad/pull/18120/files#diff-3efda46884f2e09893ae51fb4c250902d408ce40faf86d52048ae60e0094023fL74

This code here in the UI code needs to perform a fallback to the previous behavior where it was using the definition of the job to perform the restart.

So the root issue is if the job doesn't have the Source submitted, the UI will not be able to start the job.

@philrenaud
Copy link
Contributor

@ghthor that's exactly it, yes. Change in the works.

@philrenaud philrenaud self-assigned this Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
7 participants