Job cannot be restarted from UI in 1.6.2 #18547

luhhujbb · 2023-09-21T08:40:51Z

Nomad version

1.6.2

Operating system and Environment details

Ubuntu 22.04

Issue

Fail to restart a job in UI.

Reproduction steps

Saying job version is 4,

Stop the job, it will create a new version - version 5 - of the job with Stop flag set to true.
Try to restart: it will attempt to start the previously create job version, ie version 5 - with stop flag set to true.
The job doesn't restart

When attempting to restart the job an API call is made : https://nomad/v1/job/my-job/submission?version=5 which return a 404.

Expected Result

The job should start using version 4 of jobs, maybe there is no need to create a new version of the job when changing the stop flag

Actual Result

A new version version of the job is created with stop flag set to true preventing job from start

Workaround

Go to Version Tab and revert the job to version 4 where Stop flag is set to false. The workaround is not always possible this the new version is not always present in the UI.

luhhujbb · 2023-09-21T11:19:17Z

cf #18536 (comment)

Rumbles · 2023-09-26T11:51:00Z

I've just hit this issue as well, using 1.6.2 I stopped 4 jobs, planning on starting them again but they won't start.

I haven't ever noticed a new version being created on stopping the job, but it is in this case, and after stopping the job clicking start does nothing (and since there is no cli alternative if you don't have the nomad.hcl file ready for your cli to use as described by #18558) and tells you nothing about why the jobs hasn't started

I tried to workaround by reverting to the previous version, and that seems to work

tgross · 2023-09-26T15:13:01Z

Hi @Rumbles! See my comment here: #18558 (comment)

Rumbles · 2023-09-26T15:26:47Z

@tgross this should not be closed, the issue is, if you stop a job then click start, you see something spin, then nothing happens.

This is new behaviour, nomad used to let you start the job this way. I have a group of jobs I would regularly restart in this way, and now it doesn't work.

Now you have to find the last version and revert in the UI, clicking start doesn't start the job like it should

tgross · 2023-09-26T16:19:13Z

@Rumbles if you can produce a minimal repro that demonstrates you're not seeing what I've described in #18558 (comment), and that the behavior is specifically different in the UI vs the CLI, then go ahead and I'll be happy to re-open this.

Rumbles · 2023-09-26T16:29:49Z

@tgross the behaviour in the UI isn't the same as the cli, in the UI after pressing the stop button, it changes to start. This button used to allow you to start the job back up. It doesn't work any more.

What more info do you need?

luhhujbb · 2023-09-26T16:42:05Z

@tgross this issue is specific to UI. It is not possible anymore to do a basic stop/start on nomad ui, it has nothing to do with cli.

Reproduction steps are quite simple:
In the UI, go to a job, in overview tab stop it:

We are stuck in step 4/5, and nothing append. Job is never restarted, the only way is to go to version tab.

tgross · 2023-09-26T17:43:44Z

What more info do you need?

I want the job you're running because there's a half dozen other issues open recently where people are saying "I can't restart", and one of them is trying to restart a batch job (which never restart), another one is actually trying to restart individual allocations, another one is someone asking for a new feature that didn't previously exist, another one is actually as a result of HCL variables, and no one in this issue has a specific repro. But that's ok if you don't want to provide a minimal repro, we'll just have to start from zero and that means it'll take that much longer to decide whether there's even anything here.

I'm going to reopen this and mark it for investigation. I'm not sure what the Start button actually did previously in terms of the HTTP API call, but if it's a regression than that's a bug.

Rumbles · 2023-09-26T18:43:06Z

If you could define 'repro' then it might help get you the information you're after

…

On Tue, 26 Sept 2023, 19:43 Tim Gross, ***@***.***> wrote: What more info do you need? I want the job *you're* running because there's a half dozen other issues open recently where people are saying "I can't restart", and one of them is trying to restart a batch job (which *never* restart), another one is actually trying to restart individual allocations, another one is someone asking for a new feature that didn't previously exist, another one is actually as a result of HCL variables, and no one in this issue has a specific repro. But that's ok if you don't want to provide a minimal repro, we'll just have to start from zero and that means it'll take that much longer to decide whether there's even anything here. I'm going to reopen this and mark it for investigation. I'm not sure what the Start button actually did previously in terms of the HTTP API call, but if it's a regression than that's a bug. — Reply to this email directly, view it on GitHub <#18547 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB626UXOAYJ3WYRFTEOETZTX4MH5XANCNFSM6AAAAAA5BFD4QY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Rumbles · 2023-09-27T10:48:56Z

I have just had a look on one of our nomad clusters and it looks like all the jobs we have defined there do the same thing, if you stop them, the button changes to say start but it no longer works. Some are raw_exec, some are docker jobs, but every job I have tried, if I stop it, I cannot start it by pressing start, I have to go to versions and revert to the previous version... let me know if you need more info to go on @tgross, but it seems to fairly easy to reproduce and quite a big issue that needs fixing quickly

philrenaud · 2023-09-27T13:34:55Z

A few questions @Rumbles / @luhhujbb that will help us diagnose this a bit better:

Could you confirm that the jobs are service jobs?
Are you running them via Nomad Pack?
Is initial submission of the job happening in the UI as well, or via CLI/API?
Are you using HCL variables in your jobspec?

Rumbles · 2023-09-27T14:02:43Z

Hi @philrenaud , I can confirm:

some are service jobs, but my batch and system jobs also exhibit the same symptoms
no idea what that is
initial submission is via terraform
yes we use hcl variables (e.g. getting vault or consul k/vs with code like {{ with secret "secret_v2/data/autoscaler" }}{{ .Data.data.secret }}{{ end }} ) in templates

fernandomora · 2023-09-27T14:56:47Z

After stopping the job in the UI and then starting again it sends a request to:

GET /v1/job/${service}/submission?version=${version}

And the response returned is:

404 - Not Found

job source not found

luhhujbb · 2023-09-27T16:29:24Z

Hello @philrenaud,

This issue occurs in all jobs type (system/service/batch)
no usage of nomad pack
job are either deployed initialy using terraform (system jobs) or cli (service/batch jobs) via our CI
no usage of HCL variable

I notice a new version is created with the stop flag set to true:
Job: "telegraf" +/- Stop: "false" => "true" Task Group: "telegraf" Task: "telegraf"

And the start button try to start this version but the call
GET /v1/job/${service}/submission?version=${version}
return a 404 as already mentioned.

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

tgross · 2023-09-27T17:19:25Z

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

See my comment in #18558 (comment) that describes why this is.

luhhujbb · 2023-09-27T19:01:11Z

I think there is no need to create a version with Stop set to true, since Stop flag is mostly a state and not a specification of the job.

See my comment in #18558 (comment) that describes why this is.

I've read it and stopped jobs should not be evaluated by the scheduler since they are stopped. Scheduling is actually the most complexe process - here I have to give great thanks for Hashicorp teams since it works very well -, neverthess stopped jobs should not interfere in this process.

So I don't think it breaks idempotentness of evaluation since it shouldn't be evaluated at all.

Stop flag should just indicate if the job is schedule or not. If set to true,no evaluation and scheduling should be done. If set to false, the job is submitted to the scheduler to be run.

And a personnal thinking, using nomad run is maybe confusing.
Some people can think that if you run a job with stop set to false, it will be run, because it's just what it means.
The api call seems to be submission so maybe cli should move to the submit word which is more accurate to what is really done.
When I submit a periodic batch job, it is in fact not runned. It is submit to the scheduler to be runned periodicaly. But this is out of the scope of this issue.

To focus again to the issue, the start button should revert to the most recent version with stop flag set to false, if I understand correctly. And for now, this is the regression that need to be fixed.

Many thanks.

tgross · 2023-09-27T19:38:01Z

Stop flag should just indicate if the job is schedule or not. If set to true,no evaluation and scheduling should be done. If set to false, the job is submitted to the scheduler to be run.

The scheduler is what submits the plan to tell all the allocations to stop. No decisions about the life of an allocation get made without going thru the scheduler, so that there's a single source of truth. This is why you can do nomad job stop -detach and the job will still be shutdown in an orderly fashion. So without the stop flag being set on a new version of the job and an evaluation being processed, none of the allocations would stop when we stop the job. You can see this if you stop a job at the command line; afterwards there'll be an evaluation for the deregister:

$ nomad eval list
ID        Priority  Triggered By    Job ID  Namespace  Node ID  Status    Placement Failures
6cb1539b  50        job-deregister  httpd   default    <none>   complete  false
b0abb15f  50        job-register    httpd   default    <none>   complete  false

To focus again to the issue, the start button should revert to the most recent version with stop flag set to false, if I understand correctly. And for now, this is the regression that need to be fixed.

Correct.

As far as the Read Job Submission API call you're seeing, that has to do with the feature added in the recent version of Nomad where you can edit the HCL version in the UI. To support that, we had to start storing the original HCL in the state store (we were throwing it away previously), the /v1/job/${id}/submission API is how the UI is querying for that.

Because you're getting a 404 on that, either we've dropped the source somehow or the job existed before the new feature was added. My guess is the latter because the start/stop workflow in the UI you're reporting is working just fine for me with a fresh job. Or it could be that the job was created before 1.6.2, which included a patch for this feature: #18120 I was under the impression we're supposed to handling that 404 gracefully though, so there could be another bug lurking in this same area.

luhhujbb · 2023-09-28T08:20:35Z

I think you mix two concepts :

a job specification (with it's related version)
a job state (with it's related version in scheduler)

A new job specification version imply a new job state.
The submission of a job specification to the scheduler is absolutely not idempotent since allocation ids are unique. It's easy to understand that using terraform. The terraform resouce ask for a jobSpec, and commit in tfstate the the job state (including allocation ids). If hardware breakdown occurs with running allocation of this job, new allocution will be created (ie the job state change, the tfstate will be updated) but the specification doesn't change.

So a new version of the job state doesn't imply a new version of job specification.

I understand that It has been decided that stop flag is part of specification but it is not obvious since stop/start is either an action, or a state (started/stopped).

Concerning the issue, I've stop a job, purge it, proceed to a system garbage collect, and recreate it from scratch. Restart doesn't work. The job is not on the default namespace.

Hope this help.

tgross · 2023-09-28T13:03:23Z

@luhhujbb while one could model job states that way, I'm explaining to you what the data model actually is, in hopes that you can use that knowledge to be a more skillful operator of Nomad. I'm sure you understand that we're unlikely to change the data model at this point.

ghthor · 2023-09-28T17:36:06Z

@tgross The reason jobs can be STOPed but then not STARTed is because the UI is attempting to now use the submitted source file that is a new feature in 1.6

These are the situations that will cause the nomad 1.6.* UI to be unable to start a job.

IF the job was submitted to the cluster prior to the cluster being upgraded to 1.6.*
IF the job is submitted to the cluster with an older version of the CLI tool that doesn't send the new Source to the register API call
IF the job is submitted directly using the API and doesn't provide the Source
IF the job is submitted using and SDK and doesn't provide the source

https://github.com/hashicorp/nomad/pull/18120/files#diff-3efda46884f2e09893ae51fb4c250902d408ce40faf86d52048ae60e0094023fL74

This code here in the UI code needs to perform a fallback to the previous behavior where it was using the definition of the job to perform the restart.

So the root issue is if the job doesn't have the Source submitted, the UI will not be able to start the job.

philrenaud · 2023-09-28T17:44:40Z

@ghthor that's exactly it, yes. Change in the works.

luhhujbb added the type/bug label Sep 21, 2023

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023

tgross reopened this Sep 26, 2023

tgross added the theme/ui label Sep 26, 2023

philrenaud self-assigned this Sep 29, 2023

philrenaud added this to Nomad UI Sep 29, 2023

github-project-automation bot moved this to Backlog in Nomad UI Sep 29, 2023

philrenaud moved this from Backlog to In Progress in Nomad UI Sep 29, 2023

philrenaud linked a pull request Sep 29, 2023 that will close this issue

[ui] bugfix: when hitting the "start" button on a job page, if it has no submission data, fallback to raw json definition #18621

Merged

philrenaud mentioned this issue Sep 29, 2023

[ui] bugfix: when hitting the "start" button on a job page, if it has no submission data, fallback to raw json definition #18621

Merged

dhung-hashicorp added the hcc/cst Admin - internal label Sep 29, 2023

philrenaud closed this as completed in #18621 Sep 29, 2023

github-project-automation bot moved this from In Progress to Done in Nomad UI Sep 29, 2023

hc-github-team-nomad-core mentioned this issue Sep 29, 2023

Backport of [ui] bugfix: when hitting the "start" button on a job page, if it has no submission data, fallback to raw json definition into release/1.6.x #18625

Merged

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job cannot be restarted from UI in 1.6.2 #18547

Job cannot be restarted from UI in 1.6.2 #18547

luhhujbb commented Sep 21, 2023 •

edited

Loading

luhhujbb commented Sep 21, 2023

Rumbles commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023

luhhujbb commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023 via email

Rumbles commented Sep 27, 2023

philrenaud commented Sep 27, 2023

Rumbles commented Sep 27, 2023

fernandomora commented Sep 27, 2023 •

edited

Loading

luhhujbb commented Sep 27, 2023 •

edited

Loading

tgross commented Sep 27, 2023

luhhujbb commented Sep 27, 2023

tgross commented Sep 27, 2023

luhhujbb commented Sep 28, 2023

tgross commented Sep 28, 2023

ghthor commented Sep 28, 2023 •

edited

Loading

philrenaud commented Sep 28, 2023

Job cannot be restarted from UI in 1.6.2 #18547

Job cannot be restarted from UI in 1.6.2 #18547

Comments

luhhujbb commented Sep 21, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Workaround

luhhujbb commented Sep 21, 2023

Rumbles commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023

luhhujbb commented Sep 26, 2023

tgross commented Sep 26, 2023

Rumbles commented Sep 26, 2023 via email

Rumbles commented Sep 27, 2023

philrenaud commented Sep 27, 2023

Rumbles commented Sep 27, 2023

fernandomora commented Sep 27, 2023 • edited Loading

luhhujbb commented Sep 27, 2023 • edited Loading

tgross commented Sep 27, 2023

luhhujbb commented Sep 27, 2023

tgross commented Sep 27, 2023

luhhujbb commented Sep 28, 2023

tgross commented Sep 28, 2023

ghthor commented Sep 28, 2023 • edited Loading

philrenaud commented Sep 28, 2023

luhhujbb commented Sep 21, 2023 •

edited

Loading

fernandomora commented Sep 27, 2023 •

edited

Loading

luhhujbb commented Sep 27, 2023 •

edited

Loading

ghthor commented Sep 28, 2023 •

edited

Loading