Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling policy created with wrong namespace if namespace is not defined in job file #24039

Closed
eduardolmedeiros opened this issue Sep 23, 2024 · 5 comments · Fixed by #24065
Closed
Assignees
Labels
theme/autoscaling Issues related to supporting autoscaling type/bug

Comments

@eduardolmedeiros
Copy link
Contributor

eduardolmedeiros commented Sep 23, 2024

The nomad-autoscaler appears to have an issue with creating scaling policies when the namespace is not hardcoded in the job file. Even when specifying the namespace using the -namespace flag during job deployment, the autoscaler fails to create the corresponding scaling policy.

Steps to Reproduce

  1. Create a job file (e.g., http-echo.nomad.hcl) without a hardcoded namespace.
job "http-echo" {
  datacenters = ["dc1"]
  type        = "service"

  group "apps" {
    count = 1

    network {
      port "http-echo" {
        to = 8081
      }
    }

    scaling {
      min     = 1
      max     = 4
      enabled = true
      policy {
        cooldown            = "3m"
        evaluation_interval = "1m"
        check "avg_cpu" {
          source       = "nomad-apm"
          query        = "avg_cpu-allocated"
          query_window = "1m"
          strategy "target-value" {
            target = 5
          }
        }
        check "avg_memory" {
          source       = "nomad-apm"
          query        = "avg_memory-allocated"
          query_window = "1m"
          strategy "target-value" {
            target = 70
          }
        }
      }
    }

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    task "http-echo" {
      driver       = "docker"
      kill_timeout = "30s"

      resources {
        cpu    = 250
        memory = 200
      }

      config {
        image = "hashicorp/http-echo:latest"
        ports = ["http-echo"]
        args = [
          "-listen", ":8081",
          "-text", "Hello! This is an autoscaling sample job based on Nomad metrics."
        ]
      }

      service {
        name = "http-echo"
        port = "http-echo"
        tags = ["http-echo", "http"]

        check {
          name     = "http-echo port alive"
          type     = "http"
          path     = "/health"
          interval = "5s"
          timeout  = "5s"
        }
      }
    }
  }
}
  1. Deploy the job using the following command:
nomad job run -namespace=test http-echo.nomad.hcl

or

export NOMAD_NAMESPACE=test
nomad job run http-echo.nomad.hcl
  1. Check for scaling policies in the specified namespace:
nomad scaling policy list -namespace=test

Expected Behavior

The nomad-autoscaler should create a scaling policy for the job in the specified namespace.

Actual Behavior

No scaling policies are created. The command nomad scaling policy list -namespace=test returns "No policies found".

Environment

  • Nomad version: 1.8.4
  • nomad-autoscaler version: 0.4.5
  • Operating System: Oracle Linux 9

Additional Information

  • This issue occurs when the namespace is not hardcoded in the job file but specified using the -namespace flag during deployment.
  • It's unclear if this is a bug in nomad-autoscaler or if there's a specific configuration required to handle dynamically specified namespaces.

Possible Solutions

  • Investigate if nomad-autoscaler needs to be updated to recognize namespaces specified via command-line flags.
  • Check if there's a configuration option in nomad-autoscaler that needs to be set to handle dynamic namespaces.
@jrasell
Copy link
Member

jrasell commented Sep 23, 2024

Hi @eduardolmedeiros and thanks for raising this issue. The command and application in question is Nomad as the Nomad Autoscaler doesn't create the scaling policies; this happens during the Nomad job registration phase. I'll move this into the Nomad repository.

@jrasell jrasell transferred this issue from hashicorp/nomad-autoscaler Sep 23, 2024
@tgross
Copy link
Member

tgross commented Sep 24, 2024

Hi @eduardolmedeiros! I was able to reproduce this. It looks like scaling policy is being created, but the target prefix doesn't include the correct namespace:

$ nomad job run -namespace prod ./hello.nomad.hcl
...

$ nomad scaling policy list
ID        Enabled  Type        Target
0ad1ed51  true     horizontal  Namespace:default,Job:hello,Group:web

The policy gets created in the same Raft entry as the job. But at a quick glance it doesn't look like we're including the job's namespace when inserting them into the state store. I'll investigate further and report back.

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 24, 2024
@tgross tgross self-assigned this Sep 24, 2024
@tgross tgross added theme/autoscaling Issues related to supporting autoscaling type/bug labels Sep 24, 2024
@tgross tgross changed the title nomad-autoscaler does not create scaling policy when namespace is not defined in job file scaling policy created with wrong namespace if namespace is not defined in job file Sep 24, 2024
@eduardolmedeiros
Copy link
Contributor Author

Hi @eduardolmedeiros! I was able to reproduce this. It looks like scaling policy is being created, but the target prefix doesn't include the correct namespace:

$ nomad job run -namespace prod ./hello.nomad.hcl
...

$ nomad scaling policy list
ID        Enabled  Type        Target
0ad1ed51  true     horizontal  Namespace:default,Job:hello,Group:web

The policy gets created in the same Raft entry as the job. But at a quick glance it doesn't look like we're including the job's namespace when inserting them into the state store. I'll investigate further and report back.

Hi @tgross , thank you for investigating this issue. Please let me know if you need any additional information or assistance from me.

@tgross
Copy link
Member

tgross commented Sep 25, 2024

Ok, I've got the bug root-caused but I don't yet have a fix that I can be sure won't break somewhere else. Normally when we have a piece of data that's incomplete in an object we get from the jobspec, we mutate the object either in the RPC handler or the state store code, which both have access to the namespace we need. But for some reason when scaling policies were written, we do this mutation to add the missing namespace content in the conversion from an HTTP API struct to a RPC struct. That's not how we normally do this kind of thing and those are supposed to be pretty dumb conversions.

The fix is to move that into the state store code, but I need to make sure that nothing is consuming the mutated policy object along the way so we don't introduce any new bugs in the process, particularly around upgrades.

@tgross
Copy link
Member

tgross commented Sep 25, 2024

I've got a quick hack here that in my testing fixes the bug: #24065 But I do want to move this code into the state store anyways to avoid any lurking similar problems.

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Sep 25, 2024
tgross added a commit that referenced this issue Sep 27, 2024
When jobs are submitted with a scaling policy, the scaling policy's target only
includes the job's namespace if the `namespace` field is set in the jobspec and
not from the request. Normally jobs are canonicalized in the RPC handler before
being written to Raft. But the scaling policy targets are instead written during
the conversion from `api.Job` to `structs.Job`. We populate the `structs.Job`
namespace from the request here as well, but only after the conversion has
occurred. Swap the order of these operations so that the conversion is always
happening with a correct namespace.

Fixes: #24039
@tgross tgross closed this as completed in 5e1ad14 Oct 1, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Oct 1, 2024
tgross added a commit that referenced this issue Oct 1, 2024
…24065)

When jobs are submitted with a scaling policy, the scaling policy's target only
includes the job's namespace if the `namespace` field is set in the jobspec and
not from the request. Normally jobs are canonicalized in the RPC handler before
being written to Raft. But the scaling policy targets are instead written during
the conversion from `api.Job` to `structs.Job`. We populate the `structs.Job`
namespace from the request here as well, but only after the conversion has
occurred. Swap the order of these operations so that the conversion is always
happening with a correct namespace.

Long-term we should not be making mutations during conversion either. But we
can't remove it immediately because API requests may come from any agent across
upgrades. Move the scaling target creation into the `Canonicalize` method and
mark it for future removal in the API conversion code path.

Fixes: #24039
tgross added a commit that referenced this issue Oct 1, 2024
…in jobspec (#24065) (#24096)

When jobs are submitted with a scaling policy, the scaling policy's target only
includes the job's namespace if the `namespace` field is set in the jobspec and
not from the request. Normally jobs are canonicalized in the RPC handler before
being written to Raft. But the scaling policy targets are instead written during
the conversion from `api.Job` to `structs.Job`. We populate the `structs.Job`
namespace from the request here as well, but only after the conversion has
occurred. Swap the order of these operations so that the conversion is always
happening with a correct namespace.

Long-term we should not be making mutations during conversion either. But we
can't remove it immediately because API requests may come from any agent across
upgrades. Move the scaling target creation into the `Canonicalize` method and
mark it for future removal in the API conversion code path.

Fixes: #24039

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/autoscaling Issues related to supporting autoscaling type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants