Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP: define minimally required additional job options #545

Open
jdries opened this issue Sep 18, 2024 · 20 comments · May be fixed by #471
Open

UDP: define minimally required additional job options #545

jdries opened this issue Sep 18, 2024 · 20 comments · May be fixed by #471
Assignees

Comments

@jdries
Copy link

jdries commented Sep 18, 2024

Some UDP's depend on very specific job options to run them successfully. For instance:

"driver-memory": "4g",
"executor-memory": "1500m",
"python-memory": "5g",
"udf-dependency-archives": ["https://artifactory.vgt.vito.be/artifactory/auxdata-public/openeo/onnx_dependencies_1.16.3.zip#onnx_deps"]

Here, udf-dependency-archives is really mandatory.
As for the other options, these can usually be considered as lower bounds. (I don't immediately know of a case where an upper bound would be relevant.)

Here's an example, where I called it 'minimal_job_options':
https://github.com/ESA-APEx/apex_algorithms/blob/dd81a53463e5c913e09329dc02832e8db5a6350e/openeo_udp/worldcereal_inference.json#L1416

@m-mohr
Copy link
Member

m-mohr commented Sep 18, 2024

Quick note: Should probably be aligned with #471

@soxofaan
Copy link
Member

soxofaan commented Sep 19, 2024

about udf-dependency-archives: I think it even makes sense to attach this kind of info directly to the UDF, instead of indirectly associating it through the UDP that contains the UDF.

This is kind of related to the idea being brainstormed at Open-EO/openeo-processes#374 to upgrade the current "just a string" udf argument of run_udf to a richer construct (e.g. an array or object).

Another path could be to use the context argument of run_udf, currently documented as:

Additional data such as configuration options to be passed to the UDF.

The current interpretation is that the context is passed directly as argument to the entrypoint function of the UDF, but in a more relaxed interpretation it could also serve to define extra runtime settings like udf-dependency-archives

The nice things about using run_udf's context is that there is no change required at openEO API/processes specification level

@jdries
Copy link
Author

jdries commented Sep 19, 2024

@soxofaan indeed, but we still need to configure the other job options as well.

@m-mohr
Copy link
Member

m-mohr commented Sep 20, 2024

+1 on adding the UDF options to the run_udf call.

Adding job options to processes seems to mix concerns. The process could in principle also be executed in other modes, what happens then? What if I load the process into the Web Editor and change the extent to reasonable small or utterly large and execute it then? So my thinking is that adding such options to a process is not inline with the initial vision for openEO, especially when it's for CPU/mem consumption, which was always meant to be abstracted away. I think if job options are important and you want a job to be exeucted as a job, you need to actually schare the job metadata, not just the process, so pretty much a partial body for the POST /jobs request with process and the other additional properties.

Thinking about it a bit more now, could job options also be provided as process? For example:

  • configure_runtime(options) -> bool
  • configure_runtime({"driver-memory": "4g", "executor-memory": "1500m", ...})

Just a spontaneaous idea. Still somewhat mixing concerns, but doesn't need a spec change and is more visible to users. Thoughts?
(If that's fully embraced, we would not even need #471 as options could be described in the process parameter schema. But there's no distinction between job/sync/services unless we implement #429)

@soxofaan
Copy link
Member

Thinking about it a bit more now, could job options also be provided as process? For example:
configure_runtime(options) -> bool
configure_runtime({"driver-memory": "4g", "executor-memory": "1500m", ...})
Just a spontaneaous idea. Still somewhat mixing concerns, but doesn't need a spec change and is more visible to users. Thoughts?

This feels a bit too procedural/stateful to me and as such conflicts with the openEO concept of expressing your workflow as a graph of linked processing nodes. How would these configure_runtime nodes be connected to the other nodes of the graph? And related: how to resolve conflicts if there are multiple configure_runtime nodes active, not only because the user has put multiple, but additional ones indirectly pulled in by using UDPs.

@m-mohr
Copy link
Member

m-mohr commented Sep 22, 2024

That's a good question and there's no obvious solution yet. Having unconnected nodes is probably a bit of a hassle... On the other hand, adding job metadata to the process is also not very clean as pointed out above. So maybe it's really sharing jobs (i.e. job metadata) instead of processes in this case? Similar things could appear for web services, where you'd also don't want to add the metadata for the service creation to the process, but instead probably share web service metadata.

@m-mohr
Copy link
Member

m-mohr commented Oct 12, 2024

Any more thoughts/discussions?

@soxofaan
Copy link
Member

Because you asked 😄

Another consideration against something like configure_runtime nodes in the process graph:
the job options or runtime configuration is something the back-end wants to know before executing the process graph, while configure_runtime kind of implies you have to evaluate part of the process graph to get the configuration values, which muddles the separation of concerns here.
You could say that it is trivial to extract configure_runtime from a JSON doc, but I bet at some point users are going to want/expect to make the configure_runtime arguments dynamic, e.g. through UDP parameters, or even more difficult, dependent on spatio-temporal extent size of a cube, ... While this would be an interesting feature, this is a couple of orders of magnitude more complex (on level of API, client, backends, ...) than the original request here of standardizing on a field to put some configuration values.

@m-mohr
Copy link
Member

m-mohr commented Oct 14, 2024

I agree that adding it as a process graph is not ideal, so let's drop that idea.
But I also think that adding it to the process is also not ideal as it also mixes concerns and reduces portability as these options are not standardized. Also, then you have two places where these options could be.
So I was thinking why we don't just share jobs in this case? You only commented on the "in the process graph" part though, not on the other part of my comment and I was wondering about your thoughts.

@soxofaan
Copy link
Member

So maybe it's really sharing jobs (i.e. job metadata) instead of processes in this case?

I think there are quite some hurdles with this:

  • assuming you are talking about sharing GET /jobs/{job_d}:
    • this is a per-user endpoint, and there is no recommendation to have a publicly accessible variant of it, e.g. through signed/canonical URLs (like we have for batch job result related endpoints)
    • there is no standard to expose job options in the job metadata either
  • jobs can be removed or cleaned up automatically by the backend, so there is no guarantee on a long term reference.
  • likewise, if there would be some solution with signed URLs: these typically have a limited lifetime, which again undermines having a long term reference
  • sharing jobs implies a concrete back-end, while we are talking about remote process definitions that are not directly tied to a specifc deployment (e.g. job options for geopyspark-driver based backends, but not necessarily the VITO one, or some other deployment)

@m-mohr
Copy link
Member

m-mohr commented Oct 15, 2024

No, I meant to share what you send to POST /jobs, so just the JSON....

@soxofaan
Copy link
Member

soxofaan commented Oct 15, 2024

Well yes that's basically what we want to do, but instead of a concrete batch job with job options, we want it for a deployment-agnostic parameterized (remote) process with recommended/minimum job options

And to make this a bit more concrete:

for batch jobs you have

POST /jobs
{
    "process": {
        "process_graph": { ... }
    },
    "job_option_foo": "bar",
    ...
}

While UDPs and by consequence remote process definitions follow this format:

PUT /process_graphs/{process_graph_id}    # or GET  /process_graphs/{process_graph_id}
{
    "process_graph": { ... },
    "recommended_job_options_here": "?"
    ...
}

Note that with batch jobs we put job options at the top level, but UDPs start a level deeper, so the best we can do is to put it one level deeper

@m-mohr
Copy link
Member

m-mohr commented Oct 15, 2024

Thanks for the examples.

The title of the issue "define minimally required additional job options" is different to what you propose with "recommended/minimum job options"... so what do you actually want?

we want it for a deployment-agnostic parameterized (remote) process with recommended/minimum job options

Job options are usually backend specific, so that doesn't go together for me with "deployment-agnostic".

Note that with batch jobs we put job options at the top level, but UDPs start a level deeper, so the best we can do is to put it one level deeper

Nothing prevents to share a Job Creation JSON instead of a UDP...
More generally speaking: The job options were always meant as being niche, for some special cases. I don't like that they become crucial/so important now, this pretty much departs from the original openEO vision. That VITO depends so heavily on them seems like a shortcoming to me in how the system is set up. We always wanted it to be so that users don't need to worry too much about memory, CPU, but now they need to do it more and more. Not ideal, I think.

If we really want to make such options (memory, CPU, etc) a thing in openEO, they should be standardized across backends.

@soxofaan
Copy link
Member

The title of the issue "define minimally required additional job options" is different to what you propose with "recommended/minimum job options"... so what do you actually want?

I think @jdries and I mean the same with "minimally required" and "recommended/minimum" job options: job options that one is recommended to use, unless they know very well what they are doing.

Job options are usually backend specific, so that doesn't go together for me with "deployment-agnostic".

I intentionally used "deployment-agnostic" instead of "backend-agnostic", because the job options used in the examples here would apply to any backend deployment powered by the geopyspark-driver, so not just the "VITO" backend, but also CDSE (and consequently the CDSE federation too), other project-specific deployments we internally have, future deployments leveraging EOEPCA building blocks, etc.

Nothing prevents to share a Job Creation JSON instead of a UDP...

Kind of disagree: we want to build on the remote process defintion extension, which follows the UDP schemas. Going for the job creation schema instead would cause incompatibility issues with remote process definition related tooling and support.

I don't like that they become crucial/so important now, this pretty much departs from the original openEO vision. That VITO depends so heavily on them seems like a shortcoming to me in how the system is set up. We always wanted it to be so that users don't need to worry too much about memory, CPU, but now they need to do it more and more.

I understand that this conflicts a bit with the general openEO vision, but note that most users or use cases do not have to worry about explicitly setting job options. This is just a pragmatic approach for us in more advanced and heavy use cases.

If we really want to make such options (memory, CPU, etc) a thing in openEO, they should be standardized across backends.

Agree, and I think this feature request actually is in line with that: without deciding about the actual job option fields (which would lead us too far here), we can already find a convention about their location in the relevant schemas and JSON structures.

@soxofaan
Copy link
Member

I don't like that they become crucial/so important now, this pretty much departs from the original openEO vision. That VITO depends so heavily on them seems like a shortcoming to me in how the system is set up. We always wanted it to be so that users don't need to worry too much about memory, CPU, but now they need to do it more and more.

Another thought about this. These job options are usually necessary for use cases with UDFs that are quite heavy in some sense (e.g. consuming large chunks of data, or doing non-trivial ML operations). Unlike with pre-defined openEO processes, back-ends can not automatically estimate necessary resource allocation to support the UDFs, so we need some kind of user-provided indicators here to get this working in practice.

@m-mohr
Copy link
Member

m-mohr commented Nov 16, 2024

I've put this on the PSC agenda. This request is not really in line with the original vision of openEO (keep away this complexity from the user + interoperability), so want to hear from the PSC how they think about it.

@jdries
Copy link
Author

jdries commented Nov 21, 2024

Thanks, it's a very good discussion to have at that level!

@m-mohr
Copy link
Member

m-mohr commented Dec 4, 2024

Discussed in the PSC, I'll try to summarize the outcomes:

  • The general mechanism to add custom properties to jobs/services/sync. processing is fine, VITO indicates some property may fade out over time in favor of other solutions
  • Processing Parameters Extension #276 #545 #471 is generally a good idea as a first step. We want to keep the separation between jobs, services and sync. This should probably lead to alignment between the backends for a set of commonly used options. The PR should not be part of the core API, instead add it as an extension.
  • Minimally required additional options can generally be placed in UDPs, there's no schematic restriction on additional properties. The options should probably go into a object-typed property that is also separated between jobs/services/sync. So named similar to the Processing Parameters Extension #276 #545 #471 (e.g. default_job_parameters). These are the defaults unless a user overrides them in the actual job object. This addition should also go into the new extension.

@jdries Did I capture everything correctly?

@m-mohr m-mohr linked a pull request Dec 4, 2024 that will close this issue
@jdries
Copy link
Author

jdries commented Dec 4, 2024

Indeed, thanks!
As an extra detail that was mentioned: because these default_job_parameters are evaluated on the backend side, the backend has the liberty to decide if it uses or ignores what is specified in there. So it shouldn't crash when there is an unrecognized option.
Note that we'll further validate the proposal in the open source geotrellis implementation as well.

@m-mohr
Copy link
Member

m-mohr commented Dec 18, 2024

A PR for this is available here: #471

@m-mohr m-mohr linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants