Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Kueue service backend #706

Open
2 of 20 tasks
cortadocodes opened this issue Jan 6, 2025 · 0 comments
Open
2 of 20 tasks

Create Kueue service backend #706

cortadocodes opened this issue Jan 6, 2025 · 0 comments
Assignees
Labels
epic Contains links to a collection of issues major-missing-feature

Comments

@cortadocodes
Copy link
Member

cortadocodes commented Jan 6, 2025

Epic

End User Goal

Allow a user to run a question of any size without it timing out.

Overview

Cloud Run is limiting our ability to run questions that take longer than an hour and/or require more powerful hardware. It also locks us into a set of frustrating problems.

Creating a Kueue service backend will:

  • Allow us to run questions that take any amount of time (specifically opening us up to runs > 1 hour)
  • Access hardware we can't currently access (e.g. GPUs)
  • Access arbitrarily provisioned hardware (CPU, memory, storage etc.)
  • Stop pointless question reruns by allowing us to control when we acknowledge question events
  • Cancel running questions
  • Monitor running questions individually
  • Run questions on providers other than Google (i.e. on any Kubernetes cluster)

Contents

  • Investigate whether Kueue can integrate directly with pub/sub (i.e. can we link Kueue up to push subscriptions?) #711
  • If it can't, upgrade the event handler cloud function to route appropriate questions to Kueue
  • Experiment with Kueue #712
  • Acknowledge question events on receipt to Kueue or the cloud function
  • Upgrade service registries
    • Add routing information so questions can be routed to Kueue or not
    • Add configuration information (or pointers to it) for each service revision
    • Set up a lightweight service registry with a cloud function and bigquery
  • Update the standard dockerfiles to support Kueue (if necessary)
    • Use the new octue question ask CLI command as the container entrypoint
  • Run questions via CLI without starting a service  #710
  • Determine if the configuration values and manifest should be built into the service revision image
  • Use terraform to:
    • Create a Kubernetes cluster for running questions
    • Set up Kueue on the Kubernetes cluster
  • Test the Kueue service backend with the example service
  • Add ability to request specific hardware or resource flavours for a question
  • Add ability to cancel a question
  • Add ability to check a question's status
  • Determine some useful default resource flavours and cluster queue configurations
@cortadocodes cortadocodes added epic Contains links to a collection of issues major-missing-feature labels Jan 6, 2025
@cortadocodes cortadocodes changed the title Create Kueue backend Create Kueue service backend Jan 6, 2025
@cortadocodes cortadocodes self-assigned this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Contains links to a collection of issues major-missing-feature
Projects
None yet
Development

No branches or pull requests

1 participant