-
Notifications
You must be signed in to change notification settings - Fork 691
[history server] Web Server + Event Processor #4329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[history server] Web Server + Event Processor #4329
Conversation
Co-authored-by: chiayi chiayiliang327@gmail.com Co-authored-by: KunWuLuan kunwuluan@gmail.com
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: KunWuLuan <kunwuluan@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @chiayi @KunWuLuan
to help review, thank you!
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
| const ( | ||
| NIL TaskStatus = "NIL" | ||
| PENDING_ARGS_AVAIL TaskStatus = "PENDING_ARGS_AVAIL" | ||
| PENDING_NODE_ASSIGNMENT TaskStatus = "PENDING_NODE_ASSIGNMENT" | ||
| PENDING_OBJ_STORE_MEM_AVAIL TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL" | ||
| PENDING_ARGS_FETCH TaskStatus = "PENDING_ARGS_FETCH" | ||
| SUBMITTED_TO_WORKER TaskStatus = "SUBMITTED_TO_WORKER" | ||
| PENDING_ACTOR_TASK_ARGS_FETCH TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH" | ||
| PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY" | ||
| RUNNING TaskStatus = "RUNNING" | ||
| RUNNING_IN_RAY_GET TaskStatus = "RUNNING_IN_RAY_GET" | ||
| RUNNING_IN_RAY_WAIT TaskStatus = "RUNNING_IN_RAY_WAIT" | ||
| FINISHED TaskStatus = "FINISHED" | ||
| FAILED TaskStatus = "FAILED" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
LGTM! Just a question you mentioned:
How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request? |
|
todo:
|
yes it will, and this will be solved in the beta version. |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
I see. Thanks for the tips! |
…oxying Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cc @chiayi @KunWuLuan to do a final pass, thank you! |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cursor review |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
chiayi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
LGTM /approve |
| Mu sync.RWMutex | ||
| } | ||
|
|
||
| func (c *ClusterTaskMap) RLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need these funcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go maps are not thread-safe. Concurrent read/write causes undefined behaviors, so we use locks.
https://go.dev/blog/maps#concurrency
┌─────────────────────┐ ┌─────────────────────┐
│ Event Processor │ │ HTTP Handler │
│ (goroutine 1..N) │ │ (goroutine 1..M) │
└──────────┬──────────┘ └──────────┬──────────┘
│ WRITE │ READ
▼ ▼
┌──────────────────────────────────────────┐
│ ClusterTaskMap (RWMutex) │
│ ┌────────────────────────────────────┐ │
│ │ TaskMap per cluster (Mutex) │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ map[taskId] → []Task │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
Co-authored-by: @chiayi chiayiliang327@gmail.com
Co-authored-by: @KunWuLuan kunwuluan@gmail.com
Why are these changes needed?
This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
Note: I combined code from this branch #4187 and this branch #4253, and then fixed a lot of bugs
Architecture
How to test and develop in your local env
response
Related issue number
#3966
#4374
HistoryServer Alpha Milestone Gap Analysis
Summary
API Endpoints (Terminated Clusters)
/clusters/nodes/nodes/{node_id}/events/api/cluster_status/api/grafana_health/api/prometheus_health/api/data/datasets/{job_id}/api/serve/applications//api/v0/placement_groups//api/v0/tasks/api/v0/tasks/summarize/api/v0/logs/api/v0/logs/file/logical/actors/logical/actors/{actor_id}/api/jobs/api/jobs/{job_id}Remaining Work (Priority)
/api/jobs,/api/jobs/{job_id}/eventsendpoint/nodes/{node_id}/api/v0/logs/file/api/cluster_status/api/grafana_health,/api/prometheus_health/api/serve/applications/,/api/v0/placement_groups/others:
Overall Progress: ~75%
Checks