Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KOJO-127 | StateTransitionLogger LADR #74

Open
wants to merge 2 commits into
base: 4.x
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions docs/StateTransitionLogger.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Kōjō State Transition Logger
* Start Date: 2019-08-08
* Author: Przemyslaw Mucha (przemyslaw.mucha@55places.com)

## Summary
This document proposes a design for Kōjō logic that will emit log messages for every state transition of a job (e.g. from `waiting` to `working`)

## Problem
* Why are we doing this?
* To gain visibility into the lifecycle of any given job
* What use cases does it support?
* Forensic analysis of a job that didn't behave as expected
* Monitoring how much work is being done (more accurately than polling `kojo_job`)
* Monitoring how much time is spent in each state

## Proposed Solution
Job state transitions are persisted to RDBMS in `Neighborhoods\Kojo\State\Service::applyRequest()`.
To that logic we will add another `INSERT` into a new table: `kojo_job_state_transitions` (working title).
The schema of that table will include all job information (ID, type, etc.) as well as process information (execution environment, PID, etc.).
Those two statements will be wrapped in a database transaction to ensure that nothing is written to `kojo_job_state_transitions` unless the actual transition succeeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mucha55 how are you planning on using PDO to accomplish this with user-land transactions? testing PDO::inTransaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctrine DBAL supports "nested" transactions, but even without them this is fine: If there's a userspace transaction happening, because it being rolled back will mean the job state never changes, which means there shouldn't be anything written to kojo_job_state_transitions anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per an offline discussion, we have to use \PDO for managing the transaction, since a Doctrine Connection isn't aware of what happens in the \PDO connection that it was instantiated with


We will add another first-class process type to the Kōjō complement (e.g. `Server`, `Root`) called the `StateTransitionLogger` (working title).
The `StateTransitionLogger` will be a child of the `Root`, and will be most like `Worker` processes.
There will only be one `StateTransitionLogger` per Kōjō cluster, in the same way that there is only one `Worker` acting as the `Maintainer`.
`Root`s will be responsible for babysitting the status of the `StateTransitionLogger` (by inserting `command.addProcess('state_transition_logger')` messages into the "publication" redis list if nothing is holding the mutex).
Once the `StateTransitionLogger` is instantiated, it will poll the `kojo_job_state_transitions` table continuously.
For each transition event the `StateTransitionLogger` pulls into memory, it will emit a message and then delete that row.
This guarantees at-least-once delivery of transition messages.

## Backward Incompatible Changes
If there are any assumptions within Kōjō about the process hierarchy, they could be violated by the addition of a new process type, but this is unlikely.
Otherwise it's a purely additive modification to Kōjō.

## Example 1
1. There exists a dynamically scheduled job with `previous_state: 'new'`, `assigned_state: 'waiting'`, `next_state_request: 'working'`, and `work_at_datetime < NOW()`
1. A recently spawned `Worker` process selects that job to work
1. That `Worker` process reaches `Neighborhoods\Kojo\Foreman::_updateJobAsWorking()` and invokes `Neighborhoods\Kojo\State\Service::applyRequest()` to transition the job from `waiting` to `working`
1. The job, process, and transition information are inserted into `kojo_job_state_transitions`
1. The `Worker` process continues execution and hands over control to userspace
1. Concurrently, the `StateTransitionLogger` process for that cluster (not necessarily in the same execution environment) queries `kojo_job_state_transitions` and pulls that transition information into memory.
1. The `StateTransitionLogger` emits a message, deletes the row, and moves on to the next transition event
1. Concurrently, our logging infrastructure consumes the emitted messages

## Future Scope
Part of the design for the `StateTransitionLogger` is to delegate more one-per-cluster responsibilities to first-class processes.
This is in contrast to the typical Kōjō pattern of requiring every newly spawned `Worker` process to attempt to perform these responsibilities.
If this implementation of the `StateTransitionLogger` is successful, it would be desireable to refactor out `Maintainer`, `Scheduler`, etc. responsibilities in the same way.

## Drawbacks
There are multiple unknowns with this design which could cause it to become infeasible to implement, in which case we'd have to go with the approach outlined in Alternative 3.

## Unresolved Questions
* Should we log when a job is created (i.e. scheduled)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the surface I would answer "yes". When an execution cluster is constrained on resources or limited by process pool size, there may be jobs that never exit the waiting state. In this condition, we'd never see a StateTransitionLogger message about that job, and only polling kojo_job would display it.

Is there an implementation detail that makes logging on creation difficult?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we talked through the implementation a little bit last week, I realized that including job creation is easier than not including it


## Alternatives
1. `Neighborhoods\Kojo\State\Service::applyRequest()` emits the message itself when the transition happens
1. There exists a dynamically scheduled job with `previous_state: 'new'`, `assigned_state: 'waiting'`, `next_state_request: 'working'`, and `work_at_datetime < NOW()`
1. A recently spawned `Worker` process selects that job to work
1. That `Worker` process reaches `Neighborhoods\Kojo\Foreman::_updateJobAsWorking()` and invokes `Neighborhoods\Kojo\State\Service::applyRequest()` to transition the job from `waiting` to `working`
1. `Neighborhoods\Kojo\State\Service::applyRequest()` emits the transition message itself
1. The `Worker` process continues execution and hands over control to userspace
1. Userspace overrides this process's `\PDO` connection using `Neighborhoods\Kojo\Api\V1\RDBMS\Connection\Service::usePDO()`
1. Userspace begins a transaction and issues a `complete_success` request via the Kōjō API (which causes a message to be emitted)
1. Userspace rolls back the transaction, and issues a `complete_failed` request (which causes a contradictory message to be emitted)
1. `kojo_job_state_transitions` is populated via triggers on `kojo_job`
1. There exists a dynamically scheduled job with `previous_state: 'new'`, `assigned_state: 'waiting'`, `next_state_request: 'working'`, and `work_at_datetime < NOW()`
1. A recently spawned `Worker` process selects that job to work
1. That `Worker` process reaches `Neighborhoods\Kojo\Foreman::_updateJobAsWorking()` and invokes `Neighborhoods\Kojo\State\Service::applyRequest()` to transition the job from `waiting` to `working`
1. Once `Neighborhoods\Kojo\State\Service::applyRequest()` updates `kojo_job`, a trigger writes the old and new row information to `kojo_job_state_transitions`
1. The `Worker` process continues execution and hands over control to userspace
1. Concurrently, the `StateTransitionLogger` process for that cluster (not necessarily in the same execution environment) queries `kojo_job_state_transitions` and pulls that transition information into memory.
1. The `StateTransitionLogger` emits a message, deletes the row, and moves on to the next transition event
1. Unfortunately, in this scenario, there's no process information available, so only job information is emitted
1. Without process information, we can't determine whether there's a systemic issue on a particular execution environment
1. Concurrently, our logging infrastructure consumes the emitted messages
1. Make the `StateTransitionLogger` another responsibility of each process when it starts up (vs a first-class process type)
1. There are no flaws with this approach per-se, but we are of the opinion that continuing to add responsibilities to newly-spawned `Worker` processes is unsustainable and results in less deterministic behavior than first-class, single-responsibility processes

## Rejected Features
As mentioned above, refactoring all `Worker` process responsibilities is outside the scope of implementing this "prototype" process type.

## References
* [LDR Google calendar](https://calendar.google.com/calendar?cid=NTVwbGFjZXMuY29tX3JrNG12NzFnYzEwNDhwZ3EwcWptMDZidGdjQGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20)