Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE. Intro serializer lib with json and protobuf. #107

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

yb01
Copy link
Collaborator

@yb01 yb01 commented Jul 27, 2022

this PR contains changes in two major commits for

  1. a new serializer lib which is pretty much take from the apimachinary lib from k8s
  2. refactor of the types folder to make current pb auto-gen easier
  3. other fixes for using protobuf and ONLY applied to the ResponseFromRRM object for now for 730 experiments

fix in issue #102

yb01 added 6 commits July 26, 2022 20:03
use gogo protobuf
use bytes for rv map in transit for protobuf not support struct as map key

separate type def

separate type def

separate type def
…ling data from region manager

protobuf auto-gen files and hard-coded to protobuf serializer for pulling data from region manager

handle time in protobuf
@yb01 yb01 requested review from Sindica, q131172019 and sonyafenge and removed request for Sindica, q131172019 and sonyafenge July 27, 2022 03:35
@@ -118,6 +112,9 @@ func (a *Aggregator) Run() (err error) {
if eventProcess {
a.postCRV(c, crv)
}
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, there will be a comparison between actual data size and batch size. If data size == batch size, no need to wait for subsequent pull. Wait a bit if data size < batch size. Without waiting, even a single event can cause immediate subsequent pull. This seems a bit misaligned with 100ms for empty pull.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. will fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the batch size is not used, so comparing with expected ( batch size ) and the actually got data ( the length ) won't help here for now.

since the goal is to avoid waitless pull()s from the aggregator to avoid busy cpu spins, we will check the durations of pull() and/or the processEvent() and make adjustment of wait time here.

@Sindica
Copy link
Collaborator

Sindica commented Jul 27, 2022

FYI, this change caused 20% performance lost for distributor_concurrency_test (for 2M events, 200K events seem ok) when metrics is disabled.


// NewRegionNodeEvents creates a Region Node Events handler with the given logger
//
func NewRegionNodeEventsHander() *RegionNodeEventHandler {
return &RegionNodeEventHandler{}
return &RegionNodeEventHandler{
// serializer: localJson.NewSerializer("foo", false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented line should be removed, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i should have removed it. will do.

Copy link
Collaborator

@q131172019 q131172019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add an instruction/steps about how to use protobuf to automatically generate files.

Reference: CentaurusInfra/arktos#1281.

Also, before merge into main branch, can this PR be tested in GCE env because too many files are modified by this PR?

@yb01
Copy link
Collaborator Author

yb01 commented Jul 31, 2022

some perf comparison for the event queue change in commit fb9b951
-- client side end to end time:

root@ip-172-31-10-115:/work/src/global-resource-service/resource-management# tail cent-main-c.log 
W0731 01:31:33.748972   31700 singleClientTest.go:225] Prolonged watch node from server: 459bc842-ee00-4f9d-aacc-97609d18b840 with time (2.970314753s)
W0731 01:31:33.749024   31700 singleClientTest.go:225] Prolonged watch node from server: 6c9e7ab9-6cfc-4be3-9770-54ec6e54fb71 with time (2.970374009s)
W0731 01:31:33.749048   31700 singleClientTest.go:225] Prolonged watch node from server: e6e457f6-593d-4e89-82d5-4fa34b663a7c with time (2.970332255s)
W0731 01:31:33.749085   31700 singleClientTest.go:225] Prolonged watch node from server: 71393aa6-ade8-44e6-bd51-8aa47a73e4f6 with time (2.970361383s)
I0731 01:34:22.024268   31700 streamwatcher.go:115] Unexpected EOF during watch stream event decoding: unexpected EOF
I0731 01:34:22.024318   31700 singleClientTest.go:184] End of results
I0731 01:34:22.024333   31700 stats.go:28] [Metrics][Register]RegisterClientDuration: 120.607243ms
I0731 01:34:22.024345   31700 stats.go:41] [Metrics][List]ListDuration: 3.480990221s. Number of nodes listed: 25001
I0731 01:34:22.024355   31700 stats.go:60] [Metrics][Watch]Watch session last: 6m50.706639813s. Number of nodes Added :0, Updated: 3362, Deleted: 0. watch prolonged than 1s: 3362
I0731 01:34:22.024608   31700 stats.go:65] [Metrics][Watch] perc50 2.818915196s, perc90 2.925316215s, perc99 2.969276625s. Total count 3362
root@ip-172-31-10-115:/work/src/global-resource-service/resource-management# tail cent-main-event-queue-c.log 
W0731 00:36:35.287190   31090 singleClientTest.go:225] Prolonged watch node from server: b5b58a57-30fc-48a2-9e4f-922e6f1bf927 with time (2.740699344s)
W0731 00:36:35.287232   31090 singleClientTest.go:225] Prolonged watch node from server: 417ff6e8-92a1-408b-88ec-125fb6746650 with time (2.740691147s)
W0731 00:36:35.287247   31090 singleClientTest.go:225] Prolonged watch node from server: 72594f20-670f-46d1-8466-e11c9e8e3b95 with time (2.740718864s)
W0731 00:36:35.287286   31090 singleClientTest.go:225] Prolonged watch node from server: be59e4c5-9e2a-4346-98c7-676dee694fb8 with time (2.740677759s)
I0731 00:39:04.908875   31090 streamwatcher.go:115] Unexpected EOF during watch stream event decoding: unexpected EOF
I0731 00:39:04.908912   31090 singleClientTest.go:184] End of results
I0731 00:39:04.908928   31090 stats.go:28] [Metrics][Register]RegisterClientDuration: 130.604209ms
I0731 00:39:04.908938   31090 stats.go:41] [Metrics][List]ListDuration: 2.888267345s. Number of nodes listed: 25108
I0731 00:39:04.908947   31090 stats.go:60] [Metrics][Watch]Watch session last: 6m40.960074136s. Number of nodes Added :0, Updated: 2513, Deleted: 0. watch prolonged than 1s: 2513
I0731 00:39:04.909115   31090 stats.go:65] [Metrics][Watch] perc50 2.64036751s, perc90 2.736392793s, perc99 2.740376538s. Total count 2513
root@ip-172-31-10-115:/work/src/global-resource-service/resource-management# tail pr107-c.log 
W0731 01:12:35.181080   31407 singleClientTest.go:224] Prolonged watch node from server: 4a7e1c70-887c-4375-8678-66aaa72fd267 with time (2.181077207s)
W0731 01:12:35.181096   31407 singleClientTest.go:224] Prolonged watch node from server: 430cb301-4ce5-4ba9-9f19-8a8928fb7a56 with time (2.18109399s)
W0731 01:12:35.181140   31407 singleClientTest.go:224] Prolonged watch node from server: e2257edb-7944-4f67-85e9-88ad59e2f2ca with time (2.18113716s)
W0731 01:12:35.181166   31407 singleClientTest.go:224] Prolonged watch node from server: 02390311-6041-4e8d-adeb-222b10f97ab2 with time (2.181163647s)
I0731 01:13:10.864430   31407 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0731 01:13:10.864482   31407 singleClientTest.go:183] End of results
I0731 01:13:10.864498   31407 stats.go:28] [Metrics][Register]RegisterClientDuration: 122.791053ms
I0731 01:13:10.864511   31407 stats.go:41] [Metrics][List]ListDuration: 2.370043357s. Number of nodes listed: 25053
I0731 01:13:10.864521   31407 stats.go:60] [Metrics][Watch]Watch session last: 5m15.717064136s. Number of nodes Added :0, Updated: 2880, Deleted: 0. watch prolonged than 1s: 2880
I0731 01:13:10.864671   31407 stats.go:65] [Metrics][Watch] perc50 2.118337951s, perc90 2.17130846s, perc99 2.180023096s. Total count 2880
root@ip-172-31-10-115:/work/src/global-resource-service/resource-management# 

--- server side metrics:

root@ip-172-31-36-170:/work/src/global-resource-service/resource-management# tail cent-main-t.log 
I0731 01:27:31.342370    5204 installer.go:169] Serving watch for client: Client-6efc8aeb-4520-415a-8492-d7528fb54936
I0731 01:27:31.342427    5204 installer.go:189] Start watching distributor for client: Client-6efc8aeb-4520-415a-8492-d7528fb54936
I0731 01:27:31.342471    5204 installer.go:211] Start processing watch event for client: Client-6efc8aeb-4520-415a-8492-d7528fb54936
I0731 01:31:33.160481    5204 aggregator.go:97] Total (25000) region node events are pulled successfully in (10) RPs
I0731 01:31:54.682923    5204 event_metrics.go:105] [Metrics][AGG_RECEIVED] perc50 2.397479655s, perc90 2.410722505s, perc99 2.41372357s. Total count 3362
I0731 01:31:54.682959    5204 event_metrics.go:106] [Metrics][DIS_RECEIVED] perc50 2.40112165s, perc90 2.414844607s, perc99 2.41783481s. Total count 3362
I0731 01:31:54.682966    5204 event_metrics.go:107] [Metrics][DIS_SENDING] perc50 2.416017464s, perc90 2.419745487s, perc99 2.420972926s. Total count 3362
I0731 01:31:54.682971    5204 event_metrics.go:108] [Metrics][DIS_SENT] perc50 2.416026118s, perc90 2.419745713s, perc99 2.420973144s. Total count 3362
I0731 01:31:54.682976    5204 event_metrics.go:109] [Metrics][SER_ENCODED] perc50 2.416831977s, perc90 2.420030218s, perc99 2.421060159s. Total count 3362
I0731 01:31:54.682981    5204 event_metrics.go:110] [Metrics][SER_SENT] perc50 2.416832056s, perc90 2.42003035s, perc99 2.421060264s. Total count 3362           
root@ip-172-31-36-170:/work/src/global-resource-service/resource-management# tail cent-main-event-queue-t.log 
I0731 00:32:23.972636    3652 installer.go:169] Serving watch for client: Client-db1d707d-25cb-40fc-9381-2e51587f49e4
I0731 00:32:23.972687    3652 installer.go:189] Start watching distributor for client: Client-db1d707d-25cb-40fc-9381-2e51587f49e4
I0731 00:32:23.972737    3652 installer.go:211] Start processing watch event for client: Client-db1d707d-25cb-40fc-9381-2e51587f49e4
I0731 00:36:34.801489    3652 aggregator.go:97] Total (25000) region node events are pulled successfully in (10) RPs
I0731 00:36:53.778701    3652 event_metrics.go:105] [Metrics][AGG_RECEIVED] perc50 2.27149438s, perc90 2.284506426s, perc99 2.287312368s. Total count 2513
I0731 00:36:53.778760    3652 event_metrics.go:106] [Metrics][DIS_RECEIVED] perc50 2.274924693s, perc90 2.288500822s, perc99 2.291312045s. Total count 2513
I0731 00:36:53.778769    3652 event_metrics.go:107] [Metrics][DIS_SENDING] perc50 2.286279444s, perc90 2.293075514s, perc99 2.294217557s. Total count 2513
I0731 00:36:53.778776    3652 event_metrics.go:108] [Metrics][DIS_SENT] perc50 2.286287825s, perc90 2.293075714s, perc99 2.294217727s. Total count 2513
I0731 00:36:53.778784    3652 event_metrics.go:109] [Metrics][SER_ENCODED] perc50 2.287111212s, perc90 2.293252343s, perc99 2.294351256s. Total count 2513
I0731 00:36:53.778791    3652 event_metrics.go:110] [Metrics][SER_SENT] perc50 2.287111289s, perc90 2.293252401s, perc99 2.294351381s. Total count 2513
root@ip-172-31-36-170:/work/src/global-resource-service/resource-management# tail pr107-t.log 
I0731 01:07:55.175301    4600 installer.go:168] Serving watch for client: Client-f46a4a18-1155-42ef-b8cc-cbaf2e28a9f8
I0731 01:07:55.175370    4600 installer.go:188] Start watching distributor for client: Client-f46a4a18-1155-42ef-b8cc-cbaf2e28a9f8
I0731 01:07:55.175401    4600 installer.go:210] Start processing watch event for client: Client-f46a4a18-1155-42ef-b8cc-cbaf2e28a9f8
I0731 01:12:34.693864    4600 aggregator.go:92] Total (25000) region node events are pulled successfully in (10) RPs. pull duration 941.800415ms
I0731 01:12:39.487791    4600 event_metrics.go:105] [Metrics][AGG_RECEIVED] perc50 1.691580379s, perc90 1.693398308s, perc99 1.693814964s. Total count 2880
I0731 01:12:39.487834    4600 event_metrics.go:106] [Metrics][DIS_RECEIVED] perc50 1.695657601s, perc90 1.696835842s, perc99 1.697109142s. Total count 2880
I0731 01:12:39.487840    4600 event_metrics.go:107] [Metrics][DIS_SENDING] perc50 1.706647292s, perc90 1.715874898s, perc99 1.717099747s. Total count 2880
I0731 01:12:39.487846    4600 event_metrics.go:108] [Metrics][DIS_SENT] perc50 1.706647377s, perc90 1.715875037s, perc99 1.717127428s. Total count 2880
I0731 01:12:39.487851    4600 event_metrics.go:109] [Metrics][SER_ENCODED] perc50 1.707397986s, perc90 1.71634485s, perc99 1.717605537s. Total count 2880
I0731 01:12:39.487856    4600 event_metrics.go:110] [Metrics][SER_SENT] perc50 1.707398056s, perc90 1.716344969s, perc99 1.717605596s. Total count 2880
root@ip-172-31-36-170:/work/src/global-resource-service/resource-management# mv cent-main-envqueue-change-t.log cent-main-event-queue-t.log
root@ip-172-31-36-170:/work/src/global-resource-service/resource-management# 

@yb01
Copy link
Collaborator Author

yb01 commented Aug 30, 2022

on hold, should extra perf for the GRS is needed.

@@ -201,7 +202,7 @@ func (eq *NodeEventQueue) getAllEventsSinceResourceVersion(rvs types.InternalRes
}
}

nodeEvents := make([]*types.NodeEvent, 0)
nodeEvents := make([]*types.NodeEvent, 0, 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By test, this change does not improve performance.

@Sindica
Copy link
Collaborator

Sindica commented Sep 14, 2022

perf-code
perf-result

@yb01
Copy link
Collaborator Author

yb01 commented Sep 20, 2022

put this PR on hold.

@yb01 yb01 changed the title Intro serializer lib with json and protobuf. DO NOT MERGE. Intro serializer lib with json and protobuf. Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants