Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc,tests: integration test flow_control case is flaky #383

Closed
pingyu opened this issue Jan 3, 2024 · 5 comments · Fixed by #392
Closed

cdc,tests: integration test flow_control case is flaky #383

pingyu opened this issue Jan 3, 2024 · 5 comments · Fixed by #392
Labels
type/bug Something isn't working

Comments

@pingyu
Copy link
Collaborator

pingyu commented Jan 3, 2024

Bug Report

1. Describe the bug

cdc server used memory: 346708
Maybe flow-contorl is not working

See https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fmigration%2Fpull_unit_test/detail/pull_unit_test/11/pipeline/

2. Minimal reproduce step (Required)

Run integration tests in CI.

3. What did you see instead (Required)

Test failed.

4. What did you expect to see? (Required)

Test succeed.

5. What is your migration tool and TiKV version? (Required)

@pingyu pingyu added the type/bug Something isn't working label Jan 3, 2024
@pingyu
Copy link
Collaborator Author

pingyu commented Jan 3, 2024

Another case: https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fmigration%2Fpull_integration_test/detail/pull_integration_test/15/pipeline/304

[2024/01/03 15:54:45.678 +08:00] [INFO] [region_request.go:794] ["mark store's regions need be refill"] [id=1] [addr=127.0.0.1:20161] [error="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:56668->127.0.0.1:20161: read: connection timed out"] [errorVerbose="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:56668->127.0.0.1:20161: read: connection timed out\ngithub.com/tikv/client-go/v2/internal/client.sendBatchRequest\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client_batch.go:789\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).sendRequest\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:490\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).SendRequest\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:533\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).sendReqToRegion\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1184\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).SendReqCtx\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1017\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).SendReq\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:233\ngithub.com/tikv/client-go/v2/rawkv.(*Client).sendReq\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/rawkv/rawkv.go:678\ngithub.com/tikv/client-go/v2/rawkv.(*Client).PutWithTTL\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/rawkv/rawkv.go:318\ngithub.com/tikv/client-go/v2/rawkv.(*Client).Put\n\t/disk1/home/zhouzemin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/rawkv/rawkv.go:366\ngithub.com/pingcap/go-ycsb/db/tikv.(*rawDB).Insert\n\t/disk1/home/zhouzemin/run/go-ycsb/db/tikv/raw.go:173\ngithub.com/pingcap/go-ycsb/pkg/client.DbWrapper.Insert\n\t/disk1/home/zhouzemin/run/go-ycsb/pkg/client/dbwrapper.go:121\ngithub.com/pingcap/go-ycsb/pkg/workload.(*core).DoInsert\n\t/disk1/home/zhouzemin/run/go-ycsb/pkg/workload/core.go:274\ngithub.com/pingcap/go-ycsb/pkg/client.(*worker).run\n\t/disk1/home/zhouzemin/run/go-ycsb/pkg/client/client.go:133\ngithub.com/pingcap/go-ycsb/pkg/client.(*Client).Run.func2\n\t/disk1/home/zhouzemin/run/go-ycsb/pkg/client/client.go:212\nruntime.goexit\n\t/disk1/home/zhouzemin/.local/go/src/runtime/asm_amd64.s:1571"]

Tikv-servers seemed to be all down.

@pingyu pingyu changed the title tests: integration test flow_control case is flaky cdc,tests: integration test flow_control case is flaky Jan 4, 2024
@pingyu
Copy link
Collaborator Author

pingyu commented Feb 13, 2024

Heap profile (see here) showed that the function using most memory was Event_Row).Unmarshal:

❯ go tool pprof heap-dump.log
File: tikv-cdc
Type: inuse_space
Time: Feb 13, 2024 at 10:37am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 216.10MB, 94.76% of 228.05MB total
Dropped 140 nodes (cum <= 1.14MB)
Showing top 10 nodes out of 84
      flat  flat%   sum%        cum   cum%
  154.16MB 67.60% 67.60%   154.16MB 67.60%  github.com/pingcap/kvproto/pkg/cdcpb.(*Event_Row).Unmarshal
      20MB  8.77% 76.37%       20MB  8.77%  github.com/tikv/migration/cdc/cdc/kv.assembleRowEvent
   17.60MB  7.72% 84.09%    17.60MB  7.72%  golang.org/x/net/webdav.(*memFile).Write
    9.51MB  4.17% 88.26%     9.51MB  4.17%  github.com/tinylib/msgp/msgp.Require
       5MB  2.19% 90.45%     9.14MB  4.01%  github.com/tikv/migration/cdc/cdc/sorter/unified.(*heapSorter).init.func1
    4.14MB  1.82% 92.27%     4.14MB  1.82%  github.com/tikv/migration/cdc/cdc/sorter/unified.(*sortHeap).Push
    2.50MB  1.10% 93.36%     2.50MB  1.10%  github.com/tikv/migration/cdc/cdc/model.NewPolymorphicEvent
    2.03MB  0.89% 94.25%     2.03MB  0.89%  bytes.growSlice
    0.58MB  0.26% 94.51%     1.66MB  0.73%  github.com/tikv/pd/client.(*tsoClient).createTSODispatcher
    0.57MB  0.25% 94.76%     2.09MB  0.91%  google.golang.org/protobuf/internal/impl.legacyLoadMessageInfo
(pprof)

So it seems to be not an issue of memory leak or flow controlling.

As we use go-ycsb with 1KB record size as workload in flow_control case, decrease the record size may reduce memory usage of this part.

@pingyu
Copy link
Collaborator Author

pingyu commented Feb 25, 2024

profile (1).pb.gz

Sample=alloc_space:

image

pingyu added a commit that referenced this issue Mar 4, 2024
…eam` (#392)

* collect pprof heap

Signed-off-by: Ping Yu <yuping@pingcap.com>

* unlimit retry for pd connection

Signed-off-by: Ping Yu <yuping@pingcap.com>

* reduce record size

Signed-off-by: Ping Yu <yuping@pingcap.com>

* log level: info

Signed-off-by: Ping Yu <yuping@pingcap.com>

* reduce data size; add grafana panel

Signed-off-by: Ping Yu <yuping@pingcap.com>

* fix encoder size

Signed-off-by: Ping Yu <yuping@pingcap.com>

* fix

Signed-off-by: Ping Yu <yuping@pingcap.com>

* MQMessage pool

Signed-off-by: Ping Yu <yuping@pingcap.com>

* fix release

Signed-off-by: Ping Yu <yuping@pingcap.com>

* wip

Signed-off-by: Ping Yu <yuping@pingcap.com>

* fix flaky ut

Signed-off-by: Ping Yu <yuping@pingcap.com>

* logging

Signed-off-by: Ping Yu <yuping@pingcap.com>

* fix ut

Signed-off-by: Ping Yu <yuping@pingcap.com>

* adjust memory release parameter

Signed-off-by: Ping Yu <yuping@pingcap.com>

* polish

Signed-off-by: Ping Yu <yuping@pingcap.com>

* polish

Signed-off-by: Ping Yu <yuping@pingcap.com>

* polish

Signed-off-by: Ping Yu <yuping@pingcap.com>

---------

Signed-off-by: Ping Yu <yuping@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant