Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Enhancement #7

Open
garyelephant opened this issue Jul 1, 2020 · 2 comments
Open

Documentation Enhancement #7

garyelephant opened this issue Jul 1, 2020 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@garyelephant
Copy link
Member

garyelephant commented Jul 1, 2020

V2 Flink:

@garyelephant garyelephant added the documentation Improvements or additions to documentation label Jul 1, 2020
@pdlovedy
Copy link
Contributor

在waterdrop对于es的读取中(es作为input插件)[https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/configuration/input-plugins/Elasticsearch],加上对es configuration参数的说明:[https://www.elastic.co/guide/en/elasticsearch/hadoop/6.2/configuration.html],用以最大化waterdrop读取es的效率,该参数为:es.input.max.docs.per.partition的值如下:分区数 = 总数据条数/es.input.max.docs.per.partition。让用户选择合适的分区数,用以处理可能数据量较大时带来的shuffle过长使得es与其他组件迁移速率低下的问题。也是通过控制读取es的分区数来加快shuffle的过程,添加说明可以使得用户更加方便高效。这一点与es的官网对于读取es的分片所应用的cpu核数(线程数)的建议有点不一样,需要实践才能知道合适的大小。并且通过测试这个参数设置的合理,读取es的效率可以提升3-10倍。

@garyelephant garyelephant changed the title Documentation Problems Documentation Enhancement Sep 28, 2020
@garyelephant
Copy link
Member Author

garyelephant commented Oct 26, 2020

V1 Spark:

filter {
    split {
        source_field = "raw_message"
        delimiter = "\\|"
        fields = ["field1", "field2"]
    }
}
  • 案例中,应该用三引号

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants