You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use DelegateCombineFileInputFormat + LzoTextInputFormat + LzoTextOutputFormat. I'm also trying to specify the maxSplitSize for combining files. I've found that DelegateCombineFileInputFormat doesn't honor maxSplitSize, minSplitSizeNode, or minSplitSizeRack if they are configured before the job is run.
SplitUtil.getCombinedSplitSize(Configuration): Change it so it tries to getLong from COMBINE_SPLIT_SIZE, if it can't it'll try to get from CombineFileInputFormat "mapreduce.input.fileinputformat.split.maxsize" which apparently isn't a static constant, but a hard coded string...
DelegateCombineFileInputFormat could set SplitUtil.COMBINE_SPLIT_SIZE equal to CombineFileInputFormat max split size if it was set. This same approach could be used for minSplitSizeNode and minSplitSizeRack. Where in DelegateCombineFileInputFormat would this go?
The text was updated successfully, but these errors were encountered:
That does seem the safe approach. What about for minSplitSizeNode and minSplitSizeRack? I'd think to extract from conf using CFIF config keys, then set them into EB's SplitUtil conf keys too? However it doesn't appear that DelegateCFIF has any notion of a min size.
I'm trying to use DelegateCombineFileInputFormat + LzoTextInputFormat + LzoTextOutputFormat. I'm also trying to specify the maxSplitSize for combining files. I've found that DelegateCombineFileInputFormat doesn't honor maxSplitSize, minSplitSizeNode, or minSplitSizeRack if they are configured before the job is run.
Per @jcoveney "If there is a maxInputSplitSize in Hadoop's CombineFileInputFormat no, it is not honored.":
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/util/SplitUtil.java#L35
I can see a couple approaches for a fix:
SplitUtil.getCombinedSplitSize(Configuration): Change it so it tries to getLong from COMBINE_SPLIT_SIZE, if it can't it'll try to get from CombineFileInputFormat "mapreduce.input.fileinputformat.split.maxsize" which apparently isn't a static constant, but a hard coded string...
DelegateCombineFileInputFormat could set SplitUtil.COMBINE_SPLIT_SIZE equal to CombineFileInputFormat max split size if it was set. This same approach could be used for minSplitSizeNode and minSplitSizeRack. Where in DelegateCombineFileInputFormat would this go?
The text was updated successfully, but these errors were encountered: