flink 的窗口 —— 窗口流

窗口流代表了一种靠 WindowAssigner 分配窗口的数据流，可以通过 reduce、max、sum、aggregate 等 API 来生成 WindowOperator／EvictingWindowOperator 操作符，完成窗口操作

窗口流分为 AllWindowedStream 和 WindowedStream，AllWindowedStream 由 DataStream 执行 windowAll 得到，WindowedStream 由 KeyedStream 执行 window 得到。AllWindowedStream 和 WindowedStream 基本相同，因为 AllWindowedStream 的构造函数中会对输入 DataStream 执行 keyBy 操作，传入如下的 KeySelector，为所有的元素设置一个 0 的 key，mock 一个 KeyedStream

public class NullByteKeySelector<T> implements KeySelector<T, Byte> {
	@Override
	public Byte getKey(T value) throws Exception {
		return 0;
	}
}

我们就用 WindowedStream 来讲解吧，之前flink 的窗口 —— 窗口函数中的窗口函数都是配套 WindowedStream 使用的

WindowedStream 的属性

// 输入流，WindowedStream 的各种 API 都需要调用 input 的 transform 方法
// 生成 OneInputTransformation 加入 transformation 树
private final KeyedStream<T, Byte> input;  

// 窗口分配器
private final WindowAssigner<? super T, W> windowAssigner;  
	
// 触发器
private Trigger<? super T, ? super W> trigger;  
	
// 驱逐者
private Evictor<? super T, ? super W> evictor;  
	
// 用于定义的允许的延迟
private long allowedLateness = 0L;  

// 针对延迟数据的侧边输出，如果没有设置 lateDataOutputTag，延迟数据会被丢弃
private OutputTag<T> lateDataOutputTag;

WindowedStream 的 reduce 方法

reduce 方法需要用户至少传入一个 ReduceFunction，和 KeyedStream 中的 reduce 类似，对窗口中的所有元素进行聚合

不传 WindowFunction

不传 WindowFunction 的时候，flink 会默认传入一个 PassThroughWindowFunction 实例

 public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> function) {

 	function = input.getExecutionEnvironment().clean(function);

 	String callLocation = Utils.getCallLocationName();
 	String udfName = "AllWindowedStream." + callLocation;

 	return reduce(function, new PassThroughWindowFunction<W, T>());
 }

传 WindowFunction

reduce 方法接收一个 ReduceFunction 以及一个 WindowFunction，然后，根据 WindowedStream 是否设置了 evictor，创建不同的 StateDescriptor（原因见下方代码）。当设置了 evictor 的时候，会创建 EvictingWindowOperator，反之，创建 WindowOperator，需要注意的是，函数的最后调用了 forceNonParallel 方法，因此并行度为 1

 public <R> SingleOutputStreamOperator<R> reduce(
 		ReduceFunction<T> reduceFunction,
 		WindowFunction<T, R, W> function,
 		TypeInformation<R> resultType) {

 	function = input.getExecutionEnvironment().clean(function);
 	reduceFunction = input.getExecutionEnvironment().clean(reduceFunction);

 	String callLocation = Utils.getCallLocationName();
 	String udfName = "WindowedStream." + callLocation;

 	String opName;
 	KeySelector<T, K> keySel = input.getKeySelector();

 	OneInputStreamOperator<T, R> operator;

 	if (evictor != null) {
 		// 当有驱逐者的时候，EvictingWindowOperator 中
 		// evictingWindowState 需要保存所有的元素，在 emitContent
 		// 中执行 evictBefore 和 evictAfter
 		// 因此这里选择了 ListStateDescriptor
 		// 同时，我们需要自己来 reduce，所以包裹了 ReduceApplyAllWindowFunction
 		TypeSerializer<StreamRecord<T>> streamRecordSerializer =
 			(TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig()));

 		ListStateDescriptor<StreamRecord<T>> stateDesc =
 			new ListStateDescriptor<>("window-contents", streamRecordSerializer);

 		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + evictor + ", " + udfName + ")";

 		operator =
 			new EvictingWindowOperator<>(windowAssigner,
 				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
 				keySel,
 				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
 				stateDesc,
 				new InternalIterableWindowFunction<>(new ReduceApplyWindowFunction<>(reduceFunction, function)),
 				trigger,
 				evictor,
 				allowedLateness,
 				lateDataOutputTag);

 	} else {
 		// 状态描述符，因为没有驱逐者，所以不需要保留原始值
 		// 直接在状态描述符这里创建一个 ReducingStateDescriptor
 		// 到时候 WindowOperator 里 windowState.add(element.getValue()) 的时候
 		// windowState 直接就进行了 reduce 操作
 		// 因此这里的 InternalWindowFunction 是 InternalSingleValueWindowFunction
 		// 因为经过 reduce 已经是一个数值了
 		ReducingStateDescriptor<T> stateDesc = new ReducingStateDescriptor<>("window-contents",
 			reduceFunction,
 			input.getType().createSerializer(getExecutionEnvironment().getConfig()));

 		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + udfName + ")";

 		operator =
 			new WindowOperator<>(windowAssigner,
 				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
 				keySel,
 				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
 				stateDesc,
 				new InternalSingleValueWindowFunction<>(function),
 				trigger,
 				allowedLateness,
 				lateDataOutputTag);
 	}

 	return input.transform(opName, resultType, operator);
 }

传 ProcessWindowFunction

传 ProcessWindowFunction 和传 WindowFunction 的情况差不多，只是选取的窗口函数不同（不细说了）

 public <R> SingleOutputStreamOperator<R> reduce(ReduceFunction<T> reduceFunction, ProcessAllWindowFunction<T, R, W> function, TypeInformation<R> resultType) {

 	function = input.getExecutionEnvironment().clean(function);
 	reduceFunction = input.getExecutionEnvironment().clean(reduceFunction);

 	String callLocation = Utils.getCallLocationName();
 	String udfName = "WindowedStream." + callLocation;

 	String opName;
 	KeySelector<T, K> keySel = input.getKeySelector();

 	OneInputStreamOperator<T, R> operator;

 	// 这里 InternalWindowFunction 和 StateDescriptor 的取舍和上面 reduce 函数中写的一样
 	if (evictor != null) {
 		@SuppressWarnings({"unchecked", "rawtypes"})
 		TypeSerializer<StreamRecord<T>> streamRecordSerializer =
 			(TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig()));

 		ListStateDescriptor<StreamRecord<T>> stateDesc =
 			new ListStateDescriptor<>("window-contents", streamRecordSerializer);

 		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + evictor + ", " + udfName + ")";

 		operator =
 			new EvictingWindowOperator<>(windowAssigner,
 				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
 				keySel,
 				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
 				stateDesc,
 				new InternalIterableProcessWindowFunction<>(new ReduceApplyProcessWindowFunction<>(reduceFunction, function)),
 				trigger,
 				evictor,
 				allowedLateness,
 				lateDataOutputTag);

 	} else {
 		ReducingStateDescriptor<T> stateDesc = new ReducingStateDescriptor<>("window-contents",
 			reduceFunction,
 			input.getType().createSerializer(getExecutionEnvironment().getConfig()));

 		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + udfName + ")";

 		operator =
 			new WindowOperator<>(windowAssigner,
 				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
 				keySel,
 				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
 				stateDesc,
 				new InternalSingleValueProcessWindowFunction<>(function),
 				trigger,
 				allowedLateness,
 				lateDataOutputTag);
 	}

 	return input.transform(opName, resultType, operator);
 }

WindowedStream 的 sum/max/min/maxBy/minBy 方法

WindowedStream 中 sum/max/min/maxBy/minBy 这些方法和 KeyedStream 中 sum/max/min/maxBy/minBy 相同，也是调用 reduce 方法，因为 AggregationFunction 实现了 ReduceFunction 接口

public SingleOutputStreamOperator<T> sum(int positionToSum) {
	return aggregate(new SumAggregator<>(positionToSum, input.getType(), input.getExecutionConfig()));
}

private SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregator) {
	return reduce(aggregator);
}

WindowedStream 的 aggregate 方法

aggregate 方法和 reduce 方法类似，只是把 reduce 中的 ReduceFunction 换成了 AggregateFunction，这里给出一个接收 WindowFunction 的 aggregate 方法源码吧

public <ACC, V, R> SingleOutputStreamOperator<R> aggregate(
		AggregateFunction<T, ACC, V> aggregateFunction,
		WindowFunction<V, R, W> windowFunction,
		TypeInformation<ACC> accumulatorType,
		TypeInformation<R> resultType) {

	windowFunction = input.getExecutionEnvironment().clean(windowFunction);
	aggregateFunction = input.getExecutionEnvironment().clean(aggregateFunction);

	final String callLocation = Utils.getCallLocationName();
	final String udfName = "WindowedStream." + callLocation;

	final String opName;
	final KeySelector<T, K> keySel = input.getKeySelector();

	OneInputStreamOperator<T, R> operator;

	// StateDescriptor 的选择和 InternalWindowFunction 的选择和
	// 上面 reduce 的原因相同
	if (evictor != null) {
		TypeSerializer<StreamRecord<T>> streamRecordSerializer =
				(TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(
						input.getType().createSerializer(getExecutionEnvironment().getConfig()));

		ListStateDescriptor<StreamRecord<T>> stateDesc =
				new ListStateDescriptor<>("window-contents", streamRecordSerializer);

		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + evictor + ", " + udfName + ")";

		operator =
				new EvictingWindowOperator<>(windowAssigner,
						windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
						keySel,
						input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
						stateDesc,
						new InternalIterableWindowFunction<>(
								new AggregateApplyWindowFunction<>(aggregateFunction, windowFunction)),
						trigger,
						evictor,
						allowedLateness,
						lateDataOutputTag);

	} else {
		AggregatingStateDescriptor<T, ACC, V> stateDesc = new AggregatingStateDescriptor<>(
				"window-contents",
				aggregateFunction,
				accumulatorType.createSerializer(getExecutionEnvironment().getConfig()));

		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + udfName + ")";

		operator = new WindowOperator<>(
						windowAssigner,
						windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
						keySel,
						input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
						stateDesc,
						new InternalSingleValueWindowFunction<>(windowFunction),
						trigger,
						allowedLateness,
						lateDataOutputTag);
	}

	return input.transform(opName, resultType, operator);
}

WindowedStream 的 apply 方法

如果不想对窗口进行任何聚合操作，可以调用 public 的 apply 方法，传入 WindowFunction 或调用 process 方法传入 ProcessWindowFunction，public 的 apply 方法和 process 方法内部调用了下方 private 的apply 方法，会将 WindowFunction／ProcessWindowFunction 其包装成 InternalWindowFunction

从下面的代码可以看到，无论是否设置了 evictor，都是选用 ListStateDescriptor，直接将集合传递给 WindowFunction 或 ProcessWindowFunction

private <R> SingleOutputStreamOperator<R> apply(InternalWindowFunction<Iterable<T>, R, Byte, W> function, TypeInformation<R> resultType, String callLocation) {

	String udfName = "AllWindowedStream." + callLocation;

	String opName;
	KeySelector<T, K> keySel = input.getKeySelector();

	WindowOperator<K, T, Iterable<T>, R, W> operator;

	// 因为没有增量聚合操作，所以 StateDescriptor 都是 ListStateDescriptor
	if (evictor != null) {
		TypeSerializer<StreamRecord<T>> streamRecordSerializer =
				(TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig()));

		ListStateDescriptor<StreamRecord<T>> stateDesc =
				new ListStateDescriptor<>("window-contents", streamRecordSerializer);

		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + evictor + ", " + udfName + ")";

		operator =
			new EvictingWindowOperator<>(windowAssigner,
				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
				keySel,
				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
				stateDesc,
				function,
				trigger,
				evictor,
				allowedLateness,
				lateDataOutputTag);

	} else {
		ListStateDescriptor<T> stateDesc = new ListStateDescriptor<>("window-contents",
				input.getType().createSerializer(getExecutionEnvironment().getConfig()));

		opName = "TriggerWindow(" + windowAssigner + ", " + stateDesc + ", " + trigger + ", " + udfName + ")";

		operator =
			new WindowOperator<>(windowAssigner,
				windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
				keySel,
				input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
				stateDesc,
				function,
				trigger,
				allowedLateness,
				lateDataOutputTag);
	}

	return input.transform(opName, resultType, operator);
}

窗口操作小栗子

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

String host = "127.0.0.1";
int port = 9000;

DataStream<String> dataStream = env.socketTextStream(host, port);
dataStream.map(new MapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public Tuple2<String, Integer> map(String value) throws Exception {
        return Tuple2.of(value, 1);
    }
}).keyBy(0)
        .window(TumblingProcessingTimeWindows.of(Time.of(5, TimeUnit.SECONDS)))
        .sum(1)
        .printToErr();

总结

以上四篇文章，我们从四个方面介绍了 flink 中的窗口操作，窗口是 flink 的一大特色，希望大家可以好好理解～

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flink-window-stream.md

flink-window-stream.md

flink 的窗口 —— 窗口流

WindowedStream 的属性

WindowedStream 的 reduce 方法

WindowedStream 的 sum/max/min/maxBy/minBy 方法

WindowedStream 的 aggregate 方法

WindowedStream 的 apply 方法

窗口操作小栗子

总结

Files

flink-window-stream.md

Latest commit

History

flink-window-stream.md

File metadata and controls

flink 的窗口 —— 窗口流

WindowedStream 的属性

WindowedStream 的 reduce 方法

WindowedStream 的 sum/max/min/maxBy/minBy 方法

WindowedStream 的 aggregate 方法

WindowedStream 的 apply 方法

窗口操作小栗子

总结