当前位置: 首页 > news >正文

一文搞懂 Flink Graph 构建过程源码

一文搞懂 Flink Graph 构建过程

  • 1. StreamGraph构建过程
    • 1.1 transform(): 构建的核心
    • 1.2 transformOneInputTransform
    • 1.3 构造顶点
    • 1.4 构造边
    • 1.5 transformSource
    • 1.6 transformPartition
    • 1.7 transformSink

1. StreamGraph构建过程

链接: 一文搞懂 Flink 其他重要源码点击我

env.execute将进行任务的提交和执行,在执行之前会对任务进行StreamGraph和JobGraph的构建,然后再提交JobGraph。
那么现在就来分析一下StreamGraph的构建过程,在正式环境下,会调用StreamContextEnvironment.execute()方法:

public JobExecutionResult execute(String jobName) throws Exception {Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");// 获取 StreamGraphreturn execute(getStreamGraph(jobName));
}public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {// 构建StreamGraphStreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();if (clearTransformations) {// 构建完StreamGraph之后,可以清空transformations列表this.transformations.clear();}return streamGraph;
}

实现在StreamGraphGenerator.generate(this, transformations);这里的transformations列表就是在之前调用map、flatMap、filter算子时添加进去的生成的DataStream的StreamTransformation,是DataStream的描述信息。
接着看:

public StreamGraph generate() {// 实例化 StreamGraphstreamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);streamGraph.setStateBackend(stateBackend);streamGraph.setChaining(chaining);streamGraph.setScheduleMode(scheduleMode);streamGraph.setUserArtifacts(userArtifacts);streamGraph.setTimeCharacteristic(timeCharacteristic);streamGraph.setJobName(jobName);streamGraph.setGlobalDataExchangeMode(globalDataExchangeMode);// 记录了已经transformed的StreamTransformation。alreadyTransformed = new HashMap<>();for (Transformation<?> transformation: transformations) {// 根据 transformation 创建StreamGraphtransform(transformation);}final StreamGraph builtStreamGraph = streamGraph;alreadyTransformed.clear();alreadyTransformed = null;streamGraph = null;return builtStreamGraph;
}

1.1 transform(): 构建的核心

private Collection<Integer> transform(Transformation<?> transform) {if (alreadyTransformed.containsKey(transform)) {return alreadyTransformed.get(transform);}LOG.debug("Transforming " + transform);if (transform.getMaxParallelism() <= 0) {// if the max parallelism hasn't been set, then first use the job wide max parallelism// from the ExecutionConfig.int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();if (globalMaxParallelismFromConfig > 0) {transform.setMaxParallelism(globalMaxParallelismFromConfig);}}// call at least once to trigger exceptions about MissingTypeInfotransform.getOutputType();Collection<Integer> transformedIds;// 判断 transform 属于哪类 Transformationif (transform instanceof OneInputTransformation<?, ?>) {transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);} else if (transform instanceof AbstractMultipleInputTransformation<?>) {transformedIds = transformMultipleInputTransform((AbstractMultipleInputTransformation<?>) transform);} else if (transform instanceof SourceTransformation) {transformedIds = transformSource((SourceTransformation<?>) transform);} else if (transform instanceof LegacySourceTransformation<?>) {transformedIds = transformLegacySource((LegacySourceTransformation<?>) transform);} else if (transform instanceof SinkTransformation<?>) {transformedIds = transformSink((SinkTransformation<?>) transform);} else if (transform instanceof UnionTransformation<?>) {transformedIds = transformUnion((UnionTransformation<?>) transform);} else if (transform instanceof SplitTransformation<?>) {transformedIds = transformSplit((SplitTransformation<?>) transform);} else if (transform instanceof SelectTransformation<?>) {transformedIds = transformSelect((SelectTransformation<?>) transform);} else if (transform instanceof FeedbackTransformation<?>) {transformedIds = transformFeedback((FeedbackTransformation<?>) transform);} else if (transform instanceof CoFeedbackTransformation<?>) {transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);} else if (transform instanceof PartitionTransformation<?>) {transformedIds = transformPartition((PartitionTransformation<?>) transform);} else if (transform instanceof SideOutputTransformation<?>) {transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);} else {throw new IllegalStateException("Unknown transformation: " + transform);}// need this check because the iterate transformation adds itself before// transforming the feedback edgesif (!alreadyTransformed.containsKey(transform)) {alreadyTransformed.put(transform, transformedIds);}if (transform.getBufferTimeout() >= 0) {streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());} else {streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);}if (transform.getUid() != null) {streamGraph.setTransformationUID(transform.getId(), transform.getUid());}if (transform.getUserProvidedNodeHash() != null) {streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());}if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {if (transform instanceof PhysicalTransformation &&transform.getUserProvidedNodeHash() == null &&transform.getUid() == null) {throw new IllegalStateException("Auto generated UIDs have been disabled " +"but no UID or hash has been assigned to operator " + transform.getName());}}if (transform.getMinResources() != null && transform.getPreferredResources() != null) {streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());}streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());return transformedIds;
}

拿WordCount举例,在flatMap、map、reduce、addSink过程中会将生成的DataStream的StreamTransformation添加到transformations列表中。
addSource没有将StreamTransformation添加到transformations,但是flatMap生成的StreamTransformation的input持有SourceTransformation的引用。
keyBy算子会生成KeyedStream,但是它的StreamTransformation并不会添加到transformations列表中,不过reduce生成的DataStream中的StreamTransformation中持有了KeyedStream的StreamTransformation的引用,作为它的input。
所以,WordCount中有4个StreamTransformation,前3个算子均为OneInputTransformation,最后一个为SinkTransformation。

1.2 transformOneInputTransform

由于transformations中第一个为OneInputTransformation,所以代码首先会走到transformOneInputTransform((OneInputTransformation<?, ?>) transform)

private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {//首先递归的transform该OneInputTransformation的inputCollection<Integer> inputIds = transform(transform.getInput());//如果递归transform input时发现input已经被transform,那么直接获取结果即可// the recursive call might have already transformed thisif (alreadyTransformed.containsKey(transform)) {return alreadyTransformed.get(transform);}//获取共享slot资源组,默认分组名”default”String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);//将该StreamTransformation添加到StreamGraph中,当做一个顶点streamGraph.addOperator(transform.getId(),slotSharingGroup,transform.getCoLocationGroupKey(),transform.getOperator(),transform.getInputType(),transform.getOutputType(),transform.getName());if (transform.getStateKeySelector() != null) {TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);}//设置顶点的并行度和最大并行度streamGraph.setParallelism(transform.getId(), transform.getParallelism());streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());//根据该StreamTransformation有多少个input,依次给StreamGraph添加边,即input——>currentfor (Integer inputId: inputIds) {streamGraph.addEdge(inputId, transform.getId(), 0);}//返回该StreamTransformation的id,OneInputTransformation只有自身的一个id列表return Collections.singleton(transform.getId());
}

transformOneInputTransform()方法的实现如下:
1、先递归的transform该OneInputTransformation的input,如果input已经transformed,那么直接从map中取数据即可
2、将该StreamTransformation作为一个图的顶点添加到StreamGraph中,并设置顶点的并行度和共享资源组
3、根据该StreamTransformation的input,构造图的边,有多少个input,就会生成多少边,不过OneInputTransformation顾名思义就是一个input,所以会构造一条边,即input——>currentId

1.3 构造顶点

//StreamGraph类
public <IN, OUT> void addOperator(Integer vertexID,String slotSharingGroup,@Nullable String coLocationGroup,StreamOperator<OUT> operatorObject,TypeInformation<IN> inTypeInfo,TypeInformation<OUT> outTypeInfo,String operatorName) {//将StreamTransformation作为一个顶点,添加到streamNodes中if (operatorObject instanceof StoppableStreamSource) {addNode(vertexID, slotSharingGroup, coLocationGroup, StoppableSourceStreamTask.class, operatorObject, operatorName);} else if (operatorObject instanceof StreamSource) {//如果operator是StreamSource,则Task类型为SourceStreamTaskaddNode(vertexID, slotSharingGroup, coLocationGroup, SourceStreamTask.class, operatorObject, operatorName);} else {//如果operator不是StreamSource,Task类型为OneInputStreamTaskaddNode(vertexID, slotSharingGroup, coLocationGroup, OneInputStreamTask.class, operatorObject, operatorName);}TypeSerializer<IN> inSerializer = inTypeInfo != null && !(inTypeInfo instanceof MissingTypeInfo) ? inTypeInfo.createSerializer(executionConfig) : null;TypeSerializer<OUT> outSerializer = outTypeInfo != null && !(outTypeInfo instanceof MissingTypeInfo) ? outTypeInfo.createSerializer(executionConfig) : null;setSerializers(vertexID, inSerializer, null, outSerializer);if (operatorObject instanceof OutputTypeConfigurable && outTypeInfo != null) {@SuppressWarnings("unchecked")OutputTypeConfigurable<OUT> outputTypeConfigurable = (OutputTypeConfigurable<OUT>) operatorObject;// sets the output type which must be know at StreamGraph creation timeoutputTypeConfigurable.setOutputType(outTypeInfo, executionConfig);}if (operatorObject instanceof InputTypeConfigurable) {InputTypeConfigurable inputTypeConfigurable = (InputTypeConfigurable) operatorObject;inputTypeConfigurable.setInputType(inTypeInfo, executionConfig);}if (LOG.isDebugEnabled()) {LOG.debug("Vertex: {}", vertexID);}
}protected StreamNode addNode(Integer vertexID,String slotSharingGroup,@Nullable String coLocationGroup,Class<? extends AbstractInvokable> vertexClass,StreamOperator<?> operatorObject,String operatorName) {if (streamNodes.containsKey(vertexID)) {throw new RuntimeException("Duplicate vertexID " + vertexID);}//构造顶点,添加到streamNodes中,streamNodes是一个MapStreamNode vertex = new StreamNode(environment,vertexID,slotSharingGroup,coLocationGroup,operatorObject,operatorName,new ArrayList<OutputSelector<?>>(),vertexClass);// 添加到 streamNodesstreamNodes.put(vertexID, vertex);return vertex;
}

1.4 构造边

 public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {addEdgeInternal(upStreamVertexID,downStreamVertexID,typeNumber,null, //这里开始传递的partitioner都是nullnew ArrayList<String>(),null //outputTag对侧输出有效);}private void addEdgeInternal(Integer upStreamVertexID,Integer downStreamVertexID,int typeNumber,StreamPartitioner<?> partitioner,List<String> outputNames,OutputTag outputTag) {if (virtualSideOutputNodes.containsKey(upStreamVertexID)) {//针对侧输出int virtualId = upStreamVertexID;upStreamVertexID = virtualSideOutputNodes.get(virtualId).f0;if (outputTag == null) {outputTag = virtualSideOutputNodes.get(virtualId).f1;}addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, null, outputTag);} else if (virtualSelectNodes.containsKey(upStreamVertexID)) {int virtualId = upStreamVertexID;upStreamVertexID = virtualSelectNodes.get(virtualId).f0;if (outputNames.isEmpty()) {// selections that happen downstream override earlier selectionsoutputNames = virtualSelectNodes.get(virtualId).f1;}addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {//keyBy算子产生的PartitionTransform作为下游的input,下游的StreamTransformation添加边时会走到这int virtualId = upStreamVertexID;upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;if (partitioner == null) {partitioner = virtualPartitionNodes.get(virtualId).f1;}addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);} else {//一般的OneInputTransform会走到这里StreamNode upstreamNode = getStreamNode(upStreamVertexID);StreamNode downstreamNode = getStreamNode(downStreamVertexID);// If no partitioner was specified and the parallelism of upstream and downstream// operator matches use forward partitioning, use rebalance otherwise.//如果上下顶点的并行度一致,则用ForwardPartitioner,否则用RebalancePartitionerif (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {partitioner = new ForwardPartitioner<Object>();} else if (partitioner == null) {partitioner = new RebalancePartitioner<Object>();}if (partitioner instanceof ForwardPartitioner) {if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {throw new UnsupportedOperationException("Forward partitioning does not allow " +"change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +" You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");}}//一条边包括上下顶点,顶点之间的分区器等信息StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner, outputTag);//分别给顶点添加出边和入边getStreamNode(edge.getSourceId()).addOutEdge(edge);getStreamNode(edge.getTargetId()).addInEdge(edge);}
}

边的属性包括上下游的顶点,和顶点之间的partitioner等信息。如果上下游的并行度一致,那么他们之间的partitioner是ForwardPartitioner,如果上下游的并行度不一致是RebalancePartitioner,当然这前提是没有设置partitioner的前提下。如果显示设置了partitioner的情况,例如keyBy算子,在内部就确定了分区器是KeyGroupStreamPartitioner,那么它们之间的分区器就是KeyGroupStreamPartitioner。

1.5 transformSource

上述说道,transformOneInputTransform会先递归的transform该OneInputTransform的input,那么对于WordCount中在transform第一个OneInputTransform时会首先transform它的input,也就是SourceTransformation,方法在transformSource()

private <T> Collection<Integer> transformSource(SourceTransformation<T> source) {String slotSharingGroup = determineSlotSharingGroup(source.getSlotSharingGroup(), Collections.emptyList());//将source作为图的顶点和图的source添加StreamGraph中streamGraph.addSource(source.getId(),slotSharingGroup,source.getCoLocationGroupKey(),source.getOperator(),null,source.getOutputType(),"Source: " + source.getName());if (source.getOperator().getUserFunction() instanceof InputFormatSourceFunction) {InputFormatSourceFunction<T> fs = (InputFormatSourceFunction<T>) source.getOperator().getUserFunction();streamGraph.setInputFormat(source.getId(), fs.getFormat());}//设置source的并行度streamGraph.setParallelism(source.getId(), source.getParallelism());streamGraph.setMaxParallelism(source.getId(), source.getMaxParallelism());//返回自身id的单列表return Collections.singleton(source.getId());
}//StreamGraph类
public <IN, OUT> void addSource(Integer vertexID,String slotSharingGroup,@Nullable String coLocationGroup,StreamOperator<OUT> operatorObject,TypeInformation<IN> inTypeInfo,TypeInformation<OUT> outTypeInfo,String operatorName) {addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);sources.add(vertexID);
}

transformSource()逻辑也比较简单,就是将source添加到StreamGraph中,注意Source是图的根节点,没有input,所以它不需要添加边。

1.6 transformPartition

在WordCount中,由reduce生成的DataStream的StreamTransformation是一个OneInputTransformation,同样,在transform它的时候,会首先transform它的input,而它的input就是KeyedStream中生成的PartitionTransformation。所以代码会执行transformPartition((PartitionTransformation<?>) transform)

private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {StreamTransformation<T> input = partition.getInput();List<Integer> resultIds = new ArrayList<>();//首先会transform PartitionTransformation的inputCollection<Integer> transformedIds = transform(input);for (Integer transformedId: transformedIds) {//生成一个新的虚拟节点int virtualId = StreamTransformation.getNewNodeId();//将虚拟节点添加到StreamGraph中streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());resultIds.add(virtualId);}return resultIds;
}//StreamGraph类
public void addVirtualPartitionNode(Integer originalId, Integer virtualId, StreamPartitioner<?> partitioner) {if (virtualPartitionNodes.containsKey(virtualId)) {throw new IllegalStateException("Already has virtual partition node with id " + virtualId);}// virtualPartitionNodes是一个Map,存储虚拟的Partition节点virtualPartitionNodes.put(virtualId,new Tuple2<Integer, StreamPartitioner<?>>(originalId, partitioner));
}

transformPartition会为PartitionTransformation生成一个新的虚拟节点,同时将该虚拟节点保存到StreamGraph的virtualPartitionNodes中,并且会保存该PartitionTransformation的input,将PartitionTransformation的partitioner作为其input的分区器,在WordCount中,也就是作为map算子生成的DataStream的分区器。

在进行transform reduce生成的OneInputTransform时,它的inputIds便是transformPartition时为PartitionTransformation生成新的虚拟节点,在添加边的时候会走到下面的代码。边的形式大致是OneInputTransform(map)—KeyGroupStreamPartitioner—>OneInputTransform(window)

} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {//keyBy算子产生的PartitionTransform作为下游的input,下游的StreamTransform添加边时会走到这int virtualId = upStreamVertexID;upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;if (partitioner == null) {partitioner = virtualPartitionNodes.get(virtualId).f1;}addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);

1.7 transformSink

在WordCount中,transformations的最后一个StreamTransformation是SinkTransformation,方法在
transformSink()

private <T> Collection<Integer> transformSink(SinkTransformation<T> sink) {//首先transform inputCollection<Integer> inputIds = transform(sink.getInput());String slotSharingGroup = determineSlotSharingGroup(sink.getSlotSharingGroup(), inputIds);//给图添加sink,并且构造并添加图的顶点streamGraph.addSink(sink.getId(),slotSharingGroup,sink.getCoLocationGroupKey(),sink.getOperator(),sink.getInput().getOutputType(),null,"Sink: " + sink.getName());streamGraph.setParallelism(sink.getId(), sink.getParallelism());streamGraph.setMaxParallelism(sink.getId(), sink.getMaxParallelism());//构造并添加StreamGraph的边for (Integer inputId: inputIds) {streamGraph.addEdge(inputId,sink.getId(),0);}if (sink.getStateKeySelector() != null) {TypeSerializer<?> keySerializer = sink.getStateKeyType().createSerializer(env.getConfig());streamGraph.setOneInputStateKey(sink.getId(), sink.getStateKeySelector(), keySerializer);}//因为sink是图的末尾节点,没有下游的输出,所以返回空了return Collections.emptyList();
}
//StreamGraph类
public <IN, OUT> void addSink(Integer vertexID,String slotSharingGroup,@Nullable String coLocationGroup,StreamOperator<OUT> operatorObject,TypeInformation<IN> inTypeInfo,TypeInformation<OUT> outTypeInfo,String operatorName) {addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);sinks.add(vertexID);
}

transformSink()中的逻辑也比较简单,和transformSource()类似,不同的是sink是有边的,而且sink的下游没有输出了,也就不需要作为下游的input,所以返回空列表。

在transformSink()之后,也就把所有的StreamTransformation的都进行transform了,那么这时候StreamGraph中的顶点、边、partition的虚拟顶点都构建好了,返回StreamGraph即可。下一步就是根据StreamGraph构造JobGraph了。
可以看出StreamGraph的结构还是比较简单的,每个DataStream的StreamTransformation会作为一个图的顶点(PartitionTransform是虚拟顶点),根据StreamTransformation的input来构建图的边。

http://www.lryc.cn/news/437652.html

相关文章:

  • 【spring】IDEA 新建一个spring boot 项目
  • LeetCode[简单] 搜索插入位置
  • (代码可运行)Bootstrap框架的HTML示例
  • IntelliJ IDEA 2024创建Java项目
  • Python之 条件与循环(Python‘s Conditions and loops)
  • C++学习,多态纯虚函数
  • 飞速(FS)与西门子联合打造交换机自动化灌装测试生产线
  • Vue组合式API:setup()函数
  • Redis底层数据结构(详细篇)
  • 树和二叉树基本术语、性质
  • FEDERATED引擎
  • Android NDK工具
  • 使用 Docker 进入容器并运行命令的详细指南
  • 【人工智能】OpenAI最新发布的o1-preview模型,和GPT-4o到底哪个更强?最新分析结果就在这里!
  • Spring Boot-版本兼容性问题
  • Java原生HttpURLConnection实现Get、Post、Put和Delete请求完整工具类分享
  • 如何微调(Fine-tuning)大语言模型?
  • wopop靶场漏洞挖掘练习
  • 探索Python的隐秘角落:Keylogger库的神秘面纱
  • JAVA开源项目 校园管理系统 计算机毕业设计
  • Java--常见的接口--Comparable
  • luogu基础课题单 入门 上
  • 物理设计-物理数据模型优化策略
  • 产学研合作赋能产业升级新动能
  • uniapp tabBar不显示
  • 论文阅读《Robust Steganography for High Quality Images》高质量因子图片的鲁棒隐写
  • node前端开发基本设置
  • 深入掌握:如何进入Docker容器并运行命令
  • 把设计模式用起来(3)用不好的原因之时机不对
  • 【机器学习随笔】基于kmeans的车牌类型分类注意点