Flink fundamentals(Flink 基础)¶
Apache Flink is a distributed computation engine capable of handling unbounded datasets with low latency, allowing it to handle common streaming workflows. Foundry streaming uses Flink as the underlying engine to execute user code and other in-platform streaming applications such as hydrating the Ontology in real time and streaming time series ingestion.
In order to understand whether you need additional job configuration for your streaming use case, it is helpful to have some understanding of how Flink works.
More detailed documentation about Flink can be found at the Flink homepage ↗.
Flink jobs¶
All streaming jobs are described as a series of operations acting on a set of data sources, and which write results to a data sink. These operations include things like aggregations, joins, and other row-level actions like string parsing or arithmetic. Each operation is represented by Flink as the Operator abstraction. In Flink, sources and sinks are also described by operators.
Flink jobs are represented internally in terms of “job graphs” or "logical graphs.” Job graphs are directed graphs with nodes made up of operators and where the edges define the relationships between operators. When a job is submitted to Flink, it creates the job graph. A preview of your Flink job’s job graph is rendered in the Details section of jobs in the Foundry Job Tracker.
When actually executing jobs, Flink translates the logical graph into a physical graph, a representation of how operators will be executed on the compute runtime. Physical graphs are made of up tasks, which are the basic units of work in a Flink job, and which can represent either one instance of an operator or many operators chained together.
Job Managers and Task Managers¶
Like Spark, Flink’s runtime architecture includes different types of workers. Where Spark uses the driver to coordinate and manage jobs and executors to perform job tasks, Flink uses Job Managers and Task Managers, which fulfill roles roughly analogous to Spark drivers and executors.
The Flink Job Manager is responsible for scheduling tasks and allocating resources for tasks, handling finished or failed tasks, coordinating job checkpoints and failure recovery, as well as providing programmatic access to job information. Typically there is only one active Job Manager - the leader - at any given time, with backup(s) kept warm in the case of an unrecoverable error.
The Flink Task Manager is responsible for the execution of tasks as well as buffering and exchanging data between streams. There is always at least one Task Manager, but there may be more in order to parallelise stream processing. When Task Managers need to handle very large records they may require additional resources. If your stream has extremely high throughput, you may need to increase your job’s parallelism, which results in increasing the number of Task Managers.
Job state¶
While some Flink operations only need to look at single events in isolation, others need to remember information across multiple events. These are stateful operations. Some examples of stateful operations are:
- Aggregations: For example, counting the total number of events over a rolling five minute window, or calculating the running average of all known events.
- Joins: The execution engine needs to know about previously-seen events in order to join them with events that are currently been ingested.
The information required for stateful operations is known as job state, and is stored by Flink using a state backend. State is managed and stored by Task Managers and coordinated by the Job Manager in the form of checkpoints. When you have a larger state (such as when an aggregation or join has a very large window), your Job Managers and Task Managers may require additional resources.
中文翻译¶
Flink 基础¶
Apache Flink 是一个分布式计算引擎,能够以低延迟处理无界数据集,从而支持常见的流式处理工作流。Foundry 流式处理使用 Flink 作为底层引擎来执行用户代码及其他平台内流式应用,例如实时更新本体论(Ontology)以及流式时间序列数据摄取。
为了判断您的流式处理用例是否需要额外的作业配置,了解 Flink 的工作原理会有所帮助。
关于 Flink 的更详细文档,请访问 Flink 官网 ↗。
Flink 作业¶
所有流式作业都被描述为一系列对一组数据源执行的操作,并将结果写入数据接收器。这些操作包括聚合、连接以及其他行级操作,如字符串解析或算术运算。每个操作在 Flink 中都被抽象为算子(Operator)。在 Flink 中,数据源和数据接收器也由算子描述。
Flink 作业在内部以“作业图”或“逻辑图”的形式表示。作业图是有向图,其节点由算子组成,边定义了算子之间的关系。当作业提交到 Flink 时,它会创建作业图。在 Foundry 作业跟踪器(Job Tracker)中,作业的详情(Details)部分会呈现 Flink 作业图的预览。
在实际执行作业时,Flink 会将逻辑图转换为物理图(Physical Graph),即算子如何在计算运行时上执行的表示。物理图由任务(Task)组成,任务是 Flink 作业中的基本工作单元,可以表示一个算子的单个实例,也可以表示多个链接在一起的算子。
作业管理器(Job Manager)与任务管理器(Task Manager)¶
与 Spark 类似,Flink 的运行时架构包含不同类型的 worker。Spark 使用驱动程序(Driver)来协调和管理作业,使用执行器(Executor)来执行作业任务,而 Flink 则使用作业管理器(Job Manager)和任务管理器(Task Manager),它们大致相当于 Spark 的驱动程序和执行器。
Flink 作业管理器负责调度任务、为任务分配资源、处理已完成或失败的任务、协调作业检查点和故障恢复,并提供对作业信息的编程访问。通常在任何给定时间只有一个活跃的作业管理器(即领导者),并会保持一个或多个备份处于热备状态,以应对不可恢复的错误。
Flink 任务管理器负责执行任务,以及在流之间缓冲和交换数据。始终至少有一个任务管理器,但为了并行化流处理,可能会有更多。当任务管理器需要处理非常大的记录时,可能需要额外的资源。如果您的流具有极高的吞吐量,可能需要增加作业的并行度,这会导致任务管理器数量的增加。
作业状态¶
虽然某些 Flink 操作只需要孤立地查看单个事件,但其他操作需要记住跨多个事件的信息。这些是有状态操作。有状态操作的一些示例包括:
- 聚合: 例如,计算滚动五分钟窗口内的事件总数,或计算所有已知事件的运行平均值。
- 连接: 执行引擎需要了解之前见过的事件,以便将它们与当前正在摄取的事件进行连接。
有状态操作所需的信息称为作业状态(Job State),并由 Flink 使用状态后端进行存储。状态由任务管理器管理和存储,并以检查点的形式由作业管理器协调。当您拥有较大的状态时(例如聚合或连接具有非常大的窗口),您的作业管理器和任务管理器可能需要额外的资源。