跳转至

Key by(按键分组 (Key by))

Supported in: Streaming

Keys the input by the provided key by columns. Note that this does not re-sort the data and only maintains per key ordering from the point the keys are set. Re-keying data may be unsafe in that if the newly keyed data was depending on any specific ordering then we can't guarantee that ordering if it wasn't already maintained by the previous keying. Additionally sets the primary key if cdc (change data capture) mode is enabled. Primary key defines columns that indicate which rows are updates, deletes, and the ordering of when read as a current view.

Transform categories: Other

Declared arguments

  • Dataset: Dataset to perform key by on.
    Table
  • Enable cdc mode: If false, only the partition columns are set which control ordering guarantee; rows with the same values for all key columns will be output in the same order they are received. If true both partition columns and primary key are set, which in addition to modifying the ordering guarantee also sets the configuration for how to de-duplicate rows with the same values for all key columns.
    Literal\
  • Key by columns: Columns on which to partition the input by key. If change data capture is enabled (by specifying primary key parameters), these columns will also be used to defines updates. A row is considered an update if its key values exactly match the key values of a previous row.
    Set\>
  • optional Primary key is deleted column: Requires cdc mode enabled. Used in change data capture to define deletes. A row is considered a delete if its deleted column value is set to true, and the key columns exactly match a row to be deleted.
    Column\
  • optional Primary key ordering columns: Requires cdc mode enabled. Recommended to leave blank for streams. Used in change data capture to define ordering for batch and archive datasets. A row may only be an update or delete if it comes after the row it is trying to modify. If left blank and in streaming cdc mode, the most recently streamed row will always win.
    List\>

中文翻译


按键分组 (Key by)

支持:流式处理 (Streaming)

根据指定的键列对输入数据进行按键分组。请注意,此操作不会重新排序数据,仅从设置键的时间点开始维护每个键内的顺序。重新按键可能存在风险:如果新按键后的数据依赖于特定的排序规则,而之前的按键并未维护该排序,则无法保证该顺序。此外,若启用了变更数据捕获 (CDC, change data capture) 模式,还会设置主键 (primary key)。主键定义了用于标识哪些行是更新、删除操作,以及在作为当前视图读取时确定顺序的列。

转换类别:其他

声明参数

  • 数据集 (Dataset):需要执行按键分组的数据集。
    表 (Table)
  • 启用 CDC 模式 (Enable cdc mode):若设为 false,则仅设置分区列以控制顺序保证;所有键列值相同的行将按接收顺序输出。若设为 true,则同时设置分区列和主键,除修改顺序保证外,还会配置如何对键列值完全相同的行进行去重。
    字面量\<Boolean>
  • 按键列 (Key by columns):用于对输入数据按键进行分区的列。若启用了变更数据捕获(通过指定主键参数),这些列还将用于定义更新操作。当某行的键值与之前某行的键值完全匹配时,该行被视为更新操作。
    集合\<列\<Binary | Boolean | Byte | Double | Float | Integer | Long | Short | String | Timestamp>>
  • 可选 主键删除列 (Primary key is deleted column):需启用 CDC 模式。用于变更数据捕获中定义删除操作。若某行的删除列值设为 true,且其键列与待删除行完全匹配,则该行被视为删除操作。
    列\<Boolean>
  • 可选 主键排序列 (Primary key ordering columns):需启用 CDC 模式。建议流式处理时留空。用于变更数据捕获中定义批处理 (batch) 和归档 (archive) 数据集的顺序。某行仅在其时间戳晚于待修改行时,才能被视为更新或删除操作。若留空且处于流式 CDC 模式,则最近流式传输的行将始终生效。
    列表\<列\<Byte | Date | Decimal | Integer | Long | Short | String | Timestamp>>