跳转至

KNN join(KNN 连接(KNN join))

Supported in: Batch

Return the 'k' nearest rows from the right dataset for each row in the left dataset, based on the distance measure.

Transform categories: Join

Declared arguments

  • Condition for columns to select on the left: All columns in the left input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
    ColumnPredicate
  • Condition for columns to select on the right: All columns in the right input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
    ColumnPredicate
  • Distance measure expression: Distance measure expression between columns in the left and right datasets. E.g. Levenshtein distance.
    Expression\
  • K nearest: The number of nearest rows to return, i.e. if k=2 then the number of output rows will be at least doubled and the nearest 2 rows will be joined from the right. In case of ties, more rows may be returned.
    Literal\
  • Left dataset: Left dataset to use in join.
    Table
  • Rank column name: Name of the column to store the rank of the distance.
    Literal\
  • Right dataset: Right dataset to use in join.
    Table
  • optional Prefix for columns from right: Prefix to add to all columns on the right hand side.
    Literal\

Examples

Example 1: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

Output:

rank distance tail_number airline fuzzy_airline home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new airline CPH
2 4 PA-452 new air new plane JFK
1 0 MT-222 new airline new airline CPH
2 4 MT-222 new airline new plane JFK
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR

Example 2: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [home_airport],
    )
  • Distance measure expression:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

Output:

rank distance tail_number airline home_airport
1 3 PA-452 new air IAD
2 4 PA-452 new air LHR
2 4 PA-452 new air CPH
2 4 PA-452 new air JFK
1 0 MT-222 new airline CPH
2 4 MT-222 new airline JFK
1 5 XB-123 foundry air IAD
2 8 XB-123 foundry air LHR

Example 3: Base case

Description: If the distance measure returns null, this is considered the furthest distance.

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
null CPH
new plane JFK
old air IAD

Output:

rank distance tail_number airline fuzzy_airline home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new plane JFK
1 4 MT-222 new airline new plane JFK
2 7 MT-222 new airline old air IAD
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR

Example 4: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: right_

Inputs:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

Output:

rank distance tail_number airline right_fuzzy_airline right_home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new airline CPH
2 4 PA-452 new air new plane JFK
1 0 MT-222 new airline new airline CPH
2 4 MT-222 new airline new plane JFK
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR


中文翻译

KNN 连接(KNN join)

支持:批处理(Batch)

基于距离度量,为左侧数据集中的每一行返回右侧数据集中最近的 'k' 行。

转换类别:连接(Join)

声明的参数(Declared arguments)

  • 左侧列选择条件(Condition for columns to select on the left): 左侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配,该列将被选中输出。
    ColumnPredicate
  • 右侧列选择条件(Condition for columns to select on the right): 右侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配,该列将被选中输出。
    ColumnPredicate
  • 距离度量表达式(Distance measure expression): 左侧和右侧数据集中列之间的距离度量表达式。例如:Levenshtein距离(Levenshtein distance)。
    Expression\
  • K 最近邻(K nearest): 要返回的最近行数,即如果 k=2,则输出行数至少翻倍,并且将从右侧连接最近的 2 行。在平局情况下,可能会返回更多行。
    Literal\
  • 左侧数据集(Left dataset): 用于连接的左侧数据集。
    Table
  • 排名列名(Rank column name): 用于存储距离排名的列名。
    Literal\
  • 右侧数据集(Right dataset): 用于连接的右侧数据集。
    Table
  • 可选 右侧列前缀(Prefix for columns from right): 添加到右侧所有列的前缀。
    Literal\

示例

示例 1:基本情况

参数值:

  • 左侧列选择条件:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • 右侧列选择条件:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • 距离度量表达式:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K 最近邻: 2
  • 左侧数据集: ri.foundry.main.dataset.left
  • 排名列名: rank
  • 右侧数据集: ri.foundry.main.dataset.right
  • 右侧列前缀: null

输入:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

输出:

rank distance tail_number airline fuzzy_airline home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new airline CPH
2 4 PA-452 new air new plane JFK
1 0 MT-222 new airline new airline CPH
2 4 MT-222 new airline new plane JFK
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR

示例 2:基本情况

参数值:

  • 左侧列选择条件:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • 右侧列选择条件:
    columnNameIsIn(
     columnNames: [home_airport],
    )
  • 距离度量表达式:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: airline,
    ),
    )
  • K 最近邻: 2
  • 左侧数据集: ri.foundry.main.dataset.left
  • 排名列名: rank
  • 右侧数据集: ri.foundry.main.dataset.right
  • 右侧列前缀: null

输入:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

输出:

rank distance tail_number airline home_airport
1 3 PA-452 new air IAD
2 4 PA-452 new air LHR
2 4 PA-452 new air CPH
2 4 PA-452 new air JFK
1 0 MT-222 new airline CPH
2 4 MT-222 new airline JFK
1 5 XB-123 foundry air IAD
2 8 XB-123 foundry air LHR

示例 3:基本情况

描述: 如果距离度量返回 null,则视为最远距离。

参数值:

  • 左侧列选择条件:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • 右侧列选择条件:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • 距离度量表达式:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K 最近邻: 2
  • 左侧数据集: ri.foundry.main.dataset.left
  • 排名列名: rank
  • 右侧数据集: ri.foundry.main.dataset.right
  • 右侧列前缀: null

输入:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
null CPH
new plane JFK
old air IAD

输出:

rank distance tail_number airline fuzzy_airline home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new plane JFK
1 4 MT-222 new airline new plane JFK
2 7 MT-222 new airline old air IAD
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR

示例 4:基本情况

参数值:

  • 左侧列选择条件:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • 右侧列选择条件:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • 距离度量表达式:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K 最近邻: 2
  • 左侧数据集: ri.foundry.main.dataset.left
  • 排名列名: rank
  • 右侧数据集: ri.foundry.main.dataset.right
  • 右侧列前缀: right_

输入:

ri.foundry.main.dataset.left

tail_number airline miles factor
XB-123 foundry air 124 2
MT-222 new airline 1123 5
PA-452 new air 212 2

ri.foundry.main.dataset.right

fuzzy_airline home_airport
air LHR
new airline CPH
new plane JFK
old air IAD

输出:

rank distance tail_number airline right_fuzzy_airline right_home_airport
1 3 PA-452 new air old air IAD
2 4 PA-452 new air air LHR
2 4 PA-452 new air new airline CPH
2 4 PA-452 new air new plane JFK
1 0 MT-222 new airline new airline CPH
2 4 MT-222 new airline new plane JFK
1 5 XB-123 foundry air old air IAD
2 8 XB-123 foundry air air LHR