KNN join(KNN 连接(KNN join))¶
Supported in: Batch
Return the 'k' nearest rows from the right dataset for each row in the left dataset, based on the distance measure.
Transform categories: Join
Declared arguments¶
- Condition for columns to select on the left: All columns in the left input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate - Condition for columns to select on the right: All columns in the right input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate - Distance measure expression: Distance measure expression between columns in the left and right datasets. E.g. Levenshtein distance.
Expression\ - K nearest: The number of nearest rows to return, i.e. if k=2 then the number of output rows will be at least doubled and the nearest 2 rows will be joined from the right. In case of ties, more rows may be returned.
Literal\ - Left dataset: Left dataset to use in join.
Table - Rank column name: Name of the column to store the rank of the distance.
Literal\ - Right dataset: Right dataset to use in join.
Table - optional Prefix for columns from right: Prefix to add to all columns on the right hand side.
Literal\
Examples¶
Example 1: Base case¶
Argument values:
- Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
) - Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K nearest: 2
- Left dataset: ri.foundry.main.dataset.left
- Rank column name: rank
- Right dataset: ri.foundry.main.dataset.right
- Prefix for columns from right: null
Inputs:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
Output:
| rank | distance | tail_number | airline | fuzzy_airline | home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new airline | CPH |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 0 | MT-222 | new airline | new airline | CPH |
| 2 | 4 | MT-222 | new airline | new plane | JFK |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |
Example 2: Base case¶
Argument values:
- Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
) - Condition for columns to select on the right:
columnNameIsIn(
columnNames: [home_airport],
) - Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:airline,
),
) - K nearest: 2
- Left dataset: ri.foundry.main.dataset.left
- Rank column name: rank
- Right dataset: ri.foundry.main.dataset.right
- Prefix for columns from right: null
Inputs:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
Output:
| rank | distance | tail_number | airline | home_airport |
|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | IAD |
| 2 | 4 | PA-452 | new air | LHR |
| 2 | 4 | PA-452 | new air | CPH |
| 2 | 4 | PA-452 | new air | JFK |
| 1 | 0 | MT-222 | new airline | CPH |
| 2 | 4 | MT-222 | new airline | JFK |
| 1 | 5 | XB-123 | foundry air | IAD |
| 2 | 8 | XB-123 | foundry air | LHR |
Example 3: Base case¶
Description: If the distance measure returns null, this is considered the furthest distance.
Argument values:
- Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
) - Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K nearest: 2
- Left dataset: ri.foundry.main.dataset.left
- Rank column name: rank
- Right dataset: ri.foundry.main.dataset.right
- Prefix for columns from right: null
Inputs:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| null | CPH |
| new plane | JFK |
| old air | IAD |
Output:
| rank | distance | tail_number | airline | fuzzy_airline | home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 4 | MT-222 | new airline | new plane | JFK |
| 2 | 7 | MT-222 | new airline | old air | IAD |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |
Example 4: Base case¶
Argument values:
- Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
) - Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K nearest: 2
- Left dataset: ri.foundry.main.dataset.left
- Rank column name: rank
- Right dataset: ri.foundry.main.dataset.right
- Prefix for columns from right: right_
Inputs:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
Output:
| rank | distance | tail_number | airline | right_fuzzy_airline | right_home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new airline | CPH |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 0 | MT-222 | new airline | new airline | CPH |
| 2 | 4 | MT-222 | new airline | new plane | JFK |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |
中文翻译¶
KNN 连接(KNN join)¶
支持:批处理(Batch)
基于距离度量,为左侧数据集中的每一行返回右侧数据集中最近的 'k' 行。
转换类别:连接(Join)
声明的参数(Declared arguments)¶
- 左侧列选择条件(Condition for columns to select on the left): 左侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配,该列将被选中输出。
ColumnPredicate - 右侧列选择条件(Condition for columns to select on the right): 右侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配,该列将被选中输出。
ColumnPredicate - 距离度量表达式(Distance measure expression): 左侧和右侧数据集中列之间的距离度量表达式。例如:Levenshtein距离(Levenshtein distance)。
Expression\ - K 最近邻(K nearest): 要返回的最近行数,即如果 k=2,则输出行数至少翻倍,并且将从右侧连接最近的 2 行。在平局情况下,可能会返回更多行。
Literal\ - 左侧数据集(Left dataset): 用于连接的左侧数据集。
Table - 排名列名(Rank column name): 用于存储距离排名的列名。
Literal\ - 右侧数据集(Right dataset): 用于连接的右侧数据集。
Table - 可选 右侧列前缀(Prefix for columns from right): 添加到右侧所有列的前缀。
Literal\
示例¶
示例 1:基本情况¶
参数值:
- 左侧列选择条件:
columnNameIsIn(
columnNames: [tail_number, airline],
) - 右侧列选择条件:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - 距离度量表达式:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K 最近邻: 2
- 左侧数据集: ri.foundry.main.dataset.left
- 排名列名: rank
- 右侧数据集: ri.foundry.main.dataset.right
- 右侧列前缀: null
输入:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
输出:
| rank | distance | tail_number | airline | fuzzy_airline | home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new airline | CPH |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 0 | MT-222 | new airline | new airline | CPH |
| 2 | 4 | MT-222 | new airline | new plane | JFK |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |
示例 2:基本情况¶
参数值:
- 左侧列选择条件:
columnNameIsIn(
columnNames: [tail_number, airline],
) - 右侧列选择条件:
columnNameIsIn(
columnNames: [home_airport],
) - 距离度量表达式:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:airline,
),
) - K 最近邻: 2
- 左侧数据集: ri.foundry.main.dataset.left
- 排名列名: rank
- 右侧数据集: ri.foundry.main.dataset.right
- 右侧列前缀: null
输入:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
输出:
| rank | distance | tail_number | airline | home_airport |
|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | IAD |
| 2 | 4 | PA-452 | new air | LHR |
| 2 | 4 | PA-452 | new air | CPH |
| 2 | 4 | PA-452 | new air | JFK |
| 1 | 0 | MT-222 | new airline | CPH |
| 2 | 4 | MT-222 | new airline | JFK |
| 1 | 5 | XB-123 | foundry air | IAD |
| 2 | 8 | XB-123 | foundry air | LHR |
示例 3:基本情况¶
描述: 如果距离度量返回 null,则视为最远距离。
参数值:
- 左侧列选择条件:
columnNameIsIn(
columnNames: [tail_number, airline],
) - 右侧列选择条件:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - 距离度量表达式:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K 最近邻: 2
- 左侧数据集: ri.foundry.main.dataset.left
- 排名列名: rank
- 右侧数据集: ri.foundry.main.dataset.right
- 右侧列前缀: null
输入:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| null | CPH |
| new plane | JFK |
| old air | IAD |
输出:
| rank | distance | tail_number | airline | fuzzy_airline | home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 4 | MT-222 | new airline | new plane | JFK |
| 2 | 7 | MT-222 | new airline | old air | IAD |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |
示例 4:基本情况¶
参数值:
- 左侧列选择条件:
columnNameIsIn(
columnNames: [tail_number, airline],
) - 右侧列选择条件:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
) - 距离度量表达式:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left:airline,
right:fuzzy_airline,
),
) - K 最近邻: 2
- 左侧数据集: ri.foundry.main.dataset.left
- 排名列名: rank
- 右侧数据集: ri.foundry.main.dataset.right
- 右侧列前缀: right_
输入:
ri.foundry.main.dataset.left
| tail_number | airline | miles | factor |
|---|---|---|---|
| XB-123 | foundry air | 124 | 2 |
| MT-222 | new airline | 1123 | 5 |
| PA-452 | new air | 212 | 2 |
ri.foundry.main.dataset.right
| fuzzy_airline | home_airport |
|---|---|
| air | LHR |
| new airline | CPH |
| new plane | JFK |
| old air | IAD |
输出:
| rank | distance | tail_number | airline | right_fuzzy_airline | right_home_airport |
|---|---|---|---|---|---|
| 1 | 3 | PA-452 | new air | old air | IAD |
| 2 | 4 | PA-452 | new air | air | LHR |
| 2 | 4 | PA-452 | new air | new airline | CPH |
| 2 | 4 | PA-452 | new air | new plane | JFK |
| 1 | 0 | MT-222 | new airline | new airline | CPH |
| 2 | 4 | MT-222 | new airline | new plane | JFK |
| 1 | 5 | XB-123 | foundry air | old air | IAD |
| 2 | 8 | XB-123 | foundry air | air | LHR |