KNN join（KNN 连接（KNN join））¶

Supported in: Batch

Return the 'k' nearest rows from the right dataset for each row in the left dataset, based on the distance measure.

Transform categories: Join

Declared arguments¶

Condition for columns to select on the left: All columns in the left input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate
Condition for columns to select on the right: All columns in the right input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate
Distance measure expression: Distance measure expression between columns in the left and right datasets. E.g. Levenshtein distance.
Expression\
K nearest: The number of nearest rows to return, i.e. if k=2 then the number of output rows will be at least doubled and the nearest 2 rows will be joined from the right. In case of ties, more rows may be returned.
Literal\
Left dataset: Left dataset to use in join.
Table
Rank column name: Name of the column to store the rank of the distance.
Literal\
Right dataset: Right dataset to use in join.
Table
optional Prefix for columns from right: Prefix to add to all columns on the right hand side.
Literal\

Examples¶

Example 1: Base case¶

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

Output:

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

Example 2: Base case¶

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [home_airport],
)
Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

Output:

rank	distance	tail_number	airline	home_airport
1	3	PA-452	new air	IAD
2	4	PA-452	new air	LHR
2	4	PA-452	new air	CPH
2	4	PA-452	new air	JFK
1	0	MT-222	new airline	CPH
2	4	MT-222	new airline	JFK
1	5	XB-123	foundry air	IAD
2	8	XB-123	foundry air	LHR

Example 3: Base case¶

Description: If the distance measure returns null, this is considered the furthest distance.

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs:

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
null	CPH
new plane	JFK
old air	IAD

Output:

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new plane	JFK
1	4	MT-222	new airline	new plane	JFK
2	7	MT-222	new airline	old air	IAD
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

Example 4: Base case¶

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: right_

Inputs:

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

Output:

rank	distance	tail_number	airline	right_fuzzy_airline	right_home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

中文翻译¶

KNN 连接（KNN join）¶

支持：批处理（Batch）

基于距离度量，为左侧数据集中的每一行返回右侧数据集中最近的 'k' 行。

转换类别：连接（Join）

声明的参数（Declared arguments）¶

左侧列选择条件（Condition for columns to select on the left）： 左侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配，该列将被选中输出。
ColumnPredicate
右侧列选择条件（Condition for columns to select on the right）： 右侧输入模式中的所有列都将被测试是否匹配此条件。如果匹配，该列将被选中输出。
ColumnPredicate
距离度量表达式（Distance measure expression）： 左侧和右侧数据集中列之间的距离度量表达式。例如：Levenshtein距离（Levenshtein distance）。
Expression\
K 最近邻（K nearest）： 要返回的最近行数，即如果 k=2，则输出行数至少翻倍，并且将从右侧连接最近的 2 行。在平局情况下，可能会返回更多行。
Literal\
左侧数据集（Left dataset）： 用于连接的左侧数据集。
Table
排名列名（Rank column name）： 用于存储距离排名的列名。
Literal\
右侧数据集（Right dataset）： 用于连接的右侧数据集。
Table
可选 右侧列前缀（Prefix for columns from right）： 添加到右侧所有列的前缀。
Literal\

示例¶

示例 1：基本情况¶

参数值：

左侧列选择条件：
columnNameIsIn(
columnNames: [tail_number, airline],
)
右侧列选择条件：
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
距离度量表达式：
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K 最近邻： 2
左侧数据集： ri.foundry.main.dataset.left
排名列名： rank
右侧数据集： ri.foundry.main.dataset.right
右侧列前缀： null

输入：

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

输出：

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

示例 2：基本情况¶

参数值：

左侧列选择条件：
columnNameIsIn(
columnNames: [tail_number, airline],
)
右侧列选择条件：
columnNameIsIn(
columnNames: [home_airport],
)
距离度量表达式：
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: airline,
),
)
K 最近邻： 2
左侧数据集： ri.foundry.main.dataset.left
排名列名： rank
右侧数据集： ri.foundry.main.dataset.right
右侧列前缀： null

输入：

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

输出：

rank	distance	tail_number	airline	home_airport
1	3	PA-452	new air	IAD
2	4	PA-452	new air	LHR
2	4	PA-452	new air	CPH
2	4	PA-452	new air	JFK
1	0	MT-222	new airline	CPH
2	4	MT-222	new airline	JFK
1	5	XB-123	foundry air	IAD
2	8	XB-123	foundry air	LHR

示例 3：基本情况¶

描述： 如果距离度量返回 null，则视为最远距离。

参数值：

左侧列选择条件：
columnNameIsIn(
columnNames: [tail_number, airline],
)
右侧列选择条件：
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
距离度量表达式：
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K 最近邻： 2
左侧数据集： ri.foundry.main.dataset.left
排名列名： rank
右侧数据集： ri.foundry.main.dataset.right
右侧列前缀： null

输入：

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
null	CPH
new plane	JFK
old air	IAD

输出：

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new plane	JFK
1	4	MT-222	new airline	new plane	JFK
2	7	MT-222	new airline	old air	IAD
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

示例 4：基本情况¶

参数值：

左侧列选择条件：
columnNameIsIn(
columnNames: [tail_number, airline],
)
右侧列选择条件：
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
距离度量表达式：
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K 最近邻： 2
左侧数据集： ri.foundry.main.dataset.left
排名列名： rank
右侧数据集： ri.foundry.main.dataset.right
右侧列前缀： right_

输入：

ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

fuzzy_airline	home_airport
air	LHR
new airline	CPH
new plane	JFK
old air	IAD

输出：

rank	distance	tail_number	airline	right_fuzzy_airline	right_home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR