K-means clustering(K-means聚类(K-means clustering))¶
Supported in: Batch
K-means clustering is an unsupervised machine learning algorithm. It groups dataset vectors into k clusters. The k value is determined by computing the best silhouette score of the specified range between minimum k and maximum k. Number of k values defines how many k values should be tried within this range, inclusive of the boundaries.
Transform categories: Other
Declared arguments¶
- Input dataset: Source dataset containing vector column.
Table - Maximum k: Maximum number of clusters.
Literal\ - Minimum k: Minimum number of clusters.
Literal\ - Number of k values: Number of k values to test, over values from minimum k to maximum k. Note that we will train number of k clustering models, so the pipeline execution might be slow if number of k values is set to a high value. The best model is selected based on the silhouette score.
Literal\ - Vector column: Column containing the float vectors that will be used for the clustering.
Column\>
Examples¶
Example 1: Base case¶
Argument values:
- Input dataset: ri.foundry.main.dataset.a
- Maximum k: 12
- Minimum k: 3
- Number of k values: 4
- Vector column:
feature_column
Input:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| [ 1.0, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
Output:
| feature_column | cluster_id |
|---|---|
| [ 1.0, 3.1, 2.3 ] | 0 |
| [ 1.0, 3.5, 2.3 ] | 0 |
| [ 19.0, 12.3, -1.4 ] | 1 |
| [ 0.05, 3.1, 2.3 ] | 2 |
Example 2: Null case¶
Argument values:
- Input dataset: ri.foundry.main.dataset.a
- Maximum k: 12
- Minimum k: 3
- Number of k values: 4
- Vector column:
feature_column
Input:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| null |
| [ 1.0, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
Output:
| feature_column | cluster_id |
|---|---|
| [ 1.0, 3.1, 2.3 ] | 0 |
| [ 1.0, 3.5, 2.3 ] | 0 |
| [ 19.0, 12.3, -1.4 ] | 1 |
| [ 0.05, 3.1, 2.3 ] | 2 |
Example 3: Edge case¶
Argument values:
- Input dataset: ri.foundry.main.dataset.a
- Maximum k: 3
- Minimum k: 3
- Number of k values: 1
- Vector column:
feature_column
Input:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
Output:
| feature_column | cluster_id |
|---|---|
| [ 19.0, 12.3, -1.4 ] | 0 |
| [ 0.05, 3.1, 2.3 ] | 1 |
| [ 1.0, 3.5, 2.3 ] | 2 |
中文翻译¶
K-means聚类(K-means clustering)¶
支持运行模式:批处理(Batch)
K-means聚类是一种无监督机器学习算法。它将数据集向量分组为k个簇。k值通过计算指定范围内(最小k值与最大k值之间)的最佳轮廓系数(silhouette score)来确定。k值数量(Number of k values)定义了在此范围内(包含边界值)应尝试的k值个数。
转换类别:其他
声明参数(Declared arguments)¶
- 输入数据集(Input dataset): 包含向量列的源数据集。
表(Table) - 最大k值(Maximum k): 最大聚类数量。
字面量\<整数>(Literal\) - 最小k值(Minimum k): 最小聚类数量。
字面量\<整数>(Literal\) - k值数量(Number of k values): 从最小k值到最大k值范围内需要测试的k值个数。请注意,我们将训练对应数量的k值聚类模型,因此如果k值数量设置得过高,管道执行可能会变慢。最佳模型将根据轮廓系数(silhouette score)进行选择。
字面量\<整数>(Literal\) - 向量列(Vector column): 包含用于聚类的浮点向量的列。
列\<数组\<浮点数>>(Column\>)
示例¶
示例1:基本情况¶
参数值:
- 输入数据集: ri.foundry.main.dataset.a
- 最大k值: 12
- 最小k值: 3
- k值数量: 4
- 向量列:
feature_column
输入:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| [ 1.0, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
输出:
| feature_column | cluster_id |
|---|---|
| [ 1.0, 3.1, 2.3 ] | 0 |
| [ 1.0, 3.5, 2.3 ] | 0 |
| [ 19.0, 12.3, -1.4 ] | 1 |
| [ 0.05, 3.1, 2.3 ] | 2 |
示例2:空值情况¶
参数值:
- 输入数据集: ri.foundry.main.dataset.a
- 最大k值: 12
- 最小k值: 3
- k值数量: 4
- 向量列:
feature_column
输入:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| null |
| [ 1.0, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
输出:
| feature_column | cluster_id |
|---|---|
| [ 1.0, 3.1, 2.3 ] | 0 |
| [ 1.0, 3.5, 2.3 ] | 0 |
| [ 19.0, 12.3, -1.4 ] | 1 |
| [ 0.05, 3.1, 2.3 ] | 2 |
示例3:边界情况¶
参数值:
- 输入数据集: ri.foundry.main.dataset.a
- 最大k值: 3
- 最小k值: 3
- k值数量: 1
- 向量列:
feature_column
输入:
| feature_column |
|---|
| [ 0.05, 3.1, 2.3 ] |
| [ 1.0, 3.5, 2.3 ] |
| [ 19.0, 12.3, -1.4 ] |
输出:
| feature_column | cluster_id |
|---|---|
| [ 19.0, 12.3, -1.4 ] | 0 |
| [ 0.05, 3.1, 2.3 ] | 1 |
| [ 1.0, 3.5, 2.3 ] | 2 |