跳转至

K-means clustering(K-means聚类(K-means clustering))

Supported in: Batch

K-means clustering is an unsupervised machine learning algorithm. It groups dataset vectors into k clusters. The k value is determined by computing the best silhouette score of the specified range between minimum k and maximum k. Number of k values defines how many k values should be tried within this range, inclusive of the boundaries.

Transform categories: Other

Declared arguments

  • Input dataset: Source dataset containing vector column.
    Table
  • Maximum k: Maximum number of clusters.
    Literal\
  • Minimum k: Minimum number of clusters.
    Literal\
  • Number of k values: Number of k values to test, over values from minimum k to maximum k. Note that we will train number of k clustering models, so the pipeline execution might be slow if number of k values is set to a high value. The best model is selected based on the silhouette score.
    Literal\
  • Vector column: Column containing the float vectors that will be used for the clustering.
    Column\>

Examples

Example 1: Base case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 12
  • Minimum k: 3
  • Number of k values: 4
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_column cluster_id
[ 1.0, 3.1, 2.3 ] 0
[ 1.0, 3.5, 2.3 ] 0
[ 19.0, 12.3, -1.4 ] 1
[ 0.05, 3.1, 2.3 ] 2

Example 2: Null case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 12
  • Minimum k: 3
  • Number of k values: 4
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
null
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_column cluster_id
[ 1.0, 3.1, 2.3 ] 0
[ 1.0, 3.5, 2.3 ] 0
[ 19.0, 12.3, -1.4 ] 1
[ 0.05, 3.1, 2.3 ] 2

Example 3: Edge case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 3
  • Minimum k: 3
  • Number of k values: 1
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_column cluster_id
[ 19.0, 12.3, -1.4 ] 0
[ 0.05, 3.1, 2.3 ] 1
[ 1.0, 3.5, 2.3 ] 2


中文翻译

K-means聚类(K-means clustering)

支持运行模式:批处理(Batch)

K-means聚类是一种无监督机器学习算法。它将数据集向量分组为k个簇。k值通过计算指定范围内(最小k值与最大k值之间)的最佳轮廓系数(silhouette score)来确定。k值数量(Number of k values)定义了在此范围内(包含边界值)应尝试的k值个数。

转换类别:其他

声明参数(Declared arguments)

  • 输入数据集(Input dataset): 包含向量列的源数据集。
    表(Table)
  • 最大k值(Maximum k): 最大聚类数量。
    字面量\<整数>(Literal\
  • 最小k值(Minimum k): 最小聚类数量。
    字面量\<整数>(Literal\
  • k值数量(Number of k values): 从最小k值到最大k值范围内需要测试的k值个数。请注意,我们将训练对应数量的k值聚类模型,因此如果k值数量设置得过高,管道执行可能会变慢。最佳模型将根据轮廓系数(silhouette score)进行选择。
    字面量\<整数>(Literal\
  • 向量列(Vector column): 包含用于聚类的浮点向量的列。
    列\<数组\<浮点数>>(Column\>)

示例

示例1:基本情况

参数值:

  • 输入数据集: ri.foundry.main.dataset.a
  • 最大k值: 12
  • 最小k值: 3
  • k值数量: 4
  • 向量列: feature_column

输入:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

输出:

feature_column cluster_id
[ 1.0, 3.1, 2.3 ] 0
[ 1.0, 3.5, 2.3 ] 0
[ 19.0, 12.3, -1.4 ] 1
[ 0.05, 3.1, 2.3 ] 2

示例2:空值情况

参数值:

  • 输入数据集: ri.foundry.main.dataset.a
  • 最大k值: 12
  • 最小k值: 3
  • k值数量: 4
  • 向量列: feature_column

输入:

feature_column
[ 0.05, 3.1, 2.3 ]
null
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

输出:

feature_column cluster_id
[ 1.0, 3.1, 2.3 ] 0
[ 1.0, 3.5, 2.3 ] 0
[ 19.0, 12.3, -1.4 ] 1
[ 0.05, 3.1, 2.3 ] 2

示例3:边界情况

参数值:

  • 输入数据集: ri.foundry.main.dataset.a
  • 最大k值: 3
  • 最小k值: 3
  • k值数量: 1
  • 向量列: feature_column

输入:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

输出:

feature_column cluster_id
[ 19.0, 12.3, -1.4 ] 0
[ 0.05, 3.1, 2.3 ] 1
[ 1.0, 3.5, 2.3 ] 2